Andrew Correia is a data scientist, working presently as a Senior Data Scientist at TripAdvisor.
Andrew has a BA in Mathematics from the University of Massachusetts Dartmouth, a Ph.D. in Biostatistics from Harvard, and has worked as a quantitative or statistical analyst, and data scientist, at the NMR Group Inc, Brigham and Women’s Hospital, and SessionM.
He wrote the following (slightly edited) reply to the question: “What do data analysts and data scientists spend their time doing?”
New data scientists and advanced degrees
Almost all new data scientists that I’ve seen enter the field with an advanced degree (Masters or PhD) in a quantitative discipline.
Machine learning and statistics, and of course mathematics, are the backbone of data science, but as an applied discipline, practicing data scientists come from all sorts of academic backgrounds.
My background is in biostatistics, and I work with people who have advanced degrees in subjects like machine learning, aeronautics, electrical engineering, statistics, neuroscience, and physics.
And now, a new wave of people entering the field actually have degrees in data science, as several universities have begun offering advanced degrees directly in this area. All of this is to say, there are a lot of paths a person can take to becoming a data scientist.
More important than someone’s area of study is his/her knowledge and familiarity with some of the core tools and methods that we use every day.
While the thing that gets more attention that anything in the field of data science are the advanced modeling approaches that seem capable of being able to predict almost anything – deep learning comes to mind, though there are other approaches – model building and prediction make up only a piece of what we do.
Before we can make any predictions and build sophisticated models, we need data to feed into those models.
This means data scientists need to be adept at working with databases and piecing together data from different places to construct the important variables, or features, that we ultimately pass to our models to make predictions.
Understanding the problem
At the onset of any new project, a big part of my day to day will be simply understanding the problem and working with people in other areas of our business so I can really get a feel for what it is they want us as data scientists to be able to help them with and what exactly are the insights they hope to glean from our models.
From there, we have to dive into the databases to figure out where exactly we can find the data we need to answer the questions of interest.
This means working with our database engineers, the people who maintain and update our databases, to figure out which tables have the information I need to build a model that can solve the real-life business question(s) we’re trying to address.
Accessing the data means knowing how to use tools like SQL and Hive (and now, Spark) which enable us to write queries that pull and process huge amounts of data from these databases and write them to a file so that we can then load them into a program better-suited for model building.
Model building is the fun part. This is when we get to try different types of modeling approaches and see which ones perform best for the problem we’re trying to solve.
There are too many types of potential approaches to list here, but some of the most popular methods now include:
- deep learning approaches (neural networks, convolutional neural networks, recurrent neural networks)
- tree-based approaches (xgboost, random forest)
- linear models (linear regression, logistic regression, mixed effects models), and
- recommendation models (matrix factorization, collaborative filtering). T
There are also a lot of different programs that you could use to train and test your models. By far the two most popular are Python and R, and proficiency in at least one of those is a must-have if you want to become a data scientist. The software of choice will vary from organization to organization – for example in my last job I used R almost exclusively; now the opposite is true and I do almost all of my model building in Python.
Throughout the model building process there is a consistent back and forth of constructing a model and then evaluating how well it’s performing.
This is done by constructing a training data set and a testing data set – the training data set might be 80% of our historical data and the testing data set is the other 20%.
We then build the models using the training data and evaluate its performance on the test set since the model has not yet seen that data.
In this way, we can mimic what will happen in the real world when we use all of our historical data to train the model and make predictions for, e.g., the next day or the next week.
Once we have settled on a model we’re happy with, the last step before releasing it into production is doing a live A/B test on the website to see how it performs in the real world.
To do this, we’ll work with production engineers and analysts to set up the test and make sure that a portion of visitors to our website experience content as a result of our new model, while the other portion experiences content as it has been – in other words, a test and a control group. If we see positive results here, then we have confidence that we can roll the model out into production so that all users experience content that is driven in some way by the new model, which will then ideally increase revenue for the company.
Key data science skills and experience
Degree: Advanced degree in a quantitative discipline.
Methods: Recommendation models like matrix factorization and collaborative filtering; deep learning approaches like NNs, CNNs, RNNs; statistical approaches like linear and logistic regression, GLMs, and generalized linear mixed models; tree-based approaches like xgboost and random forests.
Programming/Software: SQL, Hive, Spark; R, Python.