Reddit user Alrik asked a very perceptive question about data science:
“Lately, I’ve been seeing a lot of online ed sites offering data science courses (e.g., Coursera, Udemy, etc). There are also plenty of books, going all the way down to Data Science for Dummies. But despite all that, if you look at job postings, they’re almost exclusively looking for folks with at least one PhD. My own feelings on the matter are that it’s possible to get into data science without a PhD (I did it), but that graduate-level quantitative research skills are generally necessary for the job, and that it’s exceedingly difficult to learn those skills without practicing those skills in an environment that matters. Additionally, while I believe that programming skills are something most people can pick up from an online course (lots of great Python courses, and even a few R courses), I think that coding is just the tip of the iceberg — a good data scientist needs to have a solid understanding of math and statistics so as to be able to develop a decent model, and those skills are much more difficult to develop through independent study.”
- Just how much mathematics or statistics does a data scientists need to function effectively?
- Does necessary knowledge of mathematics or statistics depend on the level at which one is operating as a data analyst or data scientist, or on the nature of the job?
- How well can necessary mathematics and statistics skills be learned on the job, or through professional education?
It’s probably fair to say that the more mathematics and statistics one has under one’s belt, the easier it will be to make headway in data science. But how much is enough, and under what circumstances?
What about people coming from a background in which they have some exposure to mathematics and statistics, but are not specialists. For example, graduates and undergraduates in biology, crime and justice, economics, environmental studies, political science, psychology? And what about people who studied or are studying English, foreign languages, philosophy, music? Are they forever ruled out of being data scientists because of their general lack of mathematics and statistics? If not, where should they start?
There is some gungho and poor advice floating around the web that for some applied areas of data science, such as machine learning, one needs hardly any mathematics or statistics.
For example, this advice from Sharp Sight Labs is fairly clear that, in their opinion, very little mathematics or statistics is needed to read the seminal books An Introduction to Statistical Learning and Applied Predictive Modeling which they describe as follows:
This means that it’s possible for you to build a good predictive model without almost any knowledge of calculus or linear algebra. If you’re still not convinced of this, then take a careful look at An Introduction to Statistical Learning or Applied Predictive Modeling. These are two excellent books on machine learning (AKA, statistical learning; AKA, model building). There’s almost no calculus or linear algebra in either of them.
This is just horribly wrong. Among other things An Introduction to Statistical Learning discusses, in-depth, all of the following, which are significant parts of linear algebra, and some of which require multi-variable calculus: principal component analysis, bias functions, regression splines, smoothing splines, hyperplanes, linear and non-linear optimization, k-means clustering.
And the Sharp Sight Labs enthusiasm is misguided in another respect: both these books contain many mathematical formulas and expressions. To interpret these expressions, let alone understand them, requires more than a medium of mathematical maturity. Here are a few examples:
What the Sharp Sight Labs post seems to be advocating is what I would describe as a data technician: someone who learns to push the buttons on a black box and get a result, without much, if any, understanding of what’s going on. To me, this is the antithesis of what it means to be a data scientist.
So, yes, I believe, the more mathematics and statistics you have under your belt the better data scientist you will be. But how much, and when, for what purpose, is another matter. And how people who want to get into data science from non-quantitative fields can get to grips with the necessary mathematics and statistics is another matter, which we hope to address more fully later.
I received this great response from Reddit user ambassador_pineapple:
OK, I am going to go on a rant. I LOVE that someone is finally talking about this.
Some background first: I have run teams of data scientists at large banks, I come from a physics and mathematics educational background, and I have taught data science. Now I run my own A.I. startup.
Some of the most important insights I have obtained in my career have come because of a deep understanding of metric spaces and n-dimensional manifolds. Advanced linear algebra has this hidden gems which no one knows about but gets to be drive insights from data using some python tools. For example, I once had a guy 10 years older than me, with a PhD, trying to run a damn logistic regression to build a segmentation model and had not even bothered to check the distribution of the predictors or the data types. Just went ahead, called the regression package on garbage data with no justification for what he was doing. This is just one of many many senior folks at places you would not believe.
Anyone who is trying to get into this profession should know this is like any other applied science. Road ahead is long and hard but rewarding. Don’t listen to assholes selling snake oil, telling you math and stats are not important. You ARE DOING mathematical and statistical work. If you don’t know deeper, more important concepts which formulate the basis of what an OLS really does, you cannot effectively use it either.
I’m very interested in the discussion and follow-up to this. I’m nearing the end of a PhD in a molecular biology-type field and have developed some decent coding skills as well as having at least a basic understanding of statistics.
Finding MOOCs to further develop coding skills (in R, Python, etc) that have a statistical application is very easy — there are lots of them around. However, finding resources that will help me take what I already know about math and stats to the next level in an independent-study format has been really challenging.
I’m sure there are many others out there who already understand things like p-values (what they really mean), confidence intervals, linear regression, t-tests, ANOVA, and other “basic” stats concepts from their education in fields other than stats/maths; but it seems the resources that cover more advanced topics are limited for these people (in my experience).