A previous post on mathematics and statistics for data science got a lot of attention, here, on *Twitter*, and on *reddit* at r/datascience.

For those of you who do not normally inhabit *reddit* – and perhaps even for those of you who do – here’s a selection of comments and replies of a number of Reddit contributors on the topic of mathematics and statistics for data science.

For those of you thinking about moving into, or already in, data science, some of these comments are very helpful, IMO. They are all quite thoughtful.

Please read and enjoy.

ambassador_pineappleOK, I am going to go on a rant. I LOVE that someone is finally talking about this.

Some background first: I have run teams of data scientists at large banks, I come from a physics and mathematics educational background, and I have taught data science. Now I run my own A.I. startup.

Some of the most important insights I have obtained in my career have come because of a deep understanding of metric spaces and n-dimensional manifolds. Advanced linear algebra has this hidden gems which no one knows about but gets to be drive insights from data using some python tools. For example, I once had a guy 10 years older than me, with a PhD, trying to run a damn logistic regression to build a segmentation model and had not even bothered to check the distribution of the predictors or the data types. Just went ahead, called the regression package on garbage data with no justification for what he was doing. This is just one of many many senior folks at places you would not believe.

Anyone who is trying to get into this profession should know this is like any other applied science. Road ahead is long and hard but rewarding. Don’t listen to assholes selling snake oil, telling you math and stats are not important. You ARE DOING mathematical and statistical work. If you don’t know deeper, more important concepts which formulate the basis of what an OLS really does, you cannot effectively use it either.

GnonpiI totally see what you mean with the regression applied to garbage data. Data science is not only about creating and training models, it’s also about preprocessing (data distribution, correlation, choosing the right model) and postprocessing (is the model overtrained? Is it biased? Can I interpret what it has learned?). And that require a mathematical background.

fooliamMS | Data Scientist | SportsHell, the majority of my time is spend on the preprocessing, and I don’t think I’m the exception. If I had to break it down, I’d say at least half of my time on any particular project is spent in preprocessing data and data exploration before I even think about trying to build a model.

datavinciSo true. Machine learning is mostly statistics on steroids.

TaXxERI’m not even sure about the “on steroids” part. Mostly ML and stats are the same things. ML is just a very successful rebranding of stats. And very wrong metaphors (e.g. explaining neural networks as “brain-like”) are used as part of this rebranding.

AGINSBThis analogy was from a cal tech course available online on the subject but neural networks are brain like in the same way that airplanes are bird like

ryanmcstylinI am glad I figured this out early. I made a pretty penny running a swim lesson business in high school & college, so I wanted to go into business. I didn’t understand the concept of studying business in undergrad when you could get an MBA with a liberal arts degree. With this in mind I started college going after a Econ major and math minor. After my first year I realized I could graduate early & would still be taking a half year of garbage classes to fill hours requirements. I didn’t want to leave college early so I started chasing majors. I ended up with

This was back in 2008 before Machine Learning was a buzzword. I knew business people liked to see stats, but you don’t even need college level math for a business degree. Now we have this modern day gold rush with a labor force that can’t even recognize gold, much less mine it.

lmcinnes8 points 3 days agoIt’s a fancy word for a relatively easy concept. If you’ve got decent background you can easily pick up what you need when you need it. I wouldn’t panic. There will always be things that are out of your ken, but you can often pick up the bits you need as you encounter them — math is a

bigsubject and you can’t reasonably expect to have expertise in all of it; the best you can hope for is enough knowledge and background to learn anything else you need.Neil1859I think this is the most important tip in this whole discussion, thank you.

test_username_existsAnalogous (and on top of) OP’s point about math / stats, I personally wish the community would start putting more focus on the software

designside of code and best practices instead of how to answer cute linked list questions / fit a model using scikit-learn; too many people think that the job is writing one-off Kaggle scripts and can’t deliver reproducible well-tested code in a collaborative team environment, at least based on my own experience.ydobonobodyI agree with this in principle but in practice probably 90% of the code I write is run once so it a pretty easy decision to slack off when it comes to good coding practices.

pinkerton_jonesYou haven’t lived until you’ve had to explain to some of these folks what a median is, and why you use it with ordinal data.

DrewdledooI’m very interested in the discussion and follow-up to this. I’m nearing the end of a PhD in a molecular biology-type field and have developed some decent coding skills as well as having at least a basic understanding of statistics.

Finding MOOCs to further develop coding skills (in R, Python, etc) that have a statistical application is very easy — there are lots of them around. However, finding resources that will help me take what I already know about math and stats to the next level in an independent-study format has been really challenging.

I’m sure there are many others out there who already understand things like p-values (what they

reallymean), confidence intervals, linear regression, t-tests, ANOVA, and other “basic” stats concepts from their education in fields other than stats/maths; but it seems the resources that cover more advanced topics are limited for these people (in my experience).GreatOwl1I think it’s because math is probably one of the hardest topics to self-teach.

garyernestdavisHere’s what I think is needed (and what colleges, universities and MOOCs cannot provide):

An individual needs to know something about mathematics or statistics, for example (a) what is singular value decomposition? Principal component analysis? (b) What actually is a p-value? What are the assumptions of linear algebra and how important is it if they don’t hold exactly?

That person can: (1) scout around for books, which may or may not be readable (usually not) and struggle through the relevant sections or chapters; (2) find someone who knows and can help them directly; (3) enroll in a course that covers a lot of other stuff as well.

Just-in-time custom learning environments just don’t seem to exist.

How economical would they be to set up I wonder?

harriswillI’m sure there are many others out there who already understand things like p-values (what they really mean), confidence intervals, linear regression, t-tests, ANOVA, and other “basic” stats concepts from their education in fields other than stats/maths; but it seems the resources that cover more advanced topics are limited for these people (in my experience).

ocyxIt’s comical that people claim to know statistics or maths without taking abstract courses in the topics. You don’t really “know” things like p-values, hypothesis testing, inference, Linear Algebra in general, etc. unless you’ve taken an abstract course that describes the derivations in depth. You have to learn the theory to realize why the shit you’re using is useful.

Blix-Can you recommend a course list people should take? I’ve taken linear algebra, stats, up to cal 2 , and differential equations. I plan on taking cal 3, but I’m not sure after that.

ocyxYou’ve taken intro Linear Algebra and Intro Statistics (I know this because you haven’t taken multivariable calculus yet). So I would take an Abstract Linear algebra approach with Inner Product and Vector Spaces as well as a mathematical statistics class that shows you the theory underlying stats. Beyond that, look online, and try to take theory based classes when it’s an important topic like Linear Algebra or Statistics.

BtDBThis. A huge portion of my job is reporting and metrics at this point. It stuns me how poorly the people who NEED to see my reports to make good informed decisions have an absolutely terrible time of extrapolating any sort of useful information, or worse the wrong information from a report. Often its not the data, or how it is portrayed, but rather the complete inability to make sound inferences from a given data set.

Trek7553I read the article, and I guess I’m just not sold. I understand why a serious data scientist would need to understand the math and stats, but I don’t understand why it’s necessary in order to be useful to a business.

I have only a basic understanding of stats, but I know data very well. I am working on my first predictive model and I understand how to build the dataset, examine the distribution, handle missing values, normalize where necessary, run various models, and examine the output. I can pick the one that is most useful (based on which accuracy metrics matter most to the business) and productionalize that to create predictions. I can tell the business which features were most important to the model so they understand generally how it works.

Why does it matter that I can’t articulate the math driving the model? The final product may not be as good as if I could precisely tune it, but it’s far better than nothing. What am I missing?

I don’t mean to argue, it’s a genuine question.

ambassador_pineappleI completely understand where you are coming from and you do bring up a good point. There is absolutely nothing wrong with what you said as long as there is a more senior person who reviews your work and you have peers in your team you work with. The nature of this work is that you must keep learning. Over time, you will gain more experience and learn more details and insights into different mathematical topics. For example, if you wanted to run some model using kernel density methods. You will start with wikipedia (of course) and then maybe it will lead you to more interesting and esoteric signal analysis techniques. Hopefully, you will eventually end up understand more about the underlying principles of signal analysis.

There is nothing wrong with the person who will learn as time goes. Hell, I learn a lot each day. My mathematical experience builds on itself and is just an advantage. Anyone can start building it. Where things start to fall apart are when a person continues to just use algorithms blindly, move up in their career (trust me, you can move up without good results. Corporate world is scary), and still be blind to how these things work. Then you are put in charge of some business process which risks millions of dollars or some intelligence program in charge of people’s safety. Using some algorithm which, let’s say requires your data follow a Gaussian distribution but your data is Laplacian in nature or just has some weird empirical distribution will lead to some bad stuff happening.

And I of course don’t mean to argue either. Always good to discuss these things!

Trek7553That totally makes sense. On this first project, we did hire an outside consultant to guide us through the process. On future projects I will be the most senior so I guess I will just have to keep learning as I go.

Thanks for your reply!

garyernestdavisI guess my answer is that it matters because you are unaware of what you don’t know. So long as you function as what seems to me to be a data technician you’re probably OK. Sorry for the name, I do not mean to be demeaning, there’s just few widely accepted job descriptions in this industry. When you have a need, as a simple example, to understand the assumptions of multivariate linear regression, because predictions don’t turn out to be accurate, then you’ll see a need to dig deeper into the statistics.

If you have no need in your current job then I agree: why dig deeper?

I did say in the article that “I believe, the more mathematics and statistics you have under your belt the better data scientist you will be. But how much, and when, for what purpose, is another matter. “

Trek7553That makes sense. I don’t think data technician is quite accurate, but I agree that I’m not a data scientist. My team is small, so I’m a “report developer/dashboard developer/data warehouse architect/data analyst”. Officially my title is “Data & Analytics Manager”.

I did not mean to attack your article, sorry if it felt that way. I’m just trying to understand how it applies to me :).

garyernestdavisNot at all – I didn’t feel you were attacking anything. It was a genuine question – thanks for asking