So much mathematics!
There’s no hard and fast rule for this, and there’s no doubt that one can get by as a data technician – someone who, for example, can work on a predictive model by compiling or retrieving a dataset, visualizing the distribution of the data, dealing with missing values, running different models, and examining the output from the models to create predictions – without knowing much in the way of mathematics or statistics behind the models.
Take a course, read a book?
Common, but not very satisfactory, advice is for people lacking certain mathematical or statistical skills to take a course of one sort or another – college or MOOC, for example – to get acquainted, and gain some practice with these skills.
So one goes off and enters an abstract world of linear algebra, for example, and stays there long enough – if one has the requisite staying power – to gain some possibly small degree of familiarity with the mathematical notation and notions
This goes against everything we believe at Math 4 Plus & datascience.university. The sort of learning where one takes a course or reads a technical book, on linear algebra, or statistics, for example, is unsituated learning. It’s hard for many people because it can appear entirely unmotivated.
Matrix, matrices? Who came up with that?
In this style of learning one reads, for example, that a matrix is a rectangular array of numbers. So freakin’ what? Who cares about rectangular arrays of numbers, and why?
If you are lucky you might learn that Chinese mathematicians in the Han dynasty, two centuries BCE, used matrices (plural of “matrix”) in the solution of simultaneous linear equations.
You might, again if you are lucky, learn that the English mathematician Arthur Cayley wrote a whole book on matrices: “Memoir on the Theory of Matrices” in 1858, so establishing matrix theory as a branch of mathematics. You can see, if you look at the link to Cayley’s book, that matrices for him were also a way of coding linear equations in several variables.
But even if you are so lucky to learn about these bits of historical background you’ll still be left wondering what all this has to do with data analysis.
There has to be a better path into linear algebra for people motivated by data analysis … and there is!
Data arrays & matrices
It is common practice to represent numerical data in the form of spreadsheets or database tables. These tables are naturally formatted as rectangular arrays of numbers, and that’s where, and why, matrices particularly, and linear algebra more generally, are relevant to data analysis.
For an aspiring data scientist a more natural way to approach linear algebra is through the very matrix tables in which data is often stored. Then the apparently abstract operations of linear algebra make much more sense, because they are motivated by a need to manipulate, simplify, and analyze data.
Of course not all data is numeric in nature, and not all data comes in rectangular arrays. But a vast amount does, and most data analysis software deals naturally and most readily with numeric data in rectangular arrays, so that’s what we will restrict ourselves to for now.
A data matrix
The U.S. Centers for Disease Control and Prevention (CDC) keeps a lot of data on diseases and health conditions. One such data set stores the percentage of adults diagnosed as diabetic, for each U.S. county, for 2004- 2013.
The data is stored in an Excel table, and this table is a rectangular array with 3226 rows and 47 columns. We call this data array a matrix, bearing in mind that “matrix” is another term for a rectangular array. This particular array contains more than numbers – it also contains county names and county identifiers. We can extract just the percent diabetic from this array to get a numeric array of data: this is our numeric matrix, part of which is shown below:
This is a numeric matrix that contains all the data on percent diabetic for each U.S. county for the years 2004 – 2013.
What about vectors?
A numeric vector is just a finite ordered list of numbers. Usually these lists are written as either rows or columns.
A row vector would look like this:
and a column vector would look like this:
Notice that the rows of a matrix are row vectors and the columns of a matrix are column vectors.
More to follow
Linear algebra deals with operations on, and transformations of, matrices and vectors, usually carried out to get the matrices, or vectors, into a simpler form than they are first presented to us. This is especially important with very large data sets, which give rise to very large matrices.