Graphs and data science
Data scientists deal a lot with graphs. For instance, in:
- Social graphs
- Brain networks
- Protein-protein interactions
- Graph-based machine learning
- Graph data bases
“Understanding these relationships between data, whether it is Web pages, products purchased, features on a vehicle, words in a message, or symptoms, treatments, and outcomes from sick patients, is the first step to accepting that graphs will be the way we think about data in the future.
The data that we have today, and in often the ways we look at data, are already steeped in the theory of graphs. In the future, our ability to understand data with graphs will take us beyond searching for an answer. Creating and analyzing graphs will bring us to answers automatically. When we let data connect itself, meaning will begin to emerge automatically.”
Graphs and matrices
Looking at the graph above we can see that this has nothing to do with linear algebra, right?
Well, actually, it has everything to do with linear algebra!
That’s because the complete graph information can be coded in a matrix: in this case by an matrix, with one row, and one column for each of the 18 vertices of the graph.
This matrix is called the adjacency matrix of the graph and it’s constructed like this:
We put a:
- “1” in row 1, column 2, because vertex 1 is connected by an edge to vertex 2;
- “1” in row 1, column 3, because vertex 1 is connected by an edge to vertex 3;
- “1” in row 1, column 4, because vertex 1 is connected by an edge to vertex 4;
- “1” in row 1, column 5, because vertex 1 is connected by an edge to vertex 5 ;
- “0” in row 1 and all other columns, because vertex 1 is not connected by an edge to any other vertices.
Then we repeat this for the rows corresponding to vertex 2, vertex 3, …, through vertex 18.
Here’s the result:
Talk to a computer
First, the matrix can be coded in most programming languages, so we can store the structure of the graph, via the adjacency matrix, in computer memory. How else were we going to do this?
Second, the adjacency matrix can be used to tell us some interesting things about the graph. For instance, how many paths of length 2, length 3, length 4, and so on, are there from one vertex to another?
We can try enumerating these, case by case, by hand, but a far niftier, and computationally easier and more efficient, way is to use the idea of a matrix power. Yes, it turns out there’s a reasonably natural way to multiply matrices, and if we multiply a square matrix by itself we get the square of the matrix, and then another multiplication gives us the cube of the matrix, and so on.
So let’s take the adjacency matrix above, use some software to calculate its power (I used Mathematica® for this purpose: R or Python will do this just fine, as, of course, will MATLAB):
- 7 paths of length 3 from vertex 1 to vertex 2
- 10 paths of length 3 from vertex 4 to vertex 1
- 14 paths of length 3 from vertex 5 to vertex 16
and much more!
Paths in big social graphs
The graphs you get will be HUGE.
If you’ve coded part of them as a matrix you can use the powers of the matrix to compute how many paths there are from one person to another though that part of the social graph.
So now do you want to learn about matrix multiplication?
Maybe it’s not quite so abstract any more.
Maybe you can see there might be a use for it in data science.
Maybe, linear algebra could turn out to be useful in data science?
Maybe, just maybe, the mathematician’s apparently abstract way of thinking could be useful – if we could just see the motivation first it would be a big help.