It’s common in social sciences such as economics, political science and psychology to regard a value of in a linear regression as a reason for considerable celebration.

However, it’s that indicates the fraction of the variation in the dependent variable accounted for by variation in the independent variable, and for that fraction is just about 50%.

In the physical sciences, a much higher linear model fit between the independent and dependent variables is demanded, and typically physics journals look with suspicion on values less than 0.95.

So if we get for a linear regression we’re in really good shape, right? We must be, surely, because only 1% of the variation in the dependent variable is not accounted for by variation in the independent variable.

Not necessarily, as the following simple example illustrates.

### An example from biology

The *Mycoplasma genitalium* bacterium has a genome of 580,076 nucleotides. In that genome there are 9,020 occurrences of the start codon ATG, and the locations of these ATG codons begins and ends 214, 263, 355, 452, 467, 547, 568, 686, 734, 822, 831, 850, 930, 1023, … , 579349, 579358, 579437, 579508, 579579, 579717, 579804, 579846, 579889, 579892, 579927, 579961, 580026, 580042.

A question we might ask is: are these locations of the ATG codon *uniformly distributed* along the genome?

One very simple way to approach this issue might be to produce 9020 uniformly random whole numbers in the range 1 through 580,076 and plot the locations of the ATG codons against these (sorted) random integers. In other words, see, via a linear regression, how much of the variability in the ATG locations is accounted for by variation in a uniformly random integer variable.

(Readers more familiar wth linear regression might argue this is *not* a good idea since the independent variable is uniformly distributed rather than normally distributed – one of the basic assumptions of linear regression.)

Her’s a plot of the ATG codon locations agains 9020 uniformly random whole numbers sorted from smallest to largest:

The data points – the ATG locations versus sorted random whole numbers – are shown in blue and the regression line is shown in red.

For this regression we get , telling us that less than 1% of the variation in the ATG codon locations are unaccounted for by variation in random whole numbers.

Yet the picture tells us something else: the data points are above the regression line for a while, then dip below it, and then rise above again.

### A closer look

We can get a clearer picture of the differences between the data points and the regression line – the residuals – by looking at a plot of the residuals:

Maybe this is just an artifact of a particular random choice of whole numbers from 1 through 9020?

To test this we can repeat our random choice many times. When we do this we find the pattern persists: there is a small yet genuine difference in the locations of the ATG codons in the *Mycoplasma genitalium* genome and a sorted random list of whole numbers between 1 and 9020.

This is a small, but possibly significant, difference in the ATG codon locations and random positions in the genome.

As a biologist, wouldn’t you want to look into this further?

### The moral?

Be wary of “high” values: always look carefully at a plot of the residuals in a regression and try to understand what that plot is telling you.

## Leave a Reply