Let’s look at this question via some real data, used in Chapter 7 of the excellent text Stat Labs – Mathematical Statistics Through Applications by Deborah Nolan & Terry Speed.
The data is a paired list of Dungeness crab carapace (upper section of the exoskeleton) sizes, before and after molting.
Take a look at the data on the STAT LABS website before importing it into R. Note that there are 5 columns, all with headers, and that the order of the post molt and pre molt carapace sizes is swapped for our purpose: we want to predict the pre molt size from the post molt size, so we want the post molt size – the independent, or predictor, variable – to be listed first.
We can import the data directly into R, and name it, from the URL for the data page:
> crabdata <-read.table(“https://www.stat.berkeley.edu/~statlabs/data/crabs.data”, header=TRUE, sep=””)
We can see what structure this data has in R:
‘data.frame’: 472 obs. of 5 variables:’data.frame’: 472 obs. of 5 variables:
$ presz : num 114 118 120 126 127 …
$ postsz: num 128 133 135 143 139 …
$ inc : num 14.1 15.1 15.4 17.1 12.6 12.9 15.6 15.1 17.1 13.2 …
$ year : int NA NA NA NA NA NA NA NA NA NA …
$ lf : int 0 0 0 0 0 0 0 0 0 0 …
This is a data frame with 5 columns. We want to extract just the first two columns and swap their order:
> carapacesizes<- data.frame(crabdata\$postsz,crabdata\$presz)
We check the structure of this new data frame:
‘data.frame’: 472 obs. of 2 variables:
$ crabdata.postsz: num 128 133 135 143 139 …
$ crabdata.presz : num 114 118 120 126 127 …
We can now plot the carapace pre molt sizes versus the post molt sizes:
> model <-lm(carapacesizes)
lm(formula = carapacesizes)
The returned coefficients tell us that a least squares line of best fit to the data points has equation:
premolt-size = -25.2137 + 1.07316 postmolt-size
How does this line look in relation to the data points?
The value from the summary for the model:
suggests, as is visually evident, that the line of least squares fit does indeed seem to be a good fit to the data points.
In this sense we seem to have a good linear model. But do we? Good for what purpose?
A key is took at the residuals – the differences between the data points and the values predicted by the linear model:
Let’s look at a histogram of the residuals:
One check on normality, or otherwise, is the kurtosis of the empirical distribution of residuals:
This is way greater than the kurtosis of 3 for a normal distribution, so this tells us what our eye can see: the distribution of residuals is far from normal.
What other evidence might there be for this?
A common test for normality is the Anderson-Darling test, implemented via the function ad.test(), which requires the R package nortest:
Anderson-Darling normality test
A = 2.5821, p-value = 1.599e-06
The low p-value of approximately indicates it is highly unlikely that the residuals are normally distributed, as we suspected already.
So, contrary to the statement on p. 150 of Stat Labs – Mathematical Statistics Through Applications that “the residuals are roughly normally distributed”, the residuals are anything but normally distributed.
Does this matter? What might it affect? Basically, the non-normality of the residuals restricts the predictive power of the linear model: it is probably quite safe to use this model to interpolate between data points, but not nearly so safe to use the model to extrapolate beyond the data.
The moral: a (very) high value is one thing, a linear model that can be used to predict beyond the data is another.