Ben Hambrecht raised a question on Twitter (@BenHambrecht): *Can high even with correct modeling be overinterpreted?* in relation to the post “Why is r^2 = 0.99 not necessarily good news?”

Let’s look at this question via some real data, used in Chapter 7 of the excellent text *Stat Labs – Mathematical Statistics Through Applications* by Deborah Nolan & Terry Speed.

The data is a paired list of Dungeness crab carapace (upper section of the exoskeleton) sizes, before and after molting.

The data is available at the STAT LABS: Data page, and the issue is whether the post-molt carapace size can be used as a predictor of the pre-molt carapace size.

Take a look at the data on the STAT LABS website before importing it into R. Note that there are 5 columns, all with headers, and that the order of the post molt and pre molt carapace sizes is swapped for our purpose: we want to predict the pre molt size from the post molt size, so we want the post molt size – the independent, or predictor, variable – to be listed first.

We can import the data directly into R, and name it, from the URL for the data page:

> crabdata <-read.table(“https://www.stat.berkeley.edu/~statlabs/data/crabs.data”, header=TRUE, sep=””)

We can see what structure this data has in R:

> str(crabdata)

‘data.frame’: 472 obs. of 5 variables:’data.frame’: 472 obs. of 5 variables:

$ presz : num 114 118 120 126 127 …

$ postsz: num 128 133 135 143 139 …

$ inc : num 14.1 15.1 15.4 17.1 12.6 12.9 15.6 15.1 17.1 13.2 …

$ year : int NA NA NA NA NA NA NA NA NA NA …

$ lf : int 0 0 0 0 0 0 0 0 0 0 …

This is a data frame with 5 columns. We want to extract just the first two columns and swap their order:

> carapacesizes<- data.frame(crabdata\$postsz,crabdata\$presz)

We check the structure of this new data frame:

> str(carapacesizes)

‘data.frame’: 472 obs. of 2 variables:

$ crabdata.postsz: num 128 133 135 143 139 …

$ crabdata.presz : num 114 118 120 126 127 …

We can now plot the carapace pre molt sizes versus the post molt sizes:

> plot(carapacesizes)

The data points seem to hover close to a straight line so we fit, and name, a linear model to the data:

> model <-lm(carapacesizes)

> model

Call:

lm(formula = carapacesizes)Coefficients:

(Intercept) crabdata.presz

-25.2137 1.07316

The returned coefficients tell us that a least squares line of best fit to the data points has equation:

*premolt-size = -25.2137 + 1.07316 postmolt-size*

How does this line look in relation to the data points?

The value from the summary for the model:

> summary(model)

suggests, as is visually evident, that the line of least squares fit does indeed seem to be a good fit to the data points.

In this sense we seem to have a good linear model. But do we? Good for what purpose?

A key is took at the *residuals* – the differences between the data points and the values predicted by the linear model:

> resids<-model$residuals

Let’s look at a histogram of the residuals:

> hist(resids)

This does not look at all like a normal distribution, which it should do if our linear model is to have any predictive validity.

One check on normality, or otherwise, is the kurtosis of the empirical distribution of residuals:

> kurtosis(resids)

[1] 12.08553

This is way greater than the kurtosis of 3 for a normal distribution, so this tells us what our eye can see: the distribution of residuals is far from normal.

What other evidence might there be for this?

A common test for normality is the Anderson-Darling test, implemented via the function **ad.test()**, which requires the R package **nortest**:

> install.packages(“nortest”)

> library(nortest)

> ad.test(resids)

Anderson-Darling normality test

data: resids

A = 2.5821, p-value = 1.599e-06

The low p-value of approximately indicates it is highly unlikely that the residuals are normally distributed, as we suspected already.

So, contrary to the statement on p. 150 of *Stat Labs – Mathematical Statistics Through Applications* that “the residuals are roughly normally distributed”, the residuals are anything but normally distributed.

Does this matter? What might it affect? Basically, the non-normality of the residuals restricts the predictive power of the linear model: it is probably quite safe to use this model to *interpolate* between data points, but not nearly so safe to use the model to *extrapolate* beyond the data.

The moral: a (very) high value is one thing, a linear model that can be used to predict beyond the data is another.

## Leave a Reply