Using Cross-Validation to Solve Anscombe's Quartet
How can we use machine learning techniques to solve a classic statistics problem?
Anscombe’s Quartet is a collection of four data sets with nearly exactly the same summary statistics. It was developed by statistician Francis Anscombe in 1973 to illustrate the importance of plotting data and the impact of outliers on statistical analysis.
We can “solve” Anscombe’s Quartet with cross-validation in the sense that we can use statistics to determine which of the four data sets are actualy well represented by a linear model. Before we dive in, let’s take a look at the data sets. Seaborn provides an easy way to load Anscombe’s Quartet, as shown below.
Load the data:
Plot the Quartet:
Visually, the four data sets look quite different. Group I
is the only group that actually seems to be observations of a linear relationship with random noise. However, they all appear to have the same linear regression line.
Next, let’s calculate all the summary statistics, to show that they are identical.
mean_x | mean_y | std_x | std_y | correlation | slope | offset | R2 | |
---|---|---|---|---|---|---|---|---|
I | 9.0 | 7.5 | 3.31662 | 2.03 | 0.816 | 0.5 | 3.0 | 0.67 |
II | 9.0 | 7.5 | 3.31662 | 2.03 | 0.816 | 0.5 | 3.0 | 0.67 |
III | 9.0 | 7.5 | 3.31662 | 2.03 | 0.816 | 0.5 | 3.0 | 0.67 |
IV | 9.0 | 7.5 | 3.31662 | 2.03 | 0.817 | 0.5 | 3.0 | 0.67 |
As expected, all summary statistics are (nearly) identical. But what if we wanted to actually figure out which data set is best described by the linear model? We can do that with cross-validation.
The idea of cross-validation is simple, you randomly hold out some amount of your data, and fit the model with the reduced set. Then, you predict on the hold out set and look at the residuals. This process is repeated k times (“k-fold cross-validation”), so that every piece of data is in the test set exactly once. Finally, you calculate the standard deviation (RMSE) and mean (MBE) of all the residuals. A “good” model will have low RMSE and nearly zero MBE.
We can repeat this process for each of the four groups in Anscombe’s Quartet. Typically, we use cross-validation to pick the best model for a data set. In this case, we are finding which data set best fits a simple linear regresssion model.
rmse | mbe | |
---|---|---|
I | 1.223902 | -0.050753 |
II | 1.368272 | -0.145880 |
III | 1.454342 | 0.208448 |
IV | 2.006684 | 0.497576 |
Perfect! Group I
has the lowest RMSE and MBE in the Quartet.
Conclusions
Traditional measures such as correlation and coefficient of determination do not provide useful insight into which of the four groups is well characterized by a linear model. More generally, these measures are not appropriate for the task of model selection.
Model selection and validation procedures have been developed by the machine learning community as an alternative to these traditional measures. These newer procedures focus on the “predictive power” of a model. Typically these methods are deployed when trying to select between various fancy, non-linear ML models (say, different forms of a deep neural network).
However, the Anscombe’s Quartet example shows that these procedures are also quite useful when evaluating linear models. Cross-validation allows us to systematically determine that group I
is best represented by a linear model with slope \(0.5\) and offset \(3.0\).