Author’s note: If you haven’t read my introduction to multilevel modeling (MLM) for NFL data, go do that before reading this post.
The field of statistics can thank its existence to one thing: variance. No one needs a statistical model to predict that the sun will rise tomorrow — at least not for the next several billion years. And no one needs a statistical model to predict a game-ending interception by Tony Romo in 2014. (I kid.)
So if doing statistics means figuring out how and why things vary, then the manner in which our techniques handle variance is kind of important. Ordinary least squares regression (OLS), upon which most NFL analyses of the past decade have been based, handles variance in clustered samples poorly; MLM does better.
Variance Components and Assumption Violations
When I wrote the other day that NFL data violates the “independent observations” assumption of OLS, I could have rephrased it this way: OLS assumes zero variation in the intercept. And when I wrote about “fixed effects,” I could have rephrased it as, “OLS assumes zero variation in the slopes.” For instance, in Wednesday’s OLS, every team in every season got the same 0.53 intercept for Pythagorean Win Percentage (PythW%) and the same 0.08 slope for ANY/A.
The inability of OLS to account for variation in intercepts and slopes is due to the fact that it only considers total variance. To wit, you’ll recall that I reported only one number for the variance in PythW%, 2.5%. MLM, on the other hand, allows you to examine variation in intercepts and slopes because it partitions total variance into two components: within-cluster (aka within-group) variance and between-cluster (aka between-group) variance. The within-group variance is based on how far each member of the group deviates from their group average (aka the group mean), while the between-group variance is based on how far each group deviates from the overall average (aka the grand mean).
I can illustrate how this works by way of Wednesday’s clustered chart:
You’ll recall that the long black line was the overall average, (i.e., the grand mean). New to the party are the short black lines representing each group’s average, (i.e., the group means). From here, it’s easy. The within-group variance is the average squared distance from the four red dots in each group to that group’s short black line, while the between-group variance is the average squared distance from the short black lines to the long black line. These add up to the total variance used in OLS, which is the average squared distance from all 32 red dots to the long black line.
In our NFL example, those calculations result in a within-group variance of 1.4% and a between-group variance of 1.1%, which (a) adds up to our original total variance, and (b) allows us to examine variation in intercepts and slopes — so we got that going for us, which is nice.
The Intraclass Correlation Coefficient
Variance partitioning also allows us to double-check whether or not MLM is the appropriate technique given our data.1
This is done via the intraclass correlation coefficient (ICC), which represents the average correlation among individual observations within the same cluster (e.g., team seasons within a head coaching regime). If the ICC is large, then members of the same group are highly similar (i.e., the data is highly clustered); and vice versa. To calculate it requires nothing more than arithmetic:
It’s just the ratio of between-group variance, b, to total variance, b + w, and therefore represents the proportion of all variation due to differences between groups.
In our NFL example, the between-group variance was 1.1% and the total variance was 2.5%, so the ICC equals 0.44. That means 44% of the total variation in PythW% was due to differences between head coaching regimes, which is a large enough proportion for us to go ahead with MLM.2
The Design Effect
As the t-statistic is the ratio of a regression coefficient to its standard error, smaller standard errors produce larger t-statistics. Unfortunately for OLS, standard errors for a clustered sample tend to be larger — sometimes much larger — than standard errors for a truly random sample. Therefore, the fundamental problem with OLS, which assumes random sampling, is that it artificially inflates t-statistics if your sample is actually clustered, and that makes some predictors appear meaningful when they actually aren’t (i.e., it increases Type I error).
The design effect (Deff), which is directly related to the ICC, tells us the amount of bias in OLS standard errors when we ignore clustering. The formula is
where m is the average cluster size and ρ is the ICC. A general rule of thumb is that researchers should use MLM when Deff is greater than 2.0 because it means OLS standard errors under cluster sampling are half of what they would be under truly random sampling.3
To put this into practice, let’s return to our NFL example. The ICC was .44 and the average cluster size was 4, which means Deff = 1 + (3*.44) = 2.32. Therefore, the OLS standard error associated with the effect of ANY/A on PythW% was less than half of what it should have been, and so MLM is a more appropriate technique to use.
DT : IR :: TL : DR
Unlike OLS, MLM partitions total variance into within-group variance and between-group variance. This has the following advantages (among others):
- It allows us to examine variation in intercepts and slopes.
- Via the ICC, it allows us to quantify the amount of clustering in our data.
- Via the Deff, it allows us to quantify the amount of bias introduced to OLS if we ignore clustering.
Note: I said double-check, not determine ad hoc. ↩
As with any correlation, describing 0.44 as small, moderate, or large depends on the field of study. The general rule of thumb is that anything approaching 0.20 suggests using MLM. That said, MLM might still be advisable with an ICC half that size if theory supports it. ↩
1/2.0 = .50, or half ↩