Let’s say we want to find out how much of a team’s Pythagorean Win Percentage (PythW%) in a given season is based on their differential in Adjusted Net Yards per Pass Attempt (ANY/A) during that season (i.e., offensive ANY/A minus defensive ANY/A). We collect our data from Pro Football Reference, run our typical ordinary least squares regression analysis (OLS), and obtain something like the following:1
Our analysis shows that teams add 0.08 to their PythW% for every yard they add to their ANY/A Differential, and the p-value for the effect is below 0.001. Things look great, but there’s a hidden flaw in what we’ve done. Namely, we’ve violated two underlying assumptions of regression analysis without even knowing it:
- Independent observations
- Fixed effects
Before descending into statistics jargon, let me first ask two questions:
- In general, do you believe a team’s PythW% last season has no impact on their PythW% this season?
- In general, do you believe the effect of ANY/A Differential on PythW% is the same for every team in every season?
If you answered “no” to both of these questions, excellent. Multilevel modeling (MLM) is potentially for you, and the jargon can commence.
By default, we treat NFL performance as having come from a random sample, when in reality it comes from a clustered sample, which means performance for one sampling unit (e.g., a team season) correlates with performances for other units in the same cluster. The classic example of a clustered sample is when data from two students are correlated because they go to the same school or have the same teacher. In an NFL context, this is analogous to data from two seasons being correlated because their “school” is New England Patriots headquarters or their “teacher” is Bill Belichick. Under clustered (rather than random) sampling, OLS tends to produce biased results, while MLM does not.
It’s important to stress here that clustering isn’t some arbitrary abstraction I’ve forced on the data because I noticed a pattern. Rather, it’s an organic process that happened prior to collecting the data in the first place. An NFL team is a relatively homogeneous social group in real life, so two of its performances will tend to be more similar to one another than to performances by other teams.
According to preliminary work I’ve done for future posts, the driving force underlying NFL clustering is the head coach, which makes sense given how much influence he typically has over the product on the field. For most franchises, few things on the football operations side of the organization elude the guiding hand of the head coach. With that in mind, the following graph illustrates clustering in NFL performance. It uses the same PythW% data as before, but plots it in the context of eight head coaching regimes:2
The overall average across all 32 team seasons was 57.5%, which is represented by the black line, and the total variance was 2.5%. We immediately see from the graph, however, that there’s considerable variation between different head coaching regimes: PythW% data for five of the eight clusters (1, 3, 6, 7, and 8) are packed together relatively tightly, and averages for the eight are, by and large, not equal.
The major distinction I’m trying to draw here is as follows. An OLS would treat every team season as a random, independent observation from a population with a mean of 57.5% PythW% and variance of 2.5%. MLM, in contrast, would group seasons based on their head coaching regime, with each regime having its own mean and variance. Not only would this MLM approach have the advantage of actually producing unbiased results; it also would conform better to the state of reality implied by the graph.
Going back to the first graph, the trendline implied that the effect of ANY/A Differential on PythW% is identical for every team in every season. Belichick’s Patriots? Belichick’s Browns? Chip Kelly’s Eagles? Rich Kotite’s Eagles? They all could be expected to improve by 0.08 PythW% for every one-yard increase in ANY/A Differential. Hopefully, you’ve realized this is one step beyond madness.
But don’t take my word for it. Here’s a revised graph showing the effect of ANY/A Differential on PythW%, except this time I’ve given a separate trendline to each of our eight clusters:
Yes, four of the eight effects mimic the black line, but the dark green cluster shows a slight deviation from average, the pink and gold clusters show a moderate deviation, and the red cluster shows a large deviation. This is a big deal because think about how our conclusion for Team Red changed when we moved from an OLS framework to an MLM framework (i.e., from the first graph to this one). Before, we would have asserted that Team Red adds 0.08 to their PythW% for every yard they add to their ANY/A Differential — because our analysis assumed they were just like every other team. Now, though, we see that their ANY/A Differential actually isn’t that influential at all.3
DT : IR :: TL : DR
OLS assumes that NFL team seasons are independent of each other and that predictor variables affect every team season to the same extent. These assumptions are frequently violated because team seasons are clustered by head coaching regimes in the real world. MLM accounts for said clustering, so it leads to better inferences, and reveals information otherwise hidden by a standard OLS.
The analyses here are strictly pedagogical in nature. Don’t fret if you notice something that begs for a more nuanced explanation. This fire is a slow burn by design. ↩
Data came from the most recent four seasons of Jack Del Rio’s Jaguars (Cluster 1), Gary Kubiak’s Texans (2), Tom Coughlin’s Giants (3), Mike Smith’s Falcons (4), Pete Carroll’s Seahawks (5), Sean Payton’s Saints (6), Mike McCarthy’s Packers (7), and Bill Belichick’s Patriots (8). ↩
In fact, the slope has dropped to 0.03, which isn’t statistically significant. ↩