Nearly a quarter century ago, current Houston Rockets general manager Daryl Morey modified Bill James’ Pythagorean Win Percentage (PythW%) formula
by determining that in an NFL context the exponent, x, equals 2.37.
In 2011, Football Outsiders (FO) adapted Clay Davenport’s “Pythagenport” method (PythPortW%), which allows x to vary between teams. In their formulation, x is calculated as follows for each NFL team:
Finally, writing for Football Perspective in 2013, Neil Paine adapted David Smyth’s “Pythagenpat” method (PythPatW%), which also allows x to vary between teams. Here, though, x is calculated as follows:1
The moral of this history is that the underlying motivation for both PythPortW% and PythPatW% was to improve upon PythW%. To the extent that they produced a higher correlation (or lower root mean squared error) with actual wins than does PythW%, it seems as though the missions were a success.
That said, in Chase Stuart’s own use of PythPatW%, he made the following astute observation (my emphasis added):
It’s important to keep in mind that the differences between Pythagenport and Pythagenpat are too minor to have any practical effect. The same is mostly — but not entirely — true with either of those versus the Pythagorean record.
From a measurement perspective, I can’t understate the importance of what Stuart’s implying here. Remember, we don’t create performance stats in NFL analytics just to make ourselves look smart. Rather, their fundamental purpose is two-fold:
- To translate an abstract concept (e.g., “team goodness”) into a concrete measurement (e.g., PythW%, DVOA, SRS, etc.)
- To use these measurements as an objective means for judging teams and players
Whether we explicitly acknowledge it or not, the whole point of NFL analytics is practical application. So if, as Stuart notes, there’s no practical difference between PythW%, PythPortW%, and PythPatW%, then preferring one over the others might simply boil down to a matter of taste, and that does us no favors in the eyes of the public: “You say Pythagorean. He says Pythagenpat. Pythagorean, Pythagenpat; let’s call the whole thing off.”
Therefore, today I’m going to borrow a technique from measurement methodology, and provide a more rigorous answer to the question, “Is there a practical difference between PythW%, PythPortW%, and PythPatW%?”
Before getting to my validity analysis, it’s helpful to provide some more detail on what we already know statistically.
Using data from 1990 to 2010, FO reported that PythPortW% had a correlation of 0.9134 with actual win percentage (ActW%), which was higher than the 0.9120 correlation for PythW%. Meanwhile, Paine found that the root mean squared error (RMSE) between PythPatW% and ActW% for 1970-2012 was 1.204, which was lower than the 1.205 RMSE for PythPortW%.2
You might notice that, for both newer metrics, most of the heavy lifting in drawing a distinction is being done by intentional rounding: Using two decimal places instead of three or four produces hardly any distinction at all. Therefore, as far as previous research is concerned, it seems that Stuart’s observation is correct: There’s no practical difference between the three methods.
Even so, a more rigorous way to approach this problem is founded on so-called statistical unity, which occurs when 1.00 is a plausible value for the correlation between two metrics. If unity exists, then there’s no distinction between the conclusions we can draw from either metric (i.e., they lack evidence supporting discriminant validity).
Here’s a synopsis of what I did:
- I calculated ActW%, PythW%, PythPortW%, and PythPatW% for every NFL team from 1970 to 2013 (n = 1,285).
- From the full group of post-merger teams, I took a random sample of 642 teams (i.e., half of them), and calculated
- The correlations between the four metrics for that sample.
- The standard errors of the correlations.
- I repeated Step 2 25 times so that the correlations and standard errors converged.
- I averaged the results of the 25 iterations, and constructed a 95% confidence interval for the final correlation estimate.
Below is a table showing the final result after Step 4:
|ActW% ↔ PythW%||0.91337||0.01609||(0.88183 , 0.94491)||YES|
|ActW% ↔ PythPortW%||0.91539||0.01591||(0.88420 , 0.94658)||YES|
|ActW% ↔ PythPatW%||0.91533||0.01592||(0.88413 , 0.94653)||YES|
|PythW% ↔ PythPortW%||0.99955||0.00118||(0.99724 , 1.00187)||NO|
|PythW% ↔ PythPatW%||0.99959||0.00113||(0.99738 , 1.00180)||NO|
|PythPortW% ↔ PythPatW%||0.99999||0.00021||(0.99957 , 1.00040)||NO|
Going from left to right in the table, the “Comparison” column identifies the two metrics being correlated, “r” gives the average correlation over 25 random NFL samples, “SE” gives the average standard error for the correlation, and “95% CI” gives the correlation’s 95% confidence interval.
Based on my results, we can reasonably conclude the following:3
- PythW%, PythPortW%, and PythPatW% are distinct from ActW% because none of their confidence intervals include 1.00 (i.e., statistical unity).
- PythW%, PythPortW%, and PythPatW% are not distinct from each other because all of their confidence intervals include 1.00.
- PythW%, PythPortW%, and PythPatW% are not distinct also because all of their confidence intervals in relation to ActW% overlap with each other.
Therefore, contrary to their underlying purpose, PythPortW% and PythPat% do not improve on PythW% statistically. Unless you believe that NFL metrics have the precision of a Six Sigma® process,4 using PythPortW% and PythPat% to judge teams is no more valid (and no less valid) than using PythW%.
But if they’re essentially saying the same thing, then the question becomes, “Which one should we use?” As is often the case in situations like this, the answer depends on how they differ theoretically; but that’s a post for another day.
DT : IR :: TL : DR
For the past quarter century, the NFL analytics community has been trying to improve on the PythW% formula developed by Daryl Morey. Using a discriminant validity test based on the concept of statistical unity, we can conclude that newer formulas have not succeeded in doing so. Therefore, preferring one over the others requires an evaluation of each one’s theoretical chops.
Even without comprehending the statistics whatsoever, the fact that I had to intentionally round these values to five decimal places should be enough of an indication. ↩
Note: They don’t. ↩