Reliability analysis is a vital — though underutilized — tool in NFL research because it helps establish a stat’s trustworthiness (or lack thereof). Among a host of reasons for why our stats-based arguments end up being wrong, one we overlook is jumping the gun: We cite a stat before it’s become a reliable indicator of “true” performance. One way to avoid this pitfall is to identify when a given stat “stabilizes,” i.e., the number of plays we must observe before it represents 50% “true” ability and 50% luck. Today, I’m going to tell you about a reliability analysis I did that asked the question, “When does Adjusted Net Yards per Attempt (ANY/A) stabilize?
Primarily, I’m taking my cue from this post by Monte McNair at Outside the Hashes, which reported a similar analysis for more basic quarterback (QB) stats. Here’s a brief summary of the methods he used:
- Collect data for all QBs with at least 200 total pass attempts from 2000 to 2009.
- Starting with QBs that had 200+ attempts, randomly select two sets of 100 attempts for each QB, and calculate the correlation (r) between the two sets (aka split-half reliability).
- Perform 25 iterations of Step 2 so as to not end up with a misleading r after only one random split.
- Repeat Steps 2 and 3 for QBs with 300+, 400+, 500+ attempts, and so on. In other words, randomly split the 300+ group into two sets of 150, etc.
- Using the formula, (attempts/2)*[(1-r)/r], calculate the number of attempts at which the explained variance, R2, would mathematically equal 0.5. For instance, using yards per attempt by QBs with 800+ attempts, he found a correlation of 0.49, so this calculation was 400*(0.51/0.49) = 416 attempts, which represented the point at which yards per attempt was half-skill/half-luck.
Although McNair’s analysis — and those of sabermetricians — provided a road map, I improved the route to our destination as follows:
- I used QB data from 2002 to 2013, which has more inherent stability given the static number of teams and divisional alignment.
- Because this is a reliability analysis of ANY/A, I used net attempts (aka dropbacks) instead of attempts.
- To control for the possibility that QB movement between teams would muddy the waters of my random sampling, I only included QBs if they had 200+ (or 300+ or 400+, etc.) dropbacks for the same team.
- Rather than concluding that stability occurs around some number of attempts, I calculated a weighted average of my Step 5 results (by sample size) to produce a concrete threshold.1
- Using Tom Tango’s method for regressing MLB stats to the mean,2 which Neil Paine used in an NFL team context on the now-defunct Pro-Football-Reference blog, I calculated the “true” ANY/A for a hypothetical QB that’s posted 6.0 ANY/A through X number of dropbacks.
Behold the brass tacks:
|Dropbacks||n||r||R2 = 0.50||Avg ANY/A||Obs 6.00 ANY/A|
In this table, n is the the number of included QBs, “R2 = 0.50″ is the number of dropbacks at which the given random split would produce an ANY/A that’s half-skill/half-luck, and “Obs 6.00 ANY/A” is the “true” ANY/A for a QB with a 6.00 ANY/A after that row’s number of dropbacks. So for instance, there were 88 QBs who had (at least) two sets of 300 dropbacks for the same team, and they averaged 5.82 ANY/A overall. Given their split-half correlation of 0.46, this group could be expected to generate a 0.50 correlation at 346 dropbacks. And given their 5.82 average ANY/A, a QB that produced 6.00 ANY/A after 300 dropbacks would have a “true” ANY/A of 5.87.
In the “Wtd Average” row, we find out when ANY/A stabilizes: 326 dropbacks. In this row, we also see that a QB with 6.00 ANY/A after 326 dropbacks has a “true” ANY/A of 5.88.3
Compared to the basic QB stats McNair evaluated, ANY/A stabilizes before interception rate (~2,500 attempts) and touchdown rate (~1,125 attempts), but after completion rate (~250 attempts) and sack rate (~200 dropbacks). Most interesting for the present analysis, however, is that ANY/A stabilizes before standard yards per attempt (~400 attempts), which suggests ANY/A is more reliable.
DT: IR :: TL : DR
If our arguments are to garner more weight with the public, identifying when various stats “stabilize” should be an important goal for NFL analytics. To get the ball rolling, I found out when ANY/A stabilizes.
Using QB data from 2002 to 2013, I iterated split-half correlations and used a couple of mathematical formulas to establish that ANY/A stabilizes at 326 dropbacks. Furthermore, this value can be used to calculate the “true” ANY/A for any QB after any number of dropbacks: Simply add 326 dropbacks’ worth of league average ANY/A to his current performance.
In short, if we’re going to use ANY/A to judge a QB, we should hold off until he’s had 326 dropbacks. Otherwise, our judgments, like ANY/A, will probably contain more error than truth.
No, I’m not claiming this threshold is magical. Read the next bullet point for proof. ↩
Hey, did you know measurement reliability forms the foundation of regression to the mean? ↩
If you want to do the math, the formula is [(6.00 * 326) + (5.75 * 326)]/ (326+326) = 3,830.5/652 = 5.88. Or more generally, it’s [(Observed ANY/A * 326) + (League Average ANY/A * 326)] / (Observed dropbacks + 326) ↩