As I’ve already demonstrated with Adjusted Net Yards per Attempt (ANY/A), one way to evaluate the reliability of an NFL stat is to calculate its split-half correlation, which tells you the number of plays we must observe before the stat represents 50% “true” ability and 50% luck. Between my work and that of Monte McNair at Outside the Hashes, here’s our current knowledge about the point at which various quarterback (QB) stats “stabilize:”

Stat | Play Type | Stabilization Point | Obs after 200 | "True" after 200 |
---|---|---|---|---|

Sack Rate | Dropbacks | ~200 | 4.0% | 5.4% |

Completion Rate | Attempts | ~250 | 65.0% | 62.9% |

Yards per Completion | Completions | ~325 | 12.5 | 11.9 |

Adjusted Net Yards per Attempt | Dropbacks | 326 | 7.0 | 6.1 |

Yards per Attempt | Attempts | ~400 | 7.5 | 7.1 |

Touchdown Rate | Attempts | ~1125 | 5.5% | 4.2% |

Interception Rate | Attempts | ~2500 | 2.0% | 3.1% |

So given research thus far, Sack Rate is currently the most reliable QB stat, followed closely by Completion Rate, Yards per Completion, ANY/A, and Yards per Attempt; Touchdown Rate and Interception Rate lag far behind.

Today, I add to this list a bugaboo of the NFL analytics community: Passer Rating (PR).

### Methods

They’re virtually the same as last time:

- I collected data for all QBs with at least 200 total pass attempts from 2002 to 2013.
- To control for team effects, I included only those QBs that had 200+ (or 300+ or 400+, etc.) attempts for the same team.
- Starting with QBs that had 200+ attempts, I randomly selected two sets of 100 attempts for each QB, and calculated the correlation (
*r*) between the two sets. - I performed 25 iterations of Step 2 so that
*r*converged. - I repeated Steps 3 and 4 for QBs with 300+, 400+, 500+ attempts, and so on.
- For each attempts group, I calculated the number of attempts at which the explained variance,
*R*, would mathematically equal 0.5.^{2}^{1} - I calculated a weighted average of my Step 6 results.
^{2} - I calculated the “true” PR for a hypothetical QB that’s posted a 95.0 PR through X number of attempts.

### Results

Here’s the payoff:

Attempts | n | r | R = 0.50^{2} | Avg Rating | Obs 95.0 Rating |
---|---|---|---|---|---|

Wtd Average | 221 | 78.4 | 86.7 | ||

100 | 192 | 0.35 | 189 | 78.0 | 83.8 |

150 | 162 | 0.40 | 226 | 78.2 | 84.9 |

200 | 128 | 0.46 | 235 | 78.3 | 86.0 |

250 | 110 | 0.52 | 235 | 78.4 | 87.0 |

300 | 88 | 0.55 | 245 | 78.5 | 87.6 |

350 | 78 | 0.62 | 214 | 78.6 | 88.8 |

400 | 69 | 0.66 | 204 | 78.6 | 89.5 |

450 | 66 | 0.66 | 230 | 78.7 | 89.5 |

500 | 60 | 0.68 | 240 | 78.8 | 89.7 |

Like last time, *n* is the the number of QBs, “*R ^{2}* = 0.50″ is the number of attempts at which PR is half-skill/half-luck, and “Obs 95.0” is the “true” PR for a QB with a 95.0 PR after that row’s number of number of attempts. So for instance, there were 110 QBs who had (at least) two sets of 250 attempts for the same team, and those QBs had an average PR of 78.4. Given the split-half correlation of 0.52, PR stabilizes at 235 attempts for this group. And given their 78.4 average PR, a QB with a 95.0 PR after 250 attempts would have a “true” PR of 87.0.

In the “Wtd Average” row, we find that PR represents 50% skill and 50% luck after **221 attempts**. Therefore, if we want to know a QB’s “true” PR after a given number of attempts, we add 221 attempts’ worth of average PR performance to his current PR.^{3} For example, the last row of the table tells us that a QB with a 95.0 PR after 221 attempts has a “true” PR of 86.7.^{4}

### Discussion

Admittedly, this result made me do a double-take. After adding the league average number of sacks to make an apples-to-apples comparison possible, PR appears to stabilize 77 dropbacks before ANY/A (i.e., 326 – 221 – 28 = 77). So it’s a more *reliable* QB stat; but does that make it a *better* QB stat?

No.

ANY/A is still a better stat to use when judging QBs because of the distinction between reliability and validity: Just because we might be able to judge a QB *earlier* using PR doesn’t mean that said judgment is *more likely to be correct*. Reliability tells us about a measure’s accuracy; validity tells us about the accuracy of our judgments based on that measure. Below is a classic illustration of this distinction:

It helps to think of this in the context of QB’s throwing accuracy on a 10-yard inside slant, where his arm is our measuring tool. Starting at the top right and moving clockwise:

- QBs like Blaine Gabbert tend to miss the receiver, but their misses are all over the place — sometimes low and ahead of the receiver, other times high and behind them, and so on. Gabbert’s arm isn’t reliable (i.e., you never know where the ball’s going to go), but we can still draw the valid conclusion that he’s an inaccurate passer.
- QBs like Drew Brees tend to hit the receiver, and almost always in the hands. Brees’ arm is reliable, and we can draw the valid conclusion that he’s an accurate passer.
- QBs like a young Alex Smith tend to consistently throw high and behind the receiver. Back then, Smith’s arm was reliable (i.e., we knew where the ball was going to go), but it would have been wrong to conclude that he was an inaccurate passer. Having only seen him miss high and behind, he could just as easily have been an accurate passer with a consistent flaw in his mechanics. (This is exactly what turned out to be the case.)
- Finally, QBs like Matthew Stafford tend to overthrow the receiver, but sometimes those overthrows are behind the receiver, while other times they’re out in front. Stafford’s arm is unreliable because we’re not that sure which way the overthrow’s going to miss, and we also can’t conclude he’s an inaccurate passer because we’ve only seen him miss to the high side (i.e., maybe there’s a flaw in his mechanics).

And so it goes with comparing PR to ANY/A as a measure of QB “goodness:” PR is more like a young Alex Smith; ANY/A more like Drew Brees. PR may stabilize before ANY/A, but in comparison it leaves much more of QB “goodness” out of the equation (i.e., the value of sacks, touchdowns, and interceptions), thereby failing us on the grounds of content validity.

### DT : IR :: TL : DR

I found that it takes 221 pass attempts before a QB’s Passer Rating represents 50% of his “true” skill and 50% luck. That’s fewer than the 326 dropbacks it takes for ANY/A to stabilize. Nevertheless, it’s still preferable to use ANY/A for judging QBs because drawing conclusions slightly faster is less important than drawing conclusions much more accurately.

Danny, what was the weighted average calculation you used? I am not getting the weighted average dropbacks of 221.

221 is the average sum product of the “n” column and the “R^2 = .50” column:

[(192*189)+(162*226)+(128*235)+(110*235)+(88*245)+(78*214)+(69*204)+(66*230)+(60*240)]/(192+162+128+110+88+78+69+66+60)

=

(36288+36612+30080+25850+21560+16692+14076+15180+14400)/953

= 210738/953

= 221.13