Benvenuti a Measurement Validity and Reliability

Measurement validity is the degree to which our statistics-based conclusions are not erroneous, and reliability is the degree to which a stat represents the “true” value of a particular characteristic. For example, using a stat like Football Outsiders‘ (FO) Defense-Adjusted Value Over Average (DVOA) to judge the “goodness” of NFL teams is valid only if such judgments are unlikely to be wrong. DVOA’s reliability, on the other hand, depends on how much it represents “true goodness” rather than “random goodness.”

For obvious reasons, measurement validity and reliability are a big deal in education and psychology, so it makes sense that one of the most influential documents on the subject is The Standards for Educational and Psychological Testing.1

Three of the five types of validity evidence listed in the The Standards are relevant to NFL analytics:

  1. Evidence based on test content
  2. Evidence based on internal structure
  3. Evidence based on relations to other variables

Test Content

According to The Standards (p. 11)

Evidence based on test content can include logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores.


I’ll translate by way of a familiar NFL example. Say you wanted to argue that Drew Brees was a better quarterback (QB) than Colin Kaepernick in 2013. Well, citing Brees’ superior Passer Rating would be a less valid way to make that argument than citing his superior Total QBR. That’s because Total QBR measures a much larger proportion of what a QB does when he’s on the field, and therefore represents more of the content domain.

Internal Structure

On this front, The Standards says (p. 13)

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based.


This is a tough one to explain, so bear with me.

Let’s take DVOA as an example. Conceptually, it’s a measure of an abstract idea (aka construct) called “NFL team goodness,” with the team’s individual plays serving as “test items,” and its three team units (i.e., offense, defense, and special teams) serving as “test components.” The internal structure of DVOA, then, is how it quantifies (a) the relationships between performances on different plays, and (b) the relationships between performances by different team units. Simply put, it’s the weighting system FO uses to calculate DVOA. Therefore, validity evidence supporting the use of DVOA based on its internal structure would show that its weighting system closely matches how performances on different plays and performances by different team units are related to one another in real life.

Just imagine a hypothetical scenario whereby NFL teams tended to perform as well on 1st-and-10 as they did on 2nd-and-5. If DVOA weighted these two performances equally, then that would be a credit to the validity of using DVOA as a measure of “NFL team goodness.”


Relations to Other Variables

Back to The Standards (p. 13):

Analyses of the relationship of test scores to variables external to the test provide another important source of validity evidence. External variables may include measures of some criteria that the test is expected to predict, as well as relationships to other tests hypothesized to measure the same constructs, and tests measuring related or different constructs.


This passage implies three subcomponents relevant to NFL analytics:

  1. Convergent evidence
  2. Generalizability evidence
  3. Test-criterion evidence

Again, I’ll translate in the context of DVOA. Convergent evidence might comprise showing that it correlates positively with other established measures of “team goodness” (e.g., Pythagorean Win PercentagePro Football Reference‘s Simple Rating System, etc.). Meanwhile, this FO article is a good example of generalizability evidence: DVOA’s current version (7), which is based on the 1991-2011 NFL seasons, can also be used to judge teams outside that era.

Finally, test-criterion evidence comes either in the form of predictive evidence or concurrent evidence. The former could be something like the correlation between Long-Term Career Forecasts for rookie QBs and their subsequent NFL performance. This post two years ago by Chase Stuart of Football Perspective is an example of the latter, as it correlated various individual passing stats with team wins in the same season (i.e., concurrently).


Quoth The Standards (p. 25, 27):

The hypothetical difference between an examinee’s observed score…and the examinee’s true or universe score…is called measurement error. Information about measurement error…includes the identification of the major sources of error, summary statistics bearing on the size of such errors, and the degree of generalizability of scores across alternate forms, scorers, administrations, or other relevant dimensions.


Reliability is all about distinguishing the signal from the noise. For instance, this post by Monte McNair of Outside the Hashes provides reliability evidence, as he calculated how long we need to wait before QB performance is equal parts skill and luck. Or in the language of reliability, he answered the question, “How many pass attempts must we observe before we can say a given QB stat is 50% true score and 50% measurement error?”2 Another example is this FO post, which showed that historical DVOAs stayed relatively the same when they were updated from Version 6 to Version 7.

DT : IR :: TL : DR

If we want NFL analytics to be trusted, then it’s our responsibility as members of the NFL analytics community to provide clear answers to the following validity and reliability questions:

  1. Do our stats adequately capture the performance area of interest?
  2. Do our weighting systems closely match the performance patterns we see in real life?
  3. Do our stats correlate highly with other similar stats?
  4. Do our stats apply to out-of-sample teams/players?
  5. Do our stats accurately predict the future and/or explain the current?
  6. Do our stats maximize signal and minimize noise?
Email to someoneShare on Facebook0Tweet about this on TwitterShare on Reddit0

  1. Like a Corvette, this thing is top-of-class in staving off depreciation, so e-mail me if you want it for, um, educational and psychological purposes only. 

  2. Evidence suggests Tony Romo‘s game-ending interceptions are 100% true score. 

Bookmark the permalink.