Here’s a small but representative subset of Adjusted Net Yards per Attempt (ANY/A) data that we might use to calculate an age curve for NFL quarterbacks (QB):
Well, this isn’t good: 100% of age groups are missing ANY/As, 80% of QBs are missing ANY/As, and 37% of the entire data set is missing ANY/As. How can we have faith in our age curve when we’re lacking so much of the necessary information to calculate it? Does it matter that some of what’s missing follows a pattern? Is there anything we can do about this; and if so, how? The answers to these questions are “yes,” “yes,” “yes,” and “keep reading this post.”
Looking Under the Hood
Before talking about solutions, let me make two things clear. First, a problem arises only when it’s our outcome variable that’s missing; in this case, ANY/A. If data is only missing on a predictor variable for which we might want to adjust (e.g., schedule strength), we can whistle past the graveyard.
Second, a bona fide missing data problem requires first diagnosing what’s causing it. Academics1 have identified three main causes for missing data. As is their wont, however, they’ve named the causes in a way that creates more confusion than education. So, for your convenience I’ve translated these missing data mechanisms into English, and come up with examples related to our QB data set.
When data for an outcome variable is missing by chance, it became missing only via some random process. You can’t predict its missingness from other variables you measured, and you couldn’t have predicted its missingness from variables you didn’t measure. If what we’re missing for the eight QBs above was due to an act of God or a clerical error in the play-by-play, for example, it would be missing by chance.
When outcome data is missing by proxy, it became missing only via a process related to a potential predictor variable, whether or not we’ve measured it. For instance, Carson Palmer, Randall Cunningham, and Jim McMahon have missing data because injured QBs are less likely to produce a qualifying ANY/A at a given age. Similarly, Bobby Hebert’s missing ANY/As came from backing up Dave Wilson in 1986 and holding out in 1990. For these QBs, injury status, starting status, and holdout status were proxies for missingness, indirectly indicating a higher probability of it.
Finally, when outcome data is missing by default, it became missing via a process related to the outcome variable itself. In our QB example, this is illustrated by Aaron Brooks, Daunte Culpepper, David Carr, and Joe Pisarcik, all of whom produced an awful ANY/A in their final qualifying year and, for all intents and purposes, were never heard from again. In other words, the likelihood of missing ANY/A data for these QBs at a given age was higher because of their own ANY/As at other ages.
Dealing with Data That’s Missing by Chance or by Proxy
When data is missing by chance or missing by proxy, we can ignore that it’s missing, but it’s often a good idea to account for it using any number of techniques. The general strategy underlying all of them is that they use the data points we have to account for the data points we don’t have.
If our missing QB data looked like the rows from Palmer to Hebert, then the easiest way out would be to exclude all QBs that were injured or a backup or a holdout (or suspended) at any point from Age 26 to Age 31 (aka listwise deletion). The next easiest way would be to replace all of the missing data with reasonable approximations, e.g., the overall average ANY/A, the average ANY/A for a given age group, or the average ANY/A for a given QB (aka mean replacement).
Although better than doing nothing, both of these fixes still have their problems. In our current example, listwise deletion would mean excluding 80% of QBs, which to me doesn’t sound like improving our analysis. Meanwhile, mean replacement would inflate the averages at a given age and artificially reduce their variances, thereby potentially biasing our results.
Luckily, there are additional techniques that don’t suffer from these issues (e.g., multiple imputation, maximum likelihood, etc.), but they’re too complicated to get into here.
Dealing with Data That’s Missing by Default
Unfortunately, data on NFL aging is rarely missing by chance or by proxy. Instead, we inevitable see the pattern exhibited by Brooks, Culpepper, Carr, and Pisarcik. Metaphorically speaking, data missing by default is the ZEBOV strain of NFL age curve research: It’s the most prevalent, the most fatal, and the most difficult to cure.
Recall that the techniques I listed in the last section use our good (aka available) data to “fix” the bad (aka missing) data. Well, when data’s missing by default, this strategy won’t work because it’s precisely our good data that’s the source of the problem; it’s corrupted, if you will.
The only way to get around this is to use some sort of “antivirus” that simultaneously disinfects our good data and protects it from re-exposure when we it fixes the bad data. Any of these three advanced techniques would do the trick:
- Selection models
- Shared parameter models
- Pattern mixture models
Detailing their methods is a post (or three) for another day, but here’s the good news. I previously made a case for using latent growth modeling (LGM) in the context of NFL age curves. Well, guess what. Each of these three is a specific application of LGM.
DT : IR :: TL : DR
There are three causes of missing outcome data:
- Proxy variables
- The outcome itself
In NFL age curve research, we can ignore the first two types, although it’s still advisable to”fill in” the missing data in some way. We can’t ignore the third type, however, and it’s by far the most common. The only way out is to use LGM; otherwise, our age curves are biased.
Little, R.J.A. & Rubin, D.B. (1987). Statistical analysis with missing data. New York, Wiley. ↩