Last fall, I wrote about internal structure validity evidence, explaining what it is, why it’s important, and how it applies to NFL data analysis. Long story short, many metrics use multiple objective observations as indicators of a subjective trait, and weight those indicators in some systematic manner. That weighting system is the internal structure of the metric, and how well it fits real-world data can be tested via a statistical technique called confirmatory factor analysis (CFA).
For instance, consider Adjusted Net Yards per Attempt (ANY/A), which — in the language of last paragraph — uses pass yards per dropback (Yds/Db), sack yards per dropback (SkYds/Db), pass touchdowns per droback (PassTDs/Db), and interceptions per dropback (INTs/Db) as objective indicators of a subjective quarterbacking (QB) trait we’ll call “QB quality.”
In terms of ANY/A’s weighting system, it’s actually got nothing to do with the fact that PassTDs are multiplied by 20 and interceptions are multiplied by -45. These are only weights in the sense that they’re the values one needs to put all four stats on the same scale. Once you do that bit of math, however, everything’s weighted equally with respect to its relative impact on ANY/A: 180 yards of PassYds/Db make the same contribution as 180 yardage-equivalent TDs/Db and 180 yardage-equivalent INTs/Db.
That’s the internal structure of ANY/A, i.e., how it combines directly observable stats into one unobservable QB trait. But does that internal structure conform to how Yds/Db, SkYds/Db, PassTDs/Db, and INTs/Db operate in the real world of the NFL? To answer that, we need CFA, and that’s what I’m going to use in this series of posts.
The (Not-So) Basic Math of CFA
In my return post last week, I noted that CFA is part of a larger class of statistical techniques called structural equation modeling (SEM). For all intents and purposes, SEM could stand for “simultaneous equations modeling” because its underlying math isn’t too far off from what Algebra II taught you about solving simultaneous equations.
For instance, we can represent the internal structure of ANY/A graphically as follows:
In CFA notation, ovals are unobservable traits, rectangles are observed metrics, and single-headed arrows indicate predictive paths. Meanwhile, lambdas (λ) represent the impact of the unobservable trait on the observed metric, phis (Φ) represent variance in the unobservable trait, and deltas (δ) represent unexplained variance in each of the observed metrics (including measurement error).
I’m not going to go into the specifics here, but Φs in the above form of CFA model are largely irrelevant. Using the rest of the notation, however, we can write out the ANY/A equations that need to be solved simultaneously:
- Yds/Db = (λ11 * QB Quality) + δ1.
- TDs/Db = (λ21 * QB Quality) + δ2.
- INTs/Db = (λ31 * QB Quality) + δ3.
- SkYds/Db = (λ41 * QB Quality) + δ4.
Anyone familiar with regression analysis will immediately recognize that these equations have a similar form: Y = (coefficient * variable) + error. And indeed, the standardized λs above can be (loosely) interpreted as standardized regression coefficients. That said, there are four important differences at play here:
- Unlike regression, which can only solve one equation at a time, SEM (via this CFA example) solves all four equations simultaneously.
- Unlike regression, which assumes, for instance, that Yds/Db is measured without error, SEM expressly estimates measurement error via the δs.
- Unlike regression, which uses ordinary least squares (OLS) estimation to calculate coefficients, SEM uses maximum likelihood (ML) estimation.
The (Not-So) Basic Math of CFA Model Fit
The fourth difference requires a bit of unpacking. Again, I won’t go into all the specifics right now, but the statistic that would tell you how well a regression model fits real-world ANY/A data, R2, is simply a function of the squared differences between actual ANY/As and the ANY/A results predicted by the model.
In CFA, however, the concept of model fit is handled entirely differently, and here’s an introductory explanation of how it works.
In our current ANY/A example, the implied covariance matrix, based on the model shown above (and its simultaneous equations), is
Whereas the actual, real-world covariance matrix, based on the data I’ve collected — more on that in Part 2 — is
In CFA (and SEM writ large), model fit is based on how well the actual, real-world values in this second covariance matrix conform to the model-implied covariance matrix above it. What’s more, whereas one can only rely on a couple of R2-based measures of model fit in regression, SEM has a whole host of model-fit measures:
- χ2 likelihood ratio (χ2)
- Akaike Information Criterion (AIC)
- Bayesian Information Criterion (BIC)
- Comparative Fit Index (CFI)
- Non-Normative Fit Index (NNFI)1
- Root Mean Square Error of Approximation (RMSEA)
DT: IR :: TL :DR
The steps in doing a confirmatory factor analysis (CFA) of Adjusted Net Yards per Attempt (ANY/A) are as follows:
- Conceptualize the ANY/A model.
- Collect ANY/A data.
- Run the analysis in a statistics program of your choosing.
- Look at the various model fit indices.
- If fit is good, interpret the model coefficients.
- Write it up.
In Part 2 of this Modeling Monday series, I’ll be writing it up.
aka the Tucker-Lewis Index, TLI ↩