I’ve previously shown that Adjusted Net Yards per Attempt (ANY/A) is one of the most reliable advanced passing stats out there, stabilizing after only 326 dropbacks (Dbs). That said, reliability is only one side of the coin when it comes to evaluating the usefulness of a metric; the other side is validity, which has five components that I detailed several months ago. The component of interest in this three-part series is internal structure validity, and the specific statistical technique I’m using to test it is called **confirmatory factor analysis (CFA)**, the fundamentals of which I introduced in Part 1.

Today, I’m going to present my CFA’s methods and results, and also discuss its implications and limitations.

## Methods

My data set comprised Pass Yards/Db, Pass TDs/Db, Pass INTs/Db, and Sack Yards/Db for the 1,114 qualifying QB seasons from 1978 to 2014. I then standardized each of these metrics because they have wildly different ranges.^{1}

Next, I randomly split the sample into two groups of 557 QB seasons so I could test whether or not the internal structure of ANY/A is sample-dependent. It’s important to note here that one advantage of CFA — and structural equation modeling writ large — is that it’s possible to estimate coefficients and test equality between groups *at the same*. This is called multigroup invariance analysis, and you can learn more about it here.

Speaking of which, the ANY/A model that I specified to be the same for both groups is graphically represented below:

Mplus v4.21 then used maximum likelihood to estimate coefficients for:

- the λs in the model, which represent the influence of an unobserved/unobservable concept called “QB Quality” on the four observed metrics
^{2} - the δs in the model, which represent the variance in each observed metric that isn’t explained by “QB Quality.”
- the Φ in the model, which represents the theoretical variance of “QB quality.”

## Results

Unlike with ordinary least squares (OLS) regression, where model fit (e.g., R^{2}) and coefficient estimates go hand in hand, CFA requires that fit be adequate before even considering coefficients.

So, looking first at CFA model fit, I focused on the Comparative Fit Index (CFI), the Tucker-Lewis Index (TLI)^{3}, and the Root Mean Squared Error of Approximation (RMSEA) for various mathematical reasons that have been detailed in the academic literature. What one wants to see is a CFI and TLI greater than or equal to 0.95 and an RMSEA less than or equal to 0.06.

Now, remember that my CFA simultaneously evaluated both groups of 557 QB seasons. Therefore, “model fit” here actually tells us whether or not the proposed ANY/A model *fits the data for both groups w**ell, and equally so*. In other words, in this particular analysis, it isn’t enough for the model to fit Group 1 well but fit Group 2 poorly, or vice versa.

So how good was model fit?

Swimmingly! The RMSEA is slightly above what you want to see, but the 90% confidence interval includes 0.06, so I won’t quibble. The important point here is that the theoretical model of ANY/A provides a good fit to real-world ANY/A data, and that fit isn’t sample-dependent.

Now that we’ve established good model fit, we can move on to interpreting the estimated coefficients. Here they are:

The key sections to pay attention to are those titled “ANYA By” for the two groups. As you can see, the values are equal in the “Estimates” column. This confirms that my CFA is testing whether or not the influence of “QB Quality” on the various ANY/A components is the same for Group 1 as it is for Group 2.

That said, the “StdYX” column for the “ANYA By” section — i.e., where the standardized coefficient estimates reside — tells a sadder tale. What we want to see is that all four estimates are greater than or equal to ±.707 in magnitude because it means that at least 50% of the variation in each observed ANY/A component is explained by the unobservable “QB Quality” concept. Unfortunately, we see that INT/Db and SkYds/Db aren’t even close to meeting that threshold. Furthermore, the fact that (the absolute value of) coefficients for Yards/Db and TDs/Db are nearly four times those of INTs/Db and SkYds/Db means that the former pair are much stronger indicators of QB Quality than the latter pair.

## Discussion

Given my previous findings about the reliability of these metrics, you might interpret the above paragraph as somewhat contradictory: Yds/Db being a good indicator of QB quality makes sense and INTs/Db being a poor indicator makes sense, but SkYds/Db seems like it shouldn’t be a poor indicator and TDs/Db shouldn’t be a good indicator. This is because, as is the saying in measurement evaluation circles, reliability simply sets the ceiling for validity. It’s always the case that a metric is less valid than it is reliable, but some metrics do a better job of preserving their reliability than others. Here, Yds/Db preserved its high reliability, TDs/Db preserved its middling reliability, SkYds/Db failed to preserve its high reliability, and INT/Db failed to even preserve its low reliability. In short, my results suggest focusing one’s attention on Yds/Db and TDs/Db when evaluating QB quality.

That said, my CFA-based conclusions about ANY/A come with a couple of caveats. First, it’s reasonable to assume that ANY/A seasons for the same QB in the same offense are probably highly correlated, so I could have done a multilevel CFA. I chose not to for this post because … baby steps. Second, just because this *internal structure* validity evidence is weaker than we’d like doesn’t preclude drawing more positive conclusions based on *other types *of validity evidence.

## DT : IR :: TL : DR

Via a multigroup CFA invariance analysis, I found that the theoretical model underlying ANY/A fits real-world ANY/A well, and that goodness of fit isn’t sample-dependent. However, I also found that Yds/Db and TDs/Db are better indicators of “QB Quality” than are INTs/Db and SkYds/Db.