While many researchers have training in basic statistical modeling for continuous response variables, it’s common to encounter categorical response variables throughout various research efforts or while surveying research literature. Modeling categorical response variables effectively is a key consideration that can greatly affect the impact and validity of one’s data.
In Stats Bootcamp II, Becki Cleveland, PhD, Assistant Professor of Medicine, and Todd Schwartz, PhD, Professor of Biostatistics, both at the University of North Carolina at Chapel Hill, presented an overview of methods for modeling categorical variables rheumatologic researchers may encounter. The session is available for on-demand viewing for registered ACR Convergence participants through October 31, 2023, on the virtual meeting website.
Dr. Cleveland defined and reviewed several types of categorical variables to effectively differentiate between unique data sets. She then covered how to determine whether two variables are related using a Chi-square test. Each Chi-square test has two contingency tables, one representing observed counts and the other measuring expected counts. This test reveals whether a relationship exists between the two variables, the probability of independence, and the probability of a difference between expected and observed frequencies.
“What it cannot tell you is the details about this relationship,” Dr. Cleveland said. “It cannot tell you the strength of the association between these two variables.”
Dr. Schwartz’s presentation focused on the statistical modeling of categorical data. Each model is interpreted using odds ratios, which can be calculated using equations of probability. The recommended statistical model is dependent upon the categorical response variables present.
Dichotomous response variables are usually used to measure “yes/no” or “true/false” inquiries of association. Dr. Schwartz recommended a logistic regression model for these measurements.
However, researchers often work with polytomous data with more than two possible levels of outcomes. In these cases, ordinal cumulative logit models must be used. Dr. Schwartz demonstrated how to use a proportional odds model, a partial proportional odds model, and a non-proportional odds model.
When the effect of an independent variable is constant for each increase in the level of the response, using a proportional odds model is the correct method, he said.
When the proportional odds model is violated, the non-proportional odds model can be implemented. While this model is similar to the proportional odds equation, the main difference is that the non-proportional model contains subscripts on each of the betas.
“In other words, I’m not going to assume that slope is the same across all of the cumulative logits, but I’m going to give the model the flexibility to estimate separate effects for each cumulative logit,” Dr. Schwartz explained.
He described the partial proportional odds model as a hybrid model that maintains separate intercepts to capture the cumulative nature of the logits. For a subset of the explanatory variables, it allows a separate slope for each of the cumulative logits; for others, it will fit a common slope.
“When there’s a common slope, I can exponentiate that and get my odds ratio that applies across all the cumulative logits,” Dr. Schwartz said. “When it’s separate, I have to estimate a separate odds ratio and interpret that separately for each of the different cumulative logits.”
When a response is polytomous and nominal, the generalized logit model is the optimal equation. In these cases, combining categories or having a cumulative logit doesn’t make sense. Therefore, each category is measured against a reference category.
Dr. Schwartz concluded his presentation by highlighting that when correlated data structures are present, researchers can use correlated data extensions via generalized estimating equations.