Kenneth Wilcox, University of Notre Dame
Combining topic modeling and regression: Supervised topic modeling with covariates
As text data become larger and increasingly accessible in psychological research, the need for appropriate statistical models has grown. One statistical framework, topic modeling, models patterns of word co-occurrences using discrete latent topics. Unsupervised topic models such as Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) are typically used to estimate topic proportions as a summary of the text (e.g., to understand the content of survey responses). In psychology, it is common to use these estimated topic proportions as regression predictors in a two-stage approach. However, it is well-known (e.g., in the factor analysis literature) that such a two-stage procedure can be problematic. We propose a novel extension of the supervised topic model (Blei & McAuliffe, 2008) that jointly estimates a topic model and a regression model that incorporates both the latent topics and other covariates as predictors. Our model, Supervised Latent Dirichlet Allocation with Covariates (SLDAX), can be fit in a single stage rather than two stages, allows for evaluation of the incremental validity of the topics given other established measures (and vice versa), and models relationships between the topics and the outcome. To estimate the SLDAX model, we derived a Gibbs sampling algorithm and developed an accompanying R package, psychtm, that implements SLDAX. Performance of the model for different data characteristics was evaluated in a simulation study. We demonstrate the application of SLDAX on an empirical data set.