
Topic modeling utilizes unsupervised machine learning to detect underlying themes within texts and has been deployed routinely to analyze social media for insights into healthcare issues. However, the inherent messiness of social media hinders the full realization of this technique’s potential. As such, we hypothesized that restricting medical concepts in social media texts to specific related semantic types and applying topic modeling to these concepts could be a feasible approach to overcome the challenge of traditional topic modeling for social media texts. Therefore, we developed a semantic-type-based topic modeling pipeline to discover self-reported health-related topics. This pipeline integrated semantic type information and Systematized Medical Nomenclature for Medicine (SNOMED) precoordinated expressions into a traditional topic modeling approach to enhance effectiveness in clustering meaningful, distinct topics. Using social media texts regarding statins for illustration, we evaluated the efficacy of this new approach and validated a newly identified topic using real-world clinical data. Based on expert evaluations, this approach resulted in more novel, distinguishable, and meaningful health-related topics compared to traditional topic modeling. In addition, our electronic health record validation for a newly identified topic in two real-world clinical databases indicated that statin users had a higher prevalence of depression or anxiety compared to matched non-users. Our results indicate that this new topic modeling pipeline can improve the extraction of themes from noisy online discussions, thereby contributing to deeper insights for healthcare research.
Science, Terminology as Topic, Q, R, Medicine, Humans, Electronic Health Records, Systematized Nomenclature of Medicine, Social Media, Research Article, Semantics
Science, Terminology as Topic, Q, R, Medicine, Humans, Electronic Health Records, Systematized Nomenclature of Medicine, Social Media, Research Article, Semantics
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 2 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
