Background: Most studies of health-related quality of life (HRQoL) rely on administration of patient reported outcome (PRO) instruments to users, and are limited by researchers’ abilities to recruit patients into studies. Social media presents a rich opportunity to gather health information from users with limited intervention. We sought to estimate HRQoL of Twitter users using automated semantic processing methods.
Methods: We gathered Tweets from 1326 users, and had them complete an empirically-validated measure of HRQoL, the CDC-4 Healthy Days questionnaire. We processed Tweets by calculating the positive or negative sentiment of each Tweet, by applying natural language processing to measure the presence of any health-related topics, and by calculating other global features of each participants’ Tweet set, including number of tweets. We used each of these features to estimate dichotomized HRQoL ("high" or "low"), and analyzed model performance using receiver operator characteristic (ROC) curve analysis.
Results: Despite a poor signal-to-noise ratio in our data from general Twitter users, we were able to estimate HRQoL with 60% accuracy, corresponding to an ROC value of 0.64.
Conclusion: Social media data present a promising opportunity for researchers to examine the health status of the general population of Twitter users. Future studies will need to refine how users are identified, depending on the nature of the research question being asked.
@inproceedings{Sarma2018,
author = {Sarma, K. V. and Spiegel, B. M. R. and Reid, M. W. and Chen, S. and Merchant, R. M. and Seltzer, E. and Arnold, C. W.},
booktitle = {AMIA Annual Symposium Proceedings},
title = {{Estimating Health-Related Quality of Life of Twitter Users: Methods for Semantic Processing of Social Media Posts}},
year = {2018},
}