J Biomed Inform. 2020 Jul 02. pii: S1532-0464(20)30128-3. [Epub ahead of print]108 103500
BACKGROUND: Real-time surveillance in the field of health informatics has emerged as a growing domain of interest among worldwide researchers. Evolution in this field has helped in the introduction of various initiatives related to public health informatics. Surveillance systems in the area of health informatics utilizing social media information have been developed for early prediction of disease outbreaks and to monitor diseases. In the past few years, the availability of social media data, particularly Twitter data, enabled real-time syndromic surveillance that provides immediate analysis and instant feedback to those who are charged with follow-ups and investigation of potential outbreaks. In this paper, we review the recent work, trends, and machine learning(ML) text classification approaches used by surveillance systems seeking social media data in the healthcare domain. We also highlight the limitations and challenges followed by possible future directions that can be taken further in this domain.
METHODS: To study the landscape of research in health informatics performing surveillance of the various health-related data posted on social media or web-based platforms, we present a bibliometric analysis of the 1240 publications indexed in multiple scientific databases (IEEE, ACM Digital Library, ScienceDirect, PubMed) from the year 2010-2018. The papers were further reviewed based on the various machine learning algorithms used for analyzing health-related text posted on social media platforms.
FINDINGS: Based on the corpus of 148 selected articles, the study finds the types of social media or web-based platforms used for surveillance in the healthcare domain, along with the health topic(s) studied by them. In the corpus of selected articles, we found 26 articles were using machine learning technique. These articles were studied to find commonly used ML techniques. The majority of studies (24%) focused on the surveillance of flu or influenza-like illness (ILI). Twitter (64%) is the most popular data source to perform surveillance research using social media text data, and Support Vector Machine (SVM) (33%) being the most used ML algorithm for text classification.
CONCLUSIONS: The inclusion of online data in surveillance systems has improved the disease prediction ability over traditional syndromic surveillance systems. However, social media based surveillance systems have many limitations and challenges, including noise, demographic bias, privacy issues, etc. Our paper mentions future directions, which can be useful for researchers working in the area. Researchers can use this paper as a library for social media based surveillance systems in the healthcare domain and can expand such systems by incorporating the future works discussed in our paper.
Keywords: Health informatics; Machine learning; Outbreak detection; Social media; Surveillance systems