J Am Med Inform Assoc. 2022 May 31. pii: ocac066. [Epub ahead of print]
OBJECTIVE: We aim to investigate the application and accuracy of artificial intelligence (AI) methods for automated medical literature screening for systematic reviews.
MATERIALS AND METHODS: We systematically searched PubMed, Embase, and IEEE Xplore Digital Library to identify potentially relevant studies. We included studies in automated literature screening that reported study question, source of dataset, and developed algorithm models for literature screening. The literature screening results by human investigators were considered to be the reference standard. Quantitative synthesis of the accuracy was conducted using a bivariate model.
RESULTS: Eighty-six studies were included in our systematic review and 17 studies were further included for meta-analysis. The combined recall, specificity, and precision were 0.928 [95% confidence interval (CI), 0.878-0.958], 0.647 (95% CI, 0.442-0.809), and 0.200 (95% CI, 0.135-0.287) when achieving maximized recall, but were 0.708 (95% CI, 0.570-0.816), 0.921 (95% CI, 0.824-0.967), and 0.461 (95% CI, 0.375-0.549) when achieving maximized precision in the AI models. No significant difference was found in recall among subgroup analyses including the algorithms, the number of screened literatures, and the fraction of included literatures.
DISCUSSION AND CONCLUSION: This systematic review and meta-analysis study showed that the recall is more important than the specificity or precision in literature screening, and a recall over 0.95 should be prioritized. We recommend to report the effectiveness indices of automatic algorithms separately. At the current stage manual literature screening is still indispensable for medical systematic reviews.
Keywords: artificial intelligence; diagnostic test accuracy; evidence-based medicine; natural language process; systematic review