J Clin Epidemiol. 2025 Feb 26. pii: S0895-4356(25)00079-4. [Epub ahead of print] 111746
BACKGROUND: Machine learning (ML) promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention.
OBJECTIVE: To provide an overview of LLM applications in SR conduct in health research.
STUDY DESIGN: We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: 26 February 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, one reviewer extracted data, checked by another.
RESULTS: Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n=15, 41%), study selection (n=14, 38%), and data extraction (n=11, 30%). The mostly recurring LLM was GPT (n=33, 89%). Validation studies were predominant (n=21, 57%). In half of the studies, authors evaluated LLM use as promising (n=20, 54%), one quarter as neutral (n=9, 24%) and one fifth as non-promising (n=8, 22%).
CONCLUSIONS: Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.
Keywords: ChatGPT; Health research; Large language models; Machine learning; Scoping review; Systematic reviews as topic