J Clin Epidemiol. 2026 Mar 12. pii: S0895-4356(26)00096-X. [Epub ahead of print]
112221
OBJECTIVES: With the exponential growth of biomedical literature, the challenge of conducting systematic reviews is becoming increasingly burdensome. We aimed to evaluate the performance of LLMs in the automation of some or all steps of systematic reviews and meta-analyses.
STUDY DESIGN AND SETTING: In this systematic review, we searched PubMed, Embase, the Cochrane Library and preprint platforms up to 14/01/2025. We included any studies assessing the performance of LLMs (e.g., GPT, Claude, Mistral) in any step of the systematic review process. Pairs of reviewers independently extracted data and assessed risk of bias. We conducted analyses using median(IQR) for positive (PPA) and negative percent agreement (NPA), respectively analogous to sensitivity and specificity, between LLMs and human reviewers.
RESULTS: From 3,889 unique references, we included 63 studies of which 52 reporting performance metrics for a total of 148 LLM performance assessments. Most assessments concerned GPT models (n=114, 77%). The most frequently evaluated tasks were Title and Abstract Screening (n=78, 53%), Data Extraction (n=23, 16%), and Full-Text screening (n=20, 14%). For Title and Abstract screening, overall median PPA was 0.92 (IQR 0.69-0.98) and median NPA was 0.89 (0.72-0.95). For full text screening, the overall median PPA was 0.93 (0.87-1.00) and median NPA was 0.92 (0.78-0.97). Late-generation LLMs released after GPT-4 seemed to provide higher performance than earlier models. For other tasks, authors reported overall good performances, but variability of performance metrics precluded complete quantitative synthesis. Global accuracy for data extraction tasks ranged from 0.36 to 1.00, with a median accuracy of 0.95 (IQR 0.91-0.97, n=11). For the 'Risk of Bias assessment' task, accuracy ranged from 0.44 to 0.90 (median = 0.62, IQR 0.53-0.76, n=6).
CONCLUSION: The performance of LLMs, particularly newer generations, shows promise in automating some repetitive steps of systematic reviews such as screening. However, their successful integration will require appropriate safeguards and careful implementation.
Keywords: Artificial intelligence; Large language models; Meta-analyses; Methodology; Systematic reviews