bims-arines 2025-03-09 papers

J Clin Epidemiol. 2025 Feb 26. pii: S0895-4356(25)00079-4. [Epub ahead of print] 111746

Large language models for conducting systematic reviews: on the rise, but not yet ready for use - a scoping review.

Judith-Lisa Lieberum, Markus Töws, Maria-Inti Metzendorf, Felix Heilmeyer, Waldemar Siemens, Christian Haverkamp, Daniel Böhringer, Joerg J Meerpohl, Angelika Eisele-Metzger.

BACKGROUND: Machine learning (ML) promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention.
OBJECTIVE: To provide an overview of LLM applications in SR conduct in health research.
STUDY DESIGN: We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: 26 February 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, one reviewer extracted data, checked by another.
RESULTS: Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n=15, 41%), study selection (n=14, 38%), and data extraction (n=11, 30%). The mostly recurring LLM was GPT (n=33, 89%). Validation studies were predominant (n=21, 57%). In half of the studies, authors evaluated LLM use as promising (n=20, 54%), one quarter as neutral (n=9, 24%) and one fifth as non-promising (n=8, 22%).
CONCLUSIONS: Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.

Keywords: ChatGPT; Health research; Large language models; Machine learning; Scoping review; Systematic reviews as topic

DOI: https://doi.org/10.1016/j.jclinepi.2025.111746

J Am Med Inform Assoc. 2025 Feb 27. pii: ocaf030. [Epub ahead of print]

Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation.

Ying Li, Surabhi Datta, Majid Rastegar-Mojarad, Kyeryoung Lee, Hunki Paek, Julie Glasgow, Chris Liston, Long He, Xiaoyan Wang, Yingxin Xu.

OBJECTIVES: We developed and validated a large language model (LLM)-assisted system for conducting systematic literature reviews in health technology assessment (HTA) submissions.
MATERIALS AND METHODS: We developed a five-module system using abstracts acquired from PubMed: (1) literature search query setup; (2) study protocol setup using population, intervention/comparison, outcome, and study type (PICOs) criteria; (3) LLM-assisted abstract screening; (4) LLM-assisted data extraction; and (5) data summarization. The system incorporates a human-in-the-loop design, allowing real-time PICOs criteria adjustment. This is achieved by collecting information on disagreements between the LLM and human reviewers regarding inclusion/exclusion decisions and their rationales, enabling informed PICOs refinement. We generated four evaluation sets including relapsed and refractory multiple myeloma (RRMM) and advanced melanoma to evaluate the LLM's performance in three key areas: (1) recommending inclusion/exclusion decisions during abstract screening, (2) providing valid rationales for abstract exclusion, and (3) extracting relevant information from included abstracts.
RESULTS: The system demonstrated relatively high performance across all evaluation sets. For abstract screening, it achieved an average sensitivity of 90%, F1 score of 82, accuracy of 89%, and Cohen's κ of 0.71, indicating substantial agreement between human reviewers and LLM-based results. In identifying specific exclusion rationales, the system attained accuracies of 97% and 84%, and F1 scores of 98 and 89 for RRMM and advanced melanoma, respectively. For data extraction, the system achieved an F1 score of 93.
DISCUSSION: Results showed high sensitivity, Cohen's κ, and PABAK for abstract screening, and high F1 scores for data extraction. This human-in-the-loop AI-assisted SLR system demonstrates the potential of GPT-4's in context learning capabilities by eliminating the need for manually annotated training data. In addition, this LLM-based system offers subject matter experts greater control through prompt adjustment and real-time feedback, enabling iterative refinement of PICOs criteria based on performance metrics.
CONCLUSION: The system demonstrates potential to streamline systematic literature reviews, potentially reducing time, cost, and human errors while enhancing evidence generation for HTA submissions.

Keywords: GPT-4; human-in-the loop AI; information extraction; large language model; systematic literature review

DOI: https://doi.org/10.1093/jamia/ocaf030

BMC Med Res Methodol. 2025 Mar 06. 25(1): 59

Validity of using a semi-automated screening tool in a systematic review assessing non-specific effects of respiratory vaccines.

Charlie Holland, Daniel B Oakes, Mohinder Sarna, Kevin Ek Chai, Leo Ng, Hannah C Moore.

BACKGROUND: The abstract screening process of systematic reviews can take thousands of hours by two researchers. We aim to determine the reliability and validity of Research Screener, a semi-automated abstract screening tool within a systematic review on non-specific and broader effects of respiratory vaccines on acute lower respiratory infection hospitalisations and antimicrobial prescribing patterns in young children.
METHODS: We searched online databases for Medline, Embase, CINAHL, Scopus and ClinicalTrials.gov from inception until 24th January 2024. We included human studies involving non-specific and broader effects of respiratory vaccines and excluded studies investigating live-attenuated vaccines. The RS trial compared relevant abstracts flagged by RS to manual screening. RS ranks abstracts by relevance based on seed articles used to validate the search strategy. Abstracts are re-ranked following reviewers' feedback. Two reviewers screened RS independently with a third reviewer resolving conflicts; three reviewers screened manually with a fourth reviewer resolving conflicts.
RESULTS: After removal of duplicates, 9,727 articles were identified for abstract screening. Of those, 3,000 were randomly selected for screening in RS, with 18% (540) screened in RS and 100% manually. In RS, 99 relevant articles were identified. After comparing RS to manual screening and completing full-text review on 26 articles not captured by RS, 4 articles were missed by RS (2 due to human error, 2 not yet screened). Hence, RS captured articles accurately whilst reducing the screening load.
CONCLUSIONS: RS is a valid and reliable tool that reduces the amount of time spent screening articles for large-scale systematic reviews. RS is a useful tool that should be considered for streamlining the process of systematic reviews.

Keywords: Research Screener; Respiratory vaccines; Semi-automated screening; Systematic review; Validation

DOI: https://doi.org/10.1186/s12874-025-02511-7