bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–03–30
five papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Am Med Inform Assoc. 2025 Mar 22. pii: ocaf050. [Epub ahead of print]
       OBJECTIVE: screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.
    MATERIALS AND METHODS: LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).
    RESULTS: On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.
    DISCUSSION: Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.
    CONCLUSION: LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.
    Keywords:  abstract screening; artificial intelligence; evidence synthesis; foundation model; large language model; systematic review
    DOI:  https://doi.org/10.1093/jamia/ocaf050
  2. JMIR Med Inform. 2025 Mar 27. 13 e65371
       BACKGROUND: A challenge in updating systematic reviews is the workload in screening the articles. Many screening models using natural language processing technology have been implemented to scrutinize articles based on titles and abstracts. While these approaches show promise, traditional models typically treat abstracts as uniform text. We hypothesize that selective training on specific abstract components could enhance model performance for systematic review screening.
    OBJECTIVE: We evaluated the efficacy of a novel screening model that selects specific components from abstracts to improve performance and developed an automatic systematic review update model using an abstract component classifier to categorize abstracts based on their components.
    METHODS: A screening model was created based on the included and excluded articles in the existing systematic review and used as the scheme for the automatic update of the systematic review. A prior publication was selected for the systematic review, and articles included or excluded in the articles screening process were used as training data. The titles and abstracts were classified into 5 categories (Title, Introduction, Methods, Results, and Conclusion). Thirty-one component-composition datasets were created by combining 5 component datasets. We implemented 31 screening models using the component-composition datasets and compared their performances. Comparisons were conducted using 3 pretrained models: Bidirectional Encoder Representations from Transformer (BERT), BioLinkBERT, and BioM- Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Moreover, to automate the component selection of abstracts, we developed the Abstract Component Classifier Model and created component datasets using this classifier model classification. Using the component datasets classified using the Abstract Component Classifier Model, we created 10 component-composition datasets used by the top 10 screening models with the highest performance when implementing screening models using the component datasets that were classified manually. Ten screening models were implemented using these datasets, and their performances were compared with those of models developed using manually classified component-composition datasets. The primary evaluation metric was the F10-Score weighted by the recall.
    RESULTS: A total of 256 included articles and 1261 excluded articles were extracted from the selected systematic review. In the screening models implemented using manually classified datasets, the performance of some surpassed that of models trained on all components (BERT: 9 models, BioLinkBERT: 6 models, and BioM-ELECTRA: 21 models). In models implemented using datasets classified by the Abstract Component Classifier Model, the performances of some models (BERT: 7 models and BioM-ELECTRA: 9 models) surpassed that of the models trained on all components. These models achieved an 88.6% reduction in manual screening workload while maintaining high recall (0.93).
    CONCLUSIONS: Component selection from the title and abstract can improve the performance of screening models and substantially reduce the manual screening workload in systematic review updates. Future research should focus on validating this approach across different systematic review domains.
    Keywords:  bidirectional encoder representations from transformer; efficiency; guideline updates; language model; literature; natural language processing; screening model; systematic review; updating systematic reviews
    DOI:  https://doi.org/10.2196/65371
  3. JMIR Form Res. 2025 Mar 28. 9 e58366
      
    Keywords:  AI; ChatGPT 3.5; LLM; adoption; analysis; app; article screening; artificial intelligence; chatbot; data; dataset; large language model; reviewer; screening; systematic review
    DOI:  https://doi.org/10.2196/58366
  4. Appl Health Econ Health Policy. 2025 Mar 28.
       INTRODUCTION: The growth of scientific literature in health economics and policy represents a challenge for researchers conducting literature reviews. This study explores the adoption of a machine learning (ML) tool to enhance title and abstract screening. By retrospectively assessing its performance against the manual screening of a recent scoping review, we aimed to evaluate its reliability and potential for streamlining future reviews.
    METHODS: ASReview was utilised in 'Simulation Mode' to evaluate the percentage of relevant records found (RRF) during title/abstract screening. A dataset of 10,246 unique records from three databases was considered, with 135 relevant records labelled. Performance was assessed across three scenarios with varying levels of prior knowledge (PK) (i.e., 5, 10, or 15 records), using both sampling and heuristic stopping criteria, with 100 simulations conducted for each scenario.
    RESULTS: The ML tool demonstrated strong performance in facilitating the screening process. Using the sampling criterion, median RRF values stabilised at 97% with 25% of the sample screened, saving reviewers approximately 32 working days. The heuristic criterion showed similar median values, but greater variability due to premature conclusions upon reaching the threshold. While higher PK levels improved early-stage performance, the ML tool's accuracy stabilised as screening progressed, even with minimal PK.
    CONCLUSIONS: This study highlights the potential of ML tools to enhance the efficiency of title and abstract screening in health economics and policy literature reviews. To fully realise this potential, it is essential for regulatory bodies to establish comprehensive guidelines that ensure ML-assisted reviews uphold rigorous evidence quality standards, thereby enhancing their integrity and reliability.
    DOI:  https://doi.org/10.1007/s40258-025-00963-y
  5. JMIR Cancer. 2025 Mar 28. 11 e65984
       Background: Natural language processing systems for data extraction from unstructured clinical text require expert-driven input for labeled annotations and model training. The natural language processing competency of large language models (LLM) can enable automated data extraction of important patient characteristics from electronic health records, which is useful for accelerating cancer clinical research and informing oncology care.
    Objective: This scoping review aims to map the current landscape, including definitions, frameworks, and future directions of LLMs applied to data extraction from clinical text in oncology.
    Methods: We queried Ovid MEDLINE for primary, peer-reviewed research studies published since 2000 on June 2, 2024, using oncology- and LLM-related keywords. This scoping review included studies that evaluated the performance of an LLM applied to data extraction from clinical text in oncology contexts. Study attributes and main outcomes were extracted to outline key trends of research in LLM-based data extraction.
    Results: The literature search yielded 24 studies for inclusion. The majority of studies assessed original and fine-tuned variants of the BERT LLM (n=18, 75%) followed by the Chat-GPT conversational LLM (n=6, 25%). LLMs for data extraction were commonly applied in pan-cancer clinical settings (n=11, 46%), followed by breast (n=4, 17%), and lung (n=4, 17%) cancer contexts, and were evaluated using multi-institution datasets (n=18, 75%). Comparing the studies published in 2022-2024 versus 2019-2021, both the total number of studies (18 vs 6) and the proportion of studies using prompt engineering increased (5/18, 28% vs 0/6, 0%), while the proportion using fine-tuning decreased (8/18, 44.4% vs 6/6, 100%). Advantages of LLMs included positive data extraction performance and reduced manual workload.
    Conclusions: LLMs applied to data extraction in oncology can serve as useful automated tools to reduce the administrative burden of reviewing patient health records and increase time for patient-facing care. Recent advances in prompt-engineering and fine-tuning methods, and multimodal data extraction present promising directions for future research. Further studies are needed to evaluate the performance of LLM-enabled data extraction in clinical domains beyond the training dataset and to assess the scope and integration of LLMs into real-world clinical environments.
    Keywords:  AI; LLM; NLP; artificial intelligence; chatbot; conversational agent; data extraction; digital health; electronic health record; health information; health technology; large language model; natural language processing; oncology; scoping review
    DOI:  https://doi.org/10.2196/65984