bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–08–10
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Stud Health Technol Inform. 2025 Aug 07. 329 1886-1887
      Systematic reviews involve time-intensive processes of screening titles, abstracts, and full texts to identify relevant studies. This study evaluates the potential of large language models (LLMs) to automate citation screening across three datasets with varying inclusion rates. Six LLMs were tested using zero- to five-shot in context-learning, with demonstration selection using PubMedBERT for semantic similarity. Majority voting and ensemble learning were applied to enhance performance. Results showed that no single LLM consistently excelled across the datasets, with sensitivity and specificity influenced by inclusion rates. Overall, ensemble learning and majority voting improved performance in citation screening.
    Keywords:  Large language model; ensemble learning; majority voting
    DOI:  https://doi.org/10.3233/SHTI251264
  2. BMC Med Inform Decis Mak. 2025 Aug 07. 25(1): 293
       BACKGROUND: Systematic reviews (SRs) and rapid reviews (RRs) are critical methodologies for synthesizing existing research evidence. However, the growing volume of literature has made the process of screening studies one of the most challenging steps in conducting systematic reviews.
    METHODS: This systematic review aimed to compare the performance of Abstrackr and GPT models (including GPT-3.5 and GPT-4) in literature screening for systematic reviews. We identified relevant studies through comprehensive searches in PubMed, Cochrane Library, and Web of Science, focusing on those that provided key performance metrics such as recall, precision, specificity, and F1 score.
    RESULTS: GPT models demonstrated superior performance compared to Abstrackr in precision (0.51 vs. 0.21), specificity (0.84 vs. 0.71), and F1 score (0.52 vs. 0.31), reflecting a higher overall efficiency and better balance in screening. This makes GPT models particularly effective in reducing false positives during fine-screening tasks.
    CONCLUSION: Abstrackr and GPT models each offer distinct advantages in literature screening. Abstrackr is more suitable for the initial screening phases, whereas GPT models excel in fine-screening tasks. To optimize the efficiency and accuracy of systematic reviews, future screening tools could integrate the strengths of both models, potentially leading to the development of hybrid systems tailored to different stages of the screening process.
    DOI:  https://doi.org/10.1186/s12911-025-03138-w
  3. NPJ Digit Med. 2025 Aug 08. 8(1): 509
      Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5-2.6 fold change. For data extraction, it outperforms a GPT-4's accuracy by 16-32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind's synthesized evidence over GPT-4's in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.
    DOI:  https://doi.org/10.1038/s41746-025-01840-7
  4. Stud Health Technol Inform. 2025 Aug 07. 329 1648-1649
      A rapidly expanding array of Artificial Intelligence (AI) tools, with continually evolving features and functionalities, offers unprecedented opportunities to streamline literature reviews, expediting the screening, extraction, and synthesis phases. We present preliminary findings of evaluating various AI tools' strengths and limitations.
    Keywords:  AI; Artificial intelligence; evaluation; literature review
    DOI:  https://doi.org/10.3233/SHTI251145
  5. Stud Health Technol Inform. 2025 Aug 07. 329 723-727
      The fundamental process of evidence extraction in evidence-based medicine relies on identifying PICO elements, with Outcomes being the most complex and often overlooked. To address this, we introduce EvidenceOutcomes, a large annotated corpus of clinically meaningful outcomes. A robust annotation guideline was developed in collaboration with clinicians and NLP experts, and three annotators annotated the Results and Conclusions of 500 PubMed abstracts and 140 EBM-NLP abstracts, achieving an inter-rater agreement of 0.76. A fine-tuned PubMedBERT model achieved F1 scores of 0.69 (entity level) and 0.76 (token level). EvidenceOutcomes offers a benchmark for advancing machine learning algorithms in extracting clinically meaningful outcomes.
    Keywords:  Biomedical Literature Research; NLP; PICO Outcomes; RCT
    DOI:  https://doi.org/10.3233/SHTI250935
  6. Stud Health Technol Inform. 2025 Aug 07. 329 239-243
       BACKGROUND: Automated classification of medical literature is increasingly vital, especially in oncology. As shown in previous work, LLMs can be used as part of a flexible framework to accurately classify biomedical literature and trials. In the present study, we aimed to explore to what extent a consensus-based approach could improve classification performance.
    METHODS: The three LLMs Mixtral-8x7B, Meta-Llama-3.1-70B, and Qwen2.5-72B were used to classify oncological trials across four data sets with nine questions. Metrics (accuracy, precision, recall, F1-score) were assessed for individual models and consensus results.
    RESULTS: Consensus was achieved in 93.93% of cases, improving accuracy (98.34%), precision (97.01%), recall (98.11%), and F1-score (97.55%) over individual models.
    CONCLUSIONS: The consensus-based LLM framework delivers high accuracy and adaptability for classifying oncological trials, with potential applications in biomedical research and trial management.
    Keywords:  knowledge synthesis; large language models; natural language processing; oncology; text classification
    DOI:  https://doi.org/10.3233/SHTI250837