bims-arines 2025-09-28 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2025–09–28
twelve papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews.
Assessing the performance of large language models in literature screening for pharmacovigilance: a comparative study.
A foundation model for human-AI collaboration in medical literature mining.
Write Your Abstracts Carefully - The Impact of Abstract Reporting Quality on Findability by Semi-Automated Title-Abstract Screening Tools.
Automated production of comparison tables for shared decision making: Comparing a human-generated table (Option Grid), a search engine process, and outputs from four large language models.
Extension of the Consolidated Criteria for Reporting Qualitative Research Guideline to Large Language Models (COREQ+LLM): Protocol for a Multiphase Study.
Machine Learning in Health Economic Evaluations: Protocol for a Scoping Review.
Critical Limitations in Systematic Reviews of Large Language Models in Health Care.
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models.
Integrating generative AI in perinatology: applications for literature review.
Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care.
Biostatisticians Meet AI: Navigating Shifts While Preserving Principles.

Value Health. 2025 Sep 23. pii: S1098-3015(25)02565-3. [Epub ahead of print]

Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews.

Ghayath Janoudi, Mara Uzun, Tim Disher, Mia Jurdana, Ena Fuzul, Josip Ivkovic, Brian Hutton.

   OBJECTIVES: Title and Abstract (TiAb) screening is a labour-intensive step in systematic literature reviews (SLR). We examine the performance of Loon Lens 1.0, an agentic AI platform for autonomous TiAb screening and test whether its confidence scores can target minimal human oversight.
METHODS: Eight SLRs by Canada's Drug Agency were re-screened through dual-human reviewers and adjudicated process (3,796 citations, 287 includes, 7.6%) and separately by Loon Lens, based on predefined eligibility criteria. Accuracy, sensitivity, precision, and specificity were measured and bootstrapped to generate 95% confidence intervals. Logistic regression with (i) confidence alone and (ii) confidence + Include/Exclude decision predicted errors and informed simulated human-in-the-loop (HITL) strategies.
RESULTS: Loon Lens achieved 95.5% accuracy (95% CI 94.8-96.1), 98.9% sensitivity (97.6-100), 95.2% specificity (94.5-95.9) and 63.0% precision (58.4-67.3). Errors clustered in Low-Medium-confidence Includes. The extended logistic regression model (confidence + decision; C-index 0.98) estimated a 75% error probability for Low-confidence Includes versus <0.1% for Very-High-confidence Excludes. Simulated HITL review of Low + Medium-confidence Includes only (145 citations, 3.8%), lifted precision to 81.4% and overall accuracy to 98.2% while preserving sensitivity (99.0%). Adding High-confidence Includes (221 citations, 5.8%) pushed precision to 89.9% and accuracy to 99.0%.
CONCLUSIONS: Across eight SLRs (3,796 citations), Loon Lens 1.0 reproduced adjudicated human screening with 98.9% sensitivity and 95.2% specificity. In simulation, restricting human-in-the-loop review to ≤5.8% of citations, by prioritising low- and medium-confidence Include calls, reduced false positives and increased precision to 89.9% while maintaining sensitivity and raising overall accuracy to 99.0%. These findings indicate that confidence-guided oversight can concentrate reviewer effort on a small subset of records.

Keywords:  artificial intelligence; health technology assessment; large language model; literature screening; systematic review

DOI:  https://doi.org/10.1016/j.jval.2025.09.008
Front Drug Saf Regul. 2024 ;4 1379260

Assessing the performance of large language models in literature screening for pharmacovigilance: a comparative study.

Dan Li, Leihong Wu, Mingfeng Zhang, Svitlana Shpyleva, Ying-Chi Lin, Ho-Yin Huang, Ting Li, Joshua Xu.

  Pharmacovigilance plays a crucial role in ensuring the safety of pharmaceutical products. It involves the systematic monitoring of adverse events and the detection of potential safety concerns related to drugs. Manual literature screening for pharmacovigilance related articles is a labor-intensive and time-consuming task, requiring streamlined solutions to cope with the continuous growth of literature. The primary objective of this study is to assess the performance of Large Language Models (LLMs) in automating literature screening for pharmacovigilance, aiming to enhance the process by identifying relevant articles more effectively. This study represents a novel application of LLMs including OpenAI's GPT-3.5, GPT-4, and Anthropic's Claude2, in the field of pharmacovigilance, evaluating their ability to categorize medical publications as relevant or irrelevant for safety signal reviews. Our analysis encompassed N-shot learning, chain-of-thought reasoning, and evaluating metrics, with a focus on factors impacting accuracy. The findings highlight the promising potential of LLMs in literature screening, achieving a reproducibility of 93%, sensitivity of 97%, and specificity of 67% showcasing notable strengths in terms of reproducibility and sensitivity, although with moderate specificity. Notably, performance improved when models were provided examples consisting of abstracts, labels, and corresponding reasoning explanations. Moreover, our exploration identified several potential contributing factors influencing prediction outcomes. These factors encompassed the choice of key words and prompts, the balance of the examples, and variations in reasoning explanations. By configuring advanced LLMs for efficient screening of extensive literature databases, this study underscores the transformative potential of these models in drug safety monitoring. Furthermore, these insights gained from this study can inform the development of automated systems for pharmacovigilance, contributing to the ongoing efforts to ensure the safety and efficacy of pharmacovigilance products.

Keywords:  LLMs; artificial intelligence; large language models; literature based discovery; pharmacovigilance

DOI:  https://doi.org/10.3389/fdsfr.2024.1379260
Nat Commun. 2025 Sep 24. 16(1): 8361

A foundation model for human-AI collaboration in medical literature mining.

Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, Chang-In Choi, Mehdi Emamverdi, Manjot K Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Yiming Luo, Hanley Ong, Justin F Rousseau, Irfan Sheikh, Jenny J Wei, Ziyang Xu, Christopher M Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun.

Applying artificial intelligence (AI) for systematic literature review holds great potential for enhancing evidence-based medicine, yet has been limited by insufficient training and evaluation. Here, we present LEADS, an AI foundation model trained on 633,759 samples curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. In experiments, LEADS demonstrates consistent improvements over four cutting-edge large language models (LLMs) on six literature mining tasks, e.g., study search, screening, and data extraction. We conduct a user study with 16 clinicians and researchers from 14 institutions to assess the utility of LEADS integrated into the expert workflow. In study selection, experts using LEADS achieve 0.81 recall vs. 0.78 without, saving 20.8% time. For data extraction, accuracy reached 0.85 vs. 0.80, with 26.9% time savings. These findings encourage future work on leveraging high-quality domain data to build specialized LLMs that outperform generic models and enhance expert productivity in literature mining.

DOI: https://doi.org/10.1038/s41467-025-62058-5
J Clin Epidemiol. 2025 Sep 24. pii: S0895-4356(25)00320-8. [Epub ahead of print] 111987

Write Your Abstracts Carefully - The Impact of Abstract Reporting Quality on Findability by Semi-Automated Title-Abstract Screening Tools.

I Spiero, A M Leeuwenberg, K G M Moons, L Hooft, J A A Damen.

   INTRODUCTION: Evidence synthesis, such as the conduct of a systematic review or clinical guideline development, is time-consuming, laborious, and costly. This is largely due to the vast amounts of titles and abstracts that need to be screened. Semi-automated screening tools can accelerate this by prioritizing the most likely relevant abstracts by using an active learning strategy. The reliability of such tools in prioritizing abstracts is related to the modelling methods that the tool uses (i.e. the ability of models to make reliable predictions of study relevance), and to the quality of the data that the modelling methods are applied to (i.e. the consistency and completeness of reporting in the titles and abstracts of studies). Here, we aimed to gain insight in the latter by evaluating the association between abstract reporting characteristics and findability by semi-automated screening tools.
METHODS: We tested the impact of reporting quality of abstracts on semi-automated screening tools by evaluating whether (I) abstract reporting quality (as scored by TRIPOD), (II) abstract structure, and (III) abstract terminology usage, are associated with findability of relevant studies during semi-automated title-abstract screening. We performed simulations using a publicly available semi-automated screening tool, ASReview, and data from two previously conducted comprehensive systematic reviews of prognostic model studies.
RESULTS: We found that better abstract reporting quality was clearly associated with greater findability by the semi-automated screening tool. To a smaller extent, the use of abstract subheadings was also associated with findability. Other abstract structure characteristics and abstract terminology usage were not associated with findability.
CONCLUSIONS: We conclude that better reporting quality of abstracts is associated with better findability by semi-automated title-abstract screening tools. This stresses the importance of adhering to abstract reporting guidelines, not only for consistent and transparent reporting across studies in general, but also to enhance the identification of relevant studies by screening tools during evidence synthesis.

Keywords:  Evidence synthesis; active learning; prioritized screening; reporting guidelines; reporting quality of abstracts; technology-assisted reviewing

DOI:  https://doi.org/10.1016/j.jclinepi.2025.111987
Patient Educ Couns. 2025 Sep 19. pii: S0738-3991(25)00723-2. [Epub ahead of print]142 109356

Automated production of comparison tables for shared decision making: Comparing a human-generated table (Option Grid), a search engine process, and outputs from four large language models.

Padhraig Ryan, Glyn Elwyn.

   OBJECTIVES: To explore the ability of artificial intelligence to produce comparison tables to facilitate shared decision-making.
METHODS: An expert human-generated comparison table (Option Grid ™) was compared to four comparison tables produced by large language models and one produced using a Google search process that a patient might undertake. Each table was prepared for a patient with osteoarthritis of the knee, considering a knee replacement. The results were compared to the Option Grid.™ The information items in each comparison table were divided into eight categories: the intervention process; benefits; side effects & adverse effects; pre-operative care; post-operative care & physical recovery; repeat surgery; decision-making process; and alternative interventions. We assessed the accuracy of each information item in a binary manner (accurate, inaccurate).
RESULTS: OpenBioLLM-70b and two proprietary ChatGPT models generated similar frequencies of information items across most categories, but omitted information on alternative interventions. The Google search process yielded the highest number of information items (n = 41), and OpenBioLLM-8b yielded the lowest (n = 20). Accuracy, compared to the human Option Grid, was 97 % for the ChatGPT models and the open-source OpenBioLLM-70b, and 95 % for OpenBioLLM-8b and the Google search process. The human-generated Option Grid had superior readability.
CONCLUSIONS: Large language models produced comparison tables that are 3-5 % less accurate than a human generated Option Grid. Comparison tables produced by large language models may be less readable and require additional checking and editing.
PRACTICE IMPLICATIONS: Subject to fact-checking and feedback, large language models may have a role to play in scaling up the production of evidence-based comparison tables that could assist patients and others.

Keywords:  Artificial intelligence; Deep learning; Machine learning; Patient participation; Shared decision-making

DOI:  https://doi.org/10.1016/j.pec.2025.109356
JMIR Res Protoc. 2025 Sep 24. 14 e78682

Extension of the Consolidated Criteria for Reporting Qualitative Research Guideline to Large Language Models (COREQ+LLM): Protocol for a Multiphase Study.

Leonard Fehring, Julian Frings, Paul Rust, Christian Kempny, Petra A Thürmann, Sven Meister.

   BACKGROUND: Qualitative research provides essential insights into human behaviors, perceptions, and experiences in health sciences. The COREQ (Consolidated Criteria for Reporting Qualitative Research) checklist, published in 2007 and endorsed by the Enhancing the Quality and Transparency of Health Research Network, advanced transparency of qualitative research reporting. However, the recent integration of large language models (LLMs) into qualitative research introduces novel opportunities and methodological challenges that existing guidelines do not address. LLMs are increasingly applied to research design as well as processing, analysis, interpretation, and even direct interaction ("conversing") with qualitative data. However, their probabilistic nature, dependence on underlying training data, and susceptibility to hallucinations necessitate dedicated reporting to ensure transparency, reproducibility, and methodological validity.
OBJECTIVE: This protocol outlines the methodological development process of COREQ+LLM, an extension to the COREQ checklist, to support transparent reporting of LLM use in qualitative research. The three main objectives are to (1) identify and categorize current applications of LLMs used as qualitative research tools, (2) assess how LLM use in qualitative studies in health care is reported in published studies, and (3) develop and refine reporting items for COREQ+LLM through a structured consensus process among international experts.
METHODS: Following the Enhancing the Quality and Transparency of Health Research Network guidance for reporting guideline development, this study comprises 4 main phases. Phase 1 is a systematic scoping review of peer-reviewed literature from January 2020 to April 2025, examining the use and reporting of LLMs in qualitative research. The scoping review protocol was registered with the Open Science Framework on June 6, 2025, and will adhere to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines. Phase 2 will use a Delphi process to reach consensus on candidate items for inclusion in the COREQ+LLM checklist among an interdisciplinary international panel of experts. Phase 3 includes pilot testing, and phase 4 involves publication and dissemination.
RESULTS: As of September 2025, the steering committee has been established, and the initial search strategy for the scoping review has identified 5049 records, with 4201 (83.20%) remaining after duplicate removal. Title and abstract screening is underway and will inform the initial draft of candidate checklist items. The COREQ+LLM extension is scheduled for completion by December 2025.
CONCLUSIONS: The integration of LLMs in qualitative research requires dedicated reporting guidelines to ensure methodological rigor, transparency, and interpretability. COREQ+LLM will address current reporting gaps by offering specific guidance for documenting LLM integration in qualitative research workflows. The checklist will assist researchers in transparently documenting LLM use, support reviewers and editors in evaluating methodological quality, and foster trust in LLM-supported qualitative research. By December 2025, COREQ+LLM will provide a rigorously developed tool to enhance the transparency, validity, and reproducibility of LLM-supported qualitative studies.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/78682.

Keywords:  AI; COREQ; Consolidated Criteria for Reporting Qualitative Research; LLMs; artificial intelligence; large language models; qualitative research; reporting guideline

DOI:  https://doi.org/10.2196/78682
JMIR Res Protoc. 2025 Sep 24. 14 e77494

Machine Learning in Health Economic Evaluations: Protocol for a Scoping Review.

Hanan Daghash, Ashleigh Kernohan, Rosiered Brownson-Smith, Rohan Pandey, Ananya Ananthakrishnan, Cen Cong, Victoria Riccalton, Edward Meinert, Gurdeep S Sagoo.

   BACKGROUND: In recent years, the development of machine learning (ML) applications has increased substantially, indicating the potential role of ML in transforming health care. However, the integration of ML approaches into health economic evaluations is underexplored and has several challenges.
OBJECTIVE: This scoping review aims to explore the applications of ML in health economic evaluations. This review will also seek to identify some potential challenges to the use of ML in health economic evaluations.
METHODS: This review will use PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) methods. The search will be conducted on MEDLINE (Ovid), Embase (Ovid), IEEE Xplore, and Cochrane Library databases. The eligibility criteria of the selection process will be based on the study types, data sources, methods, and outcomes (SDMO) framework approach.
RESULTS: The database search yielded 4141 records after removal of retractions and duplicates. Title and abstract screening of 3718 records has been completed, resulting in 30 reports retrieved for eligibility assessment. Data extraction and charting are currently in progress. The results will be published in peer-reviewed journals by the end of 2025.
CONCLUSIONS: This review will help to build up the current understanding of how ML applications are integrated in health economics evaluations. This will also explore the potential barriers to and challenges of using ML in health economics evaluations.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/77494.

Keywords:  cost-effectiveness analysis; economic evaluations; economic modelling; health economics; machine learning

DOI:  https://doi.org/10.2196/77494
J Med Internet Res. 2025 Sep 24. 27 e81769

Critical Limitations in Systematic Reviews of Large Language Models in Health Care.

Zvi Weizman.



Keywords:  AI; LLM; artificial intelligence; clinical; digital health; health care; large language models; letter; review

DOI:  https://doi.org/10.2196/81769
Proc Mach Learn Res. 2024 Aug;pii: https://proceedings.mlr.press/v252/yun24a.html. [Epub ahead of print]252

Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models.

Hye Sun Yun, David Pogrebitskiy, Iain J Marshall, Byron C Wallace.

Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs-including ones trained on biomedical texts-perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
J Perinat Med. 2025 Sep 23.

Integrating generative AI in perinatology: applications for literature review.

Rodrigo Ayala-Yáñez, Amos Grünebaum, Frank A Chervenak.

  Perinatology relies on continuous engagement with an expanding body of clinical literature, yet the volume and velocity of publications increasingly exceed the capacity of clinicians to keep pace. Generative artificial intelligence (GAI) tools - such as ChatGPT4, Claude AI, Gemini, and Perplexity AI - offer a novel approach to assist with literature retrieval, comparison of clinical guidelines, and manuscript drafting. This study evaluates the strengths and limitations of these tools in maternal-fetal medicine, using structured clinical prompts to simulate real-world applications. Perplexity AI demonstrated the best citation accuracy, while ChatGPT4 and Claude excelled in content summarization but required manual verification of citations. In simulated trials, GAI tools reduced the time to generate clinically relevant summaries by up to 70 % compared to traditional PubMed searches. However, risks such as hallucinated references and overreliance on machine-generated text persist. Use cases include summarizing aspirin use guidelines for preeclampsia and comparing ACOG vs. NICE protocols. GAI should be viewed as a supportive assistant, not a substitute, for expert review. To ensure responsible integration, clinicians must develop AI literacy, apply rigorous oversight, and adhere to ethical standards. When used judiciously, GAI can enhance efficiency, insight, and evidence-based decision-making in perinatal care.

Keywords:  generative artificial intelligence; perinatology; practice guidelines; systematic review and automation

DOI:  https://doi.org/10.1515/jpm-2025-0392
J Med Internet Res. 2025 Sep 24. 27 e82729

Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care.

Andre Python, HongYi Li, Jun-Fen Fu.



Keywords:  AI; LLM; LLM review; artificial intelligence; clinical; digital health; large language model; letter; review

DOI:  https://doi.org/10.2196/82729
Stat Med. 2025 Sep;44(20-22): e70271

Biostatisticians Meet AI: Navigating Shifts While Preserving Principles.

Bin Zhu.



Keywords:  ChatGPT; artificial intelligence; benchmark; biostatistics; large language model

DOI:  https://doi.org/10.1002/sim.70271