bims-arines 2026-05-17 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–05–17
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

When to stop reviewing: validation of stop criteria in ASReview.
Performance of large language models and prompt engineering strategies for data extraction in systematic reviews.
Empowering open medium-sized generative language models for effective structured search in biomedical systematic reviews.
Evaluating the Methodological Quality of Artificial Intelligence-Assisted Systematic Reviews: Protocol for a Mixed Methods Meta-Research Study.
Using large language models for automated assessment of reporting quality and completeness of prediction model studies.
Post-market surveillance for literature on QuantiFERON TB Gold Plus: what improvements can artificial intelligence bring?
Evaluating large language models for evidence-based clinical question answering.
Multi-Stage LLM Pipeline to Support Qualitative Content Analysis - A Proof of Concept Experiment with Expert Validation.
Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow.
Acceptability of a Large Language Model (LLM)-Generated Guideline-Based Checklist Among Otolaryngologists: A Cross-Sectional Survey and Thematic Analysis.

BMC Med Res Methodol. 2026 May 09. pii: 109. [Epub ahead of print]26(1):

When to stop reviewing: validation of stop criteria in ASReview.

C Kempny, K Annac, D Wahidie, Y Yilmaz-Aslan, P Brzoska.

   BACKGROUND: Systematic reviews are essential for evidence-based research, but they often require a great deal of time and effort. Although title and abstract (T&A) screening is just one part of the review process, it can be very time-consuming when search strategies retrieve large numbers of records. Given the exponential growth of scientific publications in recent decades, tools such as ASReview, which use machine learning (ML) for active learning-based screening, aim to reduce the workload. However, since ASReview helps users prioritise potentially relevant studies rather than supporting them in screening all records, a key question arises: at what point can the screening process safely stop without risking the omission of relevant studies when the entire dataset is not being reviewed?
METHODS: This simulation study tested three proposed stop criteria for terminating screening in ASReview without loss of relevant data: (1) stopping after a calculated number of relevant studies based on an initial sample; (2) stopping after a fixed number of consecutively studies deemed irrelevant; (3) stopping after a predefined percentage of the dataset has been screened. A total of 35,000 automated title and abstract screenings were conducted using five datasets from the SYNERGY repository. Key outcomes included the percentage of studies screened until the last relevant study was found and the number of relevant studies missed under each stop criterion.
RESULTS: The proportion of the dataset that needed to be screened to identify all relevant studies (as pre-classified in the SYNERGY dataset) varied greatly across datasets, ranging from 2.9% to 76.9% on average. None of the tested stop criteria could consistently identify all relevant studies across all datasets. Stop criterion 1 was reliable in only 2% of simulations. Stop criterion 2 showed high variability, with thresholds ranging from 2% to 61%, depending on the dataset. Stop criterion 3 failed to define a universal percentage applicable across datasets.
CONCLUSIONS: ASReview can reduce screening workload by prioritizing potentially relevant studies through ML-based ranking, thereby allowing researchers to identify relevant studies earlier in the screening process. However, no stop criterion reliably ensures that all relevant studies are identified. Early stopping may result in missed studies, depending on dataset characteristics. Current stop criteria should be applied cautiously and potentially combined with quality assurance measures. Further research is needed to develop more robust and generalizable stopping rules.
TRIAL REGISTRATION: Not applicable - this is a simulation study, not a registered systematic review.

Keywords:  ASReview; Machine learning; Screening efficiency; Stop Criteria; Systematic reviews; Title and abstract screening

DOI:  https://doi.org/10.1186/s12874-026-02866-5
Front Digit Health. 2026 ;8 1799623

Performance of large language models and prompt engineering strategies for data extraction in systematic reviews.

Takehiko Oami, Yohei Okada, Kenjiro Maeda, Taka-Aki Nakada.

   Background: Systematic reviews depend on manual data extraction and synthesis, which are time-consuming and prone to human error. Although large language models (LLMs) have the potential to automate parts of this process, their accuracy, reproducibility, and efficiency across different models and prompt strategies remain insufficiently characterized.
Methods: This study evaluated the performance of three LLMs, including ChatGPT-4o, Claude 3 Sonnet, and Gemini 1.5 Pro, for data extraction from trials addressing five clinical questions (CQs) in the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024 (J-SSCG 2024). Using portable document format files of eligible studies, LLMs extracted predefined background characteristics and clinical outcomes. Outputs generated using an original prompt were compared with those produced using chain-of-thought and self-reflection (SR) prompt strategies. Two independent reviewers assessed accuracy against a reference standard derived from manual extraction by the guideline members. Inter-session consistency across three sessions and processing time were also evaluated.
Results: For background data extraction, mean no-error proportions ranged from 81.6% (ChatGPT-4o) to 92.4% (Claude 3 Sonnet) across models. For outcome data extraction, mean no-error proportions ranged from 27.8% (Gemini 1.5 Pro) to 80.7% (Claude 3 Sonnet). Missing or incorrect values accounted for most extraction errors, whereas fabricated outputs were relatively uncommon. Prompt engineering strategies resulted in only modest changes in extraction accuracy across models. Inter-session consistency ranged from 76.3% (ChatGPT-4o) to 91.3% (Gemini 1.5 Pro) for background data extraction and from 44.8% (ChatGPT-4o) to 65.6% (Claude 3 Sonnet) for outcome data extraction. Mean processing times ranged from 29.2 to 39.7 s per article for background data extraction and from 19.3 to 46.3 s for outcome data extraction using standard prompts. When SR prompts were used, processing times increased to 59.0 to 97.7 s for background data extraction and to 52.7 to 107.1 s for outcome data extraction.
Conclusions: LLMs can reliably support background data extraction in systematic reviews. However, outcome data extraction remains challenging, emphasizing the continued need for human oversight. Extraction performance varied across models and prompt engineering strategies.Clinical Trial Registration: The study was registered in the University Hospital Medical Information Network (UMIN) clinical trials registry, identifier (UMIN000054461).

Keywords:  clinical practice guidelines; data extraction; large language model; sepsis; systematic review

DOI:  https://doi.org/10.3389/fdgth.2026.1799623
Int J Med Inform. 2026 May 06. pii: S1386-5056(26)00203-0. [Epub ahead of print]216 106463

Empowering open medium-sized generative language models for effective structured search in biomedical systematic reviews.

Leandra Budau, Richard Finney, Faezeh Ensan.

   BACKGROUND: Systematic Literature Reviews (SLRs) are essential in biomedical research, particularly for informing public health policy and clinical decision-making. However, the manual generation of Boolean queries for literature searches is resource-intensive, prone to errors, and difficult to scale. Recent advances in large language models (LLMs) have demonstrated potential, yet most existing approaches rely on zero-shot prompting of commercial models, overlooking the cost-efficiency and domain adaptability of fine-tuned open-source alternatives.
METHODS: This study proposes a novel, three-stage framework that employs medium-sized, open-source generative models, specifically BioGPT and BioT5, for automated Boolean query generation over PubMed. We develop and release datasets comprising PubMed article titles, MeSH terms, and keywords, and fine-tune the models using both title-only and title-plus-metadata prompts. We evaluate performance on two benchmark datasets: CLEF TAR and FASS-BSLR. Our experiments include comparisons with state-of-the-art baselines, prompt-based large language models, and ablation studies exploring the effects of training data size, metadata inclusion, and post-processing with PubMed's Automatic Term Mapping.
RESULTS: Fine-tuned BioGPT outperforms both traditional TAR models and commercial LLMs across key retrieval metrics. On the CLEF TAR dataset, it achieves a Precision of 0.2544, F1 of 0.2392, MAP@1000 of 0.1424, and NDCG@1000 of 0.2490, which surpasses all baselines. On the FASS dataset, it reaches a Recall of 0.1801 and NDCG@1000 of 0.0900, again outperforming all competing models. While slightly behind BioGPT, BioT5 still outperforms most baselines. Notably, BioGPT's Recall of 0.1801 on FASS is more than twice that of PubMed-Title and PubMed-Keyword, and exceeds GPT-3.5 Turbo, GPT-4, Gemini-2, and Llama-3.
CONCLUSION: This work demonstrates that fine-tuned, open-source, medium-sized generative models can match or exceed the performance of much larger commercial LLMs in Boolean query generation for biomedical SLRs. These models offer a cost-effective, privacy-preserving, and scalable alternative for structured retrieval of biomedical scholarly texts.

Keywords:  Automated boolean query generation; Biomedical systematic literature reviews; Clinical decision-making; Medium-sized open-source generative models

DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106463
JMIR Res Protoc. 2026 May 14. 15 e90588

Evaluating the Methodological Quality of Artificial Intelligence-Assisted Systematic Reviews: Protocol for a Mixed Methods Meta-Research Study.

Mohammad Jay, Mary Morgan, Sharon Elizabeth Straus, Emma Wilson, Rinku Sutradhar, Catherine Yu, Lesley Gotlib Conn, Christoffer Dharma, Lorraine Lipscombe, Antoine Eskander.

   Background: Artificial intelligence (AI), including large language models (LLMs), is increasingly integrated into systematic review (SR) workflows. AI tools may accelerate searching, screening, data extraction, and reporting, but their effects on methodological quality, reporting completeness, transparency, and reproducibility remain uncertain. Existing evaluations largely examine isolated tasks, and inconsistent disclosure of AI use limits reproducibility and oversight.
Objective: This 4-phase mixed methods meta-research study will (1) compare the methodological quality of AI-assisted versus traditional SRs; (2) refine, finalize, and apply a preliminary AI Transparency and Disclosure Index (AITDI); (3) evaluate reproducibility by comparing outputs across repeated runs of the same AI model, across different AI models, and between AI models and human reviewers at multiple SR stages; and (4) explore knowledge user perspectives on rigor, transparency, and trust in AI-assisted SRs.
Methods: We will conduct a matched cohort analysis of SRs published from 2023 to 2025 in biomedical journals. Each AI-assisted SR will be matched 1:2 with traditional SRs by publication year, clinical domain, review type, and meta-analysis status. Two independent reviewers will apply A Measurement Tool to Assess Systematic Reviews, version 2 (AMSTAR 2; methodological quality), PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 (reporting completeness), and, when applicable, Risk of Bias in SRs (ROBIS; risk-of-bias rigor). A preliminary AITDI will be refined and then applied to all AI-assisted SRs. Reproducibility will be assessed using SR-derived task sets to compare outputs across repeated runs of the same model, across different models, and between AI and human reviewers at key SR stages. Semistructured interviews with authors, editors, clinicians, policymakers, and patient partners will be analyzed using reflexive thematic analysis.
Results: As of December 2025, the study has been preregistered on the Open Science Framework (OSF; DOI: 10.17605/OSF.IO/Q5JRW), the search strategy has been finalized, and title/abstract screening has begun. Data extraction is planned for March-May 2026, followed by AITDI refinement and reproducibility testing from May 2026 to October 2026. Qualitative interviews are anticipated from October 2026 to February 2027, with final analyses by April 2027 and dissemination planned for mid-2027.
Conclusions: This study will provide one of the first empirical comparisons of methodological quality, transparency, and reproducibility of AI-assisted versus traditional SRs in the LLM era. Findings will inform expectations for responsible AI integration and support refinement of reporting and methodological best practices, including future development of AI-specific reporting and appraisal extensions (eg, PRISMA-LLM [Preferred Reporting Items for Systematic Reviews and Meta-Analyses-large language model] and AMSTAR-LLM [A Measurement Tool to Assess Systematic Reviews-large language model]).

Keywords:  AMSTAR-2; PRISMA 2020; artificial intelligence; evidence synthesis; large language models; meta-research; reproducibility; systematic review; transparency

DOI:  https://doi.org/10.2196/90588
J Clin Epidemiol. 2026 May 13. pii: S0895-4356(26)00195-2. [Epub ahead of print] 112320

Using large language models for automated assessment of reporting quality and completeness of prediction model studies.

I Spiero, A M Leeuwenberg, J Kamperman, L Hooft, K G M Moons, J A A Damen.

   BACKGROUND AND OBJECTIVE: Transparent and complete reporting in scientific papers is important for interpretation of study results and for downstream evidence generation such as systematic reviews and clinical guidelines. Many reporting checklists and tools have been developed for various types of biomedical research studies, but adherence assessment to these checklists is costly and laborious. Recent developments in large language models (LLMs) can accelerate assessment by automatically scoring the items of a checklist. We aimed to evaluate whether LLMs can accurately and efficiently assess the quality and completeness of reporting in prediction model studies based on the TRIPOD reporting guideline.
METHODS: We selected and evaluated five LLMs (Gemini-2.5-pro, Gemma3, GPT-5, Granite3.3, and Llama3.2) in their ability to automatically assess the 93 items of the TRIPOD checklist of 70 manually scored papers. The LLMs were asked to score each item with 'reported' or 'not reported' and to provide a supporting text quote. We evaluated the LLMs in terms of score correctness (sensitivity, precision, and F1-score), quote correctness, and resources required, using the manual double-reviewer scored papers as reference.
RESULTS: We found that Gemini-2.5-pro and GPT-5 performed best in scoring items with F1-scores of 0.74 and 0.73, respectively. With regard to quote correctness, the Gemini-2.5-pro model returned correct quotes in 80-97% of the cases based on manual evaluation. The LLMs took 20-30 minutes to conduct one complete checklist assessment.
CONCLUSION: We conclude that Gemini-2.5-pro and GPT-5 are suitable to implement in a tool in a semi-automated manner rather than in full automation to assist researchers, reviewers, and editors to check the quality and completeness of reporting of prediction model studies according to the TRIPOD reporting checklist. By assisting in the reporting quality and adherence scoring, the time spent on adherence assessments can be drastically decreased and thereby reporting in literature can be improved.

Keywords:  automation; large language models; natural language processing; prediction modeling; reporting guidelines

DOI:  https://doi.org/10.1016/j.jclinepi.2026.112320
IJTLD Open. 2026 May;3(5): 293-297

Post-market surveillance for literature on QuantiFERON TB Gold Plus: what improvements can artificial intelligence bring?

J Reniewicz, R Alagna, L Kordylas, M Latacz, V Suryaprakash, U Nowak, A Kois-Ostrowska, J Weleszczuk, A Blacha, V Nikolayevskyy.

   BACKGROUND: Post-market surveillance (PMS) under the European Union In Vitro Diagnostic Regulation (IVDR) demands proactive, literature-based evidence, but mature assays like QuantiFERON TB Gold Plus (QFT-Plus) generate volumes of peer-reviewed and other literature that can strain manual workflows.
METHODS: We ran a comparative study of an AI-enabled literature-surveillance platform (jointly developed with Huma.ai called the Huma.ai Platform) versus manual search for QFT-Plus PMS. PubMed and PubMed Central were queried for publications in 2024; human studies published in English underwent duplicate screening and full-text appraisal. Outcomes were yield, precision, overlap/unique entries, and reviewer time.
RESULTS: The Huma.ai Platform retrieved 673 records, with 661 relevant to screening (98.21% precision). Manual searching retrieved 111, with 106 relevant to screening (95.50% precision): there were 103 shared and three manual-only items (metadata gaps). The Huma.ai Platform contributed 561 unique papers, 5 of which were excluded after full-text appraisal. In total, 664 articles were evaluated; no new safety signals were identified. Screening time averaged ∼16 s per article with Huma.ai Platform versus ∼60 s manually; full-text time (∼15 min per article) was similar.
CONCLUSION: AI-assisted surveillance substantially increases coverage and reduces screening effort while maintaining high precision. Thus it supports efficient, reproducible PMS for QFT-Plus.

Keywords:  diagnosis; in vitro diagnostic medical device; interferon-gamma release assay; literature searches; natural language processing; tuberculosis

DOI:  https://doi.org/10.5588/ijtldopen.25.0711
Patterns (N Y). 2026 May 08. 7(5): 101519

Evaluating large language models for evidence-based clinical question answering.

Can Wang, Yiqun Chen.

  Large language models show potential in clinical applications, yet reliability for evidence-based medicine requires rigorous evaluation. We curated a multi-source benchmark with more than 20,000 question answering pairs from systematic reviews and clinical guidelines to assess performance on GPT-5, GPT-4o-mini, Claude 4, and DeepSeek-v3. Accuracy was highest with structured guidelines (90%), lower with narrative sources (70%), and lowest with systematic reviews (50%-60%). All models struggled with ambiguous evidence. We found that higher citation counts for source material correlated with increased accuracy and observed moderate geographic variation in performance. However, accuracy did not vary significantly by publication year or domain prevalence. Retrieval-augmented generation bolstered performance; providing the top three PubMed-retrieved articles yielded a 23% accuracy gain. These patterns were consistent across models, demonstrating that source clarity and targeted retrieval drive performance. We conclude that stratified evaluation and retrieval strategies are essential for ensuring factual alignment and reliability in high-stakes clinical decision-making.

Keywords:  AI evaluation; AI for medicine; benchmark datasets; biomedical NLP; clinical question answering; evidence-based medicine; large language models

DOI:  https://doi.org/10.1016/j.patter.2026.101519
Stud Health Technol Inform. 2026 May 07. 335 110-116

Multi-Stage LLM Pipeline to Support Qualitative Content Analysis - A Proof of Concept Experiment with Expert Validation.

Eva Forster, Nadja Kartschmit, Elisabeth Klager, Emmilie Mosor, Benjamin Schuster, Erika Mosor, Tanja Stamm, Klaus Donsa.

   BACKGROUND: Qualitative interview studies are a cornerstone of health and social science research, but manual analysis is time-intensive and difficult to scale, particularly in larger datasets. While Large Language Models (LLMs) offer new opportunities, concerns about transparency, reproducibility, and methodological validity have limited their scientific adoption.
OBJECTIVES: We present a four-stage LLM pipeline comprising segmentation, coding, concept development, and quote extraction, designed to replicate expert-driven qualitative analysis with a complete, auditable analysis trail.
METHODS: The pipeline was applied to 28 semi-structured interview transcripts on health data donation and evaluated by five researchers who conducted the original manual analysis using the QUEST framework.
RESULTS: The pipeline produced 12 higher-level and 73 lower-level concepts in 45 minutes, demonstrating substantial efficiency gains compared to manual analysis. Expert assessment confirmed high content validity, strong thematic overlap with manual results, and all outputs traceable to source text. The majority of evaluators deemed outputs suitable for scientific use following minor revisions.
CONCLUSION: LLM-assisted qualitative analysis, embedded in a transparent pipeline and subject to expert oversight, interpretation and contextualisation, can produce verifiable, high-quality results and substantially enhance the scalability of qualitative research.

Keywords:  Data Analysis; Health Service Research; Interviews as Topic; Large Language Models; Natural Language Processing; Qualitative Research

DOI:  https://doi.org/10.3233/SHTI260065
STAR Protoc. 2026 May 13. pii: S2666-1667(26)00186-3. [Epub ahead of print]7(2): 104533

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow.

Ahmed Abdeen Hamed, Luis M Rocha.

  We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models. To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.

Keywords:  Computer sciences; Health sciences; genetics

DOI:  https://doi.org/10.1016/j.xpro.2026.104533
Cureus. 2026 Apr;18(4): e106804

Acceptability of a Large Language Model (LLM)-Generated Guideline-Based Checklist Among Otolaryngologists: A Cross-Sectional Survey and Thematic Analysis.

Shahid Iqbal, Mobin Ahmadi, David Ahmadian, Peter Eskander.

  Purpose Large language models (LLMs) have been useful for synthesizing clinical practice guidelines into decision-support tools; however, their utility for clinicians has not been formally evaluated. This study aims to generate a structured clinical checklist from an otolaryngology guideline using an LLM and to assess clinician perceptions of its accuracy, usability, safety, and likelihood of adoption. Materials and methods An LLM (ChatGPT version 5.2, OpenAI, San Francisco, CA, USA) was provided with the American Academy of Otolaryngology-Head and Neck Surgery Clinical Practice Guideline: Evaluation of the Neck Mass in Adults and instructed to generate a concise checklist restricted to guideline content. A structured questionnaire comprising Likert-type scale items and free-text responses was distributed electronically to otolaryngologists. Quantitative responses were summarized descriptively, and thematic analysis was performed on free-text comments to identify key perceptions and concerns. Results Twenty-two otolaryngologists completed the survey, including attending physicians and trainees. Most respondents agreed that the checklist was accurate, clear, and safe; however, fewer indicated that it would save time or that they would be likely to use or recommend it in practice. Attending otolaryngologists more frequently endorsed checklist safety and expressed a greater willingness to use or recommend the checklist than trainees. Thematic analysis identified perceived clinical completeness and educational value as strengths, while omissions of specific examination elements were noted as limitations. Conclusions LLM-generated checklists derived from clinical practice guidelines were generally perceived as accurate and safe by otolaryngologists, but acceptance did not consistently translate into willingness to adopt them in practice. Perceived utility varied by level of training. These findings highlight both the potential and current limitations of LLM-generated decision-support tools and highlight the need for human oversight and further evaluation before routine clinical implementation.

Keywords:  artificial intelligence; checklist development; clinical practice guidelines; evidence-based clinical guidelines; large language models (llm)

DOI:  https://doi.org/10.7759/cureus.106804