bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–03–01
eleven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. medRxiv. 2026 Feb 17. pii: 2026.02.07.26345640. [Epub ahead of print]
       Background: The ability of large language models (LLMs) to work collaboratively and screen studies in a systematic review (SR) is under-explored. Hence, we aimed to evaluate the effectiveness of LLMs in automating the process of screening in systematic reviews.
    Methods: This is an observational study which included labeled data (title and abstracts) for five SRs. Originally, two reviewers screened the citations independently for eligibility. A third reviewer cross-checked each citation for quality assurance. GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 were used using zero-shot chain-of-thought prompting. Collaborative approaches included (i): conflict resolution using benefit of the doubt, (ii) majority voting using an independent third LLM and (iii) conflict resolution using an informed third LLM. Performance was assessed using accuracy, precision for exclusion, and recall for inclusion. Work saved over samples (WSS) was computed to estimate the reduction in manual human effort.
    Results: A total of 11300 articles were included in this study. The individual models, GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 exhibited a high precision for exclusion, achieving 99.7%, 99.7%, and 99.2% and high recall for inclusion achieving 95.5%, 96.6% and 85.7%, respectively. However, the collaborative approach utilizing the two best-performing models (GPT-4 and Claude-3S) achieved an average precision of 99.9% and a recall of 98.5% (across all collaborative approaches). Furthermore, the proposed collaborative approach resulted in an average WSS of 63.5%, compared to the average WSS of 45.2% for individual models. Conversational LLM interactions showed a consistent pattern of results.
    Limitations: This study was limited due to reliance on proprietary models, and evaluation on oncology datasets.
    Conclusion: Evidence shows that collaborative LLMs enable efficient, high-performing screening in systematic reviews, supporting continuous evidence updates.
    Primary funding source: NIH (U24CA265879-01-1) and Carolyn-Ann-Kennedy-Bacon Fund.
    DOI:  https://doi.org/10.64898/2026.02.07.26345640
  2. Evid Based Dent. 2026 Feb 24.
       BACKGROUND: The exponential growth of biomedical literature-over a million new PubMed entries each year-has outpaced traditional evidence-synthesis methods. Systematic reviews, long the cornerstone of evidence-based dentistry, are resource-intensive and often outdated within a few years, widening the gap between current research and clinical practice.
    METHODS: We outline Retrieval-Augmented Generation (RAG) as a methodology for dynamic evidence reviews. RAG strengthens Large Language Models (LLMs) by combining their generative capacity with real-time retrieval from a continuously updated, curated knowledge base. This design grounds every answer in verifiable sources and mitigates the factual errors and hallucinations seen in standalone LLMs.
    RESULTS/IMPLICATIONS: RAG enables on-demand dynamic synthesis of the latest evidence, allowing clinicians and researchers to ask complex, natural-language questions and receive concise, fully cited answers. For dental clinicians, this approach enables rapid, citation-linked answers to practice-relevant questions-such as material selection, healing outcomes, or procedural comparisons-without relying on outdated narrative summaries. We describe three complementary integration pathways-RAG on pre-retrieved article pools, public living review portals, and machine-actionable journal publications-each with distinct requirements and benefits. Looking forward, emerging agentic AI systems, capable of planning multi-step searches and iterative updates, may further enhance these capabilities. Although this framework is conceptually grounded and supported by emerging methodological evidence, prospective empirical validation, benchmarking against existing review approaches, and real-world deployment studies will be required to fully assess its performance, reliability, and impact on clinical decision-making.
    CONCLUSION: RAG offers a scalable, transparent alternative to static systematic reviews and can shorten the research-to-practice timeline. By automating retrieval and initial synthesis while keeping human critical appraisal and ethical judgment central, it points toward an era of augmented rather than automated intelligence in evidence-based dentistry.
    DOI:  https://doi.org/10.1038/s41432-026-01206-2
  3. AMIA Annu Symp Proc. 2024 ;2024 1549-1556
      Reference errors, such as citation and quotation errors, are common in scientific papers. Such errors can result in the propagation of inaccurate information, but are difficult and time-consuming to detect, posing a significant threat to the integrity of scientific literature. To support automatic detection of reference errors, we evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors. Specifically, we prepared an expert-annotated, general-domain dataset of statement-reference pairs from journal articles, one-third of which is in biomedicine. Large language models were evaluated in different settings with varying amounts of reference information provided by retrieval augmentation. Results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning. This study contributes to the growing literature that seeks to utilize artificial intelligence to assist in the writing, reviewing, and publishing of scientific papers as well as grounding of language model responses.
  4. Pol Arch Intern Med. 2026 Feb 27. pii: 17243. [Epub ahead of print]
      Clinical research published in internal medicine journals relies heavily on statistical analysis and quantitative inference, making the quality of statistical reporting and statistical peer review central to the credibility of this literature. Despite long-standing methodological recommendations, the quality of statistical analyses and reporting in medical journals remains suboptimal, and the proportion of manuscripts undergoing formal statistical review has not improved over recent decades. At the same time, generative artificial intelligence (AI) tools have been increasingly adopted in biomedical research, raising expectations that they may support statistical analysis and elements of the peer-review process. This narrative review synthesizes evidence published between 2023 and 2025 on the use of AI-assisted tools in statistical analysis and statistical review within medical research. The reviewed studies show that large language models can support selected tasks, including generation of analytical code, reproduction of simple statistical procedures, preliminary selection of statistical tests, and detection of certain formal statistical errors. However, AI performance is highly variable and frequently limited by incomplete consideration of statistical assumptions and reduced reliability in complex analytical scenarios. Current generative AI tools should not be regarded as fully autonomous instruments for statistical analysis or statistical peer review. Their effective use depends on statistical expertise, independent validation, and contextual judgment by human users. The review discusses implications for statistical practice and statistical review in internal medicine, a research setting characterized by heterogeneous observational data, multimorbidity, and frequent use of non-randomized study designs, including pragmatic clinical trials.
    DOI:  https://doi.org/10.20452/pamw.17243
  5. BMJ Open. 2026 Feb 26. 16(2): e109725
       INTRODUCTION: Artificial intelligence (AI) is rapidly evolving, offering an expanding suite of capabilities that go beyond the traditional focus on prediction and classification. Generative AI (GenAI) and agentic AI could create transformative practices to support real-world evidence (RWE) generation for health research by streamlining studies, accelerating insights and improving decision-making. However, there is no published overview available describing the range of applications in RWE generation. This review aims to describe where and how genAI and agentic AI are applied across the domains of healthcare research tasks for RWE generation. Additionally, to map applications by tasks and methods across the product lifecycle continuum, and to identify emerging gaps and opportunities.
    METHODS AND ANALYSIS: This Living Scoping Review (LSR) will include studies reporting an application and/or evaluation of genAI or agentic AI applied to one or more RWE generation research tasks. Searches will be conducted in Embase, MEDLINE and additional sources (eg, grey literature). Citations will be independently screened by two human senior reviewers for a substantive training dataset and a commercially available screening algorithm (Robot Screener) will complete screening with a human reviewer. The LSR will include reports of studies (primary or reviews) describing and/or evaluating the application of any genAI model for RWE generation in healthcare, in English, published from 1 January 2025 to the date of search. Data will be extracted from all studies included in the LSR by one independent senior reviewer using a piloted template, with 10% quality check by a second senior reviewer. Descriptive statistics will be used to summarise the applications of genAI per RWE research task, and the results of genAI evaluations. Thematic analysis will be used to describe genAI application patterns, trends, gaps and opportunities. The LSR protocol and reports will be updated annually, and findings will be published on a publicly available website (eg, ISPE-the International Society for Pharmacoepidemiology).
    ETHICS AND DISSEMINATION: Ethical approval is not required due to use of previously published data. Planned dissemination includes peer-reviewed publication, presentation and short summaries.
    Keywords:  Artificial Intelligence; EPIDEMIOLOGY; Research Design
    DOI:  https://doi.org/10.1136/bmjopen-2025-109725
  6. J Am Med Inform Assoc. 2026 Feb 26. pii: ocag026. [Epub ahead of print]
       OBJECTIVE: Clinical practice guidelines (CPGs) provide evidence-based recommendations for patient care; however, integrating them into artificial intelligence (AI) remains challenging. Previous approaches, such as rule-based systems or black-box AI models, face significant limitations, including poor interpretability, inconsistent adherence to guidelines, and narrow domain applicability. To address this, we develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs).
    MATERIALS AND METHODS: Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation. Synthetic vignettes were generated across 3 domains-headache, lower back pain, and prostate cancer-and distributed into 4 categories to test different decision scenarios. System performance was assessed on both binary specialty referral decisions and fine-grained pathway classification tasks.
    RESULTS: The binary specialty referral classification achieved consistently strong performance across all domains (F1: 0.85-1.00), with high recall (1.00 ± 0.00). In contrast, multiclass pathway assignment showed reduced performance, with domain-specific variations: headache (F1: 0.47), lower back pain (F1: 0.72), and prostate cancer (F1: 0.77).
    DISCUSSION: Domain-specific performance differences reflected the structure of each guideline. The headache guideline highlighted challenges with negation handling. The lower back pain guideline required temporal reasoning. In contrast, prostate cancer pathways benefited from quantifiable laboratory tests, resulting in more reliable decision-making.
    CONCLUSION: CPGPrompt demonstrates generalizability across diverse clinical domains while maintaining high sensitivity for referral decisions. Its transparent, auditable framework enables the systematic identification of failure modes and provides advantages over black-box AI approaches. However, persistent challenges with subjective clinical assessments indicate a need for targeted improvements and greater clinical robustness.
    Keywords:  AI; clinical decision support; clinical practice guidelines; decision trees; large language models
    DOI:  https://doi.org/10.1093/jamia/ocag026
  7. Intern Emerg Med. 2026 Feb 26.
       BACKGROUND: Large language models (LLMs) are increasingly used in biomedical research for statistical support, yet their reliability in selecting appropriate tests and generating correct software commands remains insufficiently evaluated. This study compared the performance of ChatGPT-5, Claude, and DeepSeek in identifying statistical tests and generating corresponding Stata 15 commands.
    METHODS: Thirty-two examples were adapted from the UCLA Institute for Digital Research and Education Stata tutorial. Each model was tested twice independently using standardized prompts. Responses were classified using a four-level taxonomy: COR (reference-equivalent, i.e., no deviation), SYN (minor syntactic deviation, i.e., low-risk deviation), ALT (alternative valid specification, i.e., low-risk deviation), and CMM (conceptual mismatch with potential inferential impact, i.e., high-risk deviation). Accuracy was defined as the proportion of outputs with no or low-risk deviations, calculated as (COR + SYN + ALT)/32. Model comparisons used Fisher's exact test, and reproducibility across rounds was assessed with McNemar's test and Fisher's exact test.
    RESULTS: All three models correctly identified the statistical test in all 32 examples (100% accuracy in both rounds). For Stata command generation, accuracy was high and comparable across models (round 1: ChatGPT = 90.6%, Claude = 93.8%, DeepSeek = 93.8%; round 2: ChatGPT = 90.6%, Claude = 96.9%, DeepSeek = 87.5%; p > 0.05). High-risk deviations were rare (≤ 12.5% in any model-round combination). Reproducibility between rounds was excellent (ChatGPT = 100%, Claude = 96.9%, DeepSeek = 93.8%; p > 0.05).
    CONCLUSION: ChatGPT-5, Claude, and DeepSeek demonstrated high accuracy and reproducibility in structured statistical reasoning tasks, with rare high-risk deviations that could potentially affect statistical inference. These findings support the use of advanced LLMs as complementary tools for applied statistical reasoning.
    Keywords:  AI; Accuracy; Artificial intelligence; ChatGPT; Claude; Comparison; DeepSeek; Large language model (LLM); Performance; Reproducibility; Scientific writing; Stata; Statistical analysis
    DOI:  https://doi.org/10.1007/s11739-026-04291-4
  8. PLOS Digit Health. 2026 Feb;5(2): e0000576
    Collaborators
       BACKGROUND: The exponential growth of Big Qualitative (Big Qual) data in healthcare research presents methodological challenges for traditional analysis approaches. This study evaluates the effectiveness of machine-assisted analysis using artificial intelligence (AI) tools compared to human-only analysis for processing large-scale qualitative datasets, using the Royal College of Anaesthetists' 7th National Audit Project (NAP7) baseline survey as a test case.
    METHODOLOGY/PRINCIPAL FINDINGS: We conducted a comparative methodological study analysing 5,196 free-text responses about peri-operative cardiac arrest experiences. Three researchers established a human-coded reference standard following SRQR guidelines. We then applied machine-assisted analysis using Pulsar for exploratory analysis and Caplena for sentiment and thematic analysis, evaluating performance against the human gold standard using STARD-AI reporting standards. Performance metrics included accuracy, precision, recall, F1-scores, and Cohen's Kappa, with confidence intervals calculated using bootstrap resampling. Machine-assisted analysis substantially reduced analysis time, with particularly dramatic improvements in theme identification speed. The machine-assisted approach achieved good thematic and sentiment classification accuracy compared to the human reference standard, though human analysis identified an emergent 'ambiguous' sentiment category that current AI tools cannot accommodate, highlighting limitations in commercial platforms' flexibility for inductive analysis.
    CONCLUSIONS/SIGNIFICANCE: Machine-assisted analysis offers substantial efficiency gains with acceptable accuracy trade-offs for large-scale qualitative data analysis. However, human expertise remains essential for capturing nuanced meanings, identifying emergent categories, and providing domain-specific interpretation. This hybrid approach represents a viable methodology for Big Qual research, though current AI tools' constraints in accommodating emergent classification schemes remain a limitation. Our findings establish benchmarks for future development of more flexible AI systems adapted to qualitative research paradigms.
    DOI:  https://doi.org/10.1371/journal.pdig.0000576
  9. AMIA Annu Symp Proc. 2024 ;2024 248-256
      Medical vocabularies are essential tools for capturing, classifying, and analyzing healthcare data. However, the creation and maintenance of these vocabularies are often labor-intensive and costly. This preliminary study evaluates the feasibility of using large language models (LLMs) to automate three key tasks in medical vocabulary management: term similarity, subsumption, and grouping. Using 1,533 cardiovascular terms from SNOMED CT, we applied GPT-4o and assessed the performance of 3 elementary tasks against OHDSI standardized vocabularies. While LLMs demonstrated high precision across tasks (0.78 for term similarity, 0.74 for term subsumption, 0.78 for term grouping), recall was notably lower (0.41 for term similarity, 0.08 for term subsumption, 0.52 for term grouping), indicating gaps in coverage. Overall, LLMs show promise for medical vocabulary tasks but require further refinement for clinical specificity and completeness. Future work should focus on enhancing recall, reducing hallucinations, and evaluating scalability across broader terminology sets.
  10. J Med Internet Res. 2026 Feb 24.
       BACKGROUND: Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. While AI-specific reporting guidelines have been developed to address standardized reporting of AI studies, there is currently no specific tool available for risk of bias assessment of LLM-QA studies. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness.
    OBJECTIVE: To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies.
    METHODS: We conducted two literature reviews. The first was on quality assessment tools for LLM-QA studies and the second was on LLM-QA studies, which informed the first draft of AQAT:RoB. The draft AQAT:ROB was further refined through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the consensus meeting, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen's Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken. A second Delphi process occurred between February 20 and Feb 23, 2026 to vote on post-pilot changes in response to peer review.
    RESULTS: The AQAT:RoB consists of five high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 9 sub-domains. Each sub-domain includes at least one "Support for Judgement" and at least one "Type of Bias" and are to be rated "low", "high" or "unclear" for risk of bias. Pilot evaluation was completed by internal validators who were not part of the consensus discussion and were asked to complete the AQAT:RoB form on each assigned study. Each of the 16 studies were evaluated by two evaluators independently. Pilot validation showed a percent agreement of 86.1% and a Cohen's Kappa of 0.70 between assessors.
    CONCLUSIONS: The AQAT:RoB demonstrates promising initial reliability for assessing the validity/risk of bias of LLM-QA studies. The tool will benefit from future refinements, external validation, and periodic updates to keep pace with the evolving technology.
    CLINICALTRIAL:
    DOI:  https://doi.org/10.2196/87057