bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–11–23
eleven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Dent. 2025 Nov 18. pii: S0300-5712(25)00690-6. [Epub ahead of print] 106245
       OBJECTIVES: Evidence-based dentistry rely heavily on systematic reviews and meta-analyses (SRMA), considered the most robust evidence studies. Still, conducting SRMAs is time- and resource-intensive, with high error rates in data extraction. Artificial intelligence (AI) and large language models (LLMs) offer potential to automate and accelerate SRMA processes like data extraction. However, assessing the reliability and accuracy of these new AI-based technologies for SRMA is crucial. This study evaluated the accuracy of four LLMs (DeepSeek v3 R1, Claude 3.5 Sonnet, ChatGPT-4o, and Gemini 2.0-flash) in extracting different primary numeric outcomes data in various dental topics.
    METHODS: LLMs were queried via APIs using default settings and a SMART-format prompt. Descriptive analysis was conducted at sub-outcome, outcome, and study levels. Errors were classified as hallucinations, missed, or omitted data.
    RESULTS: Overall extraction accuracy was exceptionally high at the sub-outcome level with only 3 hallucinations (from Gemini). Total errors increased at the outcome level and study level. Gemini generally performed significantly worse than others (p<0.01). Claude Sonnet 3.5 and DeepSeek-v3 generally exhibited superior accuracy and lower omission rates in full-text extraction compared to Gemini 2.0-flash and ChatGPT-4o.
    CONCLUSIONS: This first comparative evaluation of multiple LLMs for data extraction in dental research from full-text PDFs highlights their significant potential but also limitations. Performance varied notably between models, with cost not directly correlating with superior performance. While single data point extraction was highly accurate, errors increased at higher aggregation levels. Standardized outcome reporting in studies could benefit future LLM extraction, while we offer a solid benchmark for future performance comparisons.
    CLINICAL SIGNIFICANCE: This study demonstrates that LLMs can achieve high accuracy in extracting single numeric outcomes, but omission errors in full-text analyses limit their independent use in SRMAs. Improving outcome reporting standards and leveraging accurate, lower-cost models may enhance evidence synthesis efficiency in dentistry and beyond.
    Keywords:  artificial intelligence; data extraction; dentistry; large language model; meta-analysis; systematic review
    DOI:  https://doi.org/10.1016/j.jdent.2025.106245
  2. Z Evid Fortbild Qual Gesundhwes. 2025 Nov 14. pii: S1865-9217(25)00205-3. [Epub ahead of print]
      With the increasing availability of powerful large language models (LLMs), the use of artificial intelligence (AI) in qualitative research is gaining growing attention. This article critically examines the potential and limitations of such systems along key research steps, such as category development, coding, and interpretation. Drawing on our own experiences and recent studies, we discuss both functional benefits and methodological, ethical, and data protection-related challenges. The findings suggest that AI-based systems can be meaningfully employed as complementary tools for reflection - for example, to generate alternative perspectives or serve as a second or third opinion in individual projects. At the same time, it becomes evident that the core principles of qualitative research cannot be automated. We therefore advocate for a research-driven, critically reflective use of AI, grounded in methodological rigor, ethical responsibility, and ongoing scholarly discourse.
    Keywords:  Artificial intelligence (AI); Künstliche Intelligenz (KI); Large language models (LLMs); Methodische Reflexion; Methodological reflection; Qualitative Forschung; Qualitative research; Sprachmodelle (LLMs)
    DOI:  https://doi.org/10.1016/j.zefq.2025.10.004
  3. J Crit Care. 2025 Nov 19. pii: S0883-9441(25)00345-4. [Epub ahead of print]92 155358
       BACKGROUND: Large language models (LLMs) are capable of processing extensive textual data and synthesizing evidence to answer complex clinical questions. The labor-intensive nature of systematic reviews with meta-analyses (SRMAs) present a unique opportunity to evaluate the utility of LLMs as a novel method for evidence synthesis.
    OBJECTIVE: This study assessed the ability of OpenAI's o3 DeepResearch model to approximate the direction of effect, magnitude of effect and certainty of evidence for clinical questions addressed by published meta-analyses in top critical care medicine journals.
    METHODS: We constructed standardized prompts based on the PICO (Population, Intervention, Comparator, Outcome) from a convenience sample of 23 systematic reviews with meta-analyses published in high-impact critical care journals. The LLM's estimates of effect size and certainty of evidence ratings were compared to those reported in the original SRMAs.
    RESULTS: The LLM demonstrated a concordance rate of 83 % (19 of 23 studies) for the magnitude of effect size and 91 % (21 of 23 studies) for the direction of effect. Concordance for certainty of evidence was also 91 %. Discrepancies were due to differences in study selection between the LLM and SRMAs, rather than model hallucination or misinterpretation.
    CONCLUSIONS: LLMs show promise as a new tool for rapid evidence synthesis in critical care, with outputs comparable to traditional meta-analyses in many cases. While not a replacement for systematic reviews, LLMs may enhance clinical decision-making, perform rapid evidence synthesis, and streamline future research workflows.
    Keywords:  Artificial intelligence; Critical care; Evidence synthesis; Large language models; Meta-analysis
    DOI:  https://doi.org/10.1016/j.jcrc.2025.155358
  4. Nature. 2025 Nov 19.
      
    Keywords:  Computer science; Economics; Machine learning
    DOI:  https://doi.org/10.1038/d41586-025-03776-0
  5. Sci Rep. 2025 Nov 17. 15(1): 40122
      This paper evaluates the effectiveness of large language models (LLMs) in extracting complex information from text data. Using a corpus of Spanish news articles, we compare how accurately various LLMs and outsourced human coders reproduce expert annotations on five natural language processing tasks, ranging from named entity recognition to identifying nuanced political criticism in news articles. We find that LLMs consistently outperform outsourced human coders, particularly in tasks requiring deep contextual understanding. These findings suggest that current LLM technology offers researchers without programming expertise a cost-effective alternative for sophisticated text analysis.
    DOI:  https://doi.org/10.1038/s41598-025-23798-y
  6. JMIR Form Res. 2025 Nov 20. 9 e73822
       Background: The accurate extraction of biomedical entities in scientific articles is essential for effective metadata annotation of research datasets, ensuring data findability, accessibility, interoperability, and reusability in collaborative research.
    Objective: This study aimed to introduce a novel 4-step cache-augmented generation approach to identify biomedical entities for an automated metadata annotation of datasets, leveraging GPT-4o and PubTator 3.0.
    Methods: The method integrates four steps: (1) generation of candidate entities using GPT-4o, (2) validation via PubTator 3.0, (3) term extraction based on a metadata schema developed for the specific research area, and (4) a combined evaluation of PubTator-validated and schema-related terms. Applied to 23 articles published in the context of the Collaborative Research Center OncoEscape, the process was validated through supervised, face-to-face interviews with article authors, allowing an assessment of annotation precision using random-effects meta-analysis.
    Results: The approach yielded a mean of 19.6 schema-related and 6.7 PubTator-validated biomedical entities per article. Within the study's specific context, the overall annotation precision was 98% (95% CI 94%-100%), with most prediction errors concentrated in articles outside the primary basic research domain of the schema. In a subsample (n=20), available supplemental material was included in the prediction process, but it did not improve precision (98%, 95% CI 95%-100%). Moreover, the mean number of schema-related entities was 20.1 (P=.56) and the mean number of PubTator-validated entities was 6.7 (P=.68); these values did not increase with the additional information provided in the supplement.
    Conclusions: This study highlights the potential of large language model-supported metadata annotation. The findings underscore the practical feasibility of full-text analysis and suggest its potential for integration into routine workflows for biomedical metadata generation.
    Keywords:  AI; CAG; GPT-4o; PubTator 3.0; artificial intelligence; biomedical entities; cache-augmented generation; metadata annotation
    DOI:  https://doi.org/10.2196/73822
  7. Eur Heart J Digit Health. 2025 Nov;6(6): 1257-1263
       Aims: The aim of the current study was to assess the utility of a state-of-the-art large language model (LLM) based on curated, defined clinical practice recommendations to support clinicians in obtaining point-of-care guidelines for individual patient treatment while maintaining transparency.
    Methods and results: We combined cloud-based and locally run LLMs with versatile open-source tools to form a multi-query, multimodal, retrieval-augmented generation chain that closely reflects European cardiology guidelines in its responses. We compared the performance of this generation chain to other LLMs including GPT-3.5 and GPT-4 on a 306-question multiple-choice exam with questions consisting of short patient vignettes from various cardiology subspecialties, originally written to prepare candidates for the European Exam in Core Cardiology. On the multiple-choice test, our system demonstrated overall accuracy of 73.53%, while GPT-3.5 and GPT-4 had overall accuracies of 44.03 and 62.26%, respectively. Our system outperformed GPT-3.5 and GPT-4 for the following categories of questions: coronary artery disease, arrhythmia, other, valvular heart disease, cardiomyopathies, endocarditis, adult congenital heart disease, pericardial disease, cardio-oncology, pulmonary hypertension, and non-cardiac surgery. For maximum transparency, the system also provided reference quotes for its recommendations.
    Conclusion: Our system demonstrated superior performance in question-answering tasks on a set of core cardiology questions as compared with contemporary publicly available chat models. The current study illustrates that LLMs can be tailored to provide documented and accountable guideline recommendations towards actual clinical needs while ensuring recommendations are derived from up-to-date, trustable, and traceable documents.
    Keywords:  Clinical practice guidelines; Large language model; Retrieval-augmented generation
    DOI:  https://doi.org/10.1093/ehjdh/ztaf111
  8. Turk J Biol. 2025 ;49(5): 585-599
       Background/aim: Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT and DeepSeek, is being increasingly applied in clinical care, research, and education. The aim of this review is to examine how these tools may transform the conduct of medical and biological research and to define their limitations.
    Materials and methods: A narrative synthesis of the literature was performed, encompassing studies published between 2020 and 2025. Peer-reviewed journals, systematic reviews, and high-impact original research articles were included to ensure an evidence-based overview. The principle applications, validation metrics, and clinical implications across orthopedics, oncology, cardiology, internal medicine, and the biological sciences were analyzed.
    Results: LLMs demonstrate strong potential in supporting physicians during clinical decision-making, enhancing patient education, and assisting researchers in their work. They are valuable for language-related tasks and for generating structured, clear, and comprehensible content. However, concerns persist regarding data privacy, algorithmic bias, factual accuracy, and excessive dependence on data-driven outputs. Responsible implementation requires safeguards such as human oversight, model transparency, and domain-specific training.
    Conclusion: AI tools such as ChatGPT, DeepSeek, and similar models are transforming the way healthcare is delivered and studied. Their current capabilities appear highly promising. However, clinicians, technical experts, and policymakers must collaborate to ensure the safe, equitable, effective, and ethical integration of these technologies into real-world healthcare workflows.
    Keywords:  Artificial intelligence; ChatGPT; DeepSeek; clinical decision support; large language models; medical education
    DOI:  https://doi.org/10.55730/1300-0152.2765
  9. J Med Internet Res. 2025 Nov 19. 27 e78393
       Background: Prostate-specific antigen (PSA) testing remains the cornerstone of early prostate cancer detection. Society guidelines for prostate cancer screening via PSA testing serve to standardize patient care and are often used by trainees, junior staff, or generalist medical practitioners to guide medical decision-making. However, adherence to guidelines is a time-consuming and challenging task, and rates of inappropriate PSA testing are high. Retrieval-augmented generation (RAG) is a method to enhance the reliability of large language models (LLMs) by grounding responses in trusted external sources.
    Objective: This study aimed to evaluate a RAG-enhanced LLM system, grounded in current European Association of Urology and American Urological Association guidelines, to assess its effectiveness in providing guideline-concordant PSA screening recommendations compared to junior clinicians.
    Methods: A series of 44 fictional outpatient case scenarios was developed to represent a broad spectrum of clinical presentations. A RAG pipeline was developed, comprising a life expectancy estimation module based on the Charlson Comorbidity Index, followed by LLM-generated recommendations constrained to retrieved excerpts from the European Association of Urology and American Urological Association guidelines. Five junior clinicians were tasked to provide PSA testing recommendations for the same scenarios in closed-book and open-book formats. Answers were compared for accuracy in a binomial fashion. Fleiss κ was computed to assess interrater agreement among clinicians.
    Results: The RAG-LLM tool provided guideline-concordant recommendations in 95.5% (210/220) of case scenarios, compared to junior clinicians, who were correct in 62.3% (137/220) of scenarios in a closed-book format and 74.1% (163/220) of scenarios in an open-book format. The difference was statistically significant for both closed-book (P<.001) and open-book (P<.001) formats. Interrater agreement among clinicians was fair, with Fleiss κ of 0.294 and 0.321 for closed-book and open-book formats, respectively.
    Conclusions: Use of RAG techniques allows LLMs to integrate complex guidelines into day-to-day medical decision-making. RAG-LLM tools in urology have the capability to enhance clinical decision-making by providing guideline-concordant recommendations for PSA testing, potentially improving the consistency of health care delivery, reducing cognitive load on clinicians, and reducing unnecessary investigations and costs. While this study used synthetic cases in a controlled simulation environment, it establishes a foundation for future validation in real-world clinical settings.
    Keywords:  AI; LLM; artificial intelligence; guideline concordance; junior clinician; large language model
    DOI:  https://doi.org/10.2196/78393
  10. Cureus. 2025 Oct;17(10): e94949
      Background Large language models (LLMs) are increasingly integrated into academic and professional research workflows, yet their capability to accurately select appropriate statistical tests for hypothesis testing remains underexplored. Incorrect statistical test selection can lead to invalid conclusions and compromise scientific validity, making this evaluation critical for determining the reliability of LLMs in research applications. The study objective was to evaluate and compare the accuracy of six prominent LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in selecting appropriate statistical tests for various hypothesis testing scenarios. Materials and methods A comparative, cross-sectional evaluation was conducted using 20 standardized statistical testing scenarios. Each scenario was designed to cover 20 different hypothesis testing situations, including comparisons of means, proportions, non-parametric alternatives, paired versus independent samples, and correlation and regression analyses. All models were prompted with identical instructions and evaluated by five independent experts with profound knowledge in biostatistics. Responses were assessed for accuracy and rated on five domains (clarity and accessibility, identification of necessary assumptions, pedagogical value, problem-solving approach, and statistical reasoning) using a five-point Likert scale. Analysis of Variance (ANOVA) was applied for between-group comparisons, and a p<0.05 was considered significant. Results All six LLMs achieved 100% accuracy in statistical test selection across all 20 hypothesis scenarios. However, significant variations emerged in explanatory quality. Claude demonstrated superior performance in clarity and accessibility (4.65 ± 0.58, p=0.05), while the problem-solving approach showed the most consistent excellence across models. Statistical reasoning exhibited variation ranging from 3.16 to 4.66, with complex regression methods receiving lower ratings than basic statistical tests. Gemini excelled in pedagogical value (4.50 ± 0.68), while ChatGPT ranked lowest in statistical reasoning despite strong problem-solving capabilities. Conclusions All LLMs demonstrate perfect accuracy in statistical test selection; however, differences exist in the quality of explanations and reasoning provided. These findings suggest that current-generation LLMs have become dependable tools for statistical consultation in hypothesis testing scenarios. However, users should consider model-specific strengths when seeking detailed explanations or educational content.
    Keywords:  chatgpt; claude; deepseek; gemini; grok; hypothesis; large language models; le chat; statistical test
    DOI:  https://doi.org/10.7759/cureus.94949