bims-arines 2025-11-23 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2025–11–23
eleven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Accuracy of LLMs to retrieve numeric data for meta-analysis in dentistry.
[Potentials and risks of using AI in qualitative research].
Can large language models approximate the results of meta-analyses in critical care? A meta-research study.
If the AI bubble bursts, what will it mean for research?
LLMs outperform outsourced human coders on complex textual analysis.
Identifying Biomedical Entities for Datasets in Scientific Articles: 4-Step Cache-Augmented Generation Approach Using GPT-4o and PubTator 3.0.
A multi-query, multimodal, receiver-augmented solution to extract contemporary cardiology guideline information using large language models.
Artificial intelligence in medical and biological research: promise and perils of ChatGPT and DeepSeek in advancing healthcare.
The evolving roles of editors and reviewers for nonhuman "authors": Consequences for the integrity of scientific literature and medical knowledge.
Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians.
Evaluating the Accuracy and Explanatory Quality of Large Language Models ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat in Statistical Test Selection for Hypothesis Testing Decisions.

J Dent. 2025 Nov 18. pii: S0300-5712(25)00690-6. [Epub ahead of print] 106245

Accuracy of LLMs to retrieve numeric data for meta-analysis in dentistry.

Vito Carlo Alberto Caponio, Alejandro I Lorenzo-Pouso, Marco Magalhaes, Aiman Ali, Daniela Adamo, Nicola Cirillo, Rosa María López-Pintor, Gennaro Musella.

   OBJECTIVES: Evidence-based dentistry rely heavily on systematic reviews and meta-analyses (SRMA), considered the most robust evidence studies. Still, conducting SRMAs is time- and resource-intensive, with high error rates in data extraction. Artificial intelligence (AI) and large language models (LLMs) offer potential to automate and accelerate SRMA processes like data extraction. However, assessing the reliability and accuracy of these new AI-based technologies for SRMA is crucial. This study evaluated the accuracy of four LLMs (DeepSeek v3 R1, Claude 3.5 Sonnet, ChatGPT-4o, and Gemini 2.0-flash) in extracting different primary numeric outcomes data in various dental topics.
METHODS: LLMs were queried via APIs using default settings and a SMART-format prompt. Descriptive analysis was conducted at sub-outcome, outcome, and study levels. Errors were classified as hallucinations, missed, or omitted data.
RESULTS: Overall extraction accuracy was exceptionally high at the sub-outcome level with only 3 hallucinations (from Gemini). Total errors increased at the outcome level and study level. Gemini generally performed significantly worse than others (p<0.01). Claude Sonnet 3.5 and DeepSeek-v3 generally exhibited superior accuracy and lower omission rates in full-text extraction compared to Gemini 2.0-flash and ChatGPT-4o.
CONCLUSIONS: This first comparative evaluation of multiple LLMs for data extraction in dental research from full-text PDFs highlights their significant potential but also limitations. Performance varied notably between models, with cost not directly correlating with superior performance. While single data point extraction was highly accurate, errors increased at higher aggregation levels. Standardized outcome reporting in studies could benefit future LLM extraction, while we offer a solid benchmark for future performance comparisons.
CLINICAL SIGNIFICANCE: This study demonstrates that LLMs can achieve high accuracy in extracting single numeric outcomes, but omission errors in full-text analyses limit their independent use in SRMAs. Improving outcome reporting standards and leveraging accurate, lower-cost models may enhance evidence synthesis efficiency in dentistry and beyond.

Keywords:  artificial intelligence; data extraction; dentistry; large language model; meta-analysis; systematic review

DOI:  https://doi.org/10.1016/j.jdent.2025.106245
Z Evid Fortbild Qual Gesundhwes. 2025 Nov 14. pii: S1865-9217(25)00205-3. [Epub ahead of print]

[Potentials and risks of using AI in qualitative research].

Christian Kempny, Yüce Yilmaz-Aslan, Patrick Brzoska.

  With the increasing availability of powerful large language models (LLMs), the use of artificial intelligence (AI) in qualitative research is gaining growing attention. This article critically examines the potential and limitations of such systems along key research steps, such as category development, coding, and interpretation. Drawing on our own experiences and recent studies, we discuss both functional benefits and methodological, ethical, and data protection-related challenges. The findings suggest that AI-based systems can be meaningfully employed as complementary tools for reflection - for example, to generate alternative perspectives or serve as a second or third opinion in individual projects. At the same time, it becomes evident that the core principles of qualitative research cannot be automated. We therefore advocate for a research-driven, critically reflective use of AI, grounded in methodological rigor, ethical responsibility, and ongoing scholarly discourse.

Keywords:  Artificial intelligence (AI); Künstliche Intelligenz (KI); Large language models (LLMs); Methodische Reflexion; Methodological reflection; Qualitative Forschung; Qualitative research; Sprachmodelle (LLMs)

DOI:  https://doi.org/10.1016/j.zefq.2025.10.004
J Crit Care. 2025 Nov 19. pii: S0883-9441(25)00345-4. [Epub ahead of print]92 155358

Can large language models approximate the results of meta-analyses in critical care? A meta-research study.

Michael Pratte, Shawn Thirukumar, Caseng Zhang, Marat Slessarev, John Basmaji, Ross Prager.

   BACKGROUND: Large language models (LLMs) are capable of processing extensive textual data and synthesizing evidence to answer complex clinical questions. The labor-intensive nature of systematic reviews with meta-analyses (SRMAs) present a unique opportunity to evaluate the utility of LLMs as a novel method for evidence synthesis.
OBJECTIVE: This study assessed the ability of OpenAI's o3 DeepResearch model to approximate the direction of effect, magnitude of effect and certainty of evidence for clinical questions addressed by published meta-analyses in top critical care medicine journals.
METHODS: We constructed standardized prompts based on the PICO (Population, Intervention, Comparator, Outcome) from a convenience sample of 23 systematic reviews with meta-analyses published in high-impact critical care journals. The LLM's estimates of effect size and certainty of evidence ratings were compared to those reported in the original SRMAs.
RESULTS: The LLM demonstrated a concordance rate of 83 % (19 of 23 studies) for the magnitude of effect size and 91 % (21 of 23 studies) for the direction of effect. Concordance for certainty of evidence was also 91 %. Discrepancies were due to differences in study selection between the LLM and SRMAs, rather than model hallucination or misinterpretation.
CONCLUSIONS: LLMs show promise as a new tool for rapid evidence synthesis in critical care, with outputs comparable to traditional meta-analyses in many cases. While not a replacement for systematic reviews, LLMs may enhance clinical decision-making, perform rapid evidence synthesis, and streamline future research workflows.

Keywords:  Artificial intelligence; Critical care; Evidence synthesis; Large language models; Meta-analysis

DOI:  https://doi.org/10.1016/j.jcrc.2025.155358
Nature. 2025 Nov 19.

If the AI bubble bursts, what will it mean for research?

Fred Schwaller.



Keywords:  Computer science; Economics; Machine learning

DOI:  https://doi.org/10.1038/d41586-025-03776-0
Sci Rep. 2025 Nov 17. 15(1): 40122

LLMs outperform outsourced human coders on complex textual analysis.

Vicente J Bermejo, Andrés Gago, Ramiro H Gálvez, Nicolás Harari.

This paper evaluates the effectiveness of large language models (LLMs) in extracting complex information from text data. Using a corpus of Spanish news articles, we compare how accurately various LLMs and outsourced human coders reproduce expert annotations on five natural language processing tasks, ranging from named entity recognition to identifying nuanced political criticism in news articles. We find that LLMs consistently outperform outsourced human coders, particularly in tasks requiring deep contextual understanding. These findings suggest that current LLM technology offers researchers without programming expertise a cost-effective alternative for sophisticated text analysis.

DOI: https://doi.org/10.1038/s41598-025-23798-y
JMIR Form Res. 2025 Nov 20. 9 e73822

Identifying Biomedical Entities for Datasets in Scientific Articles: 4-Step Cache-Augmented Generation Approach Using GPT-4o and PubTator 3.0.

Claudia Giuliani, Gita Benadi, Felix Engel, Jonas Werner, Manuel Watter, Guido Schwarzer, Olaf Groß, Robert Zeiser, Harald Binder, Klaus Kaier.

   Background: The accurate extraction of biomedical entities in scientific articles is essential for effective metadata annotation of research datasets, ensuring data findability, accessibility, interoperability, and reusability in collaborative research.
Objective: This study aimed to introduce a novel 4-step cache-augmented generation approach to identify biomedical entities for an automated metadata annotation of datasets, leveraging GPT-4o and PubTator 3.0.
Methods: The method integrates four steps: (1) generation of candidate entities using GPT-4o, (2) validation via PubTator 3.0, (3) term extraction based on a metadata schema developed for the specific research area, and (4) a combined evaluation of PubTator-validated and schema-related terms. Applied to 23 articles published in the context of the Collaborative Research Center OncoEscape, the process was validated through supervised, face-to-face interviews with article authors, allowing an assessment of annotation precision using random-effects meta-analysis.
Results: The approach yielded a mean of 19.6 schema-related and 6.7 PubTator-validated biomedical entities per article. Within the study's specific context, the overall annotation precision was 98% (95% CI 94%-100%), with most prediction errors concentrated in articles outside the primary basic research domain of the schema. In a subsample (n=20), available supplemental material was included in the prediction process, but it did not improve precision (98%, 95% CI 95%-100%). Moreover, the mean number of schema-related entities was 20.1 (P=.56) and the mean number of PubTator-validated entities was 6.7 (P=.68); these values did not increase with the additional information provided in the supplement.
Conclusions: This study highlights the potential of large language model-supported metadata annotation. The findings underscore the practical feasibility of full-text analysis and suggest its potential for integration into routine workflows for biomedical metadata generation.

Keywords:  AI; CAG; GPT-4o; PubTator 3.0; artificial intelligence; biomedical entities; cache-augmented generation; metadata annotation

DOI:  https://doi.org/10.2196/73822
Eur Heart J Digit Health. 2025 Nov;6(6): 1257-1263

A multi-query, multimodal, receiver-augmented solution to extract contemporary cardiology guideline information using large language models.

Robert M Radke, Gerhard-Paul Diller, Rohan G Reddy, Pushpa Shivaram, David A Danford, Shelby Kutty.

   Aims: The aim of the current study was to assess the utility of a state-of-the-art large language model (LLM) based on curated, defined clinical practice recommendations to support clinicians in obtaining point-of-care guidelines for individual patient treatment while maintaining transparency.
Methods and results: We combined cloud-based and locally run LLMs with versatile open-source tools to form a multi-query, multimodal, retrieval-augmented generation chain that closely reflects European cardiology guidelines in its responses. We compared the performance of this generation chain to other LLMs including GPT-3.5 and GPT-4 on a 306-question multiple-choice exam with questions consisting of short patient vignettes from various cardiology subspecialties, originally written to prepare candidates for the European Exam in Core Cardiology. On the multiple-choice test, our system demonstrated overall accuracy of 73.53%, while GPT-3.5 and GPT-4 had overall accuracies of 44.03 and 62.26%, respectively. Our system outperformed GPT-3.5 and GPT-4 for the following categories of questions: coronary artery disease, arrhythmia, other, valvular heart disease, cardiomyopathies, endocarditis, adult congenital heart disease, pericardial disease, cardio-oncology, pulmonary hypertension, and non-cardiac surgery. For maximum transparency, the system also provided reference quotes for its recommendations.
Conclusion: Our system demonstrated superior performance in question-answering tasks on a set of core cardiology questions as compared with contemporary publicly available chat models. The current study illustrates that LLMs can be tailored to provide documented and accountable guideline recommendations towards actual clinical needs while ensuring recommendations are derived from up-to-date, trustable, and traceable documents.

Keywords:  Clinical practice guidelines; Large language model; Retrieval-augmented generation

DOI:  https://doi.org/10.1093/ehjdh/ztaf111
Turk J Biol. 2025 ;49(5): 585-599

Artificial intelligence in medical and biological research: promise and perils of ChatGPT and DeepSeek in advancing healthcare.

Mahmut Enes Kayaalp, Onur Gültekin, Serhat Akçaalan, Hamit Çağlayan Kahraman, Hüseyin Nevzat Topçu, Gülşah Kavrul Kayaalp.

   Background/aim: Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT and DeepSeek, is being increasingly applied in clinical care, research, and education. The aim of this review is to examine how these tools may transform the conduct of medical and biological research and to define their limitations.
Materials and methods: A narrative synthesis of the literature was performed, encompassing studies published between 2020 and 2025. Peer-reviewed journals, systematic reviews, and high-impact original research articles were included to ensure an evidence-based overview. The principle applications, validation metrics, and clinical implications across orthopedics, oncology, cardiology, internal medicine, and the biological sciences were analyzed.
Results: LLMs demonstrate strong potential in supporting physicians during clinical decision-making, enhancing patient education, and assisting researchers in their work. They are valuable for language-related tasks and for generating structured, clear, and comprehensible content. However, concerns persist regarding data privacy, algorithmic bias, factual accuracy, and excessive dependence on data-driven outputs. Responsible implementation requires safeguards such as human oversight, model transparency, and domain-specific training.
Conclusion: AI tools such as ChatGPT, DeepSeek, and similar models are transforming the way healthcare is delivered and studied. Their current capabilities appear highly promising. However, clinicians, technical experts, and policymakers must collaborate to ensure the safe, equitable, effective, and ethical integration of these technologies into real-world healthcare workflows.

Keywords:  Artificial intelligence; ChatGPT; DeepSeek; clinical decision support; large language models; medical education

DOI:  https://doi.org/10.55730/1300-0152.2765
J Cardiovasc Thorac Res. 2025 Sep;17(3): 143-144

The evolving roles of editors and reviewers for nonhuman "authors": Consequences for the integrity of scientific literature and medical knowledge.

Samad Ghaffari, Neda Roshanravan.

DOI: https://doi.org/10.34172/jcvtr.025.33733
J Med Internet Res. 2025 Nov 19. 27 e78393

Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians.

Joshua Yi Min Tung, Quan Le, Jinxuan Yao, Yifei Huang, Daniel Yan Zheng Lim, Gerald Gui Ren Sng, Rachel Shu En Lau, Yu Guang Tan, Kenneth Chen, Kae Jack Tay, Jen Hong Tan, John Shyi Peng Yuen, Christopher Wai Sam Cheng, Henry Sun Sien Ho.

   Background: Prostate-specific antigen (PSA) testing remains the cornerstone of early prostate cancer detection. Society guidelines for prostate cancer screening via PSA testing serve to standardize patient care and are often used by trainees, junior staff, or generalist medical practitioners to guide medical decision-making. However, adherence to guidelines is a time-consuming and challenging task, and rates of inappropriate PSA testing are high. Retrieval-augmented generation (RAG) is a method to enhance the reliability of large language models (LLMs) by grounding responses in trusted external sources.
Objective: This study aimed to evaluate a RAG-enhanced LLM system, grounded in current European Association of Urology and American Urological Association guidelines, to assess its effectiveness in providing guideline-concordant PSA screening recommendations compared to junior clinicians.
Methods: A series of 44 fictional outpatient case scenarios was developed to represent a broad spectrum of clinical presentations. A RAG pipeline was developed, comprising a life expectancy estimation module based on the Charlson Comorbidity Index, followed by LLM-generated recommendations constrained to retrieved excerpts from the European Association of Urology and American Urological Association guidelines. Five junior clinicians were tasked to provide PSA testing recommendations for the same scenarios in closed-book and open-book formats. Answers were compared for accuracy in a binomial fashion. Fleiss κ was computed to assess interrater agreement among clinicians.
Results: The RAG-LLM tool provided guideline-concordant recommendations in 95.5% (210/220) of case scenarios, compared to junior clinicians, who were correct in 62.3% (137/220) of scenarios in a closed-book format and 74.1% (163/220) of scenarios in an open-book format. The difference was statistically significant for both closed-book (P<.001) and open-book (P<.001) formats. Interrater agreement among clinicians was fair, with Fleiss κ of 0.294 and 0.321 for closed-book and open-book formats, respectively.
Conclusions: Use of RAG techniques allows LLMs to integrate complex guidelines into day-to-day medical decision-making. RAG-LLM tools in urology have the capability to enhance clinical decision-making by providing guideline-concordant recommendations for PSA testing, potentially improving the consistency of health care delivery, reducing cognitive load on clinicians, and reducing unnecessary investigations and costs. While this study used synthetic cases in a controlled simulation environment, it establishes a foundation for future validation in real-world clinical settings.

Keywords:  AI; LLM; artificial intelligence; guideline concordance; junior clinician; large language model

DOI:  https://doi.org/10.2196/78393
Cureus. 2025 Oct;17(10): e94949

Evaluating the Accuracy and Explanatory Quality of Large Language Models ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat in Statistical Test Selection for Hypothesis Testing Decisions.

Mukesh Shukla, Deepshikha Pandey, Samarjeet Kaur, Mayank Agarwal, Aayushi Goyal, Himanshi Sharma.

  Background Large language models (LLMs) are increasingly integrated into academic and professional research workflows, yet their capability to accurately select appropriate statistical tests for hypothesis testing remains underexplored. Incorrect statistical test selection can lead to invalid conclusions and compromise scientific validity, making this evaluation critical for determining the reliability of LLMs in research applications. The study objective was to evaluate and compare the accuracy of six prominent LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in selecting appropriate statistical tests for various hypothesis testing scenarios. Materials and methods A comparative, cross-sectional evaluation was conducted using 20 standardized statistical testing scenarios. Each scenario was designed to cover 20 different hypothesis testing situations, including comparisons of means, proportions, non-parametric alternatives, paired versus independent samples, and correlation and regression analyses. All models were prompted with identical instructions and evaluated by five independent experts with profound knowledge in biostatistics. Responses were assessed for accuracy and rated on five domains (clarity and accessibility, identification of necessary assumptions, pedagogical value, problem-solving approach, and statistical reasoning) using a five-point Likert scale. Analysis of Variance (ANOVA) was applied for between-group comparisons, and a p<0.05 was considered significant. Results All six LLMs achieved 100% accuracy in statistical test selection across all 20 hypothesis scenarios. However, significant variations emerged in explanatory quality. Claude demonstrated superior performance in clarity and accessibility (4.65 ± 0.58, p=0.05), while the problem-solving approach showed the most consistent excellence across models. Statistical reasoning exhibited variation ranging from 3.16 to 4.66, with complex regression methods receiving lower ratings than basic statistical tests. Gemini excelled in pedagogical value (4.50 ± 0.68), while ChatGPT ranked lowest in statistical reasoning despite strong problem-solving capabilities. Conclusions All LLMs demonstrate perfect accuracy in statistical test selection; however, differences exist in the quality of explanations and reasoning provided. These findings suggest that current-generation LLMs have become dependable tools for statistical consultation in hypothesis testing scenarios. However, users should consider model-specific strengths when seeking detailed explanations or educational content.

Keywords:  chatgpt; claude; deepseek; gemini; grok; hypothesis; large language models; le chat; statistical test

DOI:  https://doi.org/10.7759/cureus.94949