bims-arines 2025-11-09 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2025–11–09
sixteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Assessing the Feasibility and Acceptability of a Bespoke Large Language Model Pipeline to Extract Data From Different Study Designs for Public Health Evidence Reviews.
Artificial Intelligence-Assisted Data Extraction With a Large Language Model: A Study Within Reviews.
Accuracy of large language models in data extraction from randomized controlled trials in sleep medicine: A proof-of-concept study.
Human-led and artificial intelligence-automated critical appraisal of systematic reviews: Comparative evaluation.
Comparing the accuracy of AI-assisted data extraction versus human double extraction in evidence synthesis: a randomised controlled trial protocol.
AI for NONMEM Coding in Pharmacometrics Research and Education: Shortcut or Pitfall?
Toward autonomous discovery: agentic AI and the future of ophthalmic research.
Reflections on "Developing and Evaluating the Use of ChatGPT as a Screening Tool for Nurses Conducting Structured Literature Reviews: Proof of Concept Study Results".
Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM): 2025 Updates.
Applying machine learning to enhance core outcome set development: automating the data extraction and classification of outcomes.
AI-Powered Workflow for Constructing Organic Materials Databases from the Literature: Integrating Large Language Models.
OmixLitMiner 2: Guided Literature Mining for Automated Categorization of Marker Candidates in Omics Studies.
Artificial Intelligence in Health Economics and Outcomes Research: Highlighting the Contributions of Early Career Researchers.
Health Economics and Outcomes Research in the New Era of Artificial Intelligence: Catch Me If You Can.
SciDaSynth: Interactive Structured Data Extraction From Scientific Literature With Large Language Model.
Assessing the Limitations of Large Language Models in Clinical Practice Guideline-Concordant Treatment Decision-Making on Real-World Data: Retrospective Study.

Cochrane Evid Synth Methods. 2025 Nov;3(6): e70061

Assessing the Feasibility and Acceptability of a Bespoke Large Language Model Pipeline to Extract Data From Different Study Designs for Public Health Evidence Reviews.

Zalaya Simmons, Beti Evans, Tamsyn Harris, Harry Woolnough, Lauren Dunn, Jonathon Fuller, Kerry Cella, Daphne Duval.

   Introduction: Data extraction is a critical but resource-intensive step of the evidence review process. Whilst there is evidence that artificial intelligence (AI) and large language models (LLMs) can improve the efficiency of data extraction from randomized controlled trials, their potential for other study designs is unclear. In this context, this study aimed to evaluate the performance of a bespoke LLM model pipeline (Retrieval-Augmented Generation pipeline utilizing LLaMa 3-70B) to automate data extraction from a range of study designs by assessing the accuracy and reliability of the extractions measured as error types and acceptability.
Methods: Accuracy was assessed by retrospectively comparing the LLM extractions against human extractions from a review previously conducted by the authors. A total of 173 data fields from 24 articles (including experimental, observational, qualitative, and modeling studies) were assessed, of which three were used for prompt engineering. Reliability was assessed by calculating the mean maximum agreement rate (the highest proportion of identical returns from 10 consecutive extractions) for 116 data fields from 16 of the 24 studies. An evaluation framework was developed to assess the accuracy and reliability of LLM outputs measured as error types and acceptability (acceptability was assessed on whether it would be usable in real-world settings if the model acted as one reviewer and a human as a second reviewer).
Results: Of the 173 data fields evaluated for accuracy, 68% were rated by human reviewers as acceptable (consistent with what is deemed to be acceptable data extraction from a human reviewer). However, acceptability ratings varied depending on the data field extracted (33% to 100%), with at least 90% acceptability for "objective," "setting," and "study design," but 54% or less for data fields such as "outcome" and "time period." For reliability, the mean maximum agreement rate was 0.71 (SD: 0.28), with variation across different data fields.
Conclusion: This evaluation demonstrates the potential for LLMs, when paired with human quality assurance, to support data extraction in evidence reviews that include a range of study designs. However, further improvements in performance and validation are required before the model can be introduced into review workflows.

Keywords:  artificial intelligence; data extraction; evidence synthesis; feasibility; large language model; public health; systematic review

DOI:  https://doi.org/10.1002/cesm.70061
Ann Intern Med. 2025 Nov 04.

Artificial Intelligence-Assisted Data Extraction With a Large Language Model: A Study Within Reviews.

Gerald Gartlehner, Shannon Kugley, Karen Crotty, Meera Viswanathan, Andreea Dobrescu, Barbara Nussbaumer-Streit, Graham Booth, Jonathan R Treadwell, Jung Min Han, Jesse Wagner, Eric A Apaydin, Erin L Coppola, Margaret Maglione, Rainer Hilscher, Robert Chew, Meagan Pilar, Bryan Swanton, Leila C Kahwati.

BACKGROUND: Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.
OBJECTIVE: To compare an AI-assisted versus a traditional, human-only data extraction process.
DESIGN: Study within reviews (SWAR) using a prospective, parallel-group comparison with blinded data adjudicators.
SETTING: Workflow validation within 6 ongoing systematic reviews of interventions under real-world conditions.
INTERVENTION: Initial data extraction using an LLM (Claude, versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer.
MEASUREMENTS: Concordance, time on task, accuracy, sensitivity, positive predictive value, and error analysis.
RESULTS: The 6 systematic reviews in the SWAR yielded 9341 data elements from 63 studies. Concordance between the 2 methods was 77.2% (95% CI, 76.3% to 78.0%). Compared with the reference standard, the AI-assisted approach had an accuracy of 91.0% (CI, 90.4% to 91.6%) and the human-only approach an accuracy of 89.0% (CI, 88.3% to 89.6%). Sensitivities were 89.4% (CI, 88.6% to 90.1%) and 86.5% (CI, 85.7% to 87.3%), respectively, with positive predictive values of 99.2% (CI, 99.0% to 99.4%) and 98.9% (CI, 98.6% to 99.1%). Incorrect data were extracted in 9.0% (CI, 8.4% to 9.6%) of AI-assisted cases and 11.0% (CI, 10.4% to 11.7%) of human-only cases, with corresponding proportions of major errors of 2.5% (CI, 2.2% to 2.8%) versus 2.7% (CI, 2.4% to 3.1%). Missed data items were the most frequent error type in both approaches. The AI-assisted method reduced data extraction time by a median of 41 minutes per study.
LIMITATIONS: Assessing concordance and classifying errors required subjective judgment. Consistently tracking time on task was challenging.
CONCLUSION: Data extraction assisted by AI may offer a viable, more efficient alternative to human-only methods.
PRIMARY FUNDING SOURCE: Agency for Healthcare Research and Quality and RTI International.

DOI: https://doi.org/10.7326/ANNALS-25-00739
Sleep Med Rev. 2025 Oct 30. pii: S1087-0792(25)00145-5. [Epub ahead of print]84 102192

Accuracy of large language models in data extraction from randomized controlled trials in sleep medicine: A proof-of-concept study.

Zhen Peng, Xingwei Wu, Zongshi Qin, Suhail A Doi, Luis Furuya-Kanamori, Yi Hong, Lifeng Lin, Haitao Chu, Chang Xu, Mingkai Liu.

  This proof-of-concept study examined the performance of two prominent large language model (LLM)-based AI tools, ChatGPT 4o and Claude 3.5, in extracting data for four specific tasks: group size, event count, mean value, and standard deviation. Utilizing an established database that analyzed data extraction errors in systematic reviews on sleep medicine, we tested the ability of both AI tools to extract data from 648 randomized controlled trials (RCTs) using single- and multiple-sentences prompting approaches. The accuracy of the extracted data was compared to error-corrected metadata, with an overall accuracy reaching up to 71.5 % (95 % CI: 69.3 %, 73.7 %) for Claude and 69.1 % (95 % CI: 66.8 %, 71.3 %) for ChatGPT. Claude demonstrated superior performance over ChatGPT across all tasks, with the largest accuracy difference of up to 12.7 % (OR = 1.70, 95 % CI: 1.38, 2.10). The single-sentence prompt led to lower accuracy compared to the multiple-sentences prompts, with the largest percentage difference being -11.0 % (OR = 0.64, 95 % CI: 0.52, 0.78). Both AI tools achieved strong performance in extracting group size data. These findings underscore the potential of AI tools like Claude, especially when combined with effective prompting strategies like multiple-sentences prompts, to assist data extraction in sleep medicine research.

Keywords:  Automatic data extraction; ChatGPT; Claude; Evidence synthesis; Large language models; Sleep medicine

DOI:  https://doi.org/10.1016/j.smrv.2025.102192
Nurse Educ Pract. 2025 Oct 28. pii: S1471-5953(25)00371-3. [Epub ahead of print]89 104614

Human-led and artificial intelligence-automated critical appraisal of systematic reviews: Comparative evaluation.

Lucija Gosak, Gregor Štiglic, Wilson Wai San Tam, Dominika Vrbnjak.

   AIM: To evaluate and compare human-led and artificial intelligence-automated critical appraisal of evidence.
BACKGROUND: Critical appraisal is essential in evidence-based practice, yet many nurses lack the skills to perform it. Large language models offer potential support, but their role in critical appraisal remains underexplored.
DESIGN: We conducted a comparative study to evaluate the performance of five commonly used large language models versus two human reviewers in appraising four systematic reviews on interventions to reduce medication administration errors.
METHODS: We compared large language models and two human reviewers in independently appraising four systematic reviews using the JBI Critical Appraisal Checklist. These models were Perplexity Sonar (Pro), Claude 3.7 Sonnet, Gemini 2.0 Flash, GPT-4.5 and Grok-2. All models received identical full texts and standardized prompts. Responses were analyzed descriptively and agreement was assessed using Cohen's Kappa.
RESULTS: Large language models showed full agreement with human reviewers on five of 11 JBI items. Most disagreements occurred in appraising search strategy, inclusion criteria and publication bias. The agreement between human reviewers and large language models ranged from slight to moderate. The highest level of agreement was observed with Claude (κ = 0.732), while the lowest level was observed with Gemini (κ = 0.394).
CONCLUSION: Large language models can support aspects of critical appraisal evidence but lack contextual reasoning and methodological insight required for complex judgments. While Claude 3.7 Sonnet aligned most closely with human reviewers, human oversight remains essential. Large language models should serve as adjuncts and not substitutes for evidence-based practice.

Keywords:  Artificial intelligence in healthcare; Evidence-based practice; Multimodal large language models; Nursing

DOI:  https://doi.org/10.1016/j.nepr.2025.104614
BMJ Open. 2025 Nov 05. 15(11): e106546

Comparing the accuracy of AI-assisted data extraction versus human double extraction in evidence synthesis: a randomised controlled trial protocol.

Zhen Peng, Shiqi Fan, Yuan Tian, Yingxia Wang, Zongshi Qin, Suhail Doi, Chang Xu.

   INTRODUCTION: Traditional data extraction strategies, such as human double extraction, are both time consuming and labour-intensive. Artificial intelligence (AI) has emerged as a promising tool for facilitating data extraction. However, it is not yet suitable as a standalone solution. We will conduct a randomised controlled trial (RCT) to compare the efficiency and accuracy of the AI-human data extraction strategy with human double extraction.
METHODS AND ANALYSIS: This study is designed as a randomised, controlled, parallel trial. Participants will be randomly assigned to either the AI group or the non-AI group at a 1:2 allocation ratio. The AI group will use a hybrid approach that combines AI extraction followed by human verification by the same participant, while the non-AI group will use human double extraction. Data will be collected for two tasks: event count and group size. Ten RCTs will be selected from an established database that analysed data extraction errors in systematic reviews of sleep medicine. The primary outcome measure will be the percentage of correct extractions by both groups for each data extraction task.
ETHICS AND DISSEMINATION: The trial is approved by the Ethics Council of Anhui Medical University (No. 81250507). We plan to publish the main results as an academic publication in an international peer-reviewed journal in 2026.
TRIAL REGISTRATION NUMBER: Chinese Clinical Trial Register (Identifier: ChiCTR2500100393).

Keywords:  Artificial Intelligence; Information Extraction; Randomized Controlled Trial

DOI:  https://doi.org/10.1136/bmjopen-2025-106546
CPT Pharmacometrics Syst Pharmacol. 2025 Nov 04.

AI for NONMEM Coding in Pharmacometrics Research and Education: Shortcut or Pitfall?

Wenhao Zheng, Wanbing Wang, Carl M J Kirkpatrick, Cornelia B Landersdorfer, Huaxiu Yao, Jiawei Zhou.

Artificial intelligence (AI) is increasingly being explored as a tool to support pharmacometric modeling, particularly in addressing the coding challenges associated with NONMEM. In this study, we evaluated the ability of seven Large Language Models (LLMs) to generate NONMEM codes across 13 pharmacometrics tasks, including a range of population pharmacokinetic (PK) and pharmacodynamic (PD) models. We further developed a standardized scoring rubric to assess code accuracy and created an optimized prompt to improve LLM performance. Our results showed that the OpenAI o1 and gpt-4.1 models achieved the best performance, both generating codes with great accuracy for all tasks when using our optimized prompt. Overall, LLMs performed well in writing basic NONMEM model structures, providing a useful foundation for pharmacometrics model coding. However, user review and refinement remain essential, especially for complex models with special dataset alignment or advanced coding techniques. We also discussed the applications of AI in pharmacometrics education, particularly strategies to prevent overreliance on AI for coding. This work provides a benchmark for current LLMs' performance in NONMEM coding and introduces a practical prompt that can facilitate more accurate and efficient use of AI in pharmacometrics research and education.

DOI: https://doi.org/10.1002/psp4.70125
Curr Opin Ophthalmol. 2025 Oct 23.

Toward autonomous discovery: agentic AI and the future of ophthalmic research.

Brian T Soetikno, Christopher S Nielsen, Andreas Pollreisz, Daniel S W Ting.

   PURPOSE OF REVIEW: Rapid advances in large language models (LLMs) have led to the emergence of agentic artificial intelligence (AI) systems capable of autonomously performing complex scientific tasks. This review examines recent developments in agentic AI, highlighting their transformative potential for ophthalmology research and clinical practice, and discusses associated ethical considerations.
RECENT FINDINGS: Recent studies demonstrate that agentic AI systems can autonomously execute tasks traditionally performed by human researchers, including peer review, hypothesis generation, systematic reviews, and experimental design. Notable examples include AI-generated manuscripts accepted through peer review, automated systematic reviews outperforming human accuracy and efficiency, and performing complex biomedical analyses across diverse domains. Although direct ophthalmology-specific applications remain nascent, the field's data-rich nature positions it ideally for adopting agentic AI in several areas such as automated chart review, health economics modeling, and enhanced image analysis.
SUMMARY: Agentic AI represents a paradigm shift in scientific research, offering significant opportunities to enhance productivity, rigor, and innovation in ophthalmology. However, integration into clinical and research workflows necessitates careful consideration of ethical issues, including authorship attribution, data privacy, bias mitigation, and accountability. Clear governance frameworks, rigorous validation standards, and interdisciplinary training will be essential to responsibly harness agentic AI in ophthalmology.

Keywords:  artificial intelligence; autonomous research; ethics; large language models; ophthalmology

DOI:  https://doi.org/10.1097/ICU.0000000000001179
J Clin Nurs. 2025 Nov 06.

Reflections on "Developing and Evaluating the Use of ChatGPT as a Screening Tool for Nurses Conducting Structured Literature Reviews: Proof of Concept Study Results".

Ahmadreza Abedi, Maedeh Alhosseini.

DOI: https://doi.org/10.1111/jocn.70149
Korean J Radiol. 2025 Nov 03.

Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM): 2025 Updates.

Seong Ho Park, Chong Hyun Suh, Jeong Hyun Lee, Ali S Tejani, Seng Chan You, Charles E Kahn, Linda Moy.

  Recent systematic reviews have raised concerns about the quality of reporting in studies evaluating the accuracy of large language models (LLMs) in medical applications. Incomplete and inconsistent reporting hampers the ability of reviewers and readers to assess study methodology, interpret results, and evaluate reproducibility. To address this issue, the MInimum reporting items for CLear Evaluation of Accuracy Reports of Large Language Models in healthcare (MI-CLEAR-LLM) checklist was developed. This article presents an extensively updated version. While the original version focused on proprietary LLMs accessed via web-based chatbot interfaces, the updated checklist incorporates considerations relevant to application programming interfaces and self-managed models, typically based on open-source LLMs. As before, the revised MI-CLEAR-LLM focuses on reporting practices specific to LLM accuracy evaluations: specifically, the reporting of how LLMs are specified, accessed, adapted, and applied in testing, with special attention to methodological factors that influence outputs. The checklist includes essential items across categories such as model identification, access mode, input data type, adaptation strategy, prompt optimization, prompt execution, stochasticity management, and test data independence. This article also presents reporting examples from the literature. Adoption of the updated MI-CLEAR-LLM can help ensure transparency in reporting and enable more accurate and meaningful evaluation of studies.

Keywords:  Application programming interface; Artificial intelligence; Chatbot; Checklist; Generative; Guideline; Healthcare; Large language model; Large multimodal model; Local deployment; Medicine; Radiology; Reporting

DOI:  https://doi.org/10.3348/kjr.2025.1522
J Orthop Surg Res. 2025 Nov 07. 20(1): 977

Applying machine learning to enhance core outcome set development: automating the data extraction and classification of outcomes.

Ali Yalcinkaya, Kristian Gade Kjelmann, Shima Gholinezhad, Ole Rahbek, Søren Kold, Hans-Christen Husum.

   BACKGROUND: Core Outcome Sets (COS) are essential for standardizing outcome reporting in clinical research, yet their development remains resource-intensive and time-consuming. Traditional COS development requires months of expert work for manual outcome extraction and classification from literature. While machine learning (ML) has shown promise in automating systematic reviews, its application to COS development, particularly for outcome identification and classification, remains underexplored. This study evaluates whether ML models can accurately extract and classify verbatim outcomes from clinical studies according to the COMET taxonomy and determines the amount of manually annotated data needed to support reliable model performance.
METHODS: We developed an ML pipeline using a dataset of 149 full-text studies on lower limb lengthening surgery. The pipeline comprised a Sentence-BERT-based extraction model for identifying verbatim outcomes and a classification model for assigning outcomes to COMET taxonomy domains. We systematically assessed performance using training sets ranging from 5 to 85 articles to establish a practical threshold for reliable model behavior. Model performance was validated using a 28-article hold-out set with standard metrics: precision, recall, and F1-score.
RESULTS: A training size of 20 articles proved sufficient for stable model performance. The extraction model achieved an F1-score of 94% with precision and recall above 90%. The classification model attained a weighted-average F1-score of 86%, with 87% precision and 88% recall. When applied to the full dataset, the system successfully identified 94% of manually extracted outcomes. The distribution of outcome domains identified by ML closely mirrored manual classification with high accuracy.
CONCLUSION: This study demonstrates the feasibility of applying ML-based outcome extraction and classification within a specific COS development context for lower limb lengthening surgery. By reducing annotation requirements from 149 to just 20 articles while maintaining high accuracy, our approach offers a scalable, reproducible solution that substantially reduces the manual workload in COS development. This pipeline can play a significant role in streamlining evidence synthesis processes, potentially accelerating the generation of outcome lists for consensus-building exercises in COS development.

Keywords:  Core outcome sets; Lower limb lengthening surgery; Machine learning; Outcome classification; Outcome extraction; Transfer learning

DOI:  https://doi.org/10.1186/s13018-025-06386-8
ACS Omega. 2025 Oct 28. 10(42): 49545-49556

AI-Powered Workflow for Constructing Organic Materials Databases from the Literature: Integrating Large Language Models.

Hang Hu, Henry J Stirrat, Adam Alayli, Akinori Saeki, Yue Huang.

We developed an end-to-end workflow to automate the construction of materials science databases from published literature, addressing a traditionally manual, time-intensive, and labor-intensive process. The work systematically evaluates and compares different machine learning (ML) methods to optimize each task. For identifying relevant publications, we tested various ML techniques and concluded that a combination of large language model (LLM)-based embeddings, clustering, and direct LLM queries is most effective. In the subsequent data extraction phase, we employed OpenAI's GPT-4 to extract materials and their properties, achieving accuracy comparable to manually curated data sets. Additionally, we integrated AI/ML methods to automatically generate SMILES from chemical structure images, expanding the workflow's applicability to organic materials. To validate the workflow, we applied it to studying organic donor materials in organic photovoltaic devices and benchmarked its performance against a manually curated data set derived from 503 papers. The results demonstrate the workflow's efficiency and accuracy. Finally, based on our findings, we provide recommendations for selecting the best ML methods for each task and propose further improvements for the future tool development. This workflow represents a major advancement in accelerating the development of materials science databases and enables data science applications in a broader range of research topics that were historically infeasible due to the lack of available data sets.

DOI: https://doi.org/10.1021/acsomega.5c03612
Proteomics. 2025 Nov 02. e70070

OmixLitMiner 2: Guided Literature Mining for Automated Categorization of Marker Candidates in Omics Studies.

Antonia Gocke, Bente Siebels, Jelena Navolić, Carla Reinbold, Julia E Neumann, Stefan Kurtz, Hartmut Schlüter.

  Omics analyses are crucial for understanding molecular mechanisms in biological research. The vast quantity of detected biomolecules presents a significant challenge in identifying potential biomarkers. Traditional methods rely on labor-intensive literature mining to extract meaningful insights from long lists of regulated candidates of biomolecules. To address this, we developed OmixLitMiner 2 (OLM2) to improve the efficiency of omics data interpretation, speed up the validation of results and accelerate further evaluation based on the selection of marker candidates for subsequent experiments. The updated tool utilizes UniProt for synonym and protein name retrieval and employs the PubMed database as well as PubTator 3.0 for searching titles or abstracts of available biomedical literature. It allows for advanced keyword-based searches and provides classification of proteins or genes with respect to their representation in the literature in relation to scientific questions. OLM2 offers improved functionality over the previous version and comes with a user-friendly Google Colab interface. In comparison to the previous version, OLM2 improves the retrieval of relevant publications and the classification of biomolecules. We use a case study of spatially resolved proteomic data from the mouse brain cortex to demonstrate that the tool significantly reduces the time compared to manual searches and enhances the interpretability of molecular analysis.

Keywords:  OmixLitMiner; PubTator3.0; literature mining; proteomics; transcriptomics

DOI:  https://doi.org/10.1002/pmic.70070
Value Health. 2025 Nov 05. pii: S1098-3015(25)05609-8. [Epub ahead of print]

Artificial Intelligence in Health Economics and Outcomes Research: Highlighting the Contributions of Early Career Researchers.

Amy M Miller, Emily Ortman.

DOI: https://doi.org/10.1016/j.jval.2025.09.3052
Value Health. 2026 Oct 20. pii: S1098-3015(25)05612-8. [Epub ahead of print]

Health Economics and Outcomes Research in the New Era of Artificial Intelligence: Catch Me If You Can.

Jaime Caro, Jagpreet Chhatwal, Rachael L Fleurence.

DOI: https://doi.org/10.1016/j.jval.2025.09.3055
Campbell Syst Rev. 2025 Dec;21(4): e70073

SciDaSynth: Interactive Structured Data Extraction From Scientific Literature With Large Language Model.

Xingbo Wang, Samantha L Huey, Rui Sheng, Saurabh Mehta, Fei Wang.

  The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks.

Keywords:  data extraction; knowledge base; large language models; scientific literature

DOI:  https://doi.org/10.1002/cl2.70073
JMIRx Med. 2025 Nov 03. 6 e74899

Assessing the Limitations of Large Language Models in Clinical Practice Guideline-Concordant Treatment Decision-Making on Real-World Data: Retrospective Study.

Tobias Roeschl, Marie Hoffmann, Djawid Hashemi, Felix Rarreck, Nils Hinrichs, Tobias Daniel Trippel, Matthias I Gröschel, Axel Unbehaun, Christoph Klein, Jörg Kempfert, Henryk Dreger, Benjamin O'Brien, Gerhard Hindricks, Felix Balzer, Volkmar Falk, Alexander Meyer.

   Background: Studies have shown that large language models (LLMs) are promising in therapeutic decision-making, with findings comparable to those of medical experts, but these studies used highly curated patient data.
Objective: This study aimed to determine if LLMs can make guideline-concordant treatment decisions based on patient data as typically present in clinical practice (lengthy, unstructured medical text).
Methods: We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR; n=24) or transcatheter aortic valve replacement (TAVR; n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, LLaMA-2, Mistral, PaLM 2, and DeepSeek-R1) were queried using either anonymized original medical reports or manually generated case summaries to determine the most guideline-concordant treatment. We measured agreement with the heart team using Cohen κ coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using the frequency bias index (FBI; FBI >1 indicated bias toward TAVR).
Results: When presented with original medical reports, LLMs showed poor performance (Cohen κ coefficient: -0.47 to 0.22; ICC: 0.0-1.0; FBI: 0.95-1.51). The LLMs' performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (Cohen κ coefficient: -0.02 to 0.63; ICC: 0.01-1.0; FBI: 0.46-1.23). Qualitative analysis revealed instances of hallucinations in all LLMs tested.
Conclusions: Even advanced LLMs require extensively curated input for informed treatment decisions. Unreliable responses, bias, and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.

Keywords:  aortic stenosis; clinical practice guidelines; foundation models; large language models; medical data processing; reasoning models; treatment decision-making

DOI:  https://doi.org/10.2196/74899