J Med Internet Res. 2025 Jul 14. 27 e64452
Eden Avnat,
Michal Levy,
Daniel Herstain,
Elia Yanko,
Daniel Ben Joya,
Michal Tzuchman Katz,
Dafna Eshel,
Sahar Laros,
Yael Dagan,
Shahar Barami,
Joseph Mermelstein,
Shahar Ovadia,
Noam Shomron,
Varda Shalev,
Raja-Elie E Abdulnour.
Background: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.
Objective: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.
Methods: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.
Results: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).
Conclusions: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.
Keywords: benchmark; dataset; evidence-based medicine; large language models; questions and answers