Ann Surg Oncol. 2025 Oct 15.
BACKGROUND: Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored. Prostate cancer, a complex malignancy requiring guideline-based management, presents a rigorous testbed for evaluating artificial intelligence (AI)-assisted decision-making. This study compared the clinical accuracy, reasoning ability, and language quality of DeepSeek-R1 and ChatGPT variants in addressing prostate cancer diagnosis and treatment.
METHODS: A dataset of 98 prostate cancer multiple-choice questions from MedQA, MedMCQA, and China's National Medical Licensing Examination was constructed, alongside three real-world clinical cases. Responses were generated by five LLMs (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, -o3, -o4-mini) and evaluated for accuracy across three repeated runs. For case-based simulations, only R1 and o3 were compared with practicing urologists. A Clinical Decision Quality Assessment Scale (CDQAS) assessed outputs across four domains: readability, medical knowledge accuracy, diagnostic test appropriateness, and logical coherence. Blinded scoring was performed by senior urologic oncologists. Statistical analyses used one-way ANOVA with GraphPad Prism v10.1.2, Boston, Massachusetts, USA.
RESULTS: DeepSeek-R1 achieved the highest accuracy (96.60 %) on multiple-choice tasks, significantly outperforming the other models (p < 0.05 to <0.0001). In simulated case evaluations, both R1 and o3 performed comparably with physicians in overall readability and diagnostic appropriateness. Whereas R1 demonstrated superior guideline compliance and evidence-based reasoning, o3 showed advantages in workflow clarity, sequencing, and response fluency. However, o3 generated fewer explicit errors than R1. Human clinicians maintained strengths in terminology precision and logical reasoning.
CONCLUSION: DeepSeek-R1 and ChatGPT-o3 exhibit complementary strengths in prostate cancer clinical decision-making, with R1 favoring factual accuracy and o3 excelling in expressive clarity. Although both models approach human-level performance in structured evaluations, human oversight and continued domain-specific optimization remain essential for their safe and effective integration into clinical workflows.
Keywords: ChatGPT; Clinical decision-making; DeepSeek-R1; Large language models (LLMs); Prostate cancer