J Pediatr Soc North Am. 2025 Feb;10 100135
Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.
Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05.
Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05).
Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.
Key Concepts: (1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.
Level of Evidence: IV.
Keywords: American Academy of Orthopaedic Surgeons; ChatGPT; Clinical practice guideline; Developmental dysplasia of the hip; Google Gemini