Postgrad Med J. 2026 Feb 26. pii: qgag018. [Epub ahead of print]
BACKGROUND: Despite growing interest in using Chat-based Generative Pre-trained Transformer (ChatGPT) for academic writing, limited evidence exists regarding its ability to generate abstracts that are structurally compliant and ethically acceptable in orthopedic surgery.
OBJECTIVE: To assess the performance of ChatGPT-generated abstracts using only article titles from recent publications in major orthopedic journals.
METHODS: We extracted 90 human-written abstracts from three leading orthopedic journals and used each title to generate abstracts with ChatGPT-3.5 and ChatGPT-4.0. A total of 180 AI-generated abstracts were created using a standardized prompt. Each abstract was evaluated for format compliance, adherence to word limit, word count, consistency in study design, sample size correlation, and conclusion relevance. Plagiarism and AI detectability were assessed. Four orthopedic surgeons independently reviewed a subset of abstracts to identify their source.
RESULTS: GPT-4.0 achieved perfect compliance with journal format and word count, while GPT-3.5 met these criteria in 34.4% (31 of 90) and 86.7% (78 of 90) of cases, respectively (P < .001). However, only half of abstracts presented fully relevant conclusions. Plagiarism was flagged in 45% to 70% of cases across both detection programs. AI detection scores were significantly higher in GPT-generated abstracts than for human-written ones (P < .001). Human reviewers showed limited ability to distinguish between human and AI-generated abstracts, with minimal inter-rater agreement (Cohen's kappa = 0.25).
CONCLUSION: Although ChatGPT, particularly GPT 4.0, can generate abstracts that meet structural requirements and reproduce surface-level elements of academic style, significant limitations remain in content accuracy, originality, and ethical considerations. Key messages What is already known on this topic: With the expanding application of artificial intelligence (AI) techniques, the development of large language models (LLMs) has enabled the generation of natural language with enhanced performance, driven by improved context handling, broader multimodal capabilities, and optimized architectures. However, their specific capacity to generate structurally compliant and ethically acceptable abstracts in the field of orthopedic surgery remains unclear. What this study adds: This study demonstrates that while GPT-4.0 achieves superior adherence to formatting and word counts compared to GPT-3.5, both models frequently generate inaccurate conclusions and exhibit high plagiarism rates, despite being difficult for human reviewers to distinguish from human-written text. How this study might affect research, practice, or policy: Although ChatGPT shows potential as a supportive tool for generating orthopedic research abstracts, our overall findings emphasize that its unregulated or exclusive use introduces significant ethical and practical concerns. To ensure the integrity of academic publishing, it is imperative to establish clear, field-specific guidelines that govern the responsible application of LLMs in scientific writing.
Keywords: ChatGPT; academic writing; artificial intelligence; large language model; orthopedics