J Oral Maxillofac Surg. 2025 Mar 28. pii: S0278-2391(25)00187-9. [Epub ahead of print]
BACKGROUND: The peer review process faces challenges of reviewer fatigue and bias. Artificial intelligence (AI) may help address these issues, but its application in the oral and maxillofacial surgery peer review process remains unexplored.
PURPOSE: The purpose of the study was to measure and compare manuscript review performance among 4 large language models and human reviewers. large language models are AI systems trained on vast text datasets that can generate human-like responses.
STUDY DESIGN/SETTING/SAMPLE: In this cross-sectional study, we evaluated original research articles submitted to the Journal of Oral and Maxillofacial Surgery between January and December 2023. Manuscripts were randomly selected from all submissions that received at least one external peer review.
PREDICTOR VARIABLE: The predictor variable was source of review: human reviewers or AI models. We tested 4 AI models: Generative Pretrained Transformer-4o and Generative Pretrained Transformer-o1 (OpenAI, San Francisco, CA), Claude (version 3.5; Anthropic, San Francisco, CA), and Gemini (version 1.5; Google, Mountain View, CA). These models will be referred to by their architectural design characteristics, ie, dense transformers, sparse-expert, multimodal, and base transformer, to highlight their technical differences rather than their commercial identities.
OUTCOME VARIABLES: Primary outcomes included reviewer recommendations (accept = 3 to reject = 0) and responses to 6 Journal of Oral and Maxillofacial Surgery editor questions. Secondary outcomes comprised temporal stability (consistency of AI evaluations over time) analysis, domain-specific assessments (methodology, statistical analysis, clinical relevance, originality, and presentation clarity; 1 to 5 scale), and model clustering patterns.
ANALYSES: Agreement between AI and human recommendations was assessed using weighted Cohen's kappa. Intermodel reliability and temporal stability (24-hour interval) were evaluated using intraclass correlation coefficients. Domain scoring patterns were analyzed using multivariate analysis of variance with post hoc comparisons and hierarchical clustering.
RESULTS: From 22 manuscripts, human reviewers rejected 15 (68.2%), while AI rejection rates were statistically significantly lower (0 to 9.1%, P < .001). AI models demonstrated high consistency in their evaluations over time (intraclass correlation coefficient = 0.88, P < .001) and showed moderate agreement with human decisions (κ = 0.38 to 0.46).
CONCLUSIONS: While AI models showed reliable internal consistency, they were less likely to recommend rejection than human reviewers. This suggests their optimal use is as screening tools complementing expert human review rather than as replacements.