J Neurosurg. 2025 Dec 19.
1-9
Benjamin S Hopkins,
Ishan Shah,
Jonathan Dallas,
Austin J Borja,
David Gomez,
Robert G Briggs,
David J Cote,
Lawrance Chung,
Gillian Shasby,
Jonathan Sisti,
James T Rutka,
Gabriel Zada.
OBJECTIVE: The rapid development of artificial intelligence (AI) presents an opportunity to streamline the peer-review process and provide key information to guide academic journals, editorial staff, and reviewers, as well as authors. This study aimed to fine-tune several standard large language and transformer models (LLMs) on the basis of the text of peer-reviewer comments and editorial outcome decisions to find text-based associations with journal decisions for acceptance versus rejection.
METHODS: This study, with participation from the Journal of Neurosurgery Publishing Group (JNSPG), included anonymized final decision and reviewer comments to all article submissions made to the Journal of Neurosurgery (JNS) and subsidiary journals from 2021 to 2023. All final decisions were grouped as binary (acceptance/revision vs rejection/transfer). Leading words (i.e., "acceptance" or "rejection") were removed from textual reviewer comments, which were then analyzed using various machine learning and LLMs, including BERT, GPT-2, GPT-3, GPT-4o, and GRU variants, to predict the final manuscript decision outcome. Performance was measured using receiver operating characteristic (ROC) curves. Shapley Additive Explanations (SHAP) analysis was conducted to evaluate the impact of individual words on model predictions.
RESULTS: In the ROC analysis, the fine-tuned GPT-4mini and GPT-3 models achieved the highest area under the curve (AUC) values of 0.91, followed by BERT and GPT-2 with AUC values of 0.84. These were followed by bidirectional GRU and GPT-3 (untrained) with AUC values of 0.75 and 0.70, respectively. Unidirectional GRU and GPT-4o (untrained) demonstrated the lowest AUC values of 0.68 and 0.67, respectively. In the SHAP analysis, the logistic regression model identified words like future," "interesting," and "written" as significant positive predictors of acceptance, whereas "clear," "unclear," and "does" were associated with rejections. The GRU model identified "study," "useful," and "journal" as significant positive predictors, and "unclear," "reading," and "incidence" as negative predictors.
CONCLUSIONS: This proof-of-concept study demonstrates that fine-tuned AI models, particularly GPT-3, can predict manuscript acceptance with reasonable accuracy using only textual reviewer comments. Emerging themes that lend weight to article outcome include article clarity, utility, suitability, cohort size, and diligence in addressing reviewer queries. These findings suggest that, when fine-tuned, AI modeling holds significant potential in assisting and facilitating the peer-review process.
Keywords: artificial intelligence; journal; large language model; peer review