bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–09–21
four papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Arch Bone Jt Surg. 2025 ;13(8): 460-469
       Objectives: Large language models (LLMs) may improve the process of conducting systematic literature reviews. Our aim was to evaluate the utility of one popular LLM chatbot, Chat Generative Pre-trained Transformer (ChatGPT), in systematic literature reviews when compared to traditionally conducted reviews.
    Methods: We identified five systematic reviews published in the Journal of Bone and Joint Surgery from 2021 to 2022. We retrieved the clinical questions, methodologies, and included studies for each review. We evaluated ChatGPT's performance on three tasks. (1) For each published systematic review's core clinical question, ChatGPT designed a relevant database search strategy. (2) ChatGPT screened the abstracts of those articles identified by that search strategy for inclusion in a review. (3) For one systematic review, ChatGPT reviewed each individual manuscript identified after screening to identify those that fit inclusion criteria. We compared the performance of ChatGPT on each of these three tasks to the previously published systematic reviews.
    Results: ChatGPT captured a median of 91% (interquartile range, IQR 84%, 94%) of articles in the published systematic reviews. After screening of these abstracts, ChatGPT was able to capture a median of 75% (IQR 70%, 79%) of articles included in the published systematic reviews. On in-depth screening of manuscripts, ChatGPT captured only 55% of target publications; however, this improved to 100% on review of the manuscripts that ChatGPT identified on this step. Qualitative analysis of ChatGPT's performance highlighted the importance of prompt design and engineering.
    Conclusion: Using published reviews as a gold standard, ChatGPT demonstrated ability in replicating fundamental tasks for orthopedic systematic review. Cautious use and supervision of this general purpose LLM, ChatGPT, may aid in the process of systematic literature review. Further study and discussion regarding the role of LLMs in literature review is needed.
    Keywords:  ChatGPT; Large language models; Orthopedics; Systematic review
    DOI:  https://doi.org/10.22038/ABJS.2025.84896.3874
  2. Cochrane Evid Synth Methods. 2025 Sep;3(5): e70048
       Background: Risk of bias (RoB) assessment is a highly skilled task that is time-consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task-specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non-task-specific Internet-scale training sets. They demonstrate human-like abilities and might be able to support tasks like RoB assessment.
    Methods: Following a published peer-reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID-19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual- or cluster-randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review-trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review-trial pairs to estimate human-ChatGPT agreement for "Overall RoB" (primary outcome) and "RoB due to the randomization process", and ChatGPT-ChatGPT (intrarater) agreement for "Overall RoB". We used ChatGPT-4o (February 2025) throughout.
    Results: The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human-ChatGPT agreement for "Overall RoB" assessment was 50.7% (95% CI 39.3%-62.0%), substantially higher than expected by chance (p = 0.0015). Human-ChatGPT agreement for "RoB due to the randomization process" was 78.7% (95% CI 69.4%-88.0%; p < 0.001). ChatGPT-ChatGPT agreement was 74.7% (95% CI 64.8%-84.6%; p < 0.001).
    Conclusions: ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or "hallucinating". The estimated agreement for "Overall RoB" is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM-based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.
    Keywords:  ChatGPT; LLM; RoB; artificial intelligence; evidence synthesis; large language model; risk of bias
    DOI:  https://doi.org/10.1002/cesm.70048
  3. Cureus. 2025 Aug;17(8): e90026
      Systematic and scoping reviews are essential in palliative care, yet they are time-consuming and resource-intensive. Recent advancements in artificial intelligence, particularly large language models (LLMs), have shown promise in enhancing the efficiency of literature screening. However, their feasibility and accuracy in scoping reviews remain unclear. In this study, we aimed to evaluate the feasibility and performance of LLM-assisted citation screening for a scoping review on nonpharmacological interventions for delirium in patients with cancer. This prospective simulation study assessed the accuracy of three LLMs, GPT-4 Turbo, GPT-4o, and model o1 (OpenAI, San Francisco, CA, USA), in screening titles and abstracts. The dataset was derived from a previously conducted scoping review. Two reference standards were used for comparison: title/abstract screening and full-text screening results from conventional human review. LLMs were prompted using standardized inclusion and exclusion criteria based on the Population, Concept, and Context (PCC) framework. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for each model. The sensitivity and specificity were 0.43 (95% CI, 0.06-0.80) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 0.71 (95% CI, 0.38-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.91 (95% CI, 0.89-0.92) for o1, respectively. Compared with reference standard 2 (full-text screening results from conventional citation screening), the sensitivity and specificity were 1.00 (95% CI, 1.00-1.00) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 1.00 (95% CI, 1.00-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.90 (95% CI, 0.89-0.92) for o1, respectively. All models demonstrated high NPVs, indicating strong reliability in excluding irrelevant studies. However, PPVs were low across all models, reflecting a high false-positive rate. Newer LLMs, particularly model o1, demonstrated high sensitivity and acceptable specificity, supporting their use as preliminary screening tools in scoping reviews. High NPVs suggest LLMs are reliable for ruling out irrelevant citations, thereby streamlining the initial screening phase. However, consistently low PPVs raise concerns about increased reviewer burden due to false positives, emphasizing the necessity of human validation. These findings support the cautious integration of LLMs into literature screening workflows, treating their outputs as supportive tools rather than replacements for expert judgment.
    Keywords:  artificial intelligence (ai); cancer; large language models; nonpharmacological intervention; scoping review
    DOI:  https://doi.org/10.7759/cureus.90026
  4. J Nurs Scholarsh. 2025 Sep 16.
       BACKGROUND: Conducting bias assessments in systematic reviews is a time-consuming process that involves subjective judgments. The use of artificial intelligence (AI) technologies to perform these assessments can potentially save time and enhance consistency. Nevertheless, the efficacy of AI technologies in conducting bias assessments remains inadequately explored.
    AIM: This study aims to evaluate the efficacy of ChatGPT-4o in assessing bias using the revised Cochrane RoB2 tool, focusing on randomized controlled trials in nursing.
    METHODS: ChatGPT-4o was provided with the RoB2 assessment guide in the form of a PDF document and instructed to perform bias assessments for the 80 open-access RCTs included in the study. The results of the bias assessments conducted by ChatGPT-4o for each domain were then compared with those of the meta-analysis authors using Cohen's weighted kappa analysis.
    RESULTS: Weighted Cohen's kappa values showed better agreement in bias in the measurement of the outcome (D4, 0.22) and bias arising from the randomization process (D1, 0.20), while negative values in bias due to missing outcome data (D3, -0.12) and bias in the selection of the reported result (D5, -0.09) indicated poor agreement. The highest accuracy was observed in D5 (0.81), and the lowest in D1 (0.60). F1 scores were highest in bias due to deviations from intended interventions (D2, 0.74) and lowest in D3 (0.00) and D5 (0.00). Specificity was higher in D5 (0.93) and D3 (0.82), while sensitivity and precision were low in these domains.
    CONCLUSIONS: The agreement between ChatGPT-4o and the meta-analysis studies in the same RCT assessments is generally low. This indicates that ChatGPT-4o requires substantial enhancements before it can be used as a reliable tool for bias risk assessments.
    CLINICAL RELEVANCE: The AI-based tools have the potential to expedite bias assessment in systematic reviews. However, this study demonstrates that ChatGPT-4o, in its current form, lacks sufficient consistency, indicating that such tools should be integrated cautiously and used under continuous human oversight, particularly in evidence-based evaluations that inform clinical decision-making.
    Keywords:  ChatGPT‐4o; RoB2; artificial intelligence; meta‐analysis; risk of bias
    DOI:  https://doi.org/10.1111/jnu.70048