bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–06–29
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. PeerJ Comput Sci. 2025 ;11 e2822
       Background: Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.
    Methods: A literature search was run by a trained librarian to identify studies (n = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.
    Results: The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.
    Conclusion: This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.
    Keywords:  Abstract screening; ChatGPT; Large language model; Screening; Systematic review
    DOI:  https://doi.org/10.7717/peerj-cs.2822
  2. J Med Internet Res. 2025 Jun 24. 27 e70450
       BACKGROUND: The revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may assist in RoB2 implementation, although their effectiveness remains uncertain.
    OBJECTIVE: This study aims to evaluate the accuracy of LLMs in RoB2 assessments to explore their potential as research assistants for bias evaluation.
    METHODS: We systematically searched the Cochrane Library (through October 2023) for reviews using RoB2, categorized by interest in adhering or assignment. From 86 eligible reviews of randomized controlled trials (covering 1399 RCTs), we randomly selected 46 RCTs (23 per category). In addition, 3 experienced reviewers independently assessed all 46 RCTs using RoB2, recording assessment time for each trial. Reviewer judgments were reconciled through consensus. Furthermore, 6 RCTs (3 from each category) were randomly selected for prompt development and optimization. The remaining 40 trials established the internal validation standard, while Cochrane Reviews judgments served as external validation. Primary outcomes were extracted as reported in corresponding Cochrane Reviews. We calculated accuracy rates, Cohen κ, and time differentials.
    RESULTS: We identified significant differences between Cochrane and reviewer judgments, particularly in domains 1, 4, and 5, likely due to different standards in assessing randomization and blinding. Among the 20 articles focusing on adhering, 18 Cochrane Reviews and 19 reviewer judgments classified them as "High risk," while assignment-focused RCTs showed more heterogeneous risk distribution. Compared with Cochrane Reviews, LLMs demonstrated accuracy rates of 57.5% and 70% for overall (assignment) and overall (adhering), respectively. When compared with reviewer judgments, LLMs' accuracy rates were 65% and 70% for these domains. The average accuracy rates for the remaining 6 domains were 65.2% (95% CI 57.6-72.7) against Cochrane Reviews and 74.2% (95% CI 64.7-83.9) against reviewers. At the signaling question level, LLMs achieved 83.2% average accuracy (95% CI 77.5-88.9), with accuracy exceeding 70% for most questions except 2.4 (assignment), 2.5 (assignment), 3.3, and 3.4. When domain judgments were derived from LLM-generated signaling questions using the RoB2 algorithm rather than direct LLM domain judgments, accuracy improved substantially for Domain 2 (adhering; 55-95) and overall (adhering; 70-90). LLMs demonstrated high consistency between iterations (average 85.2%, 95% CI 85.15-88.79) and completed assessments in 1.9 minutes versus 31.5 minutes for human reviewers (mean difference 29.6, 95% CI 25.6-33.6 minutes).
    CONCLUSIONS: LLMs achieved commendable accuracy when guided by structured prompts, particularly through processing methodological details through structured reasoning. While not replacing human assessment, LLMs demonstrate strong potential for assisting RoB2 evaluations. Larger studies with improved prompting could enhance performance.
    Keywords:  artificial intelligence; efficiency; large language models; risk of bias 2; systematic review
    DOI:  https://doi.org/10.2196/70450
  3. Diagnostics (Basel). 2025 Jun 06. pii: 1451. [Epub ahead of print]15(12):
      Background/Objectives: Diagnostic accuracy studies are essential for the evaluation of the performance of medical tests. The risk of bias (RoB) for these studies is commonly assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. This study aimed to assess the capabilities and reasoning accuracy of large language models (LLMs) in evaluating the RoB in diagnostic accuracy studies, using QUADAS 2, compared to human experts. Methods: Four LLMs were used for the AI assessment: ChatGPT 4o model, X.AI Grok 3 model, Gemini 2.0 flash model, and DeepSeek V3 model. Ten recent open-access diagnostic accuracy studies were selected. Each article was independently assessed by human experts and by LLMs using QUADAS 2. Results: Out of 110 signaling questions assessments (11 questions for each of the 10 articles) by the four AI models, and the mean percentage of correct assessments of all the models was 72.95%. The most accurate model was Grok 3, followed by ChatGPT 4o, DeepSeek V3, and Gemini 2.0 Flash, with accuracies ranging from 74.45% to 67.27%. When analyzed by domain, the most accurate responses were for "flow and timing", followed by "index test", and then similarly for "patient selection" and "reference standard". An extensive list of reasoning errors was documented. Conclusions: This study demonstrates that LLMs can achieve a moderate level of accuracy in evaluating the RoB in diagnostic accuracy studies. However, they are not yet a substitute for expert clinical and methodological judgment. LLMs may serve as complementary tools in systematic reviews, with compulsory human supervision.
    Keywords:  artificial intelligence; diagnostic accuracy; evidence-based medicine; large language models; risk of bias
    DOI:  https://doi.org/10.3390/diagnostics15121451
  4. Glob Epidemiol. 2025 Dec;10 100207
      Current large language models (LLMs) face significant challenges in attempting to synthesize and critically assess conflicting causal claims in scientific literature about exposure-associated health effects. This paper examines the design and performance of AIA2, an experimental AI system (freely available at http://cloud.cox-associates.com/) designed to help explore and illustrate potential applications of current AI in assisting analysis of clusters of related scientific articles, focusing on causal claims in complex domains such as epidemiology, toxicology, and risk analysis. Building on an earlier AI assistant, AIA1, which critically reviewed causal claims in individual papers, AIA2 advances the approach by systematically comparing multiple studies to identify areas of agreement and disagreement, suggest explanations for differences in conclusions, flag methodological gaps and inconsistencies, synthesize and summarize well-supported conclusions despite conflicts, and propose recommendations to help resolve knowledge gaps. We illustrate these capabilities with a case study of formaldehyde exposure and leukemia using a cluster of four papers that feature very different approaches and partly conflicting conclusions. AIA2 successfully identifies major points of agreement and contention, discusses the robustness of the evidence for causal claims, and recommends future research directions to address current uncertainties. AIA2's outputs suggest that current AI can offer a promising, practicable approach to AI-assisted review of clusters of papers, promoting methodological rigor, thoroughness, and transparency in review and synthesis, notwithstanding current limitations of LLMs. We discuss the implications of AI-assisted literature review systems for improving evidence-based decision-making, resolving conflicting scientific claims, and promoting rigor and reproducibility in causal research and health risk analysis.
    Keywords:  AI-assisted literature review; Causal inference; Formaldehyde; Health risk analysis; Leukemia; Systematic review
    DOI:  https://doi.org/10.1016/j.gloepi.2025.100207
  5. Acad Med. 2025 Jun 24.
       ABSTRACT: How can artificial intelligence (AI) be used to support qualitative data analysis (QDA)? To address this question, the authors conducted 3 scholarly activities. First, they used a readily available large language model, ChatGPT-4, to analyze 3 existing narrative datasets (February 2024). ChatGPT generated accurate brief summaries; for all other attempted tasks the initial prompt failed to produce desired results. After iterative prompt engineering, some tasks (e.g., keyword counting, summarization) were successful, whereas others (e.g., thematic analysis, keyword highlighting, word tree diagram, cross-theme insights) never generated satisfactory results. Second, the authors conducted a brief scoping review of AI-supported QDA (through May 2024). They identified 130 articles (104 original research, 26 nonresearch) of which 64 were published in 2023 or 2024. Seventy studies inductively analyzed data for themes, 39 used keyword detection, 30 applied a coding rubric, 28 used sentiment analysis, and 13 applied discourse analysis. Seventy-five used unsupervised learning (e.g., transformers, other neural networks). Third, building on these experiences and drawing from additional literature, the authors examined the potential capabilities, shortcomings, dangers, and ethical repercussions of AI-supported QDA. They note that AI has been used for QDA for more than 25 years. AI-supported QDA approaches include inductive and deductive coding, thematic analysis, computational grounded theory, discourse analysis, analysis of large datasets, preanalysis transcription and translation, and suggestions for study planning and interpretation. Concerns include the imperative of a "human in the loop" for data collection and analysis, the need for researchers to understand the technology, the risk of unsophisticated analyses, inevitable influences on workforce, and apprehensions regarding data privacy and security. Reflexivity should embrace both strengths and weaknesses of AI-supported QDA. The authors conclude that AI has a long history of supporting QDA through widely varied methods. Evolving technologies make AI-supported QDA more accessible and introduce both promises and pitfalls.
    DOI:  https://doi.org/10.1097/ACM.0000000000006134
  6. Front Artif Intell. 2025 ;8 1526820
      Understanding the environmental factors that facilitate the occurrence and spread of infectious diseases in animals is crucial for risk prediction. As part of the H2020 Monitoring Outbreaks for Disease Surveillance in a Data Science Context (MOOD) project, scoping literature reviews have been conducted for various diseases. However, pathogens continuously mutate and generate variants with different sensitivities to these factors, necessitating regular updates to these reviews. In this paper, we propose to evaluate the potential benefits of artificial intelligence (AI) for updating such scoping reviews. We thus compare different combinations of AI methods for solving this task. These methods utilize generative large language models (LLMs) and lighter language models to automatically identify risk factors in scientific articles.
    Keywords:  artificial intelligence (AI); covariates analysis; infectious diseases; large language models (LLM); natural language processing (NPL); scoping review
    DOI:  https://doi.org/10.3389/frai.2025.1526820