bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–08–31
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Value Health. 2025 Aug 13. pii: S1098-3015(25)02432-5. [Epub ahead of print]
    Generative Artificial Intelligence for Navigating Systematic Reviews (GAINSR) working group
       OBJECTIVES: Artificial intelligence (AI) is widely used in healthcare for various purposes, with generative AI (GAI) increasingly being applied to systematic review (SR) processes. We aimed to summarize the evidence on the performance metrics of GAI in the SR process.
    METHODS: PubMed, EMBASE, Scopus, and ProQuest Dissertations & Theses Global were searched from their inception up to March 2025. Only experimental studies that compared GAI with other GAIs or human reviewers at any stage of the SR were included. Modified Quality Assessment of Diagnostic Accuracy Studies version 2 was used to assess the quality of the studies that used GAI in the study selection process. We summarized the findings of the included studies using a narrative approach.
    RESULTS: Out of 7418 records screened, 30 studies were included. These studies used GAI tools such as ChatGPT, Bard, and Microsoft Bing AI. GAI appears to be effective for participant, intervention, comparator, and outcome formulation and data extraction processes, including complex information. However, because of inconsistent reliability, GAI is not recommended for literature search and study selection as it may retrieve nonrelevant articles and yield inconsistent results. There was mixed evidence on whether GAI can be used for risk of bias assessment. Studies using GAI for study selection were generally of high quality based on the modified Quality Assessment of Diagnostic Accuracy Studies version 2.
    CONCLUSIONS: GAI shows promising support in participant, intervention, comparator, and outcome-based question formulation and data extraction. Although it holds potential to enhance the SR process in healthcare, further practical application and validated evidence are needed before it can be fully integrated into standard workflows.
    Keywords:  GPT; artificial intelligence; evidence synthesis; healthcare; systematic review
    DOI:  https://doi.org/10.1016/j.jval.2025.07.001
  2. BMC Med Res Methodol. 2025 Aug 25. 25(1): 199
       BACKGROUND: Literature screening constitutes a critical component in evidence synthesis; however, it typically requires substantial time and human resources. Artificial intelligence (AI) has shown promise in this field, yet the accuracy and effectiveness of AI tools for literature screening remain uncertain. This study aims to evaluate the performance of several existing AI-powered automated tools for literature screening.
    METHODS: This diagnostic accuracy study employed a cohort to evaluate the performance of five AI tools-ChatGPT 4.0, Claude 3.5, Gemini 1.5, DeepSeek-V3, and RobotSearch-in literature screening. We selected a random sample of 1,000 publications from a well-established literature cohort, with 500 as randomized controlled trials (RCTs) group and 500 as others group. Diagnostic accuracy was measured using several metrics, including the false negative fraction (FNF), time used for screening, false positive fraction (FPF), and the redundancy number needed to screen.
    RESULTS: We reported the FNF for the RCTs group and the FPF for the others group. In the RCTs group, RobotSearch exhibited the lowest FNF at 6.4% (95% CI: 4.6% to 8.9%), whereas Gemini exhibited the highest at 13.0% (95% CI: 10.3% to 16.3%). In the others group, the FPF of the four large language models ranged from 2.8% (95% CI: 1.7% to 4.7%) to 3.8% (95% CI: 2.4% to 5.9%), both of which were significantly lower than RobotSearch's rate of 22.2% (95% CI: 18.8% to 26.1%). In terms of screening efficiency, the mean time used for screening per article was 1.3 s for ChatGPT, 6.0 s for Claude, 1.2 s for Gemini, and 2.6 s for DeepSeek.
    CONCLUSIONS: The AI tools assessed in this study demonstrated commendable performance in literature screening; however, they are not yet suitable as standalone solutions. These tools can serve as effective auxiliary aids, and a hybrid approach that integrates human expertise with AI may enhance both the efficiency and accuracy of the literature screening process.
    Keywords:  Artificial intelligence; ChatGPT; Claude; Deepseek; Gemini; Large language models; Literature screening; Robotsearch
    DOI:  https://doi.org/10.1186/s12874-025-02644-9
  3. Acad Radiol. 2025 Aug 22. pii: S1076-6332(25)00750-0. [Epub ahead of print]
       RATIONALE AND OBJECTIVES: To evaluate the performance, stability, and decision-making behavior of large language models (LLMs) for title and abstract screening for radiology systematic reviews, with attention to prompt framing, confidence calibration, and model robustness under disagreement.
    MATERIALS AND METHODS: We compared five LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Llama 3.3 70B) on two imaging-focused systematic reviews (n = 5438 and n = 267 abstracts) using binary and ternary classification tasks, confidence scoring, and reclassification of true and synthetic disagreements. Disagreements were framed as either "LLM vs human" or "human vs human." We also piloted autonomous PubMed retrieval using OpenAI and Gemini Deep Research tools.
    RESULTS: LLMs achieved high specificity and variable sensitivity across reviews and tasks, with F1 scores ranging from 0.389 to 0.854. Ternary classification showed low abstention rates (<5%) and modest sensitivity gains. Confidence scores were significantly higher for correct predictions. In disagreement tasks, models more often selected the human label when disagreements were framed as "LLM vs human," consistent with authority bias. GPT-4o showed greater resistance to this effect, while others were more prone to defer to perceived human input. In the autonomous search task, OpenAI achieved moderate recall and high precision; Gemini's recall was poor but precision remained high.
    CONCLUSION: LLMs hold promise for systematic review screening tasks but require careful prompt design and circumspect human-in-the-loop oversight to ensure robust performance.
    Keywords:  Large language models; Systematic reviews; Title and abstract screening
    DOI:  https://doi.org/10.1016/j.acra.2025.08.014
  4. J Evid Based Soc Work (2019). 2025 Aug 20. 1-15
       PURPOSE: This study explores the alignment between themes identified by Artificial Intelligence (AI)-powered tools and those from a traditional, manual scoping review, focusing on generative AI's role in streamlining time-intensive research processes.
    MATERIALS AND METHODS: Thematic findings from a human-driven scoping review on peer support specialists in medical settings for opioid use disorder (OUD) were compared with outputs from NotebookLM, UTVERSE, and Gemini. Fifteen peer-reviewed articles were uploaded to each AI tool, and a standardized prompt directed the generative AI to identify themes using only the provided articles, which were then compared to the human-coded findings.
    RESULTS: The AI models identified between 53% and 80% of the themes found in the original manual analysis. While AI tools identified novel themes that could broaden the scope of analysis, they also generated inaccurate or misleading themes and overlooked others entirely.
    DISCUSSION: The variability in generative AI performance highlights its potential and limitations in thematic analysis. AI identified additional themes and misinterpreted or missed others. Human expert review remains necessary to validate the accuracy and relevance of generative AI, while addressing ethical considerations in alignment with the values of the social work profession.
    CONCLUSION: A hybrid approach that combines generative AI with expert review has the potential to support current manual research approaches and establish a robust methodology. Continued evaluation, addressing limitations, and establishing best practices for human-AI collaboration and transparent reporting are crucial for the social work research field.
    Keywords:  Generative AI; ethics; human-AI collaboration; social work research; systematic literature review
    DOI:  https://doi.org/10.1080/26408066.2025.2548853
  5. J Clin Epidemiol. 2025 Aug 25. pii: S0895-4356(25)00277-X. [Epub ahead of print] 111944
       INTRODUCTION: Published systematic reviews display an heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop a LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.
    METHODS: We assessed the performance of eight large language models (LLMs) in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (ReMarQ). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.
    RESULTS: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy=96.5% [95%CI=89.9-100%]; mean kappa coefficient=0.90 [95%CI=0.71-1.00]; mean F1-score=0.91 [95%CI=0.83-1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (i.e., were always the same) and only 11% of assessed systematic reviews showed inconsistency.
    CONCLUSION: Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.
    Keywords:  Artificial intelligence; large language models; systematic reviews
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.111944
  6. Expert Opin Drug Metab Toxicol. 2025 Aug 25. 1-20
       INTRODUCTION: Advanced artificial intelligence (AI) frameworks particularly, large language models (LLMs) have recently attracted attention for automating Drug-drug interactions (DDIs) extraction and prediction tasks. However, there is a scarcity of reviews on how LLMs can rapidly identify known and novel DDIs.
    AREAS COVERED: This review summarizes the state of LLM-based DDI extraction and prediction, based on a broad literature search from PubMed, Embase, Web of Science, Scopus, IEEE Xplore, the Cochrane Library, ACM Digital Library, Google Scholar, and Semantic Scholar published between January 2000 and February 2025. For DDI extraction from biomedical text and databases, we detail methods utilizing transformer-based models, such as domain-specific BioBERT and general GPT-based architectures. For DDI prediction, we discuss prediction frameworks including hybrid models (e.g. SmileGNN, DrugDAGT), conversational agents (e.g. ChatGPT), and prompt-based methods (e.g. DDIPrompt).
    EXPERT OPINION: LLMs offer potential for advancing pharmacovigilance and clinical decision support. However, realizing this and establishing clinical trust requires urgently addressing current limitations, particularly enhancing model explainability, improving reliability (mitigating hallucinations), and resolving data quality issues. Future research must prioritize rigorous clinical validation (prospective studies), developing robust explainable AI (XAI) techniques, refining data curation, and integrating multimodal patient data.
    Keywords:  DDI extraction; DDI prediction; Drug–drug interactions; Pharmacovigilance; Transformer-based frameworks; biomedical text mining; clinical decision support; large language models
    DOI:  https://doi.org/10.1080/17425255.2025.2551724
  7. Rev Recent Clin Trials. 2025 Aug 18.
       BACKGROUND: The pharmaceutical industry operates within a complex regulatory environment, requiring strict compliance with global guidelines. Regulatory affairs (RA) departments are pivotal in ensuring drug approvals and compliance. However, the increasing complexity and volume of regulatory requirements have put a strain on traditional processes, driving the adoption of automation tools to streamline these operations.
    OBJECTIVE: This review aims to explore the key automation tools used in regulatory affairs, focusing on their role in streamlining submissions, ensuring compliance, centralizing data, and reducing human error. It also aims to examine the emerging technologies in the field and their potential for enhancing automation.
    METHODS: A comprehensive review of current automation tools in regulatory affairs was conducted. The key tools explored include Submission Management Systems (SMS), Regulatory Information Management (RIM) systems, Electronic Document Management Systems (EDMS), and Regulatory Intelligence Tools. Additionally, the role of emerging technologies like Artificial Intelligence (AI) and Machine Learning (ML) in automating regulatory processes was evaluated.
    RESULTS: Automation tools such as SMS, RIM, EDMS, and Regulatory Intelligence Tools have been found to significantly improve the efficiency of regulatory affairs operations. These tools streamline submissions, centralize data, and ensure compliance. AI and ML technologies further enhance automation by enabling predictive analytics and automating risk assessments. Despite the advantages, challenges remain, including high implementation costs, data security concerns, and the need to adapt to varying global regulations. However, overcoming the challenges and limitations associated with these technologies in adopting regulatory automation is crucial.
    DISCUSSION: This study highlights that automation tools are important for modernizing regulatory affairs by improving efficiency, accuracy, and compliance. The integration of Artificial Intelligence (AI) and Machine Learning (ML) adds predictive and adaptive capabilities, transforming static processes into dynamic systems. These technologies hold immense potential to reshape regulatory operations globally.
    CONCLUSION: Automation tools are becoming essential in the pharmaceutical industry to maintain regulatory compliance, reduce time-to-market, and manage the increasing complexity of drug development in a globalized industry. As emerging technologies like AI, ML, and blockchain continue to evolve, they promise to further revolutionize regulatory affairs processes.
    Keywords:  Regulatory affairs; automation tools; electronic document management systems (EDMS); pharmaceutical industry; regulatory information management (RIM); regulatory intelligence.; submission management systems
    DOI:  https://doi.org/10.2174/0115748871366461250802092217