bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–08–17
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. JMIR Form Res. 2025 Aug 11. 9 e68666
       Background: Systematic reviews are essential for synthesizing research in health sciences; however, they are resource-intensive and prone to human error. The data extraction phase, in which key details of studies are identified and recorded in a systematic manner, may benefit from the application of automation processes. Recent advancements in artificial intelligence, specifically in large language models (LLMs) such as ChatGPT, may streamline this process.
    Objective: This study aimed to develop and evaluate a custom Generative Pre-Training Transformer (GPT), named Systematic Review Extractor Pro, for automating the data extraction phase of systematic reviews in health research.
    Methods: OpenAI's GPT Builder was used to create a GPT tailored to extract information from academic manuscripts. The Role, Instruction, Steps, End goal, and Narrowing (RISEN) framework was used to inform prompt engineering for the GPT. A sample of 20 studies from two distinct systematic reviews was used to evaluate the GPT's performance in extraction. Agreement rates between the GPT outputs and human reviewers were calculated for each study subsection.
    Results: The mean time for human data extraction was 36 minutes per study, compared to 26.6 seconds for GPT generation, followed by 13 minutes of human review. The GPT demonstrated high overall agreement rates with human reviewers, achieving 91.45% for review 1 and 89.31% for review 2. It was particularly accurate in extracting study characteristics (review 1: 95.25%; review 2: 90.83%) and participant characteristics (review 1: 95.03%; review 2: 90.00%), with lower performance observed in more complex areas such as methodological characteristics (87.07%) and statistical results (77.50%). The GPT correctly extracted data in 14 instances (3.25% in review 1) and four instances (1.16% in review 2) when the human reviewer was incorrect.
    Conclusions: The custom GPT significantly reduced extraction time and shows evidence that it can extract data with high accuracy, particularly for participant and study characteristics. This tool may offer a viable option for researchers seeking to reduce resource demands during the extraction phase, although more research is needed to evaluate test-retest reliability, performance across broader review types, and accuracy in extracting statistical data. The tool developed in the current study has been made open access.
    Keywords:  AI; ChatGPT; LLM; artificial intelligence; data extraction; large language models; systematic reviews
    DOI:  https://doi.org/10.2196/68666
  2. Value Health. 2025 Aug 13. pii: S1098-3015(25)02514-8. [Epub ahead of print]
       OBJECTIVES: Systematic literature reviews (SLRs) are essential for synthesizing high-quality evidence in clinical research, health economics and outcome research (HEOR), and health technology assessments (HTAs). However, the growing volume of published data has made SLRs time-consuming, labor-intensive, and costly. To address these challenges, we introduce A4SLR, an Agentic Artificial intelligence (AI)-Assisted SLR framework, that provides a flexible, extensible methodology for automating the entire SLR process-from initial query formulation to evidence synthesis-across various study fields.
    METHODS: A4SLR comprises eight modules integrated with specialized AI agents powered by large language models: Search, I/E criteria deployment, Abstract/full-text screening, Text/table pre-processing, Data extraction, Assessment, Risk of bias analysis, and Report. We implemented and validated this framework using two use cases, non-small cell lung cancer and perinatal mood and anxiety disorders. Performance of the assessment was evaluated quantitatively and qualitatively.
    RESULTS: Our implementation demonstrated high accuracy in article screening (F1 scores:0.917-0.977), risk of bias assessment (Cohen's κ:0.8442-0.9064), and data extraction (F-scores:0.96-0.998), including patient characteristics, safety and efficacy-outcomes, economic model parameters, and cost-effectiveness data. Notably, the Text/table pre-processing agent yielded comprehensive coverage of data elements, particularly in the challenging tasks of accurately matching outcome values to their corresponding study arms.
    CONCLUSIONS: Our findings highlight the potential of the A4SLR framework to transform the evidence synthesis process by addressing the limitations of manual SLRs, thereby enhancing HEOR and HTAs. Designed as a scalable, user-centric, extensible approach, A4SLR provides a robust solution for generating comprehensive up-to-date evidence to support researchers and decision-makers across diverse clinical and therapeutic areas.
    Keywords:  Agentic-AI; Article Screening; Automation; Data Extraction; Large Language Models; Systematic Literature Review
    DOI:  https://doi.org/10.1016/j.jval.2025.08.002
  3. Value Health. 2025 Aug 12. pii: S1098-3015(25)02511-2. [Epub ahead of print]
       OBJECTIVES: This exploratory study aimed to develop a large language model (LLM)-based process to automate components of network meta-analysis (NMA), including model selection, analysis, output evaluation, and results interpretation. Automating these tasks with LLMs can enhance efficiency, consistency and scalability in health economics and outcomes research, while ensuring analyses adhere to established guidelines required by health technology assessment agencies. Improvements in efficiency and scalability may potentially become relevant as the European Union Health Technology Assessment Regulation (HTAR) comes into force, given anticipated analysis requirements and timelines.
    METHODS: Using Claude 3.5 Sonnet [V2], a process was designed to automate statistical model selection, NMA output evaluation, and results interpretation based on an 'analysis-ready' dataset. Validation was assessed by replicating examples from the National Institute for Health and Care Excellence (NICE) Technical Support Document (TSD2); replicating results of non-DSU published NMAs; and generating comprehensive outputs (e.g., heterogeneity, inconsistency, convergence).
    RESULTS: The automated LLM-based process produced accurate results. Compared with TSD2 examples, differences were minimal, within expectations (given differences in sampling frameworks used), and comparable to those observed between estimates produced by the R vignettes against TSD2. Similar consistency was noted for non-DSU published NMA examples. Additionally, the LLM process generated and interpreted comprehensive NMA outputs.
    CONCLUSIONS: This exploratory study demonstrates the feasibility of LLMs to automate key components of NMAs, determining the requisite NMA framework based only on input data. Exploring these capabilities further could clarify their role in streamlining NMA workflows.
    Keywords:  automated analysis; health technology assessment (HTA); joint clinical assessments (JCAs); large language models (LLMs); network meta-analysis (NMA)
    DOI:  https://doi.org/10.1016/j.jval.2025.08.001
  4. Plast Reconstr Surg Glob Open. 2025 Aug;13(8): e7057
      Generative artificial intelligence (AI) large language models are an emerging technology, with ChatGPT and Gemini being 2 well-known examples. The current literature discusses clinical applications and limitations of AI, but its role in research has not yet been extensively evaluated. This study aimed to assess the role of ChatGPT and Gemini in developing novel and clinically relevant research ideas (RIs) for systematic reviews (SRs) in head and neck reconstruction. ChatGPT and Gemini were prompted to provide 10 novel and clinically relevant RIs for SRs in the following domains: head and neck reconstruction in general, microsurgery, and complications in reconstructive head and neck procedures. A comprehensive search was then performed for SRs in MEDLINE, Cochrane Library, and Embase to determine the novelty of the RIs generated. A total of 60 RIs were generated, with half created by ChatGPT and the other half by Gemini. Overall, 3613 entries were found through the literature search. After deduplication and screening, a total of 50 studies that partially addressed the AI-generated RIs were identified and were included in the present review. Out of the 60 AI-generated RIs, 42 had not been previously studied and were therefore considered novel. No statistically significant differences were found between the outputs generated by Gemini and ChatGPT. Both ChatGPT and Gemini were able to effectively generate novel and clinically relevant RIs for SRs, although their suggestions were generally broad. This study demonstrated that AI could potentially aid in the process of conducting novel SRs.
    DOI:  https://doi.org/10.1097/GOX.0000000000007057
  5. JMIR Form Res. 2025 Aug 15. 9 e69892
       Background: Young adults take their asthma maintenance medication 67% of the time or less. Understanding the specific needs and behaviors of young adults with asthma is essential for developing targeted interventions to improve disease self-management. Artificial intelligence (AI) has demonstrated its utility in summarizing and identifying patterns in qualitative research and may support or augment human coding efforts. However, there is pause literature to support this assertion.
    Objective: The objective of this study is to begin to explore the medication management-related needs of young adults with asthma via a pilot feasibility study. We aim to understand how to best assist young adults with asthma self-management and to identify potential areas where digital health interventions can provide support. We further aimed to understand the comparative outcome of human versus multiple AI platforms in performing thematic analysis.
    Methods: This study purposefully sampled young adults between the ages of 18 years and 29 years who had a prescription for an inhaled corticosteroid (ICS) and were either students or staff of a large metropolitan university in the northeastern United States. Semistructured interviews lasting 40 minutes on average were conducted with 4 participants via a teleconferencing application to elicit young adults' opinions on the topic. Interviews were recorded and transcribed verbatim using Otter.ai (Otter.ai, Inc). Investigators listened to the recording to confirm the accuracy of transcriptions and to make corrections when necessary. After performing a second round of line-by-line coding, the codes were reviewed by investigators and grouped into broader, overarching themes. All investigators reviewed and discussed the final codes. Human qualitative data analyses were performed using NVivo 14 software (QSR International). After completing human analyses, the investigators performed thematic analysis with multiple AI platforms (Google Gemini, Microsoft Copilot, and OpenAI's ChatGPT) to compare the final themes with investigator-derived themes.
    Results: Human analysis yielded 4 themes: support from clinicians, social support, digital self-management support, and educational support. The AI-based analysis also generated similar themes with different labels. The level of overlap on the underlying concept between humans, Gemini, Copilot, and ChatGPT was high, accounting for the fact that, although the specific labels differed, they referred to the same concept.
    Conclusions: Findings from our pilot exploratory study offer insights into the necessity for a holistic approach in supporting young adults with asthma. Based on the health belief model, if the identified multifaceted needs are addressed, health care systems may support medication adherence and improve health outcomes for this understudied patient population. Our pilot study also offers preliminary findings that artificial intelligence may be leveraged for successful thematic analysis of qualitative data with appropriate caution.
    Keywords:  AI; ChatGPT; Copilot; Gemini; asthma; medication management-related needs; thematic analysis; young adults
    DOI:  https://doi.org/10.2196/69892
  6. ACS Appl Mater Interfaces. 2025 Aug 13.
      The evolution of large language models (LLMs) is reshaping the landscape of scientific writing, enabling the generation of machine-written review papers with minimal human intervention. This paper presents a pipeline for the automated production of scientific survey articles using Retrieval-Augmented Generation (RAG) and modular LLM agents. The pipeline processes user-selected literature or citation network-derived corpora through vectorized content, reference, and figure databases to generate structured, citation-rich reviews. Two distinct strategies are evaluated: one based on manually curated literature and the other on papers selected through citation network analysis. Results demonstrate that increasing the input materials' diversity and quantity improves the generated output's depth and coherence. Although current iterations produce promising drafts, they fail to meet top-tier publication standards, particularly in critical analysis and originality. Results were obtained for a case study on a particular topic, namely, Langmuir and Langmuir-Blodgett films, but the proposed pipeline applies to any user-selected topic. The paper concludes with suggestions of how the system could be enhanced through specialized modules and discusses broader implications for scientific publishing, including ethical considerations, authorship attribution, and the risk of review proliferation. This work represents an opportunity to discuss the advantages and pitfalls introduced by the possibility of using AI assistants to support scientific knowledge synthesis.
    Keywords:  AI; large language models; machine written; scientific review writing
    DOI:  https://doi.org/10.1021/acsami.5c08837
  7. JMIR Res Protoc. 2025 Aug 14. 14 e64640
    GAMER Working Group
       BACKGROUND: The integration of artificial intelligence (AI) has revolutionized medical research, offering innovative solutions for data collection, patient engagement, and information dissemination. Powerful generative AI (GenAI) tools and other similar chatbots have emerged, facilitating user interactions with virtual conversational agents. However, the increasing use of GenAI tools in medical research presents challenges, including ethical concerns, data privacy issues, and the potential for generating false content. These issues necessitate standardization of reporting to ensure transparency and scientific rigor.
    OBJECTIVE: The development of the Generative Artificial Intelligence Tools in Medical Research (GAMER) reporting guidelines aims to establish comprehensive, standardized guidelines for reporting the use of GenAI tools in medical research.
    METHODS: The GAMER guidelines are being developed following the methodology recommended by the Enhancing the Quality and Transparency of Health Research (EQUATOR) Network, involving a scoping review and expert Delphi consensus. The scoping review searched PubMed, Web of Science, Embase, CINAHL, PsycINFO, and Google Scholar (for the first 200 results) using keywords like "generative AI" and "medical research" to identify reporting elements in GenAI-related studies. The Delphi process involves 30-50 experts with ≥3 years of experience in AI applications or medical research, selected based on publication records and expertise across disciplines (eg, clinicians and data scientists) and regions (eg, Asia and Europe). A 7-point-scale survey will establish consensus on checklist items. The testing phase invites authors to apply the GAMER checklist to GenAI-related manuscripts and provide feedback via a questionnaire, while experts assess reliability (κ statistic) and usability (time taken, 7-point Likert scale). The study has been approved by the Ethics Committee of the Institute of Health Data Science at Lanzhou University (HDS-202406-01).
    RESULTS: The GAMER project was launched in July 2023 by the Evidence-Based Medicine Center of Lanzhou University and the WHO Collaborating Centre for Guideline Implementation and Knowledge Translation, and it concluded in July 2024. The scoping review was completed in November 2023. The Delphi process was conducted from October 2023 to April 2024. The testing phase began in March 2025 and is ongoing. The expected outcome of the GAMER project is a reporting checklist accompanied by relevant terminology, examples, and explanations to guide stakeholders in better reporting the use of GenAI tools.
    CONCLUSIONS: GAMER aims to guide researchers, reviewers, and editors in the transparent and scientific application of GenAI tools in medical research. By providing a standardized reporting checklist, GAMER seeks to enhance the clarity, completeness, and integrity of research involving GenAI tools, thereby promoting collaboration, comparability, and cumulative knowledge generation in AI-driven health care technologies.
    INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/64640.
    Keywords:  ChatGPT; Delphi method; chatbots; generative AI; large language models; reporting guidelines; transparency
    DOI:  https://doi.org/10.2196/64640