bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–05–04
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Eval Clin Pract. 2025 Apr;31(3): e70100
    WCT EVI MAP group
       BACKGROUND: The introduction of systematic reviews in medicine has prompted a paradigm shift in employing evidence for decision-making across various fields. Its methodology involves structured comparisons, critical appraisals, and pooled data analysis to inform decision-making. The process itself is resource-intensive and time-consuming which can impede the timely incorporation of the latest evidence into clinical practice.
    AIM: This article introduces digital tools designed to enhance systematic review processes, emphasizing their functionality, availability, and independent validation in peer-reviewed literature.
    METHODS: We discuss digital evidence synthesis tools for systematic reviews, identifying tools for all review processes, tools for search strategy development, reference management, study selection, data extraction, and critical appraisal. Emphasis is on validated, functional tools with independently published method evaluations.
    RESULTS: Tools like EPPI-Reviewer, Covidence, DistillerSR, and JBI-SUMARI provide comprehensive support for systematic reviews. Additional tools cater to evidence search (e.g., PubMed PICO, Trialstreamer), reference management (e.g., Mendeley), prioritization in study selection (e.g., Abstrackr, EPPI-Reviewer, SWIFT-ActiveScreener), and risk bias assessment (e.g., RobotReviewer). Machine learning and AI integration facilitate workflow efficiency but require end-user informed evaluation for their adoption.
    CONCLUSION: The development of digital tools, particularly those incorporating AI, represents a significant advancement in systematic review methodology. These tools not only support the systematic review process but also have the potential to improve the timeliness and quality of evidence available for decision-making. The findings are relevant to clinicians, researchers, and those involved in the production or support of systematic reviews, with broader applicability to other research methods.
    Keywords:  artificial intelligence; automation tools; machine‐learning; pathology; systematic review automation
    DOI:  https://doi.org/10.1111/jep.70100
  2. Pharmacoecon Open. 2025 Apr 29.
      The emergence of generative artificial intelligence (GenAI) offers the potential to enhance health economics and outcomes research (HEOR) by streamlining traditionally time-consuming and labour-intensive tasks, such as literature reviews, data extraction, and economic modelling. To effectively navigate this evolving landscape, health economists need a foundational understanding of how GenAI can complement their work. This primer aims to introduce health economists to the essentials of using GenAI tools, particularly large language models (LLMs), in HEOR projects. For health economists new to GenAI technologies, chatbot interfaces like ChatGPT offer an accessible way to explore the potential of LLMs. For more complex projects, knowledge of application programming interfaces (APIs), which provide scalability and integration capabilities, and prompt engineering strategies, such as few-shot and chain-of-thought prompting, is necessary to ensure accurate and efficient data analysis, enhance model performance, and tailor outputs to specific HEOR needs. Retrieval-augmented generation (RAG) can further improve LLM performance by incorporating current external information. LLMs have significant potential in many common HEOR tasks, such as summarising medical literature, extracting structured data, drafting report sections, generating statistical code, answering specific questions, and reviewing materials to enhance quality. However, health economists must also be aware of ongoing limitations and challenges, such as the propensity of LLMs to produce inaccurate information ('hallucinate'), security concerns, issues with reproducibility, and the risk of bias. Implementing LLMs in HEOR requires robust security protocols to handle sensitive data in compliance with the European Union's General Data Protection Regulation (GDPR) and the United States' Health Insurance Portability and Accountability Act (HIPAA). Deployment options such as local hosting, secure API use, or cloud-hosted open-source models offer varying levels of control and cost, each with unique trade-offs in security, accessibility, and technical demands. Reproducibility and transparency also pose unique challenges. To ensure the credibility of LLM-generated content, explicit declarations of the model version, prompting techniques, and benchmarks against established standards are recommended. Given the 'black box' nature of LLMs, a clear reporting structure is essential to maintain transparency and validate outputs, enabling stakeholders to assess the reliability and accuracy of LLM-generated HEOR analyses. The ethical implications of using artificial intelligence (AI) in HEOR, including LLMs, are complex and multifaceted, requiring careful assessment of each use case to determine the necessary level of ethical scrutiny and transparency. Health economists must balance the potential benefits of AI adoption against the risks of maintaining current practices, while also considering issues such as accountability, bias, intellectual property, and the broader impact on the healthcare system. As LLMs and AI technologies advance, their potential role in HEOR will become increasingly evident. Key areas of promise include creating dynamic, continuously updated HEOR materials, providing patients with more accessible information, and enhancing analytics for faster access to medicines. To maximise these benefits, health economists must understand and address challenges such as data ownership and bias. The coming years will be critical for establishing best practices for GenAI in HEOR. This primer encourages health economists to adopt GenAI responsibly, balancing innovation with scientific rigor and ethical integrity to improve healthcare insights and decision-making.
    DOI:  https://doi.org/10.1007/s41669-025-00580-4
  3. BMC Med Res Methodol. 2025 Apr 28. 25(1): 116
       BACKGROUND: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.
    OBJECTIVE: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.
    METHODS: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.
    RESULTS: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.
    CONCLUSIONS: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.
    Keywords:  ChatGPT; Deepseek; Large language model; Meta-analysis; Phi
    DOI:  https://doi.org/10.1186/s12874-025-02569-3
  4. NPJ Digit Med. 2025 Apr 27. 8(1): 227
      Delays in translating new medical evidence into clinical practice hinder patient access to the best available treatments. Our data reveals an average delay of nine years from the initiation of human research to its adoption in clinical guidelines, with 1.7-3.0 years lost between trial publication and guideline updates. A substantial part of these delays stems from slow, manual processes in updating clinical guidelines, which rely on time-intensive evidence synthesis workflows. The Next Generation Evidence (NGE) system addresses this challenge by harnessing state-of-the-art biomedical Natural Language Processing (NLP) methods. This novel system integrates diverse evidence sources, such as clinical trial reports and digital guidelines, enabling automated, data-driven analyses of the time it takes for research findings to inform clinical practice. Moreover, the NGE system provides precision-focused literature search filters tailored specifically for guideline maintenance. In benchmarking against two German oncology guidelines, these filters demonstrate exceptional precision in identifying pivotal publications for guideline updates.
    DOI:  https://doi.org/10.1038/s41746-025-01648-5
  5. R Soc Open Sci. 2025 Apr;12(4): 241776
      Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26-73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.
    Keywords:  algorithmic bias; large language models; overgeneralization; science communication
    DOI:  https://doi.org/10.1098/rsos.241776
  6. J Laparoendosc Adv Surg Tech A. 2025 Apr 26.
      Aim: This study assesses the reliability of artificial intelligence (AI) large language models (LLMs) in identifying relevant literature comparing inguinal hernia repair techniques. Material and Methods: We used LLM chatbots (Bing Chat AI, ChatGPT versions 3.5 and 4.0, and Gemini) to find comparative studies and randomized controlled trials on inguinal hernia repair techniques. The results were then compared with existing systematic reviews (SRs) and meta-analyses and checked for the authenticity of listed articles. Results: LLMs screened 22 studies from 2006 to 2023 across eight journals, while the SRs encompassed a total of 42 studies. Through thorough external validation, 63.6% of the studies (14 out of 22), including 10 identified through Chat GPT 4.0 and 6 via Bing AI (with an overlap of 2 studies between them), were confirmed to be authentic. Conversely, 36.3% (8 out of 22) were revealed as fabrications by Google Gemini (Bard), with two (25.0%) of these fabrications mistakenly linked to valid DOIs. Four (25.6%) of the 14 real studies were acknowledged in the SRs, which represents 18.1% of all LLM-generated studies. LLMs missed a total of 38 (90.5%) of the studies included in the previous SRs, while 10 real studies were found by the LLMs but were not included in the previous SRs. Between those 10 studies, 6 were reviews, and 1 was published after the SRs, leaving a total of three comparative studies missed by the reviews. Conclusions: This study reveals the mixed reliability of AI language models in scientific searches. Emphasizing a cautious application of AI in academia and the importance of continuous evaluation of AI tools in scientific investigations.
    Keywords:  artificial intelligence; inguinal hernia; laparoscopic surgery; minimally invasive surgery; open surgery; robotic surgery
    DOI:  https://doi.org/10.1089/lap.2024.0277
  7. medRxiv. 2025 Apr 23. pii: 2024.09.16.24313707. [Epub ahead of print]
       Study Objectives: The coding of semi-structured interview transcripts is a critical step for thematic analysis of qualitative data. However, the coding process is often labor-intensive and time-consuming. The emergence of generative artificial intelligence (GenAI) presents new opportunities to enhance the efficiency of qualitative coding. This study proposed a computational pipeline using GenAI to automatically extract themes from interview transcripts.
    Methods: Using transcripts from interviews conducted with maternity care providers in South Carolina, we leveraged ChatGPT for inductive coding to generate codes from interview transcripts without a predetermined coding scheme. Structured prompts were designed to instruct ChatGPT to generate and summarize codes. The performance of GenAI was evaluated by comparing the AI-generated codes with those generated manually.
    Results: GenAI demonstrated promise in detecting and summarizing codes from interview transcripts. ChatGPT exhibited an overall accuracy exceeding 80% in inductive coding. More impressively, GenAI reduced the time required for coding by 81%.
    Discussion: GenAI models are capable of efficiently processing language datasets and performing multi-level semantic identification. However, challenges such as inaccuracy, systematic biases, and privacy concerns must be acknowledged and addressed. Future research should focus on refining these models to enhance reliability and address inherent limitations associated with their application in qualitative research.
    DOI:  https://doi.org/10.1101/2024.09.16.24313707