bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–09–07
fourteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Med Syst. 2025 Sep 04. 49(1): 110
      The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.
    Keywords:  Artificial intelligence; Chat-GPT; Claude; Large language model; Systematic review
    DOI:  https://doi.org/10.1007/s10916-025-02246-4
  2. JMIR AI. 2025 Sep 05. 4 e68592
       Background: Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.
    Objective: This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.
    Methods: Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.
    Results: Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.
    Conclusions: The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.
    Keywords:  AI; AI in systematic reviews; AI-assisted academic writing; AI-assisted data analysis; ChatPDF; Connected Papers; Elicit; JenniAI; SLR; artificial intelligence; systematic literature review
    DOI:  https://doi.org/10.2196/68592
  3. Cochrane Evid Synth Methods. 2025 Sep;3(5): e70044
       Introduction: Systematic reviews and meta-analyses synthesize randomized trial data to guide clinical decisions but require significant time and resources. Artificial intelligence (AI) offers a promising solution to streamline evidence synthesis, aiding study selection, data extraction, and risk of bias assessment. This study aims to evaluate the performance of ChatGPT-4o in assessing the risk of bias in randomised controlled trials (RCTs) using the Risk of Bias 2 (RoB 2) tool, comparing its results with those conducted by human reviewers in Cochrane Reviews.
    Methods: A sample of Cochrane Reviews utilizing the RoB 2 tool was identified through the Cochrane Database of Systematic Reviews (CDSR). Protocols, qualitative systematic reviews, and reviews employing alternative risk of bias assessment tools were excluded. The study utilized ChatGPT-4o to assess the risk of bias using a structured set of prompts corresponding to the RoB 2 domains. The agreement between ChatGPT-4o and consensus-based human reviewer assessments was evaluated using weighted kappa statistics. Additionally, accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were calculated. All analyses were performed using R Studio (version 4.3.0).
    Results: A total of 42 Cochrane Reviews were screened, yielding a final sample of eight eligible reviews comprising 84 RCTs. The primary outcome of each included review was selected for risk of bias assessment. ChatGPT-4o demonstrated moderate agreement with human reviewers for the overall risk of bias judgments (weighted kappa = 0.51, 95% CI: 0.36-0.66). Agreement varied across domains, ranging from fair (κ = 0.20 for selection of the reported results) to moderate (κ = 0.59 for measurement of outcomes). ChatGPT-4o exhibited a sensitivity of 53% for identifying high-risk studies and a specificity of 99% for classifying low-risk studies.
    Conclusion: This study shows that ChatGPT-4o can perform risk of bias assessments using RoB 2 with fair to moderate agreement with human reviewers. While AI-assisted risk of bias assessment remains imperfect, advancements in prompt engineering and model refinement may enhance performance. Future research should explore standardised prompts and investigate interrater reliability among human reviewers to provide a more robust comparison.
    Keywords:  artificial intelligence; evidence synthesis; large language models; risk of bias
    DOI:  https://doi.org/10.1002/cesm.70044
  4. Integr Med Res. 2025 Dec;14(4): 101217
       Background: Evidence map is a tool that visualizes the research status to identify research gaps and set priorities, but it has the limitation of the burden of continuous literature monitoring. Pharmacopuncture is a therapeutic modality used in Korean medicine that involves the injection of medicinal extracts into acupoints. This study aimed to develop an artificial intelligence (AI)-based automated system for building and maintaining a living evidence map in the field of pharmacopuncture research and verify its performance.
    Methods: A web-based system that automates literature search, selection, data extraction, and classification using PubMed API and Gemini AI was developed. The accuracy of nine tasks was evaluated and time efficiency was measured using manual review by experts as a standard reference. A visualization system using interactive bubble charts was implemented to provide a research gap identification function.
    Results: The AI system achieved an overall accuracy of 94.00% (error rate of 6.00%) for 202 articles, including detailed data extraction for 90 articles. Task-specific performance varied from sample size extraction (0% error rate) to pharmacopuncture name extraction (22.22% error rate), with high accuracy of over 90% in most tasks. Time efficiency was improved by 68.9% (190 vs. 59 minutes, including quality control), demonstrating that daily updates are practically feasible.
    Conclusions: The developed visualization system significantly improves the existing static evidence organization method by intuitively identifying research gaps. The AI-based living evidence map enables continuous evidence monitoring in the field of pharmacopuncture research with high accuracy and significant time savings.
    Keywords:  Artificial intelligence; Automatization; Living evidence map; Pharmacopuncture
    DOI:  https://doi.org/10.1016/j.imr.2025.101217
  5. J Am Med Inform Assoc. 2025 Aug 31. pii: ocaf137. [Epub ahead of print]
       OBJECTIVES: Systematic reviews in comparative effectiveness research require timely evidence synthesis. With the rapid advancement of medical research, preprint articles play an increasingly important role in accelerating knowledge dissemination. However, as preprint articles are not peer-reviewed before publication, their quality varies significantly, posing challenges for evidence inclusion in systematic reviews.
    MATERIALS AND METHODS: We developed AutoConfidenceScore (automated confidence score assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time.
    RESULTS: The random forest classifier achieved an area under the receiver operating characteristic curve (AUROC) of 0.747 using all features. The survival cure model achieved an AUROC of 0.731 for binary outcome prediction and a concordance index of 0.667 for time-to-publication risk.
    DISCUSSION: Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AutoConfidenceScore significantly enhances predictive performance while reducing manual annotation burden.
    CONCLUSION: AutoConfidenceScore has the potential to facilitate incorporation of preprint articles during the appraisal phase of systematic reviews, supporting researchers in more effective utilization of preprint resources.
    Keywords:  evidence synthesis; evidence-based medicine; large language models; preprint article; systematic reviews
    DOI:  https://doi.org/10.1093/jamia/ocaf137
  6. Psychiatry. 2025 Aug 28. 1-10
       OBJECTIVE: The proliferation of access to generative AI tools has the potential to radically alter the process of writing manuscripts. This report evaluates NotebookLM as a tool for conducting a literature review in an ethical and responsible manner.
    METHOD: We uploaded 22 relevant papers from the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS) to NotebookLM and asked questions pertaining to a hypothetical research paper. We investigated the capabilities, limitations, ethical considerations, and privacy implications of using NotebookLM and engaged in a dialogue with the tool through a series of user-written prompts and AI responses.
    RESULTS: We found that the variability and utility of responsesweres determined in large part by the ability to write meaningful prompts and the extent to which new prompts provided additional information. Investigating how NotebookLM identified key findings enhanced our prompt generation and subsequently the iterative refinement of output to produce information relevant to our mock literature review.
    CONCLUSIONS: The utility of NotebookLM will likely vary by the quality of source material uploaded into the program and the researcher's familiarity with prompt generation. There are a number of benefits and drawbacks to using this tool as a search engine or conversation partner. Ethical considerations and privacy implications of using NotebookLM are discussed.
    DOI:  https://doi.org/10.1080/00332747.2025.2541531
  7. Cochrane Evid Synth Methods. 2025 Sep;3(5): e70046
      Automation, including Machine Learning (ML), is increasingly being explored to reduce the time and effort involved in evidence syntheses, yet its adoption and reporting practices remain under-examined across disciplines (e.g., health sciences, education, and policy). This review assesses the use of automation, including ML-based techniques, in 2271 evidence syntheses published between 2017 and 2024 in the Cochrane Database of Systematic Reviews, and the journals Campbell Systematic Reviews, and Environmental Evidence. We focus on automation across four review steps: search, screening, data extraction, and analysis/synthesis. We systematically identified eligible studies from the three sources and developed a classification system to distinguish between manual, rules-based, ML-enabled, and ML-embedded tools. We then extracted data on tool use, ML integration, reporting practices, motivations for (and against) ML adoption, and the application of stopping criteria for ML-assisted screening. Only ~5% of studies explicitly reported using ML, with most applications limited to screening tasks. Although ~12% employed ML-enabled tools, ~90% of those did not clarify whether ML functionalities were actually utilized. Living reviews showed higher relative ML integration (~15%), but overall uptake remains limited. Previous work has shown that common barriers to broader adoption included limited guidance, low user awareness, and concerns over reliability. Despite ML's potential to streamline evidence syntheses, its integration remains limited and inconsistently reported. Improved transparency, clearer reporting standards, and greater user training are needed to support responsible adoption. As the research literature grows, automation will become increasingly essential-but only if challenges in usability, reproducibility, and trust are addressed.
    Keywords:  artificial intelligence; living reviews; machine learning; screening automation; systematic reviews
    DOI:  https://doi.org/10.1002/cesm.70046
  8. Health Data Sci. 2025 ;5 0322
      Background: The traditional manual literature screening approach is limited by its time-consuming nature and high labor costs. A pressing issue is how to leverage large language models to enhance the efficiency and quality of evidence-based evaluations of drug efficacy and safety. Methods: This study utilized a manually curated reference literature database-comprising vaccine, hypoglycemic agent, and antidepressant evaluation studies-previously developed by our team through conventional systematic review methods. This validated database served as the gold standard for the development and optimization of LitAutoScreener. Following the PICOS (Population, Intervention, Comparison, Outcomes, Study Design) principles, a chain-of-thought reasoning approach with few-shot learning prompts was implemented to develop the screening algorithm. We subsequently evaluated the performance of LitAutoScreener using 2 independent validation cohorts, assessing both classification accuracy and processing efficiency. Results: For respiratory syncytial virus vaccine safety validation title-abstract screening, our tools based on GPT (GPT-4o), Kimi (moonshot-v1-128k), and DeepSeek (deepseek-chat 2.5) demonstrated high accuracy in inclusion/exclusion decisions (99.38%, 98.94%, and 98.85%, respectively). Recall rates were 100.00%, 99.13%, and 98.26%, with statistically significant performance differences (χ 2 = 5.99, P = 0.048), where GPT outperformed the other models. Exclusion reason concordance rates were 98.85%, 94.79%, and 96.47% (χ 2 = 30.22, P < 0.001). In full-text screening, all models maintained perfect recall (100.00%), with accuracies of 100.00% (GPT), 100.00% (Kimi), and 99.45% (DeepSeek). Processing times averaged 1 to 5 s per article for title-abstract screening and 60 s for full-text processing (including PDF preprocessing). Conclusions: LitAutoScreener offers a new approach for efficient literature screening in drug intervention studies, achieving high accuracy and significantly improving screening efficiency.
    DOI:  https://doi.org/10.34133/hds.0322
  9. World J Methodol. 2025 Dec 20. 15(4): 102290
       BACKGROUND: Meta-analysis is a critical tool in evidence-based medicine, particularly in cardiology, where it synthesizes data from multiple studies to inform clinical decisions. This study explored the potential of using ChatGPT to streamline and enhance the meta-analysis process.
    AIM: To investigate the potential of ChatGPT to conduct meta-analyses in interventional cardiology by comparing the results of ChatGPT-generated analyses with those of randomly selected, human-conducted meta-analyses on the same topic.
    METHODS: We systematically searched PubMed for meta-analyses on interventional cardiology published in 2024. Five meta-analyses were randomly chosen. ChatGPT 4.0 was used to perform meta-analyses on the extracted data. We compared the results from ChatGPT with the original meta-analyses, focusing on key effect sizes, such as risk ratios (RR), hazard ratios, and odds ratios, along with their confidence intervals (CI) and P values.
    RESULTS: The ChatGPT results showed high concordance with those of the original meta-analyses. For most outcomes, the effect measures and P values generated by ChatGPT closely matched those of the original studies, except for the RR of stent thrombosis in the Sreenivasan et al study, where ChatGPT reported a non-significant effect size, while the original study found it to be statistically significant. While minor discrepancies were observed in specific CI and P values, these differences did not alter the overall conclusions drawn from the analyses.
    CONCLUSION: Our findings suggest the potential of ChatGPT in conducting meta-analyses in interventional cardiology. However, further research is needed to address the limitations of transparency and potential data quality issues, ensuring that AI-generated analyses are robust and trustworthy for clinical decision-making.
    Keywords:  Artificial intelligence; Cardiology; ChatGPT; Large language model; Meta-analysis; Methodology; Statistical analysis
    DOI:  https://doi.org/10.5662/wjm.v15.i4.102290
  10. Res Sq. 2025 Aug 25. pii: rs.3.rs-7216581. [Epub ahead of print]
      Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F 1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F 1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.
    DOI:  https://doi.org/10.21203/rs.3.rs-7216581/v1
  11. Can Urol Assoc J. 2025 Aug 28.
       INTRODUCTION: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).
    METHODS: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT "Urology AMSTAR 2 Quality Assessor" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.
    RESULTS: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.
    CONCLUSIONS: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.
    DOI:  https://doi.org/10.5489/cuaj.9243
  12. Qual Health Res. 2025 Sep 05. 10497323251365198
      In this essay, I offer my take on contemporary matters relevant to the existing, emerging, and imagined intersections between qualitative health research (QHR) and generative artificial intelligence (GenAI). The essay's central argument is that the increasing reliance on GenAI in QHR is eroding scholarly craftspersonship and should be challenged. In order to present and justify this argument, I posit five coordinated observations: The growing body of literature on using GenAI in qualitative research is reducing qualitative research to coding and pattern recognition; the turn to GenAI disincentivizes reading and stultifies qualitative health researchers; the infatuation with GenAI amplifies the process of McDonaldization of QHR; the time that GenAI saves us isn't being used to become better researchers; and our tendency to humanize GenAI may dehumanise us, whereas craftspersonship is a state of being human. Grounding on such observations, I make a case for embedding a techno-negative stance called neo-luddism in the political culture of QHR. I suggest that this might be an urgent task, for the relation of cruel techno-optimism that some qualitative researchers have established with GenAI can rapidly lead to their own obsolescence. Needless to say, no GenAI has been purposely employed to craft this article.
    Keywords:  AI; CAQDAS; Chat GPT; McDonaldization; craft skills; large language models; neo-luddism
    DOI:  https://doi.org/10.1177/10497323251365198
  13. Reg Anesth Pain Med. 2025 Sep 02. pii: rapm-2025-106852. [Epub ahead of print]
       INTRODUCTION: The use of artificial intelligence (AI) in the scientific process is advancing at a remarkable speed, thanks to continued innovations in large language models. While AI provides widespread benefits, including editing for fluency and clarity, it also has drawbacks, including fabricated content, perpetuation of bias, and lack of accountability. The editorial board of Regional Anesthesia & Pain Medicine (RAPM) therefore sought to develop best practices for AI usage and disclosure.
    METHODS: A steering committee from the American Society of Regional Anesthesia and Pain Medicine used a modified Delphi process to address definitions, disclosure requirements, authorship standards, and editorial oversight for AI use in publishing. The committee reviewed existing publication guidelines and identified areas of ambiguity, which were translated into questions and distributed to an expert workgroup of authors, reviewers, editors, and AI researchers.
    RESULTS: Two survey rounds, with 91% and 87% response rates, were followed by focused discussion and clarification to identify consensus recommendations. The workgroup achieved consensus on recommendations to authors about definitions of AI, required items to report, disclosure locations, authorship stipulations, and AI use during manuscript preparation. The workgroup formulated recommendations to reviewers about monitoring and evaluating the responsible use of AI in the review process, including the endorsement of AI-detection software, identification of concerns about undisclosed AI use, situations where AI use may necessitate the rejection of a manuscript, and use of checklists in the review process. Finally, there was consensus about AI-driven work, including required and optional disclosures and the use of checklists for AI-associated research.
    DISCUSSION: Our modified Delphi study identified practical recommendations on AI use during the scientific writing and editorial process. The workgroup highlighted the need for transparency, human accountability, protection of patient confidentiality, editorial oversight, and the need for iterative updates. The proposed framework enables authors and editors to harness AI's efficiencies while maintaining the fundamental principles of responsible scientific communication and may serve as an example for other journals.
    Keywords:  EDUCATION; Methods; TECHNOLOGY
    DOI:  https://doi.org/10.1136/rapm-2025-106852
  14. Front Artif Intell. 2025 ;8 1579375
       Introduction: The advent of large language models and their applications have gained significant attention due to their strengths in natural language processing.
    Methods: In this study, ChatGPT and DeepSeek are utilized as AI models to assist in diagnosis based on the responses generated to clinical questions. Furthermore, ChatGPT, Claude, and DeepSeek are used to analyze images to assess their potential diagnostic capabilities, applying the various sensitivity analyses described. We employ prompt engineering techniques and evaluate their abilities to generate high quality responses. We propose several prompts and use them to answer important information on conjunctivitis.
    Results: Our findings show that DeepSeek excels in offering precise and comprehensive information on specific topics related to conjunctivitis. DeepSeek provides detailed explanations and in depth medical insights. In contrast, the ChatGPT model provides generalized public information on the infection, which makes it more suitable for broader and less technical discussions. In this study, DeepSeek achieved a better performance with a 7% hallucination rate compared to ChatGPT's 13%. Claude demonstrated perfect 100% accuracy in binary classification, significantly outperforming ChatGPT's 62.5% accuracy.
    Discussion: DeepSeek showed limited performance in understanding images dataset on conjunctivitis. This comparative analysis serves as an insightful reference for scholars and health professionals applying these models in varying medical contexts.
    Keywords:  ChatGPT; DeepSeek; comprehensiveness; eye infection; prompts
    DOI:  https://doi.org/10.3389/frai.2025.1579375