bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–05–18
twelve papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. BMC Med Res Methodol. 2025 May 10. 25(1): 130
       BACKGROUND: Systematic reviews (SRs) are essential to formulate evidence-based guidelines but require time-consuming and costly literature screening. Large Language Models (LLMs) can be a powerful tool to expedite SRs.
    METHODS: We conducted a comparative study to evaluate the performance of a commercial tool, Rayyan, and an in-house LLM-based system in automating the screening of a completed SR on Vitamin D and falls. The SR retrieved 14,439 articles, and Rayyan was trained with 2,000 manually screened articles to categorize the rest as most likely to exclude/include, likely to exclude/include and undecided. We analyzed Rayyan's title/abstract screening performance using different inclusion thresholds. For the LLM, we used prompt engineering for title/abstract screening and Retrieval-Augmented Generation (RAG) for full-text screening. We evaluated performance using article exclusion rate (AER), false negative rate (FNR), specificity, positive predictive value (PPV), and negative predictive value (NPV). Additionally, we compared the time required to complete screening steps of the SR using both approaches against the manual screening method.
    RESULTS: Using Rayyan, including considered as undecided or likely to include for title/abstract screening resulted in an AER of 72.1% and an FNR of 5%. The total estimated screening time, including manual review of articles flagged by Rayyan, was 54.7 hours. Lowering the Rayyan threshold to 'likely to exclude' reduced the FNR to 0% and the AER to 50.7%, but increased the screening time to 81.3 h. Using the LLM system, after title/abstract and full-text screening, 78 articles remained for manual review, including all 20 identified by traditional methods. The LLM achieved an AER of 99.5%, specificity of 99.6%, PPV of 25.6%, and NPV of 100%, with a total screening time of 25.5 h, including manual review of the 78 articles, reducing the manual screening time by 95.5%.
    CONCLUSIONS: The LLM-based system significantly enhances SR efficiency, compared to manual methods and Rayyan while maintaining low FNR.
    Keywords:  Large language models; Prompt engineering; Rayyan AI; Retrieval-augmented generation; Systematic reviews
    DOI:  https://doi.org/10.1186/s12874-025-02583-5
  2. Value Health. 2025 May 08. pii: S1098-3015(25)02335-6. [Epub ahead of print]
    ISPOR Working Group on Generative AI
       OBJECTIVE: This article presents a taxonomy of generative artificial intelligence (AI) for health economics and outcomes research (HEOR), explores emerging applications, outlines methods to improve the accuracy and reliability of AI-generated outputs and describes current limitations.
    METHODS: Foundational generative AI concepts are defined, and current HEOR applications are highlighted, including for systematic literature reviews, health economic modeling, real-world evidence generation, and dossier development. Techniques such as prompt engineering (e.g., zero-shot, few-shot, chain-of-thought, persona pattern prompting), retrieval-augmented generation, model fine-tuning, and domain-specific models, and use of agents are introduced to enhance AI performance. Limitations associated with the use of generative AI foundation models are described.
    RESULTS: Generative AI demonstrates significant potential in HEOR, offering enhanced efficiency, productivity, and innovative solutions to complex challenges. While foundation models show promise in automating complex tasks, challenges persist in scientific accuracy and reproducibility, bias and fairness and operational deployment. Strategies to address these issues and improve AI accuracy are discussed.
    CONCLUSION: Generative AI has the potential to transform HEOR by improving efficiency and accuracy across diverse applications. However, realizing this potential requires building HEOR expertise and addressing the limitations of current AI technologies. Ongoing research and innovation will be key to shaping AI's future role in our field.
    DOI:  https://doi.org/10.1016/j.jval.2025.04.2167
  3. NPJ Digit Med. 2025 May 13. 8(1): 274
      Integrating large language models (LLMs) into healthcare can enhance workflow efficiency and patient care by automating tasks such as summarising consultations. However, the fidelity between LLM outputs and ground truth information is vital to prevent miscommunication that could lead to compromise in patient safety. We propose a framework comprising (1) an error taxonomy for classifying LLM outputs, (2) an experimental structure for iterative comparisons in our LLM document generation pipeline, (3) a clinical safety framework to evaluate the harms of errors, and (4) a graphical user interface, CREOLA, to facilitate these processes. Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation, consisting of 12,999 clinician-annotated sentences. We observed a 1.47% hallucination rate and a 3.45% omission rate. By refining prompts and workflows, we successfully reduced major errors below previously reported human note-taking rates, highlighting the framework's potential for safer clinical documentation.
    DOI:  https://doi.org/10.1038/s41746-025-01670-7
  4. NPJ Digit Med. 2025 May 15. 8(1): 281
      This paper presents the results of a novel scoping review on the practical models for generating three different types of synthetic health records (SHRs): medical text, time series, and longitudinal data. The innovative aspects of the review, which incorporate study objectives, data modality, and research methodology of the reviewed studies, uncover the importance and the scope of the topic for the digital medicine context. In total, 52 publications met the eligibility criteria for generating medical time series (22), longitudinal data (17), and medical text (13). Privacy preservation was found to be the main research objective of the studied papers, along with class imbalance, data scarcity, and data imputation as the other objectives. The adversarial network-based, probabilistic, and large language models exhibited superiority for generating synthetic longitudinal data, time series, and medical texts, respectively. Finding a reliable performance measure to quantify SHR re-identification risk is the major research gap of the topic.
    DOI:  https://doi.org/10.1038/s41746-024-01409-w
  5. Front Digit Health. 2025 ;7 1569554
       Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis.
    Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa.
    Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43);completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and extentofhallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance.
    Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.
    Keywords:  cancer treatment synopses; clinical evidence summarization; comparative analysis; large language models; multiple myeloma
    DOI:  https://doi.org/10.3389/fdgth.2025.1569554
  6. PLOS Digit Health. 2025 May;4(5): e0000849
      There is a growing number of articles about conversational AI (i.e., ChatGPT) for generating scientific literature reviews and summaries. Yet, comparative evidence lags its wide adoption by many clinicians and researchers. We explored ChatGPT's utility for literature search from an end-user perspective through the lens of clinicians and biomedical researchers. We quantitatively compared basic versions of ChatGPT's utility against conventional search methods such as Google and PubMed. We further tested whether ChatGPT user-support tools (i.e., plugins, web-browsing function, prompt-engineering, and custom-GPTs) could improve its response across four common and practical literature search scenarios: (1) high-interest topics with an abundance of information, (2) niche topics with limited information, (3) scientific hypothesis generation, and (4) for newly emerging clinical practices questions. Our results demonstrated that basic ChatGPT functions had limitations in consistency, accuracy, and relevancy. User-support tools showed improvements, but the limitations persisted. Interestingly, each literature search scenario posed different challenges: an abundance of secondary information sources in high interest topics, and uncompelling literatures for new/niche topics. This study tested practical examples highlighting both the potential and the pitfalls of integrating conversational AI into literature search processes, and underscores the necessity for rigorous comparative assessments of AI tools in scientific research.
    DOI:  https://doi.org/10.1371/journal.pdig.0000849
  7. J Med Internet Res. 2025 May 14. 27 e70122
      Generative large language models (LLMs), such as ChatGPT, have significant potential for qualitative data analysis. This paper aims to provide an early insight into how LLMs can enhance the efficiency of text coding and qualitative analysis, and evaluate their reliability. Using a dataset of semistructured interviews with blind gamers, this study provides a step-by-step tutorial on applying ChatGPT 4-Turbo to the grounded theory approach. The performance of ChatGPT 4-Turbo is evaluated by comparing its coding results with manual coding results assisted by qualitative analysis software. The results revealed that ChatGPT 4-Turbo and manual coding methods exhibited reliability in many aspects. The application of ChatGPT 4-Turbo in grounded theory enhanced the efficiency and diversity of coding and updated the overall grounded theory process. Compared with manual coding, ChatGPT showed shortcomings in depth, context, connections, and coding organization. Limitations and recommendations for applying artificial intelligence in qualitative research were also discussed.
    Keywords:  ChatGPT; computer-assisted software; grounded theory; human-AI collaboration; manual coding; performance
    DOI:  https://doi.org/10.2196/70122
  8. Appl Psychol Health Well Being. 2025 Jun;17(3): e70038
       STUDY OBJECTIVES: The coding of semistructured interview transcripts is a critical step for thematic analysis of qualitative data. However, the coding process is often labor-intensive and time-consuming. The emergence of generative artificial intelligence (GenAI) presents new opportunities to enhance the efficiency of qualitative coding. This study proposed a computational pipeline using GenAI to automatically extract themes from interview transcripts.
    METHODS: Using transcripts from interviews conducted with maternity care providers in South Carolina, we leveraged ChatGPT for inductive coding to generate codes from interview transcripts without a predetermined coding scheme. Structured prompts were designed to instruct ChatGPT to generate and summarize codes. The performance of GenAI was evaluated by comparing the AI-generated codes with those generated manually.
    RESULTS: GenAI demonstrated promise in detecting and summarizing codes from interview transcripts. ChatGPT exhibited an overall accuracy exceeding 80% in inductive coding. More impressively, GenAI reduced the time required for coding by 81%.
    DISCUSSION: GenAI models are capable of efficiently processing language datasets and performing multi-level semantic identification. However, challenges such as inaccuracy, systematic biases, and privacy concerns must be acknowledged and addressed. Future research should focus on refining these models to enhance reliability and address inherent limitations associated with their application in qualitative research.
    Keywords:  Coding; Generative AI; Inductive coding; Maternal health; Thematic analysis
    DOI:  https://doi.org/10.1111/aphw.70038
  9. Cureus. 2025 Apr;17(4): e82005
      Background This research compared the simple and advanced statistical results of SPSS (IBM Corp., Armonk, NY, USA) with ChatGPT-4 and ChatGPT o3-mini (OpenAI, San Francisco, CA, USA) in statistical data output and interpretation with behavioral healthcare data. It evaluated their methodological approaches, quantitative performance, interpretability, adaptability, ethical considerations, and future trends.  Methods  Fourteen statistical analyses were conducted from two real datasets that produced peer-reviewed, published scientific articles in 2024. Descriptive statistics, Pearson r, multiple correlation with Pearson r, Spearman's rho, simple linear regression, one-sample t-test, paired t-test, two-independent sample t-test, multiple linear regression, one-way analysis of variance (ANOVA), repeated measures ANOVA, two-way (factorial) ANOVA, and multivariate ANOVA were computed. The two datasets adhered to a systematically structured timeframe, March 19, 2023, through June 11, 2023, and June 7, 2023, through July 7, 2023, thereby ensuring the integrity and temporal representativeness of the data gathering. The analyses were conducted by inputting the verbal (text) commands into ChatGPT-4 and ChatGPT o3-mini along with the relevant SPSS variables, which were copied and pasted from the SPSS datasets.  Results  The study found high concordance between SPSS and ChatGPT-4 in fundamental statistical analyses, such as measures of central tendency, variability, and simple Pearson and Spearman correlation analyses, where the results were nearly identical. ChatGPT-4 also closely matched SPSS in the three t-tests and simple linear regression, with minimal effect size variations. Discrepancies emerged in complex analyses. ChatGPT o3-mini showed inflated correlation values and significant results where none were expected, indicating computational deviations. ChatGPT o3-mini produced inflated coefficients in the multiple correlation and R-squared values in two-way ANOVA and multiple regression, suggesting differing assumptions. ChatGPT-4 and ChatGPT o3-mini produced identical F-statistics with repeated measures ANOVA but reported incorrect degrees of freedom (df) values. While ChatGPT-4 performed well in one-way ANOVA, it miscalculated degrees of freedom in multivariate ANOVA (MANOVA), leading to significant discrepancies. ChatGPT o3-mini also generated erroneous F-statistics in factorial ANOVA, highlighting the need for further optimization in multivariate statistical modeling. Conclusions This study underscored the rapid advancements in artificial intelligence (AI)-driven statistical analyses while highlighting areas that require further refinement. ChatGPT-4 accurately executed fundamental statistical tests, closely matching SPSS. However, its reliability diminished in more advanced statistical procedures, requiring further validation. ChatGPT o3-mini, while optimized for Science, Technology, Engineering, and Mathematics (STEM) applications, produced inconsistencies in correlation and multivariate analyses, limiting its dependability for complex research applications. Ensuring its alignment with established statistical methodologies will be essential for widespread scientific research adoption as AI evolves.
    Keywords:  artificial intelligence in scientific writing; chatgpt; chatgpt 4; chatgpt o3-mini-model; openai; statistical analysis (spss)
    DOI:  https://doi.org/10.7759/cureus.82005
  10. Nature. 2025 May 14.
      
    Keywords:  Lab life; Machine learning; Publishing
    DOI:  https://doi.org/10.1038/d41586-025-01512-2