bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–06–15
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Dent. 2025 Jun 04. pii: S0300-5712(25)00321-5. [Epub ahead of print] 105877
       OBJECTIVES: This study assessed the performance of chatbots in the screening step of a systematic review (SR) with an exemplary focus on tooth segmentation on dental radiographs using artificial intelligence (AI).
    METHODS: A comprehensive systematic search was performed in December 2024 across seven databases: PubMed, Scopus, Web of Science, Embase, IEEE, Google Scholar, and arXiv. Five chatbots-ChatGPT-4, Claude 2 100k, Claude Instant 100k, Meta's LLaMA 3, and Gemini-were evaluated for their ability to screen articles on tooth segmentation on radiographs using AI. The evaluations took place from January to February 2025, focusing on performance metrics such as accuracy, precision, sensitivity, specificity, and F1-score for screening quality measured against expert reviewers' screening, as well as Cohen's Kappa for inter-rater agreement between different chatbots.
    RESULTS: A total of 891 studies were screened. Significant variability in the number of included or excluded studies was observed (p<0.001/Chi-square), with Claude-instant-100k having the highest inclusion rate (54.88%) and ChatGPT-4 the lowest (29.52%). Gemini excluded the most studies (67.90%), while ChatGPT-4 marked the highest number of studies for full-text review (5.39%). Fleiss' Kappa (-0.147, p < 0.001) indicated systematic disagreement between chatbots worse than random chance. Performance metrics varied; ChatGPT-4 had the highest precision (24%) and accuracy (75%) measured against human expert reviewers, while Claude-instant-100k had the highest sensitivity (96%) but the lowest precision (16%).
    CONCLUSION: Chatbots showed limited accuracy during study screening and low inter-rater agreement. There remains the need for human oversight during systematic reviewing.
    CLINICAL SIGNIFICANCE: Theoretically, Chatbots can streamline SR tasks such as screening. However, human oversight remains critical to maintain the integrity of the review.
    Keywords:  Artificial intelligence; Dentistry; Healthcare,Oral radiology; Systematic review
    DOI:  https://doi.org/10.1016/j.jdent.2025.105877
  2. Eur J Public Health. 2025 Jun 10. pii: ckaf072. [Epub ahead of print]
      Large language models (LLMs) like OpenAI's ChatGPT (generative pretrained transformers) offer great benefits to systematic review production and quality assessment. A careful assessment and comparison with standard practice is highly needed. Two custom GPTs models were developed to compare a LLM's performance in "Risk-of-bias (ROB)" assessment and "Levels of engagement reached (LOER)" classification vs human judgments. Inter-rater agreement was calculated. ROB GPT classified a slightly higher "low risk" overall judgments (27.8% vs 22.2%) and "some concern" (58.3% vs 52.8%) than the research team, for whom "high risk" judgments were double (25.0% vs 13.9%). The research team classified slightly higher "low risk" total judgments (59.7% vs 55.1%) and almost double "high risk" (11.1% vs 5.6%) compared to "ROB GPT" (55.1%), which rated higher "some concerns" (39.4% vs 29.2%) (P = .366). With regards to LOER analysis, 91.7% vs 25.0% were classified "Collaborate" level, 5.6% vs 61.1% as "Shared leadership", and 2.8% as "Involve" vs 13.9% by researchers, while no studies classified in the first two engagement level vs 8.3% and 13.9%, respectively, by researchers (P = .169). A mixed-effect ordinal logistic regression showed an odds ratio (OR) = 0.97 [95% confidence interval (CI) 0.647-1.446, P = .874] for ROB and an OR = 1.00 (95% CI = 0.397-2.543, P = .992) for LOER compared to researchers. Partial agreement on some judgments was observed. Further evaluation of these promising tools is needed to enable their effective yet reliable introduction in scientific practice.
    DOI:  https://doi.org/10.1093/eurpub/ckaf072
  3. Value Health. 2025 Jun 04. pii: S1098-3015(25)02369-1. [Epub ahead of print]
       OBJECTIVE: Cost-effectiveness analyses (CEAs) generate extensive data that can support much health economic research. However, manual data collection is time-consuming and prone to errors. Development in Artificial Intelligence (AI) and large language models (LLMs) offers a solution for automating this process. This study aims to evaluate the accuracy of LLM-based data extraction and assess its feasibility for supporting CEA data collection.
    METHODS: We evaluated the performance of a custom ChatGPT model (GPT), the Tufts CEA Registry (TCRD), and researcher-validated data (RVE) in extracting 36 predetermined variables from 34 selected structured papers. Concordance rates between GPT and RVE, TCRD and RVE, and GPT and TCRD were calculated and compared. Paired t-tests assessed differences in accuracy, and concordance across 36 variables was provided.
    RESULTS: The concordance rate between GPT and RVE was comparable to the concordance rate between TCRD and RVE (mean 0.88, SD 0.06 vs. mean 0.90, SD 0.06, p = 0.71). The performance of GPT varied across variables. GPT outperformed TCRD in capturing "Population and Intervention Details" but struggled with complex "Utility" variables.
    CONCLUSION: This study demonstrates that LLMs, like GPT, can be a promising tool for automating CEA data extraction, offering comparable accuracy to established registries. However, human supervision and expertise is essential to address challenges in complex variables.
    Keywords:  Artificial Intelligence (AI); Cost-effectiveness analysis (CEA); Data extraction; Large language models (LLMs)
    DOI:  https://doi.org/10.1016/j.jval.2025.05.008
  4. J Med Syst. 2025 Jun 12. 49(1): 80
      In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.
    Keywords:  Evidence-based practice; Generative artificial intelligence; Meta-research; Overview of reviews; Reporting guidelines; Systematic reviews; Umbrella reviews
    DOI:  https://doi.org/10.1007/s10916-025-02212-0
  5. AMIA Jt Summits Transl Sci Proc. 2025 ;2025 607-613
      Although rare diseases (RD) are gaining priority in healthcare worldwide, developing research policies for studying them in public settings remains challenging due to the limited evidence available. Evidence generation is crucial for rare diseases, requiring systematic assessment of study quality across multiple sources. Given the scarcity of patients, literature and clinical trial data for orphan drugs, we developed RD-LIVES-a tool designed to automatically accelerate evidence collection from literature and clinical trials for systematic reviews and meta-analyses. This tool enhances our understanding of treatment outcomes, determines appropriate follow-up durations, and informs the required treatment impact size for new drugs. Using Idiopathic Pulmonary Fibrosis (IPF) as an example, we demonstrate how RD-LIVES automates evidence collection and element extraction. The results indicate that RD-LIVES plays a vital role in designing costly prospective trials and has the potential to increase the likelihood of successful trial outcomes.
    Keywords:  IPF; Idiopathic Pulmonary Fibrosis; forced vital capacity; meta-analysis; overall survival; pirfenidone; progression-free survival; systematic review