bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–07–06
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Cancer Innov. 2025 Aug;4(4): e70021
       Background: Conducting a systematic review (SR) is a time-intensive process and represents the first phase in developing a clinical practice guideline (CPG). Completing a CPG through the Program in Evidence-Based Care (PEBC), a globally acknowledged guideline program supported by Ontario Health (Cancer Care Ontario), typically takes about 2 years. Thus, expediting an SR can significantly reduce the overall time required to complete a CPG. Our recently published review identified two artificial intelligence (AI) tools, DistillerSR and EPPI-Reviewer that reduced time in the title and abstract screening in an SR process when developing a CPG. However, the consistency and generalizability of these tools remain unclear within or across different SRs related to cancer. This study protocol aims to evaluate and compare the performance of DistillerSR and EPPI-Reviewer against human reviewers for title and abstract screening (Stage I screening) in cancer CPG development.
    Methods: We will conduct a retrospective simulation study to evaluate and compare the performance of DistillerSR and EPPI-Reviewer across 10 previously published CPGs by PEBC. These CPGs include the five cancer types with the highest incidence (lung, breast, prostate, colorectal, and bladder). We will run 30 simulation trials for one CPG per AI tool. Primary outcomes are workload savings and time savings in Stage I screening. The secondary outcome is the percentage of missing articles among the final included articles. This informs the accuracy and comprehensiveness of the AI tools. Descriptive and inferential statistical analysis will be conducted to evaluate the outcomes.
    Results: This is a study protocol. The data presented in the tables are illustrative examples rather than actual study results, in accordance with the journal s standard structure. All data included in the final study will be thoroughly validated.
    Discussion: This will be the first study to investigate and compare the performance of DistillerSR and EPPI-Reviewer in Stage I screening of SRs in CPGs across different cancer types. These findings will inform the reliable use of AI tools in future cancer-related CPGs. The results from this retrospective study will need to be confirmed by prospective studies.
    Keywords:  DistillerSR; EPPI‐reviewer; abstract screening; artificial intelligence; cancer screening; clinical practice guidelines; simulation study; systematic review; workload and time savings
    DOI:  https://doi.org/10.1002/cai2.70021
  2. Int J Med Inform. 2025 Jul 01. pii: S1386-5056(25)00252-7. [Epub ahead of print]203 106035
       BACKGROUND: Healthcare literature reviews underpin evidence-based practice and clinical guideline development, with citation screening as a critical yet time-consuming step. This study evaluates the effectiveness of individual large language models (LLMs) versus ensemble approaches in automating citation screening to improve the efficiency and scalability of evidence synthesis in healthcare research.
    METHODS: Performance was assessed across three healthcare-focused reviews: LLM-Healthcare (865 citations, broad scope, 49.8 % inclusion rate), MCI-Speech (959 citations, narrow scope, 6.5 % inclusion rate), and Multimodal-LLM (73 citations, moderate scope, 68.5 % inclusion rate). Six LLMs (GPT-4o Mini, GPT-4o, Gemini Flash, Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, Llama 3.1 405B Instruct) were evaluated using zero- and few-shot learning strategies with PubMedBERT for demonstration selection. We compared individual model performance with ensemble methods, including majority voting and random forest (RF), based on sensitivity and specificity.
    RESULTS: No individual LLM consistently outperformed others across all tasks. Review with narrow inclusion criteria and low inclusion rates exhibited high specificity but lower sensitivity. Ensemble methods consistently surpassed individual LLMs: the RF ensemble with GPT-4o performed best in LLM-Healthcare (sensitivity: 0.96, specificity: 0.89); the majority voting with 1-shot LLMs (sensitivity: 0.75, specificity: 0.86) and RF ensemble with 4-shot LLMs (sensitivity: 0.62, specificity: 0.97) excelled in MCI-Speech; and four RF ensembles achieved perfect classification (sensitivity: 1.0, specificity: 1.0) in Multimodal-LLM.
    CONCLUSION: Ensemble approaches improve individual LLMs' performances in citation screening across diverse healthcare review tasks, highlighting their potential to enhance evidence synthesis workflows that support clinical decision-making. However, broader validation is needed before real-world implementation.
    Keywords:  Ensemble learning; Large language model; Majority voting
    DOI:  https://doi.org/10.1016/j.ijmedinf.2025.106035
  3. Stud Health Technol Inform. 2025 Jun 26. 328 121-125
      Data cleaning has a significant role in improving data quality. Although manual data cleaning is possible, it is a time-consuming and error-prone method that highlights the need for automated data cleaning approaches. ChatGPT is one of the tools that may be used to automate the data cleaning process. In the present study, we aimed to evaluate the performance of ChatGPT-4o in data cleaning. According to the study results, ChatGPT-4o achieved mean accuracies of 94.3%, 92.5%, 92.8%, and 70.0% in cleaning the gender, hemoglobin, route, and urine glucose variables, respectively. Accuracy was consistent across three trials for gender, hemoglobin, and route variables. However, significant variation was observed across trials for the urine glucose variable. While the findings of the study emphasize the potential of ChatGPT-4o in data cleaning, further research is needed to focus on the limitations of our study.
    Keywords:  ChatGPT; Data Cleaning; Data Quality; Large Language Models (LLMs)
    DOI:  https://doi.org/10.3233/SHTI250685
  4. medRxiv. 2025 Jun 10. pii: 2025.06.09.25329285. [Epub ahead of print]
      We have developed a free, public web-based tool, Trials to Publications, https://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/TrialPubLinking/trial_pub_link_start.cgi , which employs a machine learning model to predict which publications are likely to present clinical outcome results from a given registered trial in ClinicalTrials.gov . The tool has reasonably high precision, yet in a recent study we found that when registry mentions are not explicitly listed in metadata, textual clues (in title, abstract or other metadata) could identify only roughly 1/3-1/2 of the publications with high confidence. This finding has led us to expand the scope of the tool, to search for explicit mentions of registry numbers that are located within the full-text of publications. We have now retrieved ClinicalTrials.gov registry number mentions (NCT numbers) from the full-text of 3 online biomedical article collections (open access PubMed Central, EuroPMC, and OpenAlex), as well as retrieving biomedical citations that are mentioned within the ClinicalTrials.gov registry itself. These methods greatly increase the recall of identifying linked publications, and should assist those carrying out evidence syntheses as well as those studying the meta-science of clinical trials.
    Highlights: Those conducting systematic reviews, other evidence syntheses, and meta-science analyses often need to examine published evidence arising from clinical trials. Finding publications linked to a given trial is a difficult manual process, but several automated tools have been developed. The Trials to Publications tool is the only free, public, currently maintained web-based tool that predicts publications linked to a given trial in ClinicalTrials.gov . A recent analysis indicated that the Trials to Publications tool has good precision but limited recall. In the present paper, we greatly enhanced the recall by identifying registry mentions in full-text of articles indexed in open access PubMed Central, EuroPMC and OpenAlex.The tool now has reasonably comprehensive coverage of registry mentions, both for identifying articles that present trial outcome results and for other types of articles that are linked to, or that discuss, the trials. This should greatly save effort during web searches of the literature.
    DOI:  https://doi.org/10.1101/2025.06.09.25329285
  5. JMIR AI. 2025 Apr 04. 4 e64447
       Background: The application of large language models (LLMs) in analyzing expert textual online data is a topic of growing importance in computational linguistics and qualitative research within health care settings.
    Objective: The objective of this study was to understand how LLMs can help analyze expert textual data. Topic modeling enables scaling the thematic analysis of content of a large corpus of data, but it still requires interpretation. We investigate the use of LLMs to help researchers scale this interpretation.
    Methods: The primary methodological phases of this project were (1) collecting data representing posts to an online nurse forum, as well as cleaning and preprocessing the data; (2) using latent Dirichlet allocation (LDA) to derive topics; (3) using human categorization for topic modeling; and (4) using LLMs to complement and scale the interpretation of thematic analysis. The purpose is to compare the outcomes of human interpretation with those derived from LLMs.
    Results: There is substantial agreement (247/310, 80%) between LLM and human interpretation. For two-thirds of the topics, human evaluation and LLMs agree on alignment and convergence of themes. Furthermore, LLM subthemes offer depth of analysis within LDA topics, providing detailed explanations that align with and build upon established human themes. Nonetheless, LLMs identify coherence and complementarity where human evaluation does not.
    Conclusions: LLMs enable the automation of the interpretation task in qualitative research. There are challenges in the use of LLMs for evaluation of the resulting themes.
    Keywords:  ChatGPT; artificial intelligence; generative AI; health care; large language models; machine learning
    DOI:  https://doi.org/10.2196/64447
  6. MethodsX. 2025 Dec;15 103431
      This paper presents a protocol for using ChatGPT to perform content analysis. The protocol involves converting a codebook, outlining categories, descriptions, coding rules, and possible values, into a structured prompt that guides ChatGPT's analysis. The protocol was validated through analysis of 980 research articles to identify research approaches and data collection methods. ChatGPT achieved high performance in identifying data collection methods, but faced challenges with poorly defined or underrepresented categories, particularly in mixed methods research. Overall, while it scored well for quantitative (0.96) and qualitative (0.82) studies, it struggled with mixed methods (0.60), highlighting the need for clear methodological definitions.•The protocol enhances coding efficiency and demonstrates the feasibility of using AI for content analysis, potentially streamlining the coding process in research.•Challenges arose in categories that were not clearly defined (big data), underrepresented (ethnography), or hierarchically related (Interview & Discourse/Textual analysis).•Interrater metrics indicated a substantial level of agreement, reinforcing the potential of ChatGPT in content analysis while emphasizing the importance of clear methodological definitions.
    Keywords:  ChatGPT; Content Analysis using ChatGPT; Content analysis; Data collection method; Large language model; Protocol; Research approach; Scientific abstracts
    DOI:  https://doi.org/10.1016/j.mex.2025.103431
  7. Med Ref Serv Q. 2025 Jul 02. 1-12
      Libraries with systematic review services rely on technology, often selected based on institutional subscriptions, for internal communication and data collection. Many libraries rely on manual data entry despite available no- or low-code software, like Microsoft Power Automate® or Zapier, for automating and optimizing team workflows. This case study describes how one library implemented Power Automate® flows to automate email reminders, support project management tasks, coordinate workflows across a large team, collect data, and facilitate assessment and reporting.
    Keywords:  Automation; library services; low-code; no-code; project management; statistics; systematic reviews
    DOI:  https://doi.org/10.1080/02763869.2025.2520222
  8. Arch Toxicol. 2025 Jul 04.
      Risk of bias is a critical factor influencing the reliability and validity of toxicological studies, impacting evidence synthesis and decision-making in regulatory and public health contexts. The traditional approaches for assessing risk of bias are often subjective and time-consuming. Recent advancements in artificial intelligence (AI) offer promising solutions for automating and enhancing bias detection and evaluation. This article reviews key types of biases-such as selection, performance, detection, attrition, and reporting biases-in in vivo, in vitro, and in silico studies. It further discusses specialized tools, including the SYRCLE and OHAT frameworks, designed to address such biases. The integration of AI-based tools into risk of bias assessments can significantly improve the efficiency, consistency, and accuracy of evaluations. However, AI models are themselves susceptible to algorithmic and data biases, necessitating robust validation and transparency in their development. The article highlights the need for standardized, AI-enabled risk of bias assessment methodologies, training, and policy implementation to mitigate biases in AI-driven analyses. The strategies for leveraging AI to screen studies, detect anomalies, and support systematic reviews are explored. By adopting these advanced methodologies, toxicologists and regulators can enhance the quality and reliability of toxicological evidence, promoting evidence-based practices and ensuring more informed decision-making. The way forward includes fostering interdisciplinary collaboration, developing bias-resilient AI models, and creating a research culture that actively addresses bias through transparent and rigorous practices.
    Keywords:  AI bias; Artificial intelligence; Evidence-based toxicology; OHAT; Regulatory toxicology; Risk of bias; SYRCLE; Systematic review; ToxRTool; Toxicology
    DOI:  https://doi.org/10.1007/s00204-025-03978-5