bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–10–05
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Stud Health Technol Inform. 2025 Oct 02. 332 22-26
      Artificial intelligence, particularly Large Language Models (LLM) such as ChatGPT, is emerging as a potentially transformative support for traditionally complex and time-consuming Systematic Literature Reviews (SLRs). In this study, we compared the traditional SLR process executed accordingly with Cochrane guidelines, with an AI-assisted approach using ChatGPT across various stages, from research question formulation to report writing. Effectiveness was assessed through quantitative measurements of time savings at each phase. Results showed substantial time reductions in several operational tasks, including Gantt chart, generating search terms and suggesting selection criteria. However, critical issues arose in stages requiring interpretative judgement, such as analyzing results, assessing risk of bias and final drafting. While AI cannot replace the role of the researcher, it is a valuable tool to optimize SLR workflow. The combination of human expertise and LLM capabilities presents a promising solution, provided it is accompanied by continuous development of AI systems to improve their reliability, transparency and interoperability.
    Keywords:  Artificial Intelligence (AI); Evidence Based Medicine (EBM); Large Language Model (LLM); Systematic Literature Review (SLR)
    DOI:  https://doi.org/10.3233/SHTI251488
  2. Cochrane Evid Synth Methods. 2025 Nov;3(6): e70050
       Background: Elicit AI aims to simplify and accelerate the systematic review process without compromising accuracy. However, research on Elicit's performance is limited.
    Objectives: To determine whether Elicit AI is a viable tool for systematic literature searches and title/abstract screening stages.
    Methods: We compared the included studies in four evidence syntheses to those identified using the subscription-based version of Elicit Pro in Review mode. We calculated sensitivity, precision and observed patterns in the performance of Elicit.
    Results: The sensitivity of Elicit was poor, averaging 39.5% (25.5-69.2%) compared to 94.5% (91.1-98.0%) in the original reviews. However, Elicit identified some included studies not identified by the original searches and had an average of 41.8% precision (35.6-46.2%) which was higher than the 7.55% average of the original reviews (0.65-14.7%).
    Discussion: At the time of this evaluation, Elicit did not search with high enough sensitivity to replace traditional literature searching. However, the high precision of searching in Elicit could prove useful for preliminary searches, and the unique studies identified mean that Elicit can be used by researchers as a useful adjunct.
    Conclusion: Whilst Elicit searches are currently not sensitive enough to replace traditional searching, Elicit is continually improving, and further evaluations should be undertaken as new developments take place.
    Keywords:  artificial Intelligence (AI); evidence synthesis; literature searching; research methodology; systematic review
    DOI:  https://doi.org/10.1002/cesm.70050
  3. Cochrane Evid Synth Methods. 2025 Jul;3(4): e70033
       Aim: We aimed at comparing data extractions from randomized controlled trials by using Elicit and human reviewers.
    Background: Elicit is an artificial intelligence tool which may automate specific steps in conducting systematic reviews. However, the tool's performance and accuracy have not been independently assessed.
    Methods: For comparison, we sampled 20 randomized controlled trials of which data were extracted manually from a human reviewer. We assessed the variables study objectives, sample characteristics and size, study design, interventions, outcome measured, and intervention effects and classified the results into "more," "equal to," "partially equal," and "deviating" extractions. STROBE checklist was used to report the study.
    Results: We analysed 20 randomized controlled trials from 11 countries. The studies covered diverse healthcare topics. Across all seven variables, Elicit extracted "more" data in 29.3% of cases, "equal" in 20.7%, "partially equal" in 45.7%, and "deviating" in 4.3%. Elicit provided "more" information for the variable study design (100%) and sample characteristics (45%). In contrast, for more nuanced variables, such as "intervention effects," Elicit's extractions were less detailed, with 95% rated as "partially equal."
    Conclusions: Elicit was capable of extracting data partly correct for our predefined variables. Variables like "intervention effect" or "intervention" may require a human reviewer to complete the data extraction. Our results suggest that verification by human reviewers is necessary to ensure that all relevant information is captured completely and correctly by Elicit.
    Implications: Systematic reviews are labor-intensive. Data extraction process may be facilitated by artificial intelligence tools. Use of Elicit may require a human reviewer to double-check the extracted data.
    Keywords:  artificial intelligence; data extraction; human reviewer; randomized controlled trial; systematic review
    DOI:  https://doi.org/10.1002/cesm.70033
  4. BMC Med Res Methodol. 2025 Sep 29. 25(1): 219
       BACKGROUND: Supervised learning can accelerate article screening in systematic reviews, but still requires labor-intensive manual annotation. While large language models (LLMs) like GPT-3.5 offer a rapid and convenient alternative, their reliability is challenging. This study aims to design an efficient and reliable annotation method for article screening.
    METHODS: Given that relevant articles are typically a small subset of those retrieved articles during screening, we propose a human-LLM collaborative annotation method that focuses on verifying positive annotations made by the LLM. Initially, we optimized the prompt using a manually annotated standard dataset, refining it iteratively to achieve near-perfect recall for the LLM. Subsequently, the LLM, guided by the optimized prompt, annotated the articles, followed by human verification of the LLM-identified positive samples. This method was applied to screen articles on precision oncology randomized controlled trials, evaluating both its efficiency and reliability.
    RESULTS: For prompt optimization, a standard dataset of 200 manually annotated articles was equally divided into a tuning set and a validation set (1:1 ratio). Through iterative prompt optimization, the LLM achieved near-perfect recall in the tuning and validation sets, with 100% and 85.71%, respectively. Using the optimized prompt, we conducted collaborative annotation. To evaluate its performance, we manually reviewed a random sample of 300 articles that had been annotated using the collaborative annotation method. The results showed that the collaborative annotation achieved an F1 score of 0.9583, reducing the annotation workload by approximately 80% compared to manual annotation alone. Additionally, we trained a BioBERT-based supervised model on the collaborative annotation data, which outperformed the model trained on data annotated solely by the LLM, further validating the reliability of the collaborative annotation method.
    CONCLUSIONS: The human-LLM collaborative annotation method demonstrates potential for enhancing the efficiency and reliability of article screening, offering valuable support for systematic reviews and meta-analyses.
    Keywords:  Article screening; Human-LLM collaboration; Precision oncology randomized controlled trials
    DOI:  https://doi.org/10.1186/s12874-025-02674-3
  5. Rev Med Chil. 2025 Sep;pii: S0034-98872025000900641. [Epub ahead of print]153(9): 641-645
      Recently, there has been a surge in technological tools designed to automate tasks across various areas of health sciences, including the identification of evidence used in the development of evidence syntheses that inform clinical practice guideline (CPG) recommendations. Simultaneously, there has been a significant increase in the production of systematic reviews, meaning that much of the relevant evidence is already included in existing reviews.
    AIM: To compare the performance of the semi-automated Epistemonikos Evidence Matrix tool with that of a traditional manual literature search in identifying studies for the development of clinical practice guidelines.
    MATERIALS AND METHODS: During the development of three CPGs (focused on HIV/AIDS, pediatric asthma management, and stroke management), we compared studies identified through a traditional search strategy in MEDLINE, Embase, and the Cochrane Library with those found using a strategy based on existing systematic reviews, via the Epistemonikos database. The traditional search employed keyword-based strategies and a specific filter for randomized controlled trials. In contrast, the Epistemonikos-based strategy relied on the semi-automated Evidence Matrix tool, which identifies studies shared across two or more systematic reviews.
    RESULTS: Across the three guidelines, 8,466 potentially relevant articles were identified using the traditional method, compared to 6,771 using the Epistemonikos-based method. Of these, 155 studies (1.8%) were deemed truly relevant in the traditional search, versus 103 (1.5%) in the Epistemonikos-based approach (p= 0.14). The approach based on existing reviews demonstrated significantly higher precision (94% vs. 78%, p<0.01) but lower sensitivity (58% vs. 88%, p<0.01) compared to the traditional search.
    CONCLUSIONS: The evidence search strategy based on existing systematic reviews is an efficient and reliable alternative for identifying relevant studies to support evidence-based decision-making.
    DOI:  https://doi.org/10.4067/s0034-98872025000900641
  6. ArXiv. 2024 Apr 01. pii: arXiv:2311.11211v3. [Epub ahead of print]
      Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.
  7. Forensic Sci Med Pathol. 2025 Oct 02.
      The predictive capability of machine learning plays a crucial role in aiding forensic practitioners in decision-making regarding opinions. However, the intricate specialization and complexity involved in developing machine learning models impede their comprehensive utilization within forensic science research and practical identification. The utilisation of Advanced Data Analysis (ADA) tools based on the ChatGPT-4 provides strategies to address this challenge by simplifying the machine learning process. The objective of this study was to assess the efficacy of autonomously machine learning models for ADA in diverse tasks by providing ADA with an array of data types, with postmortem interval (PMI), injury time, and sudden cardiac death (SCD) serving as illustrative examples. ChatGPT ADA is capable of autonomously conducting data standardization and selecting the optimal machine learning model based on the raw data. A comparison of the prediction results of ADA with those generated by machine learning models developed by professional data analysts revealed that ADA demonstrated robust predictive performance across diverse datasets. Furthermore, no statistically significant differences were observed in the evaluation metrics across the models when compared to those constructed by data analysts. In conclusion, for the forensic field with a greater number of applications, ChatGPT ADA simplifies the intricate construction process of machine learning and offers a prospective instrument for the comprehensive implementation of machine learning in forensic research and practice by emulating human discourse. However, ADA should not supplant researchers but rather serve as a supplementary tool for research, avoiding its misuse as an "all in" predatory analysis instrument.
    Keywords:  ADA; ChatGPT; LLM; Machine learning
    DOI:  https://doi.org/10.1007/s12024-025-01100-w
  8. ACS Omega. 2025 Sep 23. 10(37): 42127-42134
      The rapid growth of scientific literature demands advanced methodologies to analyze and synthesize research trends efficiently. This paper explores the integration of complex network analysis and large language models (LLMs) to automate the generation of literature analyses, focusing on the field of wearable sensors for health monitoring. Using OpenAlex as a source of scientific papers in this field, paper citation networks were constructed and partitioned into thematic clusters, revealing key subtopics such as flexible graphene-based sensors, gait analysis, and machine learning applications. These clusters, characterized by their term importance and interconnectivity, served as input for LLMs (ChatGPT) to generate structured outlines and descriptive summaries. While LLMs produced coherent overviews, limitations emerged, including superficial analyses and inaccuracies in referenced literature. The study demonstrates the potential of combining network-based methodologies with LLMs to create scalable literature reviews, albeit with limitations to be addressed concerning depth and accuracy. The analyses highlight wearable sensors' transformative role in healthcare, driven by advancements in materials science, artificial intelligence, and device integration, while also identifying critical gaps such as standardization, biocompatibility, and energy efficiency. This hybrid approach offers a promising framework for accelerating scholarly synthesis, though today human oversight remains essential to ensure rigor and relevance.
    DOI:  https://doi.org/10.1021/acsomega.5c04542