bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–08–03
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Med Ref Serv Q. 2025 Jul 31. 1-13
      While AI has been used in health sciences libraries for decades, the emergence of publicly available large language models (LLMs) has the potential to change how researchers conduct literature searches for systematic reviews. Using a recently published systematic review as a model, we compared the review's published Medline OVID search strategy with 3 strategies that ChatGPT created when prompted with the review's objective. Then we ran the published review and each LLM search strategy through Medline to compare the results with the articles identified by the review's authors to be important.
    Keywords:  Artificial intelligence; extant datasets; large language models; libraries; precision; recall; search strategies; systematic reviews
    DOI:  https://doi.org/10.1080/02763869.2025.2537075
  2. BMC Med Res Methodol. 2025 Jul 31. 25(1): 182
       BACKGROUND: Risk of bias (RoB) assessment is an essential part of systematic reviews that requires reading and understanding each eligible trial and RoB tools. RoB assessment is subject to human error and is time-consuming. Machine learning-based tools have been developed to automate RoB assessment using simple models trained on limited corpuses. ChatGPT is a conversational agent based on a large language model (LLM) that was trained on an internet-scale corpus and has demonstrated human-like abilities in multiple areas including healthcare. LLMs might be able to support systematic reviewing tasks such as assessing RoB. We aim to assess interrater agreement in overall (rather than domain-level) RoB assessment between human reviewers and ChatGPT, in randomized controlled trials of interventions within medical interventions.
    METHODS: We will randomly select 100 individually- or cluster-randomized, parallel, two-arm trials of medical interventions from recent Cochrane systematic reviews that have been assessed using the RoB1 or RoB2 family of tools. We will exclude reviews and trials that were performed under emergency conditions (e.g., COVID-19), as well as public health and welfare interventions. We will use 25 of the trials and human RoB assessments to engineer a ChatGPT prompt for assessing overall RoB, based on trial methods text. We will obtain ChatGPT assessments of RoB for the remaining 75 trials and human assessments. We will then estimate interrater agreement using Cohen's κ.
    RESULTS: The primary outcome for this study is overall human-ChatGPT interrater agreement. We will report observed agreement with an exact 95% confidence interval, expected agreement under random assessment, Cohen's κ, and a p-value testing the null hypothesis of no difference in agreement. Several other analyses are also planned.
    CONCLUSIONS: This study is likely to provide the first evidence on interrater agreement between human RoB assessments and those provided by LLMs and will inform subsequent research in this area.
    Keywords:  Artificial intelligence; ChatGPT; Large language model; Machine learning; Risk of bias; Systematic reviewing
    DOI:  https://doi.org/10.1186/s12874-025-02631-0
  3. J Med Internet Res. 2025 Jul 29. 27 e69700
       Background: Health Evidence provides access to quality appraisals for >10,000 evidence syntheses on the effectiveness and cost-effectiveness of public health and health promotion interventions. Maintaining Health Evidence has become increasingly resource-intensive due to the exponential growth of published literature. Innovative screening methods using artificial intelligence (AI) can potentially improve efficiency.
    Objective: The objectives of this project are to: (1) assess the ability of AI-assisted screening to correctly predict nonrelevant references at the title and abstract level and investigate the consistency of this performance over time, and (2) evaluate the impact of AI-assisted screening on the overall monthly manual screening set.
    Methods: Training and testing were conducted using the DistillerSR AI Preview & Rank feature. A set of manually screened references (n=43,273) was uploaded and used to train the AI feature and assign probability scores to each reference to predict relevance. A minimum threshold was established where the AI feature correctly identified all manually screened relevant references. The AI feature was tested on a separate set of references (n=72,686) from the May 2019 to April 2020 monthly searches. The testing set was used to determine an optimal threshold that ensured >99% of relevant references would continue to be added to Health Evidence. The performance of AI-assisted screening at the title and abstract screening level was evaluated using recall, specificity, precision, negative predictive value, and the number of references removed by AI. The number and percentage of references removed by AI-assisted screening and the change in monthly manual screening time were estimated using an implementation reference set (n=272,253) from November 2020 to 2023.
    Results: The minimum threshold in the training set of references was 0.068, which correctly removed 37% (n=16,122) of nonrelevant references. Analysis of the testing set identified an optimal threshold of 0.17, which removed 51,706 (71.14%) references using AI-assisted screening. A slight decrease in recall between the 0.068 minimum threshold (99.68%) and the 0.17 optimal threshold (94.84%) was noted, resulting in four missed references included via manual screening at the full-text level. This was accompanied by an increase in specificity from 35.95% to 71.70%, doubling the proportion of references AI-assisted screening correctly predicted as not relevant. Over 3 years of implementation, the number of references requiring manual screening was reduced by 70%, reducing the time spent manually screening by an estimated 382 hours.
    Conclusions: Given the magnitude of newly published peer-reviewed evidence, the curation of evidence supports decision makers in making informed decisions. AI-assisted screening can be an important tool to supplement manual screening and reduce the number of references that require manual screening, ensuring that the continued availability of curated high-quality synthesis evidence in public health is possible.
    Keywords:  automation; citation screening; database management; machine learning; methodology; natural language processing; systematic review; text classification; title and abstract screening
    DOI:  https://doi.org/10.2196/69700
  4. Int J Med Inform. 2025 Jul 23. pii: S1386-5056(25)00265-5. [Epub ahead of print]204 106048
       BACKGROUND: Large language models (LLMs) have the potential to offer solutions for automating many of the manual tasks involved in scientific reviews, including data extraction, literature screening, summarization, and quality assessment.
    OBJECTIVES: This study aims to evaluate the performance of LLMs in the task of title and abstract screening and full-text data extraction of a scoping review study, by identifying their effectiveness, efficiency, and potential integration into human-based and manual tasks.
    MATERIALS AND METHOD: The following key three steps of a scientific scoping review were automated: 1) Title and Abstract Screening, 2) Full-Text Screening, and 3) Data Extraction based on nine study dimensions. The four most recent lightweight open-source LLMs -Mistral, Vicuna, and Llama 3.2 with 1B and 3B parameters- were applied and evaluated through the steps.
    RESULTS: Llama 3.2-3B demonstrated the best performance in the title and abstract screening, achieving an accuracy of 66 %, excelling in the exclusion of papers. For full-text screening, it maintained the highest overall accuracy of 65 %, effectively identifying excluded papers. In data extraction, the Mistral model outperformed others across most dimensions, though Llama 3.2-3B excelled in extracting objectives and study implications.
    DISCUSSION AND CONCLUSION: The present study underscores both the potential and limitations of LLMs in automating scoping reviews. Automating the entire scoping review without human intervention is sub-optimal. Using a more controlled approach balances the strengths of LLMs with the need for human judgment, supporting not only the replication of scientific reviews but also their continuous refinement and follow-up over time.
    Keywords:  Automation; Disability; Large language models; Scoping review
    DOI:  https://doi.org/10.1016/j.ijmedinf.2025.106048
  5. Cochrane Evid Synth Methods. 2025 Jul;3(4): e70038
      Open access scholarly resources have potential to simplify the literature search process, support more equitable access to research knowledge, and reduce biases from lack of access to relevant literature. OpenAlex is the world's largest open access database of academic research. However, it is not known whether OpenAlex is suitable for comprehensively identifying research for systematic reviews. We present an approach to measure the utility of OpenAlex as part of undertaking a systematic review, and present findings in the context of undertaking a systematic map on the implementation of diabetic eye screening. Procedures were developed to investigate OpenAlex's content coverage and capture, focusing on: (1) availability of relevant research records; (2) retrieval of relevant records from a Boolean search of OpenAlex (3) retrieval of relevant records from combining a PubMed Boolean search with a citations and related-items search of OpenAlex, and (4) efficient estimation of relevant records not identified elsewhere. The searches were conducted in July 2024 and repeated in March 2025 following removal of certain closed access abstracts from the OpenAlex data set. The original systematic review searches yielded 131 relevant records and 128 (98%) of these are present in OpenAlex. OpenAlex Boolean searches retrieved 126 (96%) of the 131 records, and partial screening yielded two relevant records not previously known to the review team. Retrieval was reduced to 123 (94%) when the searches were repeated in March 2025. However, the volume of records from the OpenAlex Boolean search was considerably greater than assessed for the original systematic map. Combining a Boolean search from PubMed and OpenAlex network graph searches yielded 93% recall. It is feasible and useful to investigate the use of OpenAlex as a key information resource for health topics. This approach can be modified to investigate OpenAlex for other systematic reviews. However, the volume of records obtained from searches is larger than that obtained from conventional sources, something that could be reduced using machine learning. Further investigations are needed, and our approach replicated in other reviews.
    DOI:  https://doi.org/10.1002/cesm.70038
  6. ArXiv. 2025 Jun 03. pii: arXiv:2506.03321v1. [Epub ahead of print]
      We investigated the feasibility of predicting Medical Subject Headings (MeSH) Publication Types (PTs) from MEDLINE citation metadata using pre-trained Transformer-based models BERT and DistilBERT. This study addresses limitations in the current automated indexing process, which relies on legacy NLP algorithms. We evaluated monolithic multi-label classifiers and binary classifier ensembles to enhance the retrieval of biomedical literature. Results demonstrate the potential of Transformer models to significantly improve PT tagging accuracy, paving the way for scalable, efficient biomedical indexing.
  7. PLoS One. 2025 ;20(8): e0329349
       BACKGROUND AND OBJECTIVE: Systematic reviews and meta-analyses are critical in forensic medicine; however, these processes are labor-intensive and time-consuming. ASReview, an open-source machine learning framework, has demonstrated potential to improve the efficiency and transparency of systematic reviews in other disciplines. Nevertheless, its applicability to forensic medicine remains unexplored. This study evaluates the utility of ASReview for forensic medical literature review.
    METHODS: A three-stage experimental design was implemented. First, stratified five-fold cross-validation was conducted to assess ASReview's compatibility with forensic medical literature. Second, incremental learning and sampling methods were employed to analyze the model's performance on imbalanced datasets and the effect of training set size on predictive accuracy. Third, gold standard were translated into computational languages to evaluate ASReview's capacity to address real-world systematic review objectives.
    RESULTS: ASReview exhibited robust viability for screening forensic medical literature. The tool efficiently prioritized relevant studies while excluding irrelevant records, thereby improving review productivity. Model performance remained stable when labeled training data constituted less than 80% of the total sample size. Notably, when the training set proportion ranged from 10% to 55%, ASReview's predictions aligned closely with human reviewer decisions.
    CONCLUSION: ASReview represents a promising tool for forensic medical literature review. Its ability to handle imbalanced datasets and gather goal-oriented information enhances the efficiency and transparency of systematic reviews and meta-analyses in forensic medicine. Further research is required to optimize implementation strategies and validate its utility across diverse forensic medical contexts.
    DOI:  https://doi.org/10.1371/journal.pone.0329349
  8. Cochrane Evid Synth Methods. 2025 Jul;3(4): e70037
       Introduction: Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.
    Methods: We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.
    Results: The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (p < .001) and level of detail (p = .004), and 2 points higher on readability (p = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (p < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, p < .001).
    Conclusion: ChatGPT-4o shows promise in creating plain language summaries for Cochrane reviews at least as well as humans and in some cases slightly better. This study suggests ChatGPT-4o's could become a tool for drafting easy-to-understand plain language summaries for Cochrane reviews with a quality approaching or matching human authors.
    Clinical Trial Registration and Protocol: Available at https://osf.io/aq6r5.
    Keywords:  ChatGPT; artificial intelligence; plain language summaries; randomized controlled trial
    DOI:  https://doi.org/10.1002/cesm.70037
  9. BMC Med Res Methodol. 2025 Jul 31. 25(1): 184
      
    Keywords:  Bi-directional encoder representations from transformers; Evidence-based medicine; Language model; Named entity recognition; Natural language rrocessing; PICO; Systematic literature review
    DOI:  https://doi.org/10.1186/s12874-025-02624-z
  10. Cureus. 2025 Jun;17(6): e86972
      Pharmacovigilance (PV) is a science that plays a crucial role in protecting patients by detecting adverse drug reactions (ADRs). PV can do this by collecting and analyzing data from a wide variety of healthcare sources. However, traditional PV methods face limitations, particularly in accurately and efficiently analyzing large datasets. This limitation leads to underreported ADRs, which negatively impact many patients. However, with the recent rise in artificial intelligence, PV as a science has the potential to improve. This can be done by incorporating different subsets of AI, such as machine learning (ML) and natural language processing (NLP), into PV. The aim of this study is to describe how integrating AI, specifically ML and NLP, into PV systems can improve data collection, data processing, and the detection of ADRs. A comprehensive literature search was conducted using PubMed and Google Scholar to examine studies that were conducted within the last 30 years. Twenty-eight studies were included in this paper. Inclusion criteria included articles that were written in English, articles focusing on PV as a science, ADRs, AI's current role in PV, and AI's potential role in PV. Exclusion criteria included studies that were not published in English and studies that were published more than 30 years ago. The findings from several systematic reviews that explore the implementation of AI into PV indicate that AI can improve PV by enhancing the efficiency and accuracy of detecting ADRs. Through ML algorithms, ADRs can be identified more quickly and accurately compared to traditional PV methods; while using the NLP model, AI is able to extract relevant patient data from unstructured data sources such as electronic health records (EHRs) and report certain drug interactions more accurately and efficiently. However, there are limitations to incorporating AI into PV. These include ethical, legal, and privacy concerns; interpretative limitations if certain datasets are incomplete and are missing information; the lack of current research; and the need to conduct more research on this topic to definitively determine whether AI should be incorporated into PV. With the exponential development of technology such as AI, there is a lot of promise in strengthening PV into a more accurate and efficient ADR detection system. While there is some research highlighting AI's potential to enhance PV, much more research needs to be conducted to fully substantiate this claim. Incorporating AI into PV does, however, have the potential to change ADR detection methods for the better.
    Keywords:  adverse drug reactions (adr); ai and machine learning; artificial intelligence in medicine; natural language processing (nlp); pharmacovigilance
    DOI:  https://doi.org/10.7759/cureus.86972