bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–03–23
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. BMC Med Res Methodol. 2025 Mar 18. 25(1): 75
       BACKGROUND: Artificial intelligence (AI) tools are increasingly being used to assist researchers with various research tasks, particularly in the systematic review process. Elicit is one such tool that can generate a summary of the question asked, setting it apart from other AI tools. The aim of this study is to determine whether AI-assisted research using Elicit adds value to the systematic review process compared to traditional screening methods.
    METHODS: We compare the results from an umbrella review conducted independently of AI with the results of the AI-based searching using the same criteria. Elicit contribution was assessed based on three criteria: repeatability, reliability and accuracy. For repeatability the search process was repeated three times on Elicit (trial 1, trial 2, trial 3). For accuracy, articles obtained with Elicit were reviewed using the same inclusion criteria as the umbrella review. Reliability was assessed by comparing the number of publications with those without AI-based searches.
    RESULTS: The repeatability test found 246,169 results and 172 results for the trials 1, 2, and 3 respectively. Concerning accuracy, 6 articles were included at the conclusion of the selection process. Regarding, revealed 3 common articles, 3 exclusively identified by Elicit and 17 exclusively identified by the AI-independent umbrella review search.
    CONCLUSION: Our findings suggest that AI research assistants, like Elicit, can serve as valuable complementary tools for researchers when designing or writing systematic reviews. However, AI tools have several limitations and should be used with caution. When using AI tools, certain principles must be followed to maintain methodological rigour and integrity. Improving the performance of AI tools such as Elicit and contributing to the development of guidelines for their use during the systematic review process will enhance their effectiveness.
    Keywords:  Accuracy; Artificial intelligence tools; Reliability; Systematic review writing
    DOI:  https://doi.org/10.1186/s12874-025-02528-y
  2. J Food Prot. 2025 Mar 18. pii: S0362-028X(25)00040-7. [Epub ahead of print] 100488
      Systematic reviews in food safety research are vital but hindered by the amount of required human labor. The objective of this study was to evaluate the effectiveness of semi-automated active learning models, as an alternative to manual screening, in screening articles by title and abstract for subsequent full-text review and inclusion in a systematic review of food safety literature. We used a dataset of 3,738 articles, which were previously manually screened in a systematic scoping review for studies about digital food safety tools, of which 214 articles were selected (labeled) via title-abstract screening for further full-text review. On this dataset, we compared three models: (i) Naive Bayes/Term Frequency - Inverse Document Frequency (TF-IDF), (ii) Logistic Regression/Doc2Vec, and (iii) Regression/TF-IDF under two scenarios: 1) screening an unlabeled dataset, and 2) screening a labeled benchmark dataset. We show that screening with active learning models offers a significant improvement over manual (random) screening across all models. In the first scenario, given a stopping criterion of 5% of total records consecutively without having labeled an article relevant, the three models respectively achieve recalls of (mean ± standard deviation) 99.2±0.8%, 97.9± 2.7%, and 98.8± 0.4% while having viewed only 62.6±3.2%, 58.9±2.9%, and 57.6±3.2% of total records. In general, there was a tradeoff between recall and the number of articles that needed to be screened. In the second scenario, we observe that all models perform similarly overall, including similar Work Saved Over Sampling values at the 90% and 95% recall criteria, but models using the TF-IDF feature extractor typically outperform the model using Doc2Vec at finding relevant articles early in screening. In particular, all models outperformed random screening at any recall level. This study demonstrates the promise of incorporating active learning models to facilitate literature synthesis in digital food safety.
    Keywords:  Heuristic Stopping Criteria; Human-in-the-loop; Statistical Stopping Criteria; Text Vectorization; Word Embeddings; Work Saved Over Sampling
    DOI:  https://doi.org/10.1016/j.jfp.2025.100488
  3. JMIR Cancer. 2025 Mar 19. 11 e63347
       Background: Plain language summaries (PLSs) of Cochrane systematic reviews are a simple format for presenting medical information to the lay public. This is particularly important in oncology, where patients have a more active role in decision-making. However, current PLS formats often exceed the readability requirements for the general population. There is still a lack of cost-effective and more automated solutions to this problem.
    Objective: This study assessed whether a large language model (eg, ChatGPT) can improve the readability and linguistic characteristics of Cochrane PLSs about oncology interventions, without changing evidence synthesis conclusions.
    Methods: The dataset included 275 scientific abstracts and corresponding PLSs of Cochrane systematic reviews about oncology interventions. ChatGPT-4 was tasked to make each scientific abstract into a PLS using 3 prompts as follows: (1) rewrite this scientific abstract into a PLS to achieve a Simple Measure of Gobbledygook (SMOG) index of 6, (2) rewrite the PLS from prompt 1 so it is more emotional, and (3) rewrite this scientific abstract so it is easier to read and more appropriate for the lay audience. ChatGPT-generated PLSs were analyzed for word count, level of readability (SMOG index), and linguistic characteristics using Linguistic Inquiry and Word Count (LIWC) software and compared with the original PLSs. Two independent assessors reviewed the conclusiveness categories of ChatGPT-generated PLSs and compared them with original abstracts to evaluate consistency. The conclusion of each abstract about the efficacy and safety of the intervention was categorized as conclusive (positive/negative/equal), inconclusive, or unclear. Group comparisons were conducted using the Friedman nonparametric test.
    Results: ChatGPT-generated PLSs using the first prompt (SMOG index 6) were the shortest and easiest to read, with a median SMOG score of 8.2 (95% CI 8-8.4), compared with the original PLSs (median SMOG score 13.1, 95% CI 12.9-13.4). These PLSs had a median word count of 240 (95% CI 232-248) compared with the original PLSs' median word count of 364 (95% CI 339-388). The second prompt (emotional tone) generated PLSs with a median SMOG score of 11.4 (95% CI 11.1-12), again lower than the original PLSs. PLSs produced with the third prompt (write simpler and easier) had a median SMOG score of 8.7 (95% CI 8.4-8.8). ChatGPT-generated PLSs across all prompts demonstrated reduced analytical tone and increased authenticity, clout, and emotional tone compared with the original PLSs. Importantly, the conclusiveness categorization of the original abstracts was unchanged in the ChatGPT-generated PLSs.
    Conclusions: ChatGPT can be a valuable tool in simplifying PLSs as medically related formats for lay audiences. More research is needed, including oversight mechanisms to ensure that the information is accurate, reliable, and culturally relevant for different audiences.
    Keywords:  AI; ChatGPT; Cochrane; artificial intelligence; decision-making; health communication; health literacy; large language model; medical information; neoplasms; oncology; patient education; plain language
    DOI:  https://doi.org/10.2196/63347
  4. Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024 694-702
      Large Language Models (LLMs), enhanced with Clinical Practice Guidelines (CPGs), can significantly improve Clinical Decision Support (CDS). However, approaches for incorporating CPGs into LLMs are not well studied. In this study, we develop three distinct methods for incorporating CPGs into LLMs: Binary Decision Tree (BDT), Program-Aided Graph Construction (PAGC), and Chain-of-Thought-Few-Shot Prompting (CoT-FSP), and focus on CDS for COVID-19 outpatient treatment as the case study. Zero-Shot Prompting (ZSP) is our baseline method. To evaluate the effectiveness of the proposed methods, we create a set of synthetic patient descriptions and conduct both automatic and human evaluation of the responses generated by four LLMs: GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2. All four LLMs exhibit improved performance when enhanced with CPGs compared to the baseline ZSP. BDT outperformed both CoT-FSP and PAGC in automatic evaluation. All of the proposed methods demonstrate high performance in human evaluation. LLMs enhanced with CPGs outperform plain LLMs with ZSP in providing accurate recommendations for COVID-19 outpatient treatment, highlighting the potential for broader applications beyond the case study.
    Keywords:  artificial intelligence; clinical decision support; generative ai; large language models; prompting
    DOI:  https://doi.org/10.1109/ichi61247.2024.00111
  5. Drug Saf. 2025 Mar 15.
       INTRODUCTION: Manual identification of case narratives with specific relevant information can be challenging when working with large numbers of adverse event reports (case series). The process can be supported with a search engine, but building search queries often remains a manual task. Suggesting terms to add to the search query could support assessors in the identification of case narratives within a case series.
    OBJECTIVE: The aim of this study is to explore the feasibility of identifying case narratives containing specific characteristics with a narrative search engine supported by artificial intelligence (AI) query suggestions.
    METHODS: The narrative search engine uses Best Match 25 (BM25) and suggests additional query terms from two word embedding models providing English and biomedical words to a human in the loop. We calculated the percentage of relevant narratives retrieved by the system (recall) and the percentage of retrieved narratives relevant to the search (precision) on an evaluation dataset including narratives from VigiBase, the World Health Organization global database of adverse event reports for medicines and vaccines. Exact-match search and BM25 search with the Relevance Model (RM3), an alternative way to expand queries, were used as comparators.
    RESULTS: The gold standard included 55/750 narratives labelled as relevant. Our narrative search engine retrieved on average 56.4% of the relevant narratives (recall), which is higher when compared with exact-match search (21.8%), without a significant drop in precision  (54.5% to 43.1%). The recall is also higher as compared with RM3 (34.4%).
    CONCLUSIONS: Our study demonstrates that a narrative search engine supported by AI query suggestions can be a viable alternative to an exact-match search and BM25 search with RM3, since it can facilitate the retrieval of additional relevant narratives during signal assessments.
    DOI:  https://doi.org/10.1007/s40264-025-01529-6
  6. medRxiv. 2025 Mar 07. pii: 2025.03.06.25323516. [Epub ahead of print]
      Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.
    DOI:  https://doi.org/10.1101/2025.03.06.25323516