bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–06–01
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Comput Biol Med. 2025 May 28. pii: S0010-4825(25)00815-7. [Epub ahead of print]193 110464
       OBJECTIVE: Developing search strategies for synthesizing evidence on drug harms requires specialized expertise and knowledge. The aim of this study was to evaluate ChatGPT's ability to enhance search strategies for systematic reviews of drug harms by identifying missing and generating omitted keywords.
    MATERIALS AND METHODS: A literature search in PubMed identified systematic reviews of drug harms from 10 high-impact journals between 1-Nov-2013 to 27-Nov-2023. Sixteen search strategies used in these reviews were selected each with a single error of omission introduced. ChatGPT's (GPT-4) performance was evaluated based on error detection, similarity between the extracted and generated search strategies via strict and semantic keyword matching, and proportion of omitted keywords generated.
    RESULTS: ChatGPT identified the introduced errors in all search strategies. Under strict matching, the mean Jaccard's similarity measure was 0.17 (range: 0.00-0.52) and with semantic matching this increased to 0.23 (range: 0.00-0.53). Similarly, the mean proportion of keywords recreated by ChatGPT was 49 % using strict matching increasing to 71 % with semantic matching.
    DISCUSSION AND CONCLUSION: ChatGPT effectively detected errors and generated relevant keywords, showing potential as a tool for evidence retrieval on drug harms.
    Keywords:  Artificial intelligence; Drug-related side effects and adverse reactions; Evidence-based medicine; Patient safety; Systematic reviews as topic
    DOI:  https://doi.org/10.1016/j.compbiomed.2025.110464
  2. J Biomed Inform. 2025 May 28. pii: S1532-0464(25)00089-9. [Epub ahead of print] 104860
       BACKGROUND: Systematic reviews (SRs) require substantial time and human resources, especially during the screening phase. Large Language Models (LLMs) have shown the potential to expedite screening. However, their use in generating structured PICOS (Population, Intervention/Exposure, Comparison, Outcome, Study design) summaries from title and abstract to assist human reviewers during screening remains unexplored.
    OBJECTIVE: To assess the impact of open-source (Mistral-Nemo-Instruct-2407) LLM-generated structured PICOS summaries on the speed and accuracy of title and abstract screening.
    METHODS: Four neurology trainees were grouped into two pairs based on previous screening experience. Pair A (A1, A2) consisted of less experienced trainees (1-2 SR), while Pair B (B1, B2) consisted of more experienced trainees (≥3 SR). Reviewers A1 and B1 received titles, abstracts, and LLM-generated structured PICOS summaries for each article. Reviewers A2 and B2 received only titles and abstracts. All reviewers independently screened the same set of 1,003 articles using predefined eligibility criteria. Screening times were recorded, and performance metrics were calculated.
    RESULTS: PICOS-assisted reviewers screened significantly faster (A1: 116 min; B1: 90 min) than those without (A2: 463 min; B2: 370 min), with approximately 75% reduction in screening workload. Sensitivity was perfect for PICOS-assisted reviewers (100%), whereas it was lower for those without assistance (88.0% and 92.0%). Furthermore, PICOS-assisted reviewers demonstrated higher accuracy (99.9%), specificity (99.9), F1 scores (98.0%), and strong inter-rater reliability (Cohen's Kappa of 99.8%). Less experienced reviewer with PICOS assistance(A1) outperformed experienced reviewer(B2) without assistance in both efficiency and sensitivity.
    CONCLUSION: LLM-generated PICOS summaries enhance the speed and accuracy of title and abstract screening by providing an additional layer of structured information. With PICOS assistance, less experienced reviewer surpassed their more experienced peers. Future research should explore the applicability of this novel method across diverse fields outside of neurology and its integration into fully automated systems.
    Keywords:  Automation; LLM; Meta-analysis; Screening; Systematic review
    DOI:  https://doi.org/10.1016/j.jbi.2025.104860
  3. Front Artif Intell. 2025 ;8 1587244
      The exponential growth of scientific literature presents challenges for pharmaceutical, biotechnological, and Medtech industries, particularly in regulatory documentation, clinical research, and systematic reviews. Ensuring accurate data extraction, literature synthesis, and compliance with industry standards require AI tools that not only streamline workflows but also uphold scientific rigor. This study evaluates the performance of AI tools designed for bibliographic review, data extraction, and scientific synthesis, assessing their impact on decision-making, regulatory compliance, and research productivity. The AI tools assessed include general-purpose models like ChatGPT and specialized solutions such as ELISE (Elevated LIfe SciencEs), SciSpace/Typeset, Humata, and Epsilon. The evaluation is based on three main criteria: Extraction, Comprehension, and Analysis with Compliance and Traceability (ECACT) as additional dimensions. Human experts established reference benchmarks, while AI Evaluator models ensure objective performance measurement. The study introduces the ECACT score, a structured metric assessing AI reliability in scientific literature analysis, regulatory reporting and clinical documentation. Results demonstrate that ELISE consistently outperforms other AI tools, excelling in precise data extraction, deep contextual comprehension, and advanced content analysis. ELISE's ability to generate traceable, well-reasoned insights makes it particularly well-suited for high-stakes applications such as regulatory affairs, clinical trials, and medical documentation, where accuracy, transparency, and compliance are paramount. Unlike other AI tools, ELISE provides expert-level reasoning and explainability, ensuring AI-generated insights align with industry best practices. ChatGPT is efficient in data retrieval but lacks precision in complex analysis, limiting its use in high-stakes decision-making. Epsilon, Humata, and SciSpace/Typeset exhibit moderate performance, with variability affecting their reliability in critical applications. In conclusion, while AI tools such as ELISE enhance literature review, regulatory writing, and clinical data interpretation, human oversight remains essential to validate AI outputs and ensure compliance with scientific and regulatory standards. For pharmaceutical, biotechnological, and Medtech industries, AI integration must strike a balance between automation and expert supervision to maintain data integrity, transparency, and regulatory adherence.
    Keywords:  AI tool; Elevated LIfe SciencE solution; data extraction; scientific literature; systematic review
    DOI:  https://doi.org/10.3389/frai.2025.1587244
  4. BMC Med Res Methodol. 2025 May 30. 25(1): 150
       BACKGROUND: With the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review.
    METHODS: We employed a structured survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers.
    RESULTS: The analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots' answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer.
    CONCLUSIONS: Chatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years.
    Keywords:  Artificial intelligence; ChatFlash; ChatGPT; Chatbot; Large language model; ZenoChat
    DOI:  https://doi.org/10.1186/s12874-025-02532-2
  5. AMIA Annu Symp Proc. 2024 ;2024 818-827
      Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.
  6. J Am Acad Dermatol. 2025 May 28. pii: S0190-9622(25)02207-8. [Epub ahead of print]
      
    Keywords:  Systematic Review; cutaneous squamous cell carcinoma; generative AI; reporting guideline
    DOI:  https://doi.org/10.1016/j.jaad.2025.03.101
  7. AMIA Annu Symp Proc. 2024 ;2024 493-502
      Digital health technologies (DHTs) have revolutionized clinical trials, offering unprecedented opportunities to streamline processes, enhance patient engagement, and improve data quality. Growing technology device and broadband access are contributing to the increasing number of DHT-enabled trials. Ideally, DHTs have the potential to make clinical research more inclusive and diverse. However, while the variety in digital technologies and implementations present a strong display of healthcare innovation, major challenges arise concerning DHT generalizability and translation into real-world medical practice. In this study, we report our efforts in accelerating the literature review process related to the use of DHTs in randomized controlled trials (RCTs) by leveraging large language models (LLMs); identified in existing LLM task evaluations as possible tools supporting evidence harvesting scalability. We designed three tasks for automating title screening and information extraction of DHT-enabled RCTs using multiple LLMs, which yielded promising results towards large scale literature review.
  8. Bioengineering (Basel). 2025 May 02. pii: 486. [Epub ahead of print]12(5):
      This work introduces TrialSieve, a novel framework for biomedical information extraction that enhances clinical meta-analysis and drug repurposing. By extending traditional PICO (Patient, Intervention, Comparison, Outcome) methodologies, TrialSieve incorporates hierarchical, treatment group-based graphs, enabling more comprehensive and quantitative comparisons of clinical outcomes. TrialSieve was used to annotate 1609 PubMed abstracts, 170,557 annotations, and 52,638 final spans, incorporating 20 unique annotation categories that capture a diverse range of biomedical entities relevant to systematic reviews and meta-analyses. The performance (accuracy, precision, recall, F1-score) of four natural-language processing (NLP) models (BioLinkBERT, BioBERT, KRISSBERT, PubMedBERT) and the large language model (LLM), GPT-4o, was evaluated using the human-annotated TrialSieve dataset. BioLinkBERT had the best accuracy (0.875) and recall (0.679) for biomedical entity labeling, whereas PubMedBERT had the best precision (0.614) and F1-score (0.639). Error analysis showed that NLP models trained on noisy, human-annotated data can match or, in most cases, surpass human performance. This finding highlights the feasibility of fully automating biomedical information extraction, even when relying on imperfectly annotated datasets. An annotator user study (n = 39) revealed significant (p < 0.05) gains in efficiency and human annotation accuracy with the unique TrialSieve tree-based annotation approach. In summary, TrialSieve provides a foundation to improve automated biomedical information extraction for frontend clinical research.
    Keywords:  artificial intelligence; biocuration; biomedical information extraction; biomedical literature annotation; biomedical literature schema; large language model; named entity recognition; natural-language processing; text mining
    DOI:  https://doi.org/10.3390/bioengineering12050486