bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–02–23
nine papers selected by
Farhad Shokraneh



  1. Front Pharmacol. 2025 ;16 1454245
       Introduction: Researchers are increasingly exploring the use of artificial intelligence (AI) tools in evidence synthesis, a labor-intensive, time-consuming, and costly effort. This review explored and quantified the potential efficiency benefits of using automated tools as part of core evidence synthesis activities compared with human-led methods.
    Methods: We searched the MEDLINE and Embase databases for English-language articles published between 2012 and 14 November 2023, and hand-searched the ISPOR presentations database (2020-2023) for articles presenting quantitative results on workload efficiency in systematic literature reviews (SLR) when AI automation tools were utilized. Data on efficiencies (time- and cost-related) were collected.
    Results: We identified 25 eligible studies: 13 used machine learning, 10 used natural language processing, and once each used a systematic review automation tool and a non-specified AI tool. In 17 studies, a >50% time reduction was observed, with 5-to 6-fold decreases in abstract review time. When the number of abstracts reviewed was examined, decreases of 55%-64% were noted. Studies examining work saved over sampling at 95% recall reported 6- to 10-fold decreases in workload with automation. No studies quantified the economic impact associated with automation, although one study found that there was an overall labor reduction of >75% over manual methods during dual-screen reviews.
    Discussion: AI can reduce both workload and create time efficiencies when applied to evidence gathering efforts in SLRs. These improvements can facilitate the implementation of novel approaches in decision making that consider the real-life value of health technologies. Further research should quantify the economic impact of automation in SLRs.
    Keywords:  artificial intelligence; efficiencies; evidence synthesis; machine learning; systematic review
    DOI:  https://doi.org/10.3389/fphar.2025.1454245
  2. Reg Anesth Pain Med. 2025 Feb 16. pii: rapm-2024-106358. [Epub ahead of print]
       INTRODUCTION: Artificial intelligence (AI), particularly large-language models like Chat Generative Pre-Trained Transformer (ChatGPT), has demonstrated potential in streamlining research methodologies. Systematic reviews and meta-analyses, often considered the pinnacle of evidence-based medicine, are inherently time-intensive and demand meticulous planning, rigorous data extraction, thorough analysis, and careful synthesis. Despite promising applications of AI, its utility in conducting systematic reviews with meta-analysis remains unclear. This study evaluated ChatGPT's accuracy in conducting key tasks of a systematic review with meta-analysis.
    METHODS: This validation study used data from a published meta-analysis on emotional functioning after spinal cord stimulation. ChatGPT-4o performed title/abstract screening, full-text study selection, and data pooling for this systematic review with meta-analysis. Comparisons were made against human-executed steps, which were considered the gold standard. Outcomes of interest included accuracy, sensitivity, specificity, positive predictive value, and negative predictive value for screening and full-text review tasks. We also assessed for discrepancies in pooled effect estimates and forest plot generation.
    RESULTS: For title and abstract screening, ChatGPT achieved an accuracy of 70.4%, sensitivity of 54.9%, and specificity of 80.1%. In the full-text screening phase, accuracy was 68.4%, sensitivity 75.6%, and specificity 66.8%. ChatGPT successfully pooled data for five forest plots, achieving 100% accuracy in calculating pooled mean differences, 95% CIs, and heterogeneity estimates (I2 score and tau-squared values) for most outcomes, with minor discrepancies in tau-squared values (range 0.01-0.05). Forest plots showed no significant discrepancies.
    CONCLUSION: ChatGPT demonstrates modest to moderate accuracy in screening and study selection tasks, but performs well in data pooling and meta-analytic calculations. These findings underscore the potential of AI to augment systematic review methodologies, while also emphasizing the need for human oversight to ensure accuracy and integrity in research workflows.
    Keywords:  CHRONIC PAIN; Meta-Analysis; Methods; Spinal Cord Stimulation
    DOI:  https://doi.org/10.1136/rapm-2024-106358
  3. Musculoskelet Surg. 2025 Feb 17.
      Artificial intelligence (AI) is transforming orthopedic research by optimizing academic workflows, improving evidence synthesis, and expanding access to advanced data analysis tools. Generative AI models such as ChatGPT and GPT-4, alongside specialized platforms such as Consensus and SciSpace, empower researchers to refine search queries, enhance literature reviews, synthesize documents, and conduct advanced statistical analyses. These technologies enable the interpretation of large datasets, saving time and boosting efficiency. For orthopedic residents, AI is particularly impactful, revolutionizing their education and fostering greater independence in research. This review explores the key applications of AI as a research assistant in orthopedics, as well as its ethical considerations and challenges.
    Keywords:  Artificial intelligence; Generative AI models; Large language model; Orthopaedic surgery; Orthopaedics; Research enhancement
    DOI:  https://doi.org/10.1007/s12306-025-00894-w
  4. J Med Libr Assoc. 2025 Jan 14. 113(1): 31-38
       Objective: Sexual and gender minority (SGM) populations experience health disparities compared to heterosexual and cisgender populations. The development of accurate, comprehensive sexual orientation and gender identity (SOGI) measures is fundamental to quantify and address SGM disparities, which first requires identifying SOGI-related research. As part of a larger project reviewing and synthesizing how SOGI has been assessed within the health literature, we provide an example of the application of automated tools for systematic reviews to the area of SOGI measurement.
    Methods: In collaboration with research librarians, a three-phase approach was used to prioritize screening for a set of 11,441 SOGI measurement studies published since 2012. In Phase 1, search results were stratified into two groups (title with vs. without measurement-related terms); titles with measurement-related terms were manually screened. In Phase 2, supervised clustering using DoCTER software was used to sort the remaining studies based on relevance. In Phase 3, supervised machine learning using DoCTER was used to further identify which studies deemed low relevance in Phase 2 should be prioritized for manual screening.
    Results: 1,607 studies were identified in Phase 1. Across Phases 2 and 3, the research team excluded 5,056 of the remaining 9,834 studies using DoCTER. In manual review, the percentage of relevant studies in results screened manually was low, ranging from 0.1 to 7.8 percent.
    Conclusions: Automated tools used in collaboration with research librarians have the potential to save hundreds of hours of human labor in large-scale systematic reviews of SGM health research.
    Keywords:  Automation; Health; Methods; Sexual and Gender Minorities; Systematic Review
    DOI:  https://doi.org/10.5195/jmla.2025.1860
  5. J Med Libr Assoc. 2025 Jan 14. 113(1): 65-77
       Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.
    Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.
    Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.
    Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
    Keywords:  Artificial Intelligence; Biomedical Informatics; Evidence Synthesis; Generative AI; Information Science; LLMs; Large Language Models; Library Science
    DOI:  https://doi.org/10.5195/jmla.2025.1985
  6. BMC Neurol. 2025 Feb 19. 25(1): 69
       OBJECTIVE: To evaluate the potential of two large language models (LLMs), GPT-4 (OpenAI) and PaLM2 (Google), in automating migraine literature analysis by conducting sentiment analysis of migraine medications in clinical trial abstracts.
    BACKGROUND: Migraine affects over one billion individuals worldwide, significantly impacting their quality of life. A vast amount of scientific literature on novel migraine therapeutics continues to emerge, but an efficient method by which to perform ongoing analysis and integration of this information poses a challenge.
    METHODS: "Sentiment analysis" is a data science technique used to ascertain whether a text has positive, negative, or neutral emotional tone. Migraine medication names were extracted from lists of licensed biological products from the FDA, and relevant abstracts were identified using the MeSH term "migraine disorders" on PubMed and filtered for clinical trials. Standardized prompts were provided to the APIs of both GPT-4 and PaLM2 to request an article sentiment as to the efficacy of each medication found in the abstract text. The resulting sentiment outputs were classified using both a binary and a distribution-based model to determine the efficacy of a given medication.
    RESULTS: In both the binary and distribution-based models, the most favorable migraine medications identified by GPT-4 and PaLM2 aligned with evidence-based guidelines for migraine treatment.
    CONCLUSIONS: LLMs have potential as complementary tools in migraine literature analysis. Despite some inconsistencies in output and methodological limitations, the results highlight the utility of LLMs in enhancing the efficiency of literature review through sentiment analysis.
    Keywords:  Artificial intelligence; Headaches; Large language model; Literature review; Migraine
    DOI:  https://doi.org/10.1186/s12883-025-04071-1
  7. Healthc Inform Res. 2025 Jan;31(1): 48-56
       OBJECTIVES: The objective of this study was to develop the weightage identified network of keywords (WINK) technique for selecting and utilizing keywords to perform systematic reviews more efficiently. This technique aims to improve the thoroughness and precision of evidence synthesis by employing a more rigorous approach to keyword selection.
    METHODS: The WINK methodology involves generating network visualization charts to analyze the interconnections among keywords within a specific domain. This process integrates both computational analysis and subject expert insights to enhance the accuracy and relevance of the findings. In the example considered, the networking strength between the contexts of environmental pollutants with endocrine function as Q1 and systemic health with oral health-related terms as Q2 was examined, and keywords with limited networking strength were excluded. Utilizing the Medical Subject Headings (MeSH) terms identified from the WINK technique, a search string was built and compared to an initial search with fewer keywords.
    RESULTS: The application of the WINK technique in building the search string yielded 69.81% and 26.23% more articles for Q1 and Q2, respectively, compared to conventional approaches. This significant increase demonstrates the technique's effectiveness in identifying relevant studies and ensuring comprehensive evidence synthesis.
    CONCLUSIONS: By prioritizing keywords with higher weightage and utilizing network visualization charts, the WINK technique ensures comprehensive evidence synthesis and enhances accuracy in systematic reviews. Its effectiveness in identifying relevant studies marks a significant advancement in systematic review methodology, offering a more robust and efficient approach to keyword selection.
    Keywords:  Bibliometrics; Classification; Data Mining; Medical Subject Headings; Search Engine
    DOI:  https://doi.org/10.4258/hir.2025.31.1.48
  8. Nature. 2025 Feb 17.
      
    Keywords:  Language; Machine learning; Software
    DOI:  https://doi.org/10.1038/d41586-025-00437-0
  9. J Med Libr Assoc. 2025 Jan 14. 113(1): 58-64
       Objective: Use of the search filter 'exp animals/not humans.sh' is a well-established method in evidence synthesis to exclude non-human studies. However, the shift to automated indexing of Medline records has raised concerns about the use of subject-heading-based search techniques. We sought to determine how often this string inappropriately excludes human studies among automated as compared to manually indexed records in Ovid Medline.
    Methods: We searched Ovid Medline for studies published in 2021 and 2022 using the Cochrane Highly Sensitive Search Strategy for randomized trials. We identified all results excluded by the non-human-studies filter. Records were divided into sets based on indexing method: automated, curated, or manual. Each set was screened to identify human studies.
    Results: Human studies were incorrectly excluded in all three conditions, but automated indexing inappropriately excluded human studies at nearly double the rate as manual indexing. In looking specifically at human clinical randomized controlled trials (RCTs), the rate of inappropriate exclusion of automated-indexing records was seven times that of manually-indexed records.
    Conclusions: Given our findings, searchers are advised to carefully review the effect of the 'exp animals/not humans.sh' search filter on their search results, pending improvements to the automated indexing process.
    Keywords:  Abstract and Indexing; Automated Indexing; Evidence Synthesis; Medical Subject Headings (MeSH)
    DOI:  https://doi.org/10.5195/jmla.2025.1972