bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–07–20
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Cochrane Evid Synth Methods. 2025 Jul;3(4): e70036
       Background: Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.
    Objective: To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.
    Methods: Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.
    Results: Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: -5.2%-11.9%, p = 0.445).
    Conclusion: AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.
    Keywords:  ChatGPT; Elicit; artificial intelligence; data extraction; large language models; research methodology; systematic review methodology
    DOI:  https://doi.org/10.1002/cesm.70036
  2. J Am Med Inform Assoc. 2025 Jul 18. pii: ocaf117. [Epub ahead of print]
       OBJECTIVES: To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.
    MATERIALS AND METHODS: We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.
    RESULTS: Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.
    DISCUSSION: Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.
    CONCLUSION: This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.
    Keywords:  evidence-based medicine; generative artificial intelligence; meta-analysis; risk of bias; systematic reviews
    DOI:  https://doi.org/10.1093/jamia/ocaf117
  3. Comput Methods Programs Biomed. 2025 Jul 12. pii: S0169-2607(25)00379-7. [Epub ahead of print]270 108962
       BACKGROUND AND OBJECTIVES: Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time-consuming and costly.
    METHODS: We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.
    RESULTS: The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.
    CONCLUSIONS: ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool's performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.
    Keywords:  Automation; Data extraction tool; Environmental chemicals; Scientific papers; Systematic scoping review
    DOI:  https://doi.org/10.1016/j.cmpb.2025.108962
  4. Cochrane Evid Synth Methods. 2025 May;3(3): e70031
       Introduction: We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included ("false exclusion rate").
    Methods: We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.
    Results: GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%-417 of the 41,742 articles-were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.
    Discussion/Conclusions: GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.
    Keywords:  GPT‐4; evidence synthesis; large language model; opioids; policy
    DOI:  https://doi.org/10.1002/cesm.70031
  5. Cochrane Evid Synth Methods. 2025 May;3(3): e70030
       Introduction: Evidence syntheses are crucial in healthcare and elsewhere but are resource-intensive, often taking years to produce. Artificial intelligence and machine learning (AI/ML) tools may improve production efficiency in certain review phases, but little is known about their impact on entire reviews.
    Methods: We performed prespecified analyses of a convenience sample of eligible healthcare- or welfare-related reviews commissioned at the Norwegian Institute of Public Health between August 1 2020 (first commission to use AI/ML) and January 31 2023 (administrative cut-off). The main exposures were AI/ML use following an internal support team's recommendation versus no use. Ranking (e.g., priority screening), classification (e.g., study design), clustering (e.g., documents), and bibliometric analysis (e.g., OpenAlex) tools were included, but we did not include or exclude specific tools. Generative AI tools were not widely available during the study period. The outcomes were resources (person-hours) and time from commission to completion (approval for delivery, including peer review; weeks). Analyses accounted for nonrandomized assignment and censored outcomes (reviews ongoing at cut-off). Researchers classifying exposures were blinded to outcomes. The statistician was blinded to exposure.
    Results: Among 39 reviews, 7 (18%) were health technology assessments versus systematic reviews, 19 (49%) focused on healthcare versus welfare, 18 (46%) planned meta-analysis, and 3 (8%) were ongoing at cut-off. AI/ML tools were used in 27 (69%) reviews. Reviews that used AI/ML as recommended used more resources (mean 667 vs. 291 person-hours) but were completed slightly faster (27.6 vs. 28.2 weeks). These differences were not statistically significant (relative resource use 3.71; 95% CI: 0.36-37.95; p = 0.269; relative time-to-completion: 0.92; 95% CI: 0.53-1.58; p = 0.753).
    Conclusions: Associations between AI/ML use and the outcomes remains uncertain. Multicenter studies or meta-analyses may be needed to determine if these tools meaningfully reduce resource use and time to produce evidence syntheses.
    Keywords:  artificial intelligence; automation; business process management; evidence synthesis; machine learning; research waste; systematic reviewing
    DOI:  https://doi.org/10.1002/cesm.70030
  6. J Med Internet Res. 2025 Jul 15. 27 e75666
       Unlabelled: Advances in artificial intelligence (AI) promise to reshape the landscape of scientific inquiry. Amidst all these, OpenAI's latest tool, Deep Research, stands out for its potential to revolutionize how researchers engage with the literature. However, this leap forward presents a paradox; while AI-generated reviews offer speed and accessibility with minimal effort, they raise fundamental concerns about citation integrity, critical appraisal, and the erosion of deep scientific thinking. These concerns are particularly problematic in the context of biomedical research, where evidence quality may influence clinical practice and decision-making. In this piece, we present an empirical evaluation of Deep Research and explore both its remarkable capabilities and inherent limitations. Through structured experimentation, we assess its effectiveness in synthesizing literature, highlight key shortcomings, and reflect on the broader implications of these tools for research training, and the integrity of evidence-based practice. With AI tools increasingly blurring the lines between knowledge generation and critical inquiry, we argue that while AI democratizes access to knowledge, wisdom remains distinctly human.
    Keywords:  AI; LLMs; artificial intelligence; hallucination; large language model; medical research; scientific writing
    DOI:  https://doi.org/10.2196/75666
  7. Neurosurgery. 2025 Jan 16. 97(2): 387-398
       BACKGROUND AND OBJECTIVES: Scholarly output is accelerating in medical domains, making it challenging to keep up with the latest neurosurgical literature. The emergence of large language models (LLMs) has facilitated rapid, high-quality text summarization. However, LLMs cannot autonomously conduct literature reviews and are prone to hallucinating source material. We devised a novel strategy that combines Reference Publication Year Spectroscopy-a bibliometric technique for identifying foundational articles within a corpus-with LLMs to automatically summarize and cite salient details from articles. We demonstrate our approach for four common spinal conditions in a proof of concept.
    METHODS: Reference Publication Year Spectroscopy identified seminal articles from the corpora of literature for cervical myelopathy, lumbar radiculopathy, lumbar stenosis, and adjacent segment disease. The article text was split into 1024-token chunks. Queries from three knowledge domains (surgical management, pathophysiology, and natural history) were constructed. The most relevant article chunks for each query were retrieved from a vector database using chain-of-thought prompting. LLMs automatically summarized the literature into a comprehensive narrative with fully referenced facts and statistics. Information was verified through manual review, and spine surgery faculty were surveyed for qualitative feedback.
    RESULTS: Our tandem approach cost less than $1 for each condition and ran within 5 minutes. Generative Pre-trained Transformer-4 was the best-performing model, with a near-perfect 97.5% citation accuracy. Surveys of spine faculty helped refine the prompting scheme to improve the cohesion and accessibility summaries. The final artificial intelligence-generated text provided high-fidelity summaries of each pathology's most clinically relevant information.
    CONCLUSION: We demonstrate the rapid, automated summarization of seminal articles for four common spinal pathologies, with a generalizable workflow implemented using consumer-grade hardware. Our tandem strategy fuses bibliometrics and artificial intelligence to bridge the gap toward fully automated knowledge distillation, obviating the need for manual literature review and article selection.
    Keywords:  Artificial intelligence; Bibliometrics; Big data; Large language models; Literature review; Machine learning; Spinal surgery
    DOI:  https://doi.org/10.1227/neu.0000000000003354
  8. Value Health. 2025 Jul 11. pii: S1098-3015(25)02455-6. [Epub ahead of print]
    ISPOR Working Group on Generative AI
       INTRODUCTION: Generative artificial intelligence (AI), particularly large language models (LLMs), holds significant promise for Health Economics and Outcomes Research (HEOR). However, standardized reporting guidance for LLM-assisted research is lacking. This article introduces the ELEVATE-GenAI framework and checklist-reporting guidelines specifically designed for HEOR studies involving LLMs.
    METHODS: The framework was developed through a targeted literature review of existing reporting guidelines, AI evaluation frameworks, and expert input from the ISPOR Working Group on Generative AI. It comprises ten domains-including model characteristics, accuracy, reproducibility, and fairness and bias. The accompanying checklist translates the framework into actionable reporting items. To illustrate its use, the framework was applied to two published HEOR studies: one focused on a systematic literature review tasks and the other on economic modeling.
    RESULTS: The ELEVATE-GenAI framework offers a comprehensive structure for reporting LLM-assisted HEOR research, while the checklist facilitates practical implementation. Its application to the two case studies demonstrates its relevance and usability across different HEOR contexts.
    LIMITATIONS: Although the framework provides robust reporting guidance, further empirical testing is needed to assess its validity, completeness, usability as well as its generalizability across diverse HEOR use cases.
    CONCLUSION: The ELEVATE-GenAI framework and checklist address a critical gap by offering structured guidance for transparent, accurate, and reproducible reporting of LLM-assisted HEOR research. Future work will focus on extensive testing and validation to support broader adoption and refinement.
    Keywords:  Artificial Intelligence; Generative AI; Large Language Model; Reporting Guidelines
    DOI:  https://doi.org/10.1016/j.jval.2025.06.018
  9. J Med Internet Res. 2025 Jul 14. 27 e64452
       Background: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.
    Objective: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.
    Methods: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.
    Results: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).
    Conclusions: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.
    Keywords:  benchmark; dataset; evidence-based medicine; large language models; questions and answers
    DOI:  https://doi.org/10.2196/64452