bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–04–26
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Knee Surg Sports Traumatol Arthrosc. 2026 Apr 21.
       PURPOSE: To evaluate the accuracy, agreement, and efficiency of a dual-large language model (LLM) approach using Generative Pre-Trained Transformer 5.2 (GPT-5.2) and Google Gemini 3 Pro for automated data extraction in orthopaedic systematic reviews.
    METHODS: Eight studies from a previously published systematic review on paediatric revision anterior cruciate ligament reconstruction were used to test extraction accuracy, agreement, and efficiency against a pre-defined gold-standard. Both GPT 5.2 and Gemini 3 Pro were prompted via the OpenAI and Google Application Programming Interface (API). Each study had a total of 48 equally-weighted data fields to extract from spanning six domains: study characteristics, participant details, injury characteristics, primary and revision surgery details, and outcomes. Extractions were graded as correct, partially correct, or incorrect in reference to the gold-standard.
    RESULTS: Across all 384 fields, both LLMs produced fully correct outputs in 315 (82%) cases, while at least one model was fully correct in 365 (95.1%). Among the six extraction domains, study characteristics (100%, 32/32), injury characteristics (93.8%, 30/32), and outcomes (91.1%, 102/112) showed the highest percentage of at least one model being correct. The entire extraction task was completed in 27 and 35.8 min by GPT-5.2 and Gemini 3 Pro, respectively, for a total API cost of $3.22USD.
    CONCLUSION: A parallel-LLM approach using GPT-5.2 and Gemini 3 Pro achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review. Most errors were due to omission of minor details in complex domains such as surgical details. At least one model was fully correct in over 95% of fields, supporting the use of a dual-LLM framework as a reliable first-pass tool for human verification.
    LEVEL OF EVIDENCE: Level IV.
    Keywords:  artificial intelligence; automation; extraction; large language model; systematic review
    DOI:  https://doi.org/10.1002/ksa.70412
  2. Syst Rev. 2026 Apr 21.
       BACKGROUND: Systematic reviews are essential for evidence-based decision-making, but the screening stage is often labor-intensive and susceptible to human error. Machine learning (ML) approaches, including active learning (AL), have increasingly been used to support title and abstract screening. One such approach is the SAFE procedure, which has been proposed to guide the use of AL-assisted screening in systematic reviews. However, evidence on how well this procedure performs in large, heterogeneous datasets generated by broad search strategies remains limited. This study therefore evaluates the effectiveness and reliability of AL-assisted screening with particular focus on the SAFE procedure. Specifically, it examines the comprehensiveness and necessity of the recommended SAFE procedure, assesses the influence of different labeling strategies, and investigates whether AL-assisted screening can help reduce manual screening errors.
    METHODS: Screening of four large, heterogeneous datasets from medication management systematic reviews was simulated using ASReview. The datasets ranged from 3475 to 16218 records. For these datasets 0.08 to 1% of records were included in the final systematic review. Our simulations systematically varied all parameters defined by the SAFE procedure. Recall versus sampling behavior was analyzed, with a focus on the impact of parameter choices on retrieving records selected for full text inclusions and on reducing the number of records to be screened.
    RESULTS: AL-assisted screening can effectively reduce the number of records to screen by almost 90% without increasing the risk of missing relevant records in comparison to manual screening. For three of the four datasets, the best performance was achieved with the SAFE procedure combined with the elas-u4 and elas-h3 models and full-text labeling. Under these conditions, ASReview identified all studies included after full-text review and reduced the screening workload by 89-90%. In practical terms, this means that screening only 10-11% of the original records was sufficient to identify all final included studies in these datasets. This parameter combination identified 87% of the studies ultimately included after full-text review in the remaining dataset (16,218 records; 0.6% included at title/abstract screening and 0.08% included after full-text review). For this dataset, the best performance, identifying all studies included after full-text review while reducing the screening workload by 90%, was achieved when using the SAFE procedure with the simpler Naive Bayes model, the TF-IDF feature extractor, and title/abstract labeling.
    CONCLUSIONS: AL-assisted screening can safely and effectively reduce the workload needed to screen the large, heterogeneous datasets common in medication management systematic reviews. We recommend the modified SAFE procedure using full-text labels and the elas models. If the estimated ratio of full text includes is very low, it may be more appropriate to use the original SAFE procedure with title/abstract labeling.
    Keywords:  Artificial intelligence; Machine learning; Pharmacy; Systematic review
    DOI:  https://doi.org/10.1186/s13643-026-03185-y
  3. Res Synth Methods. 2026 Apr 22. 1-19
      Systematic reviews (SRs) are critical for evidence-based research but are time-consuming and labor-intensive. The rapid expansion of academic publications further challenges the performance and applicability of existing screening and classification methods. While large language models (LLMs) present new opportunities for automation, limited research has examined whether they can achieve classification performance comparable to human reviewers in large-scale, multi-class settings. With the goal of improving classification performance, we proposed an LLM-based framework that leverages full-text key-insight extraction to enhance literature classification. We constructed a manually curated dataset of 900 articles from 17 published SRs to quantitatively evaluate the classification capabilities of LLMs. The results provided empirical evidence of LLMs' potential in supporting large-scale SRs and introduced a practical pathway for improving efficiency and reliability in evidence synthesis. Empirical results showed that key-insight-based classification (KBC) significantly outperforms abstract-based classification (ABC). We implemented a confidence-weighted voting (CWV) mechanism using multiple LLMs to improve robustness. The CWV method achieved the highest macro F1-score of 0.796, substantially exceeding KBC (0.732), ABC (0.676), and unsupervised K-means clustering (0.446). By employing zero-shot LLMs, our approach demonstrated the potential for enhanced adaptability across diverse domains and classification tasks without requiring fine-tuning, demonstrating that a carefully designed pipeline can enable LLMs to achieve classification performance comparable to human reviewers.
    Keywords:  artificial intelligence; evidence synthesis; large language model; literature screening; paper classification; systematic review
    DOI:  https://doi.org/10.1017/rsm.2026.10094
  4. J Clin Epidemiol. 2026 Apr 20. pii: S0895-4356(26)00147-2. [Epub ahead of print] 112272
      Podcasts can make health evidence easier to follow, but it is unclear whether AI-assisted production can match human production when both use the same audio format. We will run a randomised, two-arm, non-inferiority trial comparing AI-assisted podcasts with human-produced podcasts. Adults (≥18 years; English-proficient) will be recruited from the general public via Prolific, an online research participant recruitment platform, and randomly allocated 1:1 to listen to three short episodes (6-8 minutes each) based on the same Cochrane Plain Language Summaries. The AI arm uses Wondercraft AI in a human-in-the-loop workflow; the human arm features experienced communicators working to an identical brief. In both arms, content is limited to the Plain Language Summary, with authorship masked for participants and expert raters. The primary outcome is comprehension, measured by a 10-item test per episode, with the primary analysis using the participant-level mean score across the three episodes, aligned with the QUEST "Understanding" dimension. Secondary outcomes include format accessibility (listenability), quality of information, perceived trust, and safety. Non-inferiority margins are pre-specified; for comprehension, the margin is 1 point on the 10-item scale. If non-inferiority is shown, we will also assess superiority. We plan to recruit 458 participants. Differences between arms will be estimated using appropriate repeated-measures models, with two-sided 95% confidence intervals. This trial evaluates whether a vetted AI workflow can match human communicators on comprehension, quality, safety, accessibility, and trust when both deliver podcasts derived from the same evidence base. By providing head-to-head evidence in the same audio format, the study will address a practical question faced by journals and health organisations already experimenting with AI tools: can AI generate clear, safe, and trusted audio content at scale, and identifies where human input remains essential.
    Keywords:  Cochrane reviews; QUEST framework; artificial intelligence; health communication; non-inferiority trial; plain language summaries; podcasts; systematic reviews
    DOI:  https://doi.org/10.1016/j.jclinepi.2026.112272
  5. JMIR Res Protoc. 2026 Apr 21. 15 e82725
       Background: Despite the growing emphasis on open science and equity in research, qualitative data capturing diverse human experiences and perspectives are rarely reused beyond the original study. Increasingly, data repositories are used to make these data publicly available, but it is unclear whether these data can be effectively identified by researchers interested in secondary data analysis.
    Objective: We describe a protocol for identifying and characterizing archived qualitative datasets in leading public repositories, developing an artificial intelligence-based tool to enhance qualitative data reuse, and validating that tool using existing data.
    Methods: We will search 4 leading repositories to assess the scope and identifiability of existing publicly available qualitative datasets. We will subsequently build the Human Experiences and Reflections (HEARs) Archive, a directory of deidentified study data that is only accessible indirectly through the use of the HEARs Portal. The HEARs Portal will be supported by large language model-based tools using the retrieval-augmented generation framework. The artificial intelligence tools' performance will be assessed across 3 domains: relevance of identified studies, validity as evaluated by comparison with human qualitative data analysis, and robustness against the addition of irrelevant information.
    Results: A preliminary review of existing data repositories has begun. The anticipated study completion date is December 31, 2026.
    Conclusions: The proposed project will provide evidence regarding the existing capacity for identifying and accessing qualitative data through leading repositories. It will also provide evidence on the validity of the HEARs Data Connector for identifying and describing qualitative datasets in ways that can assist researchers interested in secondary analysis. Establishing the validity of the HEARs Data Connector and developing an evidence-based ongoing improvement and monitoring strategy will be essential for establishing trust within the qualitative research community.
    Keywords:  LLM; information science; large language models; qualitative research; secondary data analysis; transcripts
    DOI:  https://doi.org/10.2196/82725
  6. Neurosurg Pract. 2026 Jun;7(2): e000230
       BACKGROUND AND OBJECTIVES: Neurosurgery Publications encourages the creation of graphical abstracts to accompany published articles. The goal of this study was to develop a pipeline for the automatic conversion of Neurosurgery Publications articles into graphical abstracts using Cascade Styling Sheets (CSS) templates and iterative prompting of a frontier vision language model and to conduct a human evaluation of this pipeline.
    METHODS: We developed an automated pipeline to convert extracted manuscript content into standardized graphical abstracts. The pipeline implements a custom CSS profile designed to match existing journal standards. Using Claude Sonnet-3.5, we generated structured hypertext markup language summaries organized into 6 sections: Objectives, Background, Methods, Results, Discussion, and Conclusion. The model selected up to 2 representative figures per manuscript based on caption analysis. We evaluated performance using 100 randomly selected articles published between 2020 and 2024 (95 from Neurosurgery, 4 from Operative Neurosurgery, 1 from Neurosurgery Practice). Three Editorial Review Board members independently assessed abstracts using 3 binary criteria: (1) proper formatting, (2) factual accuracy, and (3) visual appeal.
    RESULTS: Generated graphical abstracts achieved proper formatting in 85% of cases (95% CI: 76.7%-90.7%), factual accuracy in 99% (95% CI: 94.4%-99.9%), and visual appropriateness in 82% (95% CI: 73.3%-88.3%). Overall, 70% of abstracts (95% CI: 60.5%-78.1%) met all 3 criteria and were deemed "publication ready" without manual intervention. Error analysis revealed poor figure selection (40.0%) as the most common failure mode, followed by title replacement errors from PDF extraction (26.7%).
    CONCLUSION: Our artificial intelligence-CSS pipeline demonstrates the feasibility of automating graphical abstract generation for neurosurgical manuscripts, achieving publication-ready quality in 70% of cases with 99% factual accuracy. This technology offers a scalable augmentation tool that can reduce the design burden for authors, enhancing visual scientific communication in neurosurgical publishing while complementing human expertise.
    Keywords:  Cascade Styling Sheets templates; large language models; medical publishing; scientific communication; vision-language models; visual abstracts
    DOI:  https://doi.org/10.1227/neuprac.0000000000000230
  7. BMJ Health Care Inform. 2026 Apr 24. pii: e101959. [Epub ahead of print]33(1):
       OBJECTIVE: Meaningful assessments of how large language models (LLMs) incorporate clinical guidelines require large-scale testing over many queries. Here, we evaluate the prevalence of clinical guideline omissions and hallucinations in a large sample of diagnostic LLM outputs.
    METHODS: We used simulated case vignettes and zero-shot prompting to generate diagnostic outputs and rationales from GPT-4.1 and DeepSeek-V3. English case vignettes were created for hypercholesterolaemia and type-2 diabetes mellitus. Each vignette contained identical medical information, while sociodemographic characteristics varied in terms of sex, ethnicity and location. We calculated the prevalence of existing and hallucinated clinical guidelines in LLM outputs across disease, LLM and sociodemographic characteristics.
    RESULTS: We analysed a total of 12 197 LLM outputs, which quantifies three hazard areas: omissions (up to 97% for DeepSeek-V3 and 46% for GPT-4.1), hallucinations (up to 9%) and inconsistencies (guideline citation rate ranging from 0% to 78.39% across sociodemographic vignettes). Omission and hallucination rates were generally similar across vignettes with different sex or ethnicity data, yet were particularly sensitive to patient location.
    DISCUSSION: This study highlights significant variability in clinical guideline prediction across two different diseases, three different sociodemographic variables and two LLMs, even when the LLMs were instructed by identical prompts, establishing clinical guideline prediction in LLM outputs as a stochastic event.
    CONCLUSION: The stochastic nature of LLMs creates a unique challenge for evidence generation and clinical deployment. Being able to measure and capture this stochasticity within high-quality research designs will be a prerequisite to advancing the responsible deployment of LLMs in healthcare.
    Keywords:  Artificial intelligence; Decision Making, Computer-Assisted; Evidence-Based Medicine; Large Language Models
    DOI:  https://doi.org/10.1136/bmjhci-2025-101959
  8. Radiol Technol. 2026 May-Jun;97(5):97(5): 303-309
       PURPOSE: To evaluate the factual accuracy and citation fidelity of Scopus AI's outputs in response to a single health care-related research question about the importance of human trafficking prevention education for professionals.
    METHODS: This study employed a mixed-methods content verification approach. A single health care-related research question was entered into Scopus AI (Elsevier), which generated a summary, expanded summary, and concept map. Quantitative data were collected by classifying each statement in the Scopus AI output as accurate, misleading, or incorrect. Qualitative analysis provided contextual insights into citation use, source type, and interpretation of content.
    RESULTS: Of the 30 statements analyzed from the Scopus AI output, 27 (90.0%) were rated as accurate, and 3 (10.0%) were categorized as misleading. No incorrect or hallucinated content was detected. Qualitative analysis revealed that Scopus AI consistently cited legitimate, peer-reviewed sources. However, in 2 cases, the tool referenced secondary sources without clarification, raising questions about source hierarchy.
    DISCUSSION: Though Scopus AI produced largely reliable academic content, this study underscores the need for user verification and scholarly judgment, particularly regarding secondary sources and citation transparency. The findings highlight the importance of teaching students to critically evaluate artificial intelligence (AI)-generated material. In response to the findings, a classroom activity titled "Fact-Check the Bot" was developed to promote critical AI literacy. This activity guides learners in assessing AI-generated claims using a verification matrix and original literature and can be adapted for use with other AI tools.
    CONCLUSION: This study demonstrates the potential and the limitations of generative AI in academic research and offers a model for integrating verification practices into educational settings to enhance students' critical engagement with AI tools.
    Keywords:   AI literacy; Scopus AI; critical thinking; digital literacy; literature verification; generative AI