bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–05–03
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. BMJ Evid Based Med. 2026 Apr 29. pii: bmjebm-2025-114044. [Epub ahead of print]
       OBJECTIVE: To examine the potential errors of a general large language model (LLM) (ie, Claude 3.5 Sonnet) on data extraction from randomised controlled trials (RCTs).
    DESIGN AND SETTING: An empirical study comparing Claude 3.5 Sonnet extractions against a human-performed verification dataset. The extraction tasks for Claude 3.5 Sonnet were based solely on original RCT portable document format (PDF) files. For PDFs that could not be directly extracted by Claude 3.5 Sonnet, optical character recognition was employed to convert them into text format before extraction.
    PARTICIPANTS: A random sample of 664 trials was selected from a well-established trial bank and a final data pool was established based on rigorous manual cross-checking as a reference standard.
    DATA SOURCES: PubMed, EMBASE, Scopus, Web of Science (all databases) and the Cochrane Central Register of Controlled Trials (CENTRAL) up to February 2023.
    ELIGIBILITY CRITERIA FOR SELECTING STUDIES: RCTs on children involving medication and adverse events.
    MAIN OUTCOME MEASURES: Claude 3.5 Sonnet was applied to extract the basic information (eg, trial design, population information and source of funding) and adverse outcomes (ie, name of adverse events, number of events). Claude 3.5 Sonnet outputs were compared against the final data pool and all errors were recorded. Results are presented as error rates and with 95% CI, estimated using a generalised linear mixed model.
    RESULTS: For the 664 trials, a total of 23 069 data cells were extracted via Claude 3.5 Sonnet, with 10 624 for basic information and 12 445 for adverse outcomes. The overall error rate for data extraction was 6.6% (95% CI 5.4% to 8.2%), with 5.7% (95% CI 5.2% to 6.1%) in basic information and 7.6% (95% CI 4.9% to 11.8%) in adverse outcomes. When stratified the 1542 total errors by error types, misallocation (assigning data to incorrect fields; 57.1%, 881/1542) and missed or omitted data (incomplete extraction of available data; 23.2%, 357/1542) accounted for the two most frequent errors, with misallocation occurring more in basic information (53.3%, 470/881), while missed or omitted data occurred more in adverse outcomes (96.1%, 343/357). Post hoc analysis examining the association between trial reporting quality (assessed using Consolidated Standards of Reporting Trials (CONSORT) 2025 and LLM data extraction error rates indicated that higher CONSORT adherence was associated with lower extraction error rates.
    CONCLUSIONS: The data extraction error of Claude was relatively low, but it alerts LLM applications in evidence synthesis. Detailed checking for LLM outputs should be the primary consideration for evidence synthesisers.
    Keywords:  Child Health; Drug-Related Side Effects and Adverse Reactions; Evidence-Based Practice
    DOI:  https://doi.org/10.1136/bmjebm-2025-114044
  2. Can J Psychiatry. 2026 Apr 30. 7067437261445767
      BackgroundLarge language models (LLMs) may reduce the burden associated with performing systematic reviews by prescreening abstracts from a literature search for eligibility for inclusion in full-text review.MethodsWe developed an iterative, LLM-based workflow for screening abstracts: after manual specification of eligibility criteria and seed examples, an ensemble of five LLMs deliberates through a Delphi process to classify a batch of abstracts; these labels are used to train a logistic regression model that ranks the remaining abstracts and identifies a new batch of abstracts for LLM escalation until all abstracts are labelled by the LLM or probability thresholds. We tested our workflow on abstracts screened in three published systematic reviews in psychiatry. Our primary endpoint was the recall metric, and secondary endpoint was the work saved over sampling at 95% recall metric (WSS@95%).ResultsIn a dataset on autism biomarkers, 1,655 (35%) of 4,745 retrieved abstracts were judged to be relevant by the original authors. The Delphi-LLM workflow correctly identified 1,605 (97.0%) of these 1,655 abstracts (precision = 54.2%, WSS@95% = 38.1%). The performance metrics were better than non-LLM approaches (recall ≤ 91%, WSS@95 ≤ 26%), and, overall, balanced these metrics optimally compared to single-LLM agents (recall = 84.9-99.9%, WSS@95% = 16.7-39.8%). The recall and work saved metrics were similarly reliable and among the top in two low-prevalence datasets on an attention-deficit hyperactivity disorder treatment review (10% of 2,891 relevant) and a posttraumatic stress disorder trajectory review (7% of 4,453 relevant). For these two datasets, recall was 100.0% and 96.4%, and the WSS@95% was 17.3% and 18.5%, respectively.ConclusionsWe presented the design and validation of a novel abstract screening workflow that centres around a Delphi-style aggregation process to harness the strengths of five open-source LLMs that can be run on consumer-level workstations. This multi-LLM workflow showed acceptable and reliable performance for use as an automated prescreening method to facilitate systematic reviews.
    Keywords:  Delphi method; abstract screening; large language models; systematic reviews; text embedding
    DOI:  https://doi.org/10.1177/07067437261445767
  3. Int J Biometeorol. 2026 Apr 28. pii: 143. [Epub ahead of print]70(5):
      
    Keywords:  ChatGPT; Geomagnetic activity; Mortality; Physiological effects; Solar cycle; Solar variability
    DOI:  https://doi.org/10.1007/s00484-026-03220-6
  4. JAMIA Open. 2026 Apr;9(2): ooag051
       Objective: This proof of concept for utilizing automatic data extraction methods to extract health technology assessment (HTA) attributes from HTA reports of medicines aimed to explore which attributes could be extracted and how accurately, using different data extraction methods. This enables easy access to insights into HTA recommendations for policymaking and policy-related research.
    Materials and Methods: In total, 14 relevant attributes (eg, assessment outcome or date) were identified for extraction using two classical natural language processing (NLP) methods (rule-based and classification models) and a generative AI method (large language model (LLM)-based, i.e., Claude 3 Opus). The performance of these techniques was compared using 50 HTA reports published by the National Institute for Health and Care Excellence (NICE, United Kingdom).
    Results: All three methods were able to extract certain attributes with high accuracy, with differences between the extraction methods and the type of attribute. The LLM-based extraction was the only method able to extract attributes on a medicine-indication combination level. The LLM-based extraction performed best (88-98% semantical accuracy for 12/14 attributes). Extraction of Outcome relative effectiveness analyses (REA) and Comparator was the most challenging and had the lowest accuracy (∼70% for the LLM-based extraction).
    Discussion & Conclusion: Automatic data extraction for relevant attributes from HTA reports is possible, but there is still room for improvement. LLM-based extraction outperformed the two NLP methods, but challenges regarding the use of commercial software and reproducibility remain. Future research should focus on expanding the system to other HTA organizations and further refining the LLM-based extraction.
    Keywords:  Automated data extraction; generative AI; health technology assessment; large language models; natural language processing
    DOI:  https://doi.org/10.1093/jamiaopen/ooag051
  5. JMIR Form Res. 2026 Apr 27. 10 e55127
       Background: Risk of bias (RoB) assessment of randomized clinical trials (RCTs) is vital to answering systematic review questions accurately. Manual RoB assessment for hundreds of RCTs is a cognitively demanding and lengthy process. Automation has the potential to assist reviewers in rapidly identifying text descriptions in RCTs that indicate potential risks of bias. However, no RoB text span annotated corpus could be used to fine-tune or evaluate large language models (LLMs), and there are no established guidelines for annotating the RoB spans in RCTs.
    Objective: The revised Cochrane RoB 2 test (RoB 2) tool provides comprehensive guidelines for RoB assessment; however, due to the inherent subjectivity of this tool, it cannot be directly used as RoB annotation guidelines. The study aimed to develop precise RoB text span annotation instructions that could address this subjectivity and thus aid the corpus annotation.
    Methods: We leveraged RoB 2 guidelines to develop visual instructional placards that serve as annotation guidelines for RoB spans and risk judgments. Expert annotators used these visual placards to annotate a dataset named RoBuster, consisting of 41 full-text RCTs from the domains of physiotherapy and rehabilitation. We report interannotator agreement (IAA) between 2 annotators for text span annotations before and after applying visual instructions on a subset (n=9) of RoBuster. We also provide IAA on bias risk judgments using Cohen κ. Moreover, we used a portion of RoBuster (n=10) to evaluate an LLM using a straightforward evaluation framework. This evaluation aimed to gauge the performance of an LLM (here GPT 3.5) in the challenging task of RoB span extraction and demonstrate the utility of this corpus using a straightforward framework.
    Results: We present a corpus of 41 RCTs with fine-grained text span annotations comprising more than 28,427 tokens belonging to 22 RoB classes. The IAA at the text span level calculated using the F1 measure varies from 0% to 90%, while Cohen κ for risk judgments ranges between -0.235 and 1.0. Using visual instructions for annotation increases the IAA by more than 17 percentage points. LLM (GPT-3.5) shows promising but varied observed agreements with the expert annotation across the different bias questions.
    Conclusions: Despite having comprehensive bias assessment guidelines and visual instructional placards, RoB annotation remains a complex task. Using visual placards for bias assessment and annotation enhances IAA compared to cases where visual placards are absent; however, text annotation remains challenging for the subjective questions and the questions for which annotation data are unavailable in RCTs. Similarly, while GPT-3.5 demonstrates effectiveness, its accuracy diminishes with more subjective RoB questions and low information availability.
    Keywords:  LLM; RCT; RoBuster; corpus; corpus annotation; effectiveness; information extraction; large language models; natural language processing; physiotherapy; randomized controlled trials; rehabilitation; reviewer; risk of bias; tools
    DOI:  https://doi.org/10.2196/55127
  6. JMIR AI. 2026 Apr 29. 5 e77311
       BACKGROUND: The exponential growth of digital information has led to an unprecedented expansion in the volume of unstructured text data. Efficient classification of these data is critical for timely evidence synthesis and informed decision-making in health care. Machine learning techniques have shown considerable promise for text classification tasks. However, multiclass classification of papers by study publication type has been largely overlooked compared to binary or multilabel classification. Addressing this gap could significantly enhance knowledge translation workflows and support systematic review processes.
    OBJECTIVE: This study aimed to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original studies, reviews, evidence-based guidelines, and nonexperimental studies.
    METHODS: The titles and abstracts of McMaster's Premium Literature Service (PLUS) dataset comprising 162,380 papers were used for fine-tuning seven domain-specific transformers. Clinical experts classified the papers into four mutually exclusive publication types. PLUS data were split in an 80:10:10 ratio into training, validation, and testing sets, with the Clinical Hedges dataset used for external validation. A grid search evaluated the impact of class weight (CW) adjustments, learning rate (LR), batch size (BS), warmup ratio, and weight decay (WD), totaling 1890 configurations. Models were assessed using 10 metrics, including the area under the receiver operating characteristic curve (AUROC), the F1-score (harmonic mean of precision and recall), and Matthew's correlation coefficient (MCC). The performance of individual classes was assessed using a one-to-rest approach, and overall performance was assessed using the macro average. Optimal models identified from validation results were further tested on both PLUS and Clinical Hedges, with calibration assessed visually.
    RESULTS: Ten best-performing models achieved macro AUROC≥0.99, F1-score≥0.89, and MCC≥0.88 on the validation and testing sets. Performance declined on Clinical Hedges. Models were consistently better at classifying original studies and reviews. Biomedical Bidirectional Encoder Representations from Transformers (fine-tuned on biomedical text; BioBERT)-based models had superior calibration performance, especially for original studies and reviews. Optimal configurations for search included lower LRs (1 × 10-5 and 3 × 10-5), midrange BSs (32-128), and lower WD (0.005-0.010). CW adjustments improved recall but generally reduced performance on other metrics. Models generally struggled with accurately classifying nonexperimental and guideline studies, potentially due to class imbalance and content heterogeneity.
    CONCLUSIONS: This study used a comprehensive hyperparameter search to highlight the effectiveness of fine-tuned transformer models, notably BioBERT variants, for multiclass clinical literature classification. Although class weighting generally decreased overall performance, addressing class imbalance through alternative methods, such as hierarchical classification or targeted resampling, warrants future exploration. Hyperparameter configurations were crucial for robust performance, aligning with the previous literature. These findings support future modeling research and practical deployment in human-in-the-loop systems to support knowledge synthesis and translation workflows with the findings from this work.
    Keywords:  classification; deep learning; information science; medical informatics; natural language processing
    DOI:  https://doi.org/10.2196/77311
  7. J Med Internet Res. 2026 May 01. 28 e88766
       Background: Generative artificial intelligence (GenAI) tools are increasingly used in scientific research to support literature searches, evidence synthesis, and manuscript preparation. While these systems promise substantial efficiency gains, concerns have emerged regarding their reliability, particularly their tendency to cite inaccurate, fabricated, or retracted literature. The unrecognized inclusion of retracted studies poses a serious risk to research integrity and evidence-based decision-making. Whether commonly used GenAI tools can reliably detect, exclude, or transparently communicate the retraction status of scientific publications remains unclear.
    Objective: This study aimed to evaluate the ability of freely available GenAI tools to correctly handle retracted scientific articles during literature searches. Primary and secondary outcomes focused on accuracy, reliability, and consistency in recognizing retracted literature.
    Methods: In this pragmatic trial, nine widely used free-access GenAI tools (ChatGPT 4, ChatGPT 5, Claude, Gemini, Perplexity, Microsoft Copilot, SciSpace, ScienceOS, and Consensus) were evaluated. Each tool was asked five predefined, standardized questions addressing topic overview, article identification, article summarization, and explicit assessment of retraction status. Overall, 15 retracted articles (the 10 most cited and 5 most recently retracted as of May 23, 2025) were selected from the Retraction Watch database. All questions were repeated twice to assess intratool consistency. Responses were independently rated as correct or incorrect by 2 researchers. Descriptive statistics summarized performance, and comparisons between general-purpose and research-focused AI tools were conducted using descriptive statistics. Interreviewer agreement was assessed using Cohen kappa coefficient.
    Results: None of the evaluated AI tools consistently handled retracted articles correctly. No model achieved perfect accuracy across all question sets. ChatGPT 5 performed best, defined by the primary outcome of achieving fully correct responses to all five predefined tasks (5/5) for the highest number of retracted articles, correctly answering all five questions for 8 of 15 articles (53.3%). Research-focused tools (SciSpace, ScienceOS, and Consensus) failed to produce a single fully correct response set. Retracted articles were frequently included in topic overviews without warning, with error rates exceeding 40% in several tools. When specifically asked about retraction status, most systems failed to provide correct or complete information. OpenEvidence only reported data for a subset of our retracted articles as it is only used in health care literature. It demonstrated strong performance in topic overviews but low accuracy in identifying retracted articles.
    Conclusions: Freely available GenAI tools are currently not able to detect, exclude, or appropriately flag retracted scientific literature. The widespread and confident reproduction of retracted studies represents a substantial threat to research integrity, particularly in medical and evidence-based fields. Until retraction-aware verification mechanisms are systematically integrated, independent source checking remains essential when using AI-assisted literature tools.
    Keywords:  AI; artificial intelligence; data accuracy; ethics; evidence-based Practice; retraction of publication; retractions; scientific misconduct
    DOI:  https://doi.org/10.2196/88766
  8. Trials. 2026 Apr 25.
      We have previously described a free, public web-based tool, Trials to Publications, https://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/TrialPubLinking/trial_pub_link_start.cgi, which employs a machine-learning model based on title, abstract, and other metadata features to predict which publications are likely to present clinical outcome results from a given registered trial in ClinicalTrials.gov. We have now updated and expanded the scope of the tool, by extracting mentions of ClinicalTrials.gov registry numbers (NCT numbers) from the full-text of 3 online biomedical article collections (open access PubMed Central (PMC), EuroPMC, and OpenAlex), as well as retrieving biomedical publications that are mentioned within the ClinicalTrials.gov registry itself. These mentions greatly increase the number of linked publications identified by the tool and should assist those carrying out evidence syntheses as well as those studying the metascience of clinical trials.
    Keywords:  Bibliographic databases; Clinical trials; Information retrieval; Linking trials to publications; Systematic reviews
    DOI:  https://doi.org/10.1186/s13063-026-09747-8
  9. ArXiv. 2026 Feb 12. pii: arXiv:2603.09986v1. [Epub ahead of print]
      Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and ,2 respectively.