bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–05–24
eighteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Stud Health Technol Inform. 2026 May 21. 336 934-938
      This study presents OpenExtract, an open-source pipeline for automated data extraction in large-scale systematic literature reviews. The pipeline queries large language models (LLMs) to predict data entries based on relevant sections of scientific articles. To test the efficacy of OpenExtract, we apply it to a systematic literature review in digital health and compare its outputs with those of human researchers. OpenExtract achieves precision and recall scores of > 0.8 in this task, indicating that it can be effective at extracting data automatically and efficiently. OpenExtract: https://github.com/JimAchterbergLUMC/OpenExtract.
    Keywords:  Data Extraction; Digital Health; Large Language Models; Retrieval Augmented Generation; Survey; Systematic Literature Review
    DOI:  https://doi.org/10.3233/SHTI260316
  2. Stud Health Technol Inform. 2026 May 21. 336 670-674
      Systematic reviews are essential for evidence-based healthcare but remain highly resource-intensive, with most retrieved studies ultimately excluded after manual screening. This study developed and evaluated a hybrid expert-LLM workflow to reduce human workload while maintaining accuracy and transparency. Within the Thyroid Risk Stratification Tool (ThyRST) project, ChatGPT-5 was used to classify 14,858 records on thyroid nodule malignancy risk into thematic categories. Recurrent irrelevant concepts were refined through expert consensus involving clinicians and informaticians and embedded as exclusion rules in structured prompts. The model then labelled each abstract as INCLUDE, EXCLUDE, or MAYBE, producing outputs for audit and verification. A random sample of 100 records was independently reviewed by human assessors to evaluate performance. The workflow achieved 96% concordance (κ = 0.91) with human reviewers, with only one false exclusion, and reduced manual screening time by approximately 70%. These results demonstrate that a transparent Delphi-inspired expert-LLM can accurately and reproducibly automate early-stage evidence screening, providing substantial efficiency gains while preserving human oversight and methodological rigor. The approach offers a practical pathway toward the responsible integration of generative AI in systematic review methodology and digital health research.
    Keywords:  ChatGPT; Evidence Screening; Expert Consensus; Large Language Models; Systematic Review; Thyroid Cancer
    DOI:  https://doi.org/10.3233/SHTI260255
  3. Knee Surg Sports Traumatol Arthrosc. 2026 May 20.
       PURPOSE: Large language models (LLMs) are a form of artificial intelligence (AI) that have emerged as potential tools to augment systematic review workflows. This study aimed to evaluate GPT-5 as a third reviewer for full-text screening across orthopaedic subspecialties.
    METHODS: Three review topics were selected. Python scripts were developed to call on the GPT-5 model via the OpenAI application programming interface (API) to perform full-text screening using predefined inclusion and exclusion criteria. Two human reviewers simultaneously performed screening based on the same criteria. Performance metrics such as specificity, sensitivity, accuracy, positive predictive value (PPV), negative predictive values (NPV), and F1 scores for GPT-5 were calculated based on a gold-standard inclusion and exclusion list developed by a third human adjudicator. Efficiency metrics included total cost and completion time.
    RESULTS: The number of full-texts screened were 35, 70 and 146 amongst the three review topics. For topic one, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 100% each. For topic two, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 98.2%, 93.3%, 98.2%, 97.1% and 93.3% respectively. For topic three, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 100%, 100%, 99.2%, 99.3% and 96.7%, respectively. Time to completion ranged between 18.1 and 58 min. Cost ranged from $0.84 to $3.29 USD.
    CONCLUSION: GPT-5 demonstrated high diagnostic accuracy as a third reviewer for full-text screening across three different subspecialties, with high agreement with final consensus adjudication decisions. These findings suggest that modern LLMs can potentially augment dual-review screening workflows by providing efficient decision-support while preserving methodological rigour. However, the small number of included studies within each topic resulted in wide confidence intervals, and additional validation across larger datasets are necessary.
    LEVEL OF EVIDENCE: Not applicable.
    Keywords:  GPT‐5; artificial intelligence; automation; full‐text; large language model; systematic review
    DOI:  https://doi.org/10.1002/ksa.70462
  4. BMJ Evid Based Med. 2026 May 17. pii: bmjebm-2025-114055. [Epub ahead of print]
       OBJECTIVES: To evaluate the performance of large language models (LLMs) in risk of bias assessment and to examine whether prompt engineering improves their accuracy and alignment with expert reasoning.
    METHODS: We analysed 158 randomised controlled trials from 10 dental systematic reviews and their risk of bias assessments were reviewed and revised to serve as the reference standard. Two LLMs (DeepSeek-V3 and GPT-5) were evaluated under four prompting strategies, including direct command, command with reference, constrained output and formula-constrained output. The direct command served as the blank control group, simulating the approach commonly used by clinicians, whereas the other three groups employed different prompt engineering. The performance of LLMs across the seven domains of RoB-1 was evaluated using accuracy and agreement. The reasoning process of the LLMs was expressed in the form of syllogisms and its similarity to expert reasoning was assessed using MMD2.
    RESULTS: LLMs showed limited capability in risk of bias assessment under the blank control condition, with mean accuracies of 0.72 for DeepSeek-V3 and 0.65 for GPT-5. With formula-constrained prompting, the performance of both LLMs improved significantly, and the overall accuracy increased to 0.85 for both DeepSeek-V3 and GPT-5 (both vs the blank control group, p<0.001). Agreement metrics showed a similar pattern, with higher agreement under formula-constrained prompting than under the other prompting strategies (p<0.001 for both models). In addition, the syllogistic output format provided a clear representation of the reasoning process underlying risk of bias assessment. Compared with constrained output, formula-constrained prompting also produced reasoning that was more closely aligned with the reference answers, as indicated by lower MMD² values (DeepSeek-V3: 0.0765 vs 0.1239; GPT-5: 0.0548 vs 0.1068).
    CONCLUSION: Prompt engineering substantially improved the performance of LLMs in risk of bias assessment. Although LLMs cannot currently replace human reviewers, they may serve as efficient and transparent tools to support this process.
    Keywords:  Dentistry; Evidence-Based Practice; Information Science; Methods; Systematic Reviews as Topic
    DOI:  https://doi.org/10.1136/bmjebm-2025-114055
  5. Front Res Metr Anal. 2026 ;11 1807672
       Introduction: Large language models (LLMs) show great promise as tools for assisting scientific peer review, but their agreement with human experts in quantitative assessment of academic content needs further investigation. This study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating conference abstracts compared to one another and to human reviewers.
    Methods: Three LLMs independently graded 160 abstracts from a regional conference, while 14 human reviewers each assessed a subset using an identical rubric with eight criteria scored on a 1-5 scale. We compared AI and human scoring patterns using boxplots, calculated intraclass correlation coefficients (ICCs) for inter-rater reliability both among LLMs and between human and LLMs, and examined Bland-Altman plots to identify agreement patterns and systematic bias.
    Results: Three LLMs demonstrated high internal consistency with narrow interquartile ranges and few outliers in composite scores, while human reviewers exhibited greater scoring variability. LLMs also achieved good-to-excellent agreement with each other across all criteria (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs = 0.45-0.60 for composite score, impression, clarity, objective, and results. The two LLMs' concordance with humans achieved fair levels on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini performed notably worse, showing fair agreement on half the criteria and poor reliability on impact and applicability. Bland-Altman analysis revealed acceptable or negligible systematic bias, with mean differences of 0.24 (ChatGPT), 0.42 (Gemini), and -0.02 (Claude) from human mean ratings.
    Discussion: With appropriate model selection, LLMs could reach moderate agreement with human experts on abstract overall quality and objective criteria, supporting their potential use for pre-screening low-quality submissions or serving as additional reviewers. Their ability to apply rubrics consistently across large volumes of abstracts offers advantages in efficiency and standardization that exceed human feasibility. However, LLMs' reduced performance on subjective dimensions indicates that they should complement rather than replace human judgment in abstract evaluation, with expert review remaining essential for comprehensive assessment.
    Keywords:  abstract evaluation; artificial intelligence; inter-rater reliability; large language models; peer-review
    DOI:  https://doi.org/10.3389/frma.2026.1807672
  6. Behav Res Methods. 2026 May 22. pii: 170. [Epub ahead of print]58(6):
      Manual data extraction in meta-research is often tedious, time-consuming, and error-prone. In this paper, we investigate whether the current generation of large language models (LLMs) can be used to extract accurate information from scientific papers. Across the meta-research literature, these tasks usually range from extracting verbatim information (e.g., the number of participants in a study, effect sizes, or whether a study is preregistered) to making subjective inferences. Using a publicly available dataset containing a wide range of metascientific variables from 43 network psychometrics papers, we tested five LLMs (Claude 4.6 Opus, Claude 4.5 Sonnet, Claude 4.5 Haiku, GPT-5.2, and GPT-5 mini). We used an automated API-based pipeline to extract variables from the documents. This approach allows batch processing of research papers. As such, it represents a more efficient and scalable way to extract metascientific data than the default chat interface. The extraction accuracy ranged from 79.6% to 91.3% across the models. The extraction performance was generally higher for more explicit, verbatim information and worse for variables that required more complicated inference. Furthermore, most models were able to convey uncertainty in more contentious cases. We provide a comparison of the accuracy and cost-effectiveness of the individual models and discuss the characteristics of variables that are and are not suitable for automatic coding. Furthermore, we describe some of the common pitfalls and best practices of automated LLM data extraction. The proposed procedure can substantially reduce the time and costs associated with conducting meta-research.
    Keywords:  Large language models; Metaresearch; Metascience; Network psychometrics
    DOI:  https://doi.org/10.3758/s13428-026-03052-7
  7. AEM Educ Train. 2026 Jun;10 e70189
       Objectives: The increasing volume of medical education research necessitates efficient, reliable, and scalable methods for conducting quality appraisals. The Medical Education Research Study Quality Instrument (MERSQI) is a widely used tool, although its manual scoring process remains resource-intensive. This study evaluated how well large language models (LLMs) appraise medical education research using the MERSQI tool in comparison with human judges.
    Methods: Three LLMs (GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro) assigned MERSQI domain scores to 1423 medical education research articles. The authors compared AI-generated scores with human-generated scores using intraclass correlation coefficients (ICCs) across the six MERSQI domains. They evaluated the agreement between AI- and human-generated MERSQI composite scores using Bland-Altman plots.
    Results: Domain-level ICC values ranged from fair (0.24) to near perfect (0.81), with the lowest agreement observed in the 'sampling,' 'validity evidence,' and 'data analysis' domains. No single LLM consistently outperformed the others across all domains. Composite score agreement with human ratings was substantial and similar across LLMs (ICC range: 0.65-0.69). GPT-5 produced slightly lower composite scores than humans, while Claude Sonnet 4 and Gemini 2.5 Pro produced higher scores, with Gemini showing the largest deviation. The Bland-Altman plots for Gemini 2.5 Pro suggested proportional bias, indicating its agreement with human scores varied across the range of study quality.
    Conclusions: These LLMs demonstrated substantial agreement with human raters for MERSQI composite scores, but domain-level agreement varied. Systematic differences in scoring patterns highlight the need for human oversight and additional calibrations before integrating LLMs into systematic review appraisal workflows.
    Keywords:  AI‐assisted research; MERSQI; artificial intelligence; medical education
    DOI:  https://doi.org/10.1002/aet2.70189
  8. Stud Health Technol Inform. 2026 May 21. 336 1058-1059
      Large language models (LLMs) can accelerate the early stages of thematic analysis while preserving rigor. We piloted a lightweight, open-source LLM pipeline on a single semi-structured interview from a mixed-methods case study of Québec's electronic health-record rollout. Manual coders applied a predefined codebook to identify concerns about the deployment; four LLMs (including Hermes 3) performed the same task. Hermes 3 achieved the highest extraction accuracy, identifying 70 concern dimensions of which 54 were valid. Combined human-machine coding yielded 67 unique valid concerns, with the LLM uncovering 11 dimensions missed manually. These results show that LLMs can markedly reduce manual effort and reveal overlooked themes, yet expert validation remains essential. A hybrid workflow, LLM-driven extraction combined with researcher manual coding, is recommended.
    Keywords:  Human–machine coding synergy; Large language models (LLMs); Reducing manual coding workload; Thematic analysis
    DOI:  https://doi.org/10.3233/SHTI260349
  9. BMC Med Res Methodol. 2026 May 21.
       BACKGROUND: The exponential growth of biomedical literature challenges the feasibility, reproducibility, and bias control of diagnostic meta-analyses based on manual screening.
    METHODS: We propose a scalable framework integrating automated topic modeling (Latent Dirichlet Allocation, LDA) for thematic pre-screening with hierarchical multivariate meta-analysis to jointly synthesize sensitivity and specificity. Abstracts from eight databases were processed using linguistic normalization, lemmatization, and probabilistic topic modeling to prioritize diagnostically relevant studies. Selected studies were synthesized using bivariate hierarchical random-effects models on the logit scale, allowing incorporation of methodological and clinical moderators.
    RESULTS: Applied to dengue diagnostic algorithms in Latin America, the framework reduced 5, 766 retrieved records to 10 studies contributing 94 algorithms. Machine-learning-based models showed significantly higher joint diagnostic performance than traditional statistical models (logit coefficient = 1.5657, p < 0.001), while external validation was associated with a significant loss of performance (-0.7962, p < 0.001). Estimated sensitivity ranged from 37.97% to 86.66% and specificity from 63.26% to 94.81% across algorithm types and phases.
    CONCLUSIONS: The proposed workflow offers a methodologically rigorous and scalable approach to diagnostic evidence synthesis that is adaptable to clinical domains characterized by rapid literature growth and heterogeneous diagnostic evidence.
    Keywords:  Dengue diagnosis; Diagnostic accuracy; Hierarchical meta-analysis; LDA; Machine learning in healthcare; Systematic review automation; Topic modeling
    DOI:  https://doi.org/10.1186/s12874-026-02873-6
  10. Stud Health Technol Inform. 2026 May 21. 336 949-953
      Drug-indication knowledge underpins evidence-based prescribing but remains challenging to maintain manually. Existing resources incompletely represent off-label uses that may be captured within artificial intelligence models or described in the biomedical literature. This study presents THERA-IE (Therapeutic Hypothesis Extraction and Relationship Analytics - Indication Extraction), an open-source framework that combines natural language processing, transformer, and large language model (LLM) approaches to identify putative therapeutic indications. Twenty FDA-approved drugs were analyzed using SemMedDB, PubMedBERT, and three open-source LLMs (Llama 3.2, MedLlama 3, and Qwen 3). LLMs were implemented in both model-only ("Naïve") and literature-supported ("Literature-Based") modes. System performance was benchmarked against DrugBank. Naïve Llama 3.2 achieved the highest overall F1-score (0.60 ± 0.18), while the Literature-Based Llama 3.2 attained the highest precision (0.67 ± 0.29). SemMedDB achieved the highest recall (0.77 ± 0.25) but also produced the most false positives. THERA-IE identified an average of 4.8 ± 2.1 plausible off-label indications per drug. These findings demonstrate the potential of THERA-IE to enable scalable, reproducible extraction of therapeutic knowledge from biomedical literature.
    Keywords:  LLM; drug indications; natural language processing; off-label use
    DOI:  https://doi.org/10.3233/SHTI260319
  11. Stud Health Technol Inform. 2026 May 21. 336 468-472
      The rapid expansion of clinical knowledge presents significant challenges for maintaining current and comprehensive clinical practice guidelines (CPGs). Manual curation processes are resource-intensive and often result in delayed integration of new evidence. We present a reproducible pipeline for semi-automated surveillance of clinical practice guidelines (CPGs). The system ingests heterogeneous guideline documents, parses main text (PyMuPDF) and segments sentences (spaCy), normalizes terms via a concept layer, embeds passages with biomedical language models, and indexes them with FAISS for dense retrieval. A retrieval-augmented generation (RAG) step drafts editorial suggestions with provenance for expert review. PubMedBERT achieved highest performance across metrics, with 80% inter-annotator agreement, supporting expert-guided AI guideline updates.
    Keywords:  Clinical practice guidelines (CPGs); FAISS; biomedical NLP; information retrieval; nonparametric statistics; retrieval-augmented generation
    DOI:  https://doi.org/10.3233/SHTI260199
  12. Bioinform Adv. 2026 ;6(1): vbag116
       Motivation: The exponential growth of academic literature has presented unprecedented opportunities. However, it also underscores the need for advanced search methodologies to support efficient knowledge discovery. While effective for structured queries, traditional keyword-based search engines often struggle with the inherent variability of language, where the same concept can be expressed in many ways, leading to incomplete or imprecise retrieval of relevant research. Another issue that must be considered is that of lexical ambiguity, such as polysemy or homonymy, whereby several words and abbreviations can have multiple meanings. This results in items placed in the results list that are irrelevant to the search context. Recent advances in natural language processing have enabled semantic similarity techniques that move beyond basic text matching toward context-aware search.
    Results: We developed VectorSage (https://vectorsage.nube.uni-greifswald.de), an advanced biomedical search system for retrieving PubMed abstracts using a hybrid approach that combines term relevance scoring with embedding-based semantic similarity. VectorSage employs a global ranking mechanism to enhance further search relevance by sorting the retrieved documents, ensuring a balance between semantic relevance and keyword specificity. This method enables efficient literature exploration and knowledge discovery.
    DOI:  https://doi.org/10.1093/bioadv/vbag116
  13. Stud Health Technol Inform. 2026 May 21. 336 1041-1042
       INTRODUCTION: Manual construction of UMLS concept sets is time-consuming and inconsistent across users.
    METHODS: CUI-Curate, a GPT-5 and graph-based retrieval framework, was developed to automate clinical concept set generation from UMLS source vocabularies.
    RESULTS: Across five target concepts, CUI-Curate achieved higher recall (mean gain = +0.17) while maintaining high precision (mean 0.94), comparable to manual curation, and substantially reducing manual effort.
    CONCLUSION: Automated, LLM-assisted curation offers an efficient and reproducible alternative to manual UMLS browsing.
    Keywords:  Large language models; Unified Medical Language System; clinical text mining; graphRAG; knowledge graph; natural language processing
    DOI:  https://doi.org/10.3233/SHTI260341
  14. Stud Health Technol Inform. 2026 May 21. 336 854-858
      Automating the classification of clinical evidence levels in biomedical literature can support precision oncology by facilitating the acceleration of variant interpretation and informed decision-making. This study compares the performance of two state-of-the-art large language models (LLMs) (GPT-4.1-mini and Gemini-2.5-Flash) and two machine learning (ML) algorithms (decision tree and XGBoost) for classifying publications according to the Clinical Interpretation of Variants in Cancer (CIViC) evidence level system. Zero- and few-shot prompting strategies were tested for LLMs, while Term Frequency-Inverse Document Frequency (TF-IDF) and word embedding representations were evaluated for ML models. XGBoost with TF-IDF achieved the highest performance (micro-F1 = 0.83), outperforming both LLMs and decision trees. All models performed best on mid-range evidence levels (B to D) and struggled with high (A) and inferential (E) levels, reflecting dataset imbalance and linguistic ambiguity. These findings suggest that, at present, abstract-level evidence classification is largely driven by explicit lexical cues, with limited added benefit from standalone LLM-based approaches.
    Keywords:  Clinical Evidence Level; Large Language Models; Machine Learning; Text Classification
    DOI:  https://doi.org/10.3233/SHTI260300
  15. BioData Min. 2026 May 16.
       BACKGROUND: Identifying confounding variables is fundamental for robust observational studies, yet the traditional manual process is a time-consuming and subjective barrier for researchers. Recent advances in Retrieval-Augmented Generation (RAG) offer a promising solution, but most existing systems rely on full-text access, cloud-hosted APIs, or manually curated knowledge graphs, raising concerns about privacy, copyright, and computational cost, and making local deployment difficult.
    OBJECTIVE: This study developed and evaluated a heuristic tool to scope candidate confounders for adjustment in observational studies. Using a locally deployed, abstract-only RAG architecture, our tool generates a traceable shortlist of candidate confounders from PICO (Population, Intervention, Comparison, Outcome) queries over medical abstracts.
    METHODS: We implemented a three-stage architecture for PICO-based scoping of candidate confounder. The pipeline was deployed on an all-in-one local server and evaluated using 1,000 expert-curated PICO queries spanning 20 clinical specialties. Performance was assessed along four dimensions-internal consistency, output volume, efficiency, and clinical acceptance-by a multi-institutional clinician panel, and was compared with a graph-only SemMedDB baseline.
    RESULTS: Across repeated runs, the pipeline showed high internal consistency (candidate confounder list consistency 94.6%±8.7%; PMID set consistency 79.4%±23.5%). It suggested a median of 6 candidate confounders (IQR 8) for adjustment and retrieved a median of 33 unique PMIDs (IQR 7) per query. Median processing time was 44.50 s (IQR 31.72). Expert review yielded an overall clinical acceptance rate of 87.12%.
    CONCLUSIONS: In an exploratory capacity, a locally deployed, abstract-only RAG workflow can generate clinically interpretable and traceable candidate confounder suggestions to support early-stage observational study design, particularly in settings with privacy constraints or limited access to full texts and cloud resources.
    TRIAL REGISTRATION: NA.
    Keywords:  Clinical research support; Confounder identification; Large language models; Local deployment; Observational study design; PICO framework; Retrieval-augmented generation
    DOI:  https://doi.org/10.1186/s13040-026-00562-0
  16. Vet Clin North Am Small Anim Pract. 2026 May 18. pii: S0195-5616(26)00052-5. [Epub ahead of print]
      This article reviews computer systems that support veterinary clinical practice using artificial intelligence language models for language interpretation and generation, such as systems for client communication, medical records, clinical decision support, and clinical practice assessment. It provides guidance on incorporating tools based on large language models into clinical workflows to improve efficiency, clinical accuracy, and provider performance. Key inherent risks and recommendations for the responsible use of this technology by veterinary professionals are provided.
    Keywords:  AI in veterinary medicine; Artificial intelligence; Clinical decision support; Large language models; Natural language processing; Responsible AI
    DOI:  https://doi.org/10.1016/j.cvsm.2026.03.014
  17. Stud Health Technol Inform. 2026 May 21. 336 1045-1046
      Recommendations from clinical practice guidelines are crucial for increasing patient care. We compare LLM/VLM-based extraction approaches against a rule set baseline. The results show that most recommendations can be extracted, but potential risk for subsequent use remains.
    Keywords:  Clinical Practice Guidelines; Information Extraction; LLMs
    DOI:  https://doi.org/10.3233/SHTI260343