bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–06–07
thirteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. BMC Med Res Methodol. 2026 May 30.
    Air and Health Atlas Study Group
       BACKGROUND: The exponential growth of scientific publications has increased the complexity of evidence synthesis. Systematic reviews remain essential but highly resource-intensive. Large language models (LLMs) offer new opportunities to support or partially automate key steps of this process. This study evaluates the performance of Elicit's Systematic Reviews workflow in comparison to the traditional methodology, using as reference a published umbrella review on the association between air pollution and acute lower respiratory infections (ALRI).
    METHODS: A parallel workflow was developed to reproduce each phase of the traditional review. Considering article retrieval, for the traditional workflow, articles were retrieved through a Boolean search, while for the AI-assisted workflow a natural-language query submitted to Elicit. Screening was conducted in two steps, emulating the traditional PECOS-based criteria. Full-text evaluation and quality appraisal were performed through Elicit's "data extraction" functionalities. For quality appraisal the validated AMSTAR-2 EH questionnaire was applied.
    RESULTS: The traditional Boolean search identified 324 unique articles. When compared with the 500 records retrieved by Elicit, an overlap of 8% was observed, which prevented a direct, recall-oriented comparison of search performance. To enable a controlled comparison of downstream steps, we applied Elicit's screening and data-extraction functions to the 324 records identified through the Boolean search, using empirically defined screening-score threshold in Elicit to select studies for further evaluation. In the screening on title and abstract, 33 articles were identified through the traditional workflow and 70 through Elicit, 30 articles overlapping (recall 90.9%, precision 42.9%). The full-text screening selected 15 articles with the traditional methodology and 24 with Elicit, all 15 articles from the traditional methodology being included in Elicit selection (recall 100%, precision 62.5%). In the quality assessment, Elicit showed 24.4% disagreement on general items and 30.4% on additional items of the AMSTAR-2 EH. Errors clustered around multi-component questions, items requiring expert interpretation, and information located in supplementary materials.
    CONCLUSIONS: Elicit can support several phases of systematic reviews and reduce manual workload, but it cannot independently reproduce the methodological rigor required for high-quality evidence synthesis. At present, LLM-based tools are best positioned as complementary systems within human-supervised workflows.
    Keywords:  AMSTAR-2 EH; Artificial Intelligence; Data extraction; Evidence synthesis; Large language models; Systematic reviews
    DOI:  https://doi.org/10.1186/s12874-026-02892-3
  2. Ann Rheum Dis. 2026 Jun 02. pii: S0003-4967(26)00282-7. [Epub ahead of print]
       OBJECTIVES: Systematic literature reviews (SLRs) provide the scientific basis for European Alliance of Associations for Rheumatology (EULAR) task force projects, but they are highly time- and labour-intensive in an ever-growing research landscape, repetitive, and susceptible to human error. In this study, we evaluate the performance of machine learning (ML) models developed for semiautomated title and abstract screening, aiming at supporting and accelerating future review processes.
    METHODS: Title and abstract screening of 4 SLRs, conducted to inform the 2025 update of the EULAR management recommendations for rheumatoid arthritis, was replicated using different ML models, with manual screening serving as reference standard. Eligible software was identified via the Systematic Review Toolbox, online and ML-based searches, and reference checking. Tools were included if they were actively maintained, accessible, and offered screening functionality beyond commonly used reference managers.
    RESULTS: Nine tools employing ML-based record prioritisation and relevance prediction were identified, of which 3 met all inclusion criteria and were evaluated in detail. These were applied to search results from 3 SLRs on randomised controlled trials (RCTs) and 1 that also included observational studies. Across reviews, substantial workload reductions were achieved (mean 77.8%, SD 12.8%) while consistently capturing over 95% of all relevant records. Additionally, ML-supported classification and filtering of RCTs prior to screening resulted in a mean reduction of abstracts to screen by 57.3% (SD 7.4%).
    CONCLUSIONS: Replication of manual title and abstract screening demonstrates the considerable potential of ML tools to support systematic reviews, while highlighting important limitations and pitfalls. Prospective evaluations will identify optimal strategies for integrating these tools into future review processes while maintaining high methodological standards.
    DOI:  https://doi.org/10.1016/j.ard.2026.05.006
  3. J Clin Epidemiol. 2026 May 29. pii: S0895-4356(26)00224-6. [Epub ahead of print] 112349
       OBJECTIVES: Living guidelines are an emerging approach to ensure timely synthesis of the research evidence. However, pragmatic methods for maintenance are needed to ensure sustainability. Our study aimed to simulate and evaluate the performance and efficiency of various single database evidence retrieval workflows augmented by AI-enabled pre-ranking with cutoff for living guideline development and maintenance.
    METHODS: A retrospective simulation study was conducted using data from the 2023 International Polycystic Ovary Syndrome Guidelines. Simulations were run across four databases (Medline, Embase, PubMed and OpenAlex) to identify the peer-reviewed articles included in the guidelines. Workflows were evaluated at the guideline (all articles) and topic level. Single database topic-specific searches were compared against single database overarching searches. The performance of overarching searches with AI-enabled pre-ranking with cutoff at guideline and topic level was also evaluated. Metrics included recall, precision, F score, number of articles needed to screen per relevant study (NNR) and overall screening workload.
    RESULTS: Across 38 eligible topics (854 articles), overarching searches outperformed topic-specific searches at guideline level for both recall (92% to 96% versus 76% to 89%) and efficiency, reducing overall screening workload by 63% to 70%, and requiring teams to screen 28 to 48 articles per relevant study versus 76 to 160 between comparable databases (Embase, Medline). At individual topic level, topic-specific searches were more efficient than overarching searches integrated with topic-specific rankings. However, topic-specific searches had significantly lower recall (p<0.01) in comparison. AI-enabled ranking provided only marginal efficiency gains at guideline level (3% to 21% NNR reduction) compared to topic level (85% to 95% NNR reduction). Lastly, performance of automated article retrieval via PubMed API was equivalent to manual retrieval via Ovid Medline.
    CONCLUSION: Single database overarching searches outperform single database topic-specific searches and should be considered during guideline maintenance when most of the guideline needs updating. While topic-specific searches may be more efficient in instances where only a few areas need to be updated, using a single database approach may result in lower recall. single database overarching searches integrated with topic-specific rankings can be considered in such cases.
    Keywords:  Evidence synthesis; evidence retrieval; learning health systems; living evidence; living guidelines; vector search
    DOI:  https://doi.org/10.1016/j.jclinepi.2026.112349
  4. Int J Med Inform. 2026 Jun 02. pii: S1386-5056(26)00256-X. [Epub ahead of print]218 106516
       OBJECTIVE: This study evaluated model-configuration and stopping-rule decisions when using active learning-based title-and-abstract screening in health technology evidence syntheses.
    METHODS: We conducted retrospective simulations using seven pre-labelled datasets from systematic, scoping, and overview reviews in health technology. Simulations were implemented with ASReview Makita and compared lightweight configurations based on one-hot encoding or term frequency-inverse document frequency with naive Bayes, logistic regression, random forest, and support vector machine classifiers. Performance was evaluated using normalised recall regret ("loss"), work saved over sampling at 95% (WSS@95) and 100% recall (WSS@100), early recall, and K%-consecutive-irrelevant stopping rules. Repeated simulations and exploratory dataset-level analyses were conducted for the highest-ranked configuration.
    RESULTS: SVM + TF-IDF (with bigrams) had the most favourable overall performance, with an average loss of 0.08 (95% CI 0.06 to 0.09), WSS@95 of 0.70 (95% CI 0.59 to 0.79), and WSS@100 of 0.50 (95% CI 0.30 to 0.69). At a fixed 7% consecutive-irrelevant stopping rule, all datasets reached at least 95% recall in the main analysis, with mean recall of 98%. In repeated simulations, the fixed 7% rule achieved mean recall of 97%; however, one very low-prevalence dataset did not reach 95% recall until K = 33%. Exploratory analyses suggested that relevant-record prevalence, textual similarity among relevant records, and abstract completeness may help explain variation in model performance and stopping-rule reliability, although these analyses were hypothesis-generating.
    CONCLUSION: Active learning-based screening reduced workload in these health technology datasets, but its use requires explicit implementation choices. SVM + TF-IDF (with bigrams) was the most pragmatic initial configuration, and a 7% consecutive-irrelevant rule was a useful stopping heuristic. However, stopping decisions should depend on the review's tolerance for missed studies, dataset quality, topic heterogeneity, and available safeguards, rather than on a fixed threshold alone.
    Keywords:  ASReview; Computational simulation; Evidence-based health science; Systematic review
    DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106516
  5. JCO Clin Cancer Inform. 2026 Apr;10(2): e2500386
       PURPOSE: The growing volume of biomedical literature, especially in oncology, necessitates automated tools for extracting clinically relevant information. Large language models (LLMs) offer promising capabilities for data extraction. However, their potential to extract clinically relevant information from case reports detailing rare treatment interactions remains underexplored.
    METHODS: We systematically searched PubMed for case reports on interactions between radiotherapy (RT) and pembrolizumab, cetuximab, or cisplatin. A random sample of 100 report abstracts for each therapy was manually classified by two independent medical experts using 23 Boolean questions about patient demographics, treatment, cancer type, and outcomes with mutually exclusive answers, forming a ground truth. An LLM-based system with the open-source Generative Pretrained Transformer (GPT) models (GPT-OSS-120B and GPT-OSS-20B) was applied to classify these reports and the remaining data set entries using the defined question structure. Performance of the approach was evaluated using the standard classification metrics accuracy, precision, recall, and F1-scores.
    RESULTS: The searches yielded 320 (pembrolizumab), 147 (cetuximab), and 2055 (cisplatin) publications. Inter-rater agreement for manual classification was high (Cohen's kappa = 0.85), though lower for specific outcome and cancer type questions. The LLM-based classification (GPT-OSS-120B model) achieved high overall performance with an F1-score of 93.64% (95.19% accuracy, 93.23% precision, 94.05% recall). Performance was consistent across systemic therapies (STs), with the GPT-OSS-20B model showing similar results (F1-score 93.22%). Analysis of the entire data sets revealed that 56.14% of publications described patients who received both RT and ST. Proportions of positive and negative outcomes varied by therapy and sequencing.
    CONCLUSION: LLM-based classification systems demonstrate high performance for curating scientific case reports on RT and ST interactions. These findings support their potential for high-throughput hypothesis generation and knowledge base construction, particularly for underutilized case reports, with even smaller open-source models proving to be effective.
    DOI:  https://doi.org/10.1200/CCI-25-00386
  6. Res Sq. 2026 May 28. pii: rs.3.rs-9726844. [Epub ahead of print]
      Background Qualitative methods are widely used in health services research to derive context-specific insights and depth of understanding. Manual coding, a foundational technique in rigorous qualitative analysis, is highly resource and time-intensive and difficult to scale. This is a particular challenge in health services research, where repeated rounds of interviews are common and rapid turnaround is often required. Natural Language Processing (NLP), specifically Large Language Models (LLMs), have shown potential to enhance efficiency in qualitative analysis. However, there is limited research providing guidance on how to integrate LLMs while maintaining rigor and trustworthiness. In this proof-of-concept study, we propose, apply, and evaluate an NLP-assisted coding method in a health services research setting. Methods We analyzed 22 interviews among public health officials, law enforcement, community organizers, and medical professionals at one California county to examine existing substance use service gaps and needs. A primarily deductive codebook was iteratively refined until two coders achieved an inter-coder reliability (ICR) > 0.95 and was applied to the transcripts using ATLAS.ti. We developed an NLP-assisted method that uses a semantic shift algorithm to segment transcripts which are then passed to GPT-4 for code assignment and explanation using the codebook and coding guidelines developed during the manual process. We evaluated the method with a quantitative assessment of agreement between human and NLP-assigned codes, a qualitative and quantitative soundness assessment by two reviewers, and a comparative efficiency analysis. Results The NLP-assisted method had moderate agreement with human coding (modified pooled Cohen's Kappa = 0.66), and 71.8% of codes were rated as sound by reviewers. Sound codes were more often observed for high-certainty and straightforward codes, and when text chunks were semantically well defined. The NLP-assisted method had more difficulty with non-linear conversation and entity-dependent codes. Coding time was reduced significantly from ~40 hours for the traditional method to ~1 hour for the NLP-assisted method. Conclusions These findings suggest that LLMs can be effectively incorporated into qualitative processes while maintaining rigor if humans are embedded into the process. By maintaining a human-in-the-loop workflow, our methodology allows for researchers to maintain familiarity with the data, define the research question(s) and codebook, and determine if there are results that are not sound. By incorporating LLMs into the coding stage of the process, key limitations of traditional qualitative methods in health services research can be addressed, such as scalability, and resource and time limitations.
    DOI:  https://doi.org/10.21203/rs.3.rs-9726844/v1
  7. Int J Nurs Stud. 2026 May 19. pii: S0020-7489(26)00256-7. [Epub ahead of print]182 105584
       BACKGROUND: Qualitative data analysis in nursing research remains labor-intensive and vulnerable to researcher bias. While large language models offer transformative potential for automating thematic extraction and improving analytical consistency, their methodological rigor, alignment with human analysis, and applicability to nursing contexts remain underexplored.
    AIM: This study examined whether large language models can assist qualitative descriptive analysis by generating preliminary, data-near summaries of participants' accounts and whether these AI-generated outputs align with human-generated descriptive syntheses. Using kinesiophobia in postoperative bone tumor patients as a case study, we propose a triangulated framework that combines large language models and human coding to enhance analytical rigor and efficiency.
    METHODS: Semi-structured interviews (N = 15) with postoperative bone tumor patients were analyzed using two approaches: (1) large language model analysis via ChatGPT and DeepSeek; and (2) human-coded analysis by an experienced qualitative researcher. Methodological trustworthiness was assessed through coding consistency and time-efficiency metrics.
    RESULTS: Both large language models, aligned with the human analyst, identified four common themes: (1) Disease and treatment experiences; (2) Mind-body dynamics in rehabilitation; (3) Utilization of health education; and (4) Roles of family support. The thematic output of the large language models showed strong overlap with the human-coded analysis (Cohen's κ = 0.89) while substantially reducing coding time. Remaining discrepancies may reflect differences in interpreting implicit emotional cues, variation in analytic focus and scope between human and model outputs, and the potential illusion of model understanding.
    CONCLUSION: Large language models hold promise as valuable supplementary tools in qualitative nursing research, improving efficiency and reducing potential bias of human-coded analysis. Yet, human expertise remains essential for interpreting psychosocial nuances and ensuring contextual relevance. This study introduces a hybrid large language model-human methodology, enhancing qualitative rigor while maintaining the patient-centered ethos of nursing. Future research should assess the scalability of this approach across diverse study populations.
    Keywords:  Bone tumor patients; Large language models; Nursing research; Qualitative data analysis
    DOI:  https://doi.org/10.1016/j.ijnurstu.2026.105584
  8. J Multimorb Comorb. 2026 Jan-Dec;16:16 26335565261444423
       Background: People living with multimorbidity often experience unmet social care needs, which can negatively affect wellbeing and increase pressure on health and social care systems. Artificial intelligence (AI)-enabled tools may support more timely and tailored responses to these needs. Large language models (LLMs) are emerging as tools to support qualitative research, although research detailing their integration into qualitative analytic workflows remains limited.
    Methods: We conducted a secondary thematic analysis of 75 qualitative interview transcripts involving people with multimorbidity and their carers. The dataset was coded according to an analytic framework of exploratory, interpretive, and integrative layers of meaning. The dataset was analysed according to two parallel analytic streams: human reflexive thematic analysis, and qualitative analysis using Claude Sonnet 4. Model outputs were iteratively reviewed and compared against manual thematic analysis for convergence and divergence.
    Results: Across the analytic workflow, twelve themes from the original human-led analysis were used as a reference framework for examining areas of alignment, extension, or divergence in LLM-generated interpretations. The LLM-assisted analysis highlighted shifts in analytic emphasis and candidate interpretive nuances, including emotive tone and latent cross-cutting concerns, while requiring human oversight to determine evidential grounding.
    Conclusions: We present a structured methodological illustration for integrating LLM-assisted outputs within qualitative analysis. Using convergence-divergence mapping, we examine how LLM-generated interpretations may function as an additional analytic lens that can support reflexivity, transparency, and analytic auditability in qualitative research applied within the context of multimorbidity.
    Keywords:  artificial intelligence; large language models; multimorbidity; qualitative analysis; social care
    DOI:  https://doi.org/10.1177/26335565261444423
  9. Int J Med Inform. 2026 May 29. pii: S1386-5056(26)00231-5. [Epub ahead of print]218 106491
       PURPOSE: Reproducibility in rehabilitation evidence synthesis is influenced not only by search strategy and adjudication architecture but also by the structural clarity of operational taxonomy. This study evaluated whether shared operational definitions support classification stability across AI-assisted adjudication architectures.
    METHODS: A previously established deduplicated rehabilitation corpus was analyzed using a fixed multi-adjudicator architecture under standardized operational constraints. Inter-architecture concordance (agreement, Cohen's κ, and Gwet's AC1) was assessed. Corpus expansion was modeled through staged database inclusion, and stability bounds were estimated under best- and worst-case perturbation scenarios without re-adjudication of newly identified records. Risk of bias assessment was not performed, as the objective was classification concordance rather than therapeutic effect estimation.
    RESULTS: High inter-architecture concordance was observed under fixed operational definitions. Sensitivity-envelope modeling identified both stability-preserving and stability-failing boundary conditions. Under worst-case forced-discordance assumptions, κ declined substantially as modeled corpus expansion increased, indicating that robustness was conditional rather than unconditional.
    CONCLUSIONS: Explicit operational taxonomy may constrain classification variability across AI-assisted adjudication architectures when citation-level metadata are incomplete. AI systems cannot recover procedural specificity that is not encoded within bibliographic records. Because taxonomy was held constant rather than experimentally varied, the relative contribution of taxonomy versus architecture remains an empirical question for future work. Although evaluated within a dry needling corpus, the underlying metadata-signal problem may extend to other intervention domains, with implications for AI-assisted evidence workflows in healthcare decision-making.
    Keywords:  Artificial intelligence; Classification stability; Evidence synthesis; Metadata quality; Operational taxonomy; Rehabilitation
    DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106491
  10. bioRxiv. 2026 May 26. pii: 2026.05.21.727015. [Epub ahead of print]
      Infectious and immune-mediated diseases (IIDs) represent a broad and rapidly expanding biomedical literature domain in which scalable evidence extraction, disease ontology refinement, and interpretable knowledge integration are essential for biomedical discovery. We constructed an IID-specific biomedical knowledge graph (IID KG) from PubMed abstracts and PMC full-text articles by integrating nested named entity recognition, ontology-guided identifier assignment, full-text relation extraction, and relation-resolution strategies. A gold-standard corpus of 500 PubMed abstracts and 8 PMC full-text articles was manually annotated for nested biomedical entities across six entity types. The resulting models were applied to 30,128,068 PubMed abstracts and 1,385,500 IID-related PMC full-text articles. A unified IID ontology was developed from 411,341 disease terms using hierarchical text classification, large language model-based refinement, ontology cross-referencing, and expert review, yielding 179,657 confirmed MeSH mappings. The final IID KG contains approximately 1,837,513 unique entities and 16,295,390 unique relations across eight relation types. The resource was released publicly together with repurposing workflows, supporting ontology-aligned literature mining, disease mechanism analysis, and drug-repurposing hypothesis generation for IID research.
    DOI:  https://doi.org/10.64898/2026.05.21.727015
  11. J Prosthet Dent. 2026 Jun 02. pii: S0022-3913(26)00346-X. [Epub ahead of print]
       STATEMENT OF PROBLEM: Large language model (LLM)-based artificial intelligence (AI) platforms have emerged as tools to support clinical decision-making in dentistry, but their alignment with high-level evidence from systematic reviews in implant prosthodontics remains unclear.
    PURPOSE: The purpose of this study was to evaluate the degree of alignment between responses generated by ChatGPT and Google Gemini and the conclusions of published systematic reviews in implant prosthodontics.
    MATERIAL AND METHODS: Systematic reviews published between 2023 and 2025 addressing clinical questions in implant prosthodontics were included, with their conclusions used as reference standards and operationalized as expected-answer statements. Methodological quality of the included reviews was assessed using Assessing the Methodological Quality of Systematic Reviews 2 (AMSTAR 2). Standardized population, intervention, comparison, outcome (PICO)-based questions were submitted to ChatGPT and Google Gemini using identical prompts and no prior context. Agreement between AI responses and review conclusions was scored on a 5-point Likert scale by 2 blinded evaluators, with interrater reliability assessed using weighted Cohen kappa. Platform comparisons used the Wilcoxon matched-pairs signed-rank test, and domain analyses used the Kruskal-Wallis test with Dunn post hoc comparisons (α=.05).
    RESULTS: Seventy-four systematic reviews were included and categorized into 5 prosthodontic domains. Both ChatGPT and Google Gemini showed high agreement across domains, with no significant differences between platforms or domains (P>.05). Interrater agreement was almost perfect (κ=0.88-0.97). Although agreement was similar, ChatGPT more often reported moderate certainty, whereas Google Gemini more frequently expressed high certainty.
    CONCLUSIONS: ChatGPT and Google Gemini showed high agreement with systematic review conclusions in implant prosthodontics. Differences in certainty expression highlighted the need for cautious interpretation and professional oversight.
    DOI:  https://doi.org/10.1016/j.prosdent.2026.05.001
  12. Urolithiasis. 2026 Jun 06. pii: 110. [Epub ahead of print]54(1):
      Large language models (LLMs) are increasingly investigated for their potential role in guideline-based clinical information support. However, their consistency with subspecialty guidelines, particularly in urolithiasis, remains underexplored. This study aimed to evaluate the performance of four large language models (LLMs); GPT-4, GPT-4-turbo, Claude, and Gemini in generating guideline-concordant responses to clinical questions related to urolithiasis. A total of 105 clinical questions were independently developed by the authors based on urolithiasis management principles. Each LLM generated responses in two separate sessions. Two experienced urologists evaluated the outputs for accuracy and concordance with guideline recommendations. Inter-rater agreement analysis demonstrated fair agreement between evaluators. Differences across models and guideline categories were assessed using appropriate statistical tests. All four LLMs demonstrated high guideline adherence, with mean total scores ranging from 86.5 ± 5.2 (Gemini) to 92.8 ± 3.1 (Claude). Claude achieved the highest correlation with expert ratings (r = 0.94, p < 0.01). There were no statistically significant differences across models or among the nine clinical categories (p > 0.05). Session-to-session repeatability was also high for all models, with intra-model correlation coefficients exceeding 0.90. LLMs, particularly Claude, can provide reliable, guideline-consistent answers to urolithiasis-related clinical queries. Their consistent performance across themes suggests utility as adjunctive informational tools for guideline-based urological education and support, although further validation in real-world clinical settings remains necessary.
    Keywords:  Artificial intelligence; Clinical decision support; Guideline adherence; Large language models; Natural language processing; Urolithiasis
    DOI:  https://doi.org/10.1007/s00240-026-02013-1