bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–04–12
eleven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Int J Med Inform. 2026 Apr 02. pii: S1386-5056(26)00162-0. [Epub ahead of print]214 106422
       INTRODUCTION: Large language model-based artificial intelligence tools are increasingly explored to support systematic reviews, yet evidence regarding their reliability in full-text screening remains limited. This study evaluated the performance of two versions of ChatGPT (4.0 and 5.0) compared with human reviewers during article selection for a systematic review on influenza vaccine effectiveness.
    METHODS: A total of 170 full-text articles were independently assessed for eligibility using predefined inclusion and exclusion criteria. Human reviewers served as the gold standard. ChatGPT 4.0 and 5.0 were prompted using standardized instructions mirroring the review protocol. Agreement with human decisions was evaluated using accuracy, sensitivity, specificity, precision, F1-score, and Cohen's κ. Intra-model reproducibility was assessed for ChatGPT 5.0.
    RESULTS: ChatGPT 4.0 achieved an accuracy of 0.71 (95% CI: 0.64-0.78) and a Cohen's κ of 0.43, indicating moderate agreement with human reviewers. ChatGPT 5.0 demonstrated improved performance, with accuracy increasing 0.06 to 0.77 (95% CI: 0.70-0.83), sensitivity of 0.87, specificity of 0.70, and κ of 0.55, corresponding to moderate-to-substantial agreement. Intra-model reproducibility for ChatGPT 5.0 showed 80% agreement (κ = 0.60), indicating partial but imperfect consistency.
    CONCLUSIONS: ChatGPT 5.0 outperformed ChatGPT 4.0 in full-text screening accuracy and reproducibility, approaching but not matching human performance. These findings support the use of current LLMs as decision-support tools rather than autonomous reviewers in systematic reviews. Transparent reporting of model versions, prompts, and input quality is essential to ensure credible AI-assisted evidence synthesis.
    Keywords:  Artificial intelligence; ChatGPT; Chatbot; Full-text screening; Large language models; Reproducibility; Systematic review
    DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106422
  2. Adv Clin Exp Med. 2026 Apr 09.
       BACKGROUND: Risk-of-bias (RoB) assessment is essential for evidence synthesis but remains time-consuming and inherently subjective. Artificial intelligence (AI) may improve the efficiency of systematic reviews; however, its reliability in reproducing expert RoB judgements remains uncertain.
    OBJECTIVES: To compare the performance of AI models and human raters in RoB assessment of randomized controlled trials (RCTs) using the revised Joanna Briggs Institute (JBI) critical appraisal tool.
    MATERIAL AND METHODS: Thirteen RCTs published between 2023 and 2025 in orthopedic journals were independently assessed by 2 human raters (an expert (R1) and a novice (R2)) and 2 AI models (ChatGPT-4.0 (CGPT) and DeepSeek-R1 (DS)) using the 13-domain JBI checklist. Deep-reasoning functionalities (e.g., chain-of-thought prompting) were applied. Inter-rater agreement, deviations from the expert assessment (reference standard), and binary disagreements (e.g., Yes vs No) were analyzed to evaluate consistency.
    RESULTS: The AI models demonstrated high inter-model agreement (91%), exceeding human-AI agreement (CGPT vs R1: 64%; DS vs R1: 68%). However, both AI systems showed substantial divergence from expert judgements in interpretive domains, including allocation concealment (Q2), blinding (Q7), and overall trial design (Q13), with deviation rates ranging from 30% to 38.5%. Binary decision reversals were more frequent in AI assessments (CGPT: 8.9%; DS: 7.7%) than in the human comparison (R2 vs R1: 2.4%). Human raters showed stronger agreement in contextual interpretation (R1-R2: 89.3%), whereas AI models performed better in rule-based domains (Q8/Q9: 100% agreement).
    CONCLUSIONS: AI can reliably support the automation of objective components of RoB assessment but remains limited in handling interpretive, context-dependent judgements. A hybrid approach combining AI-assisted pre-screening with expert evaluation may enhance the scalability of systematic reviews without compromising methodological rigor.
    Keywords:  artificial intelligence; critical appraisal; randomized controlled trials; risk of bias; systematic review
    DOI:  https://doi.org/10.17219/acem/216070
  3. J Med Internet Res. 2026 04 08. 28 e87057
       Background: Despite the transformative potential of large language models (LLMs) in health care, the rapid development of these tools has outpaced their rigorous evaluation. While artificial intelligence-specific reporting guidelines have been developed to address standardized reporting of artificial intelligence studies, there is currently no specific tool available for risk of bias assessment of LLM question-answer (QA) studies. Existing risk-of-bias tools for medical research are not well suited to the unique challenges of evaluating LLM-QA studies, which creates a critical gap in assessing their safety and effectiveness.
    Objective: This study aims to develop the Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for LLM-QA studies to systematically evaluate the validity and risk of bias in LLM-QA studies.
    Methods: We conducted 2 literature reviews. The first was on quality assessment tools for LLM-QA studies, and the second was on LLM-QA studies, which informed the first draft of the AQAT:RoB. The draft AQAT:ROB was further refined through a prespecified iterative process of modified Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the consensus meeting, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and a Cohen κ of ≥0.61 between evaluators, no further rounds of development and validation were undertaken. A second Delphi process occurred between February 20 and February 23, 2026, to vote on postpilot changes in response to peer review.
    Results: The AQAT:RoB consists of 5 high-level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes). These domains are subdivided into 9 subdomains. Each subdomain includes at least one "Support for Judgment" and at least one "Type of Bias" and is to be rated "low," "high," or "unclear" for risk of bias. A pilot evaluation was completed by internal validators who were not part of the consensus discussion and were asked to complete the AQAT:RoB form for each assigned study. Each of the 16 studies was evaluated by 2 evaluators independently. Pilot validation showed a percent agreement of 86.1% and a Cohen κ of 0.70 between assessors.
    Conclusions: The AQAT:RoB demonstrates promising initial reliability for assessing the validity or risk of bias in LLM-QA studies. The tool will benefit from future refinements, external validation, and periodic updates to keep pace with evolving technology.
    Keywords:  AQAT: RoB; Alberta Risk of Bias Assessment Tool for LLM-QA studies; artificial intelligence; chatbot; large language model; quality assessment; question-answer studies; risk of bias
    DOI:  https://doi.org/10.2196/87057
  4. Res Synth Methods. 2026 Apr 06. 1-19
      Evidence synthesis findings hinge upon well-designed, effective search strategies. When developing these strategies, evidence synthesis teams make multiple decisions (e.g., selecting information sources, developing search string architecture, and picking supplementary search methods) that directly affect the breadth of discovered evidence and thus evidence synthesis outcomes. Despite the number of decisions required when developing search strategies, limited guidance exists to inform these decisions using a data-driven approach. To help address this gap, we developed CiteSource, an R package and accompanying Shiny application, that supports data-driven search strategy development and reporting. CiteSource allows users to assign and retain metadata across three custom fields: source, label, and string to indicate where the records were found, what method or string was used to find them, and whether they were included after screening. CiteSource allows users to visually map the overlap between sets of records, create data summaries of citation records, and export citation records with the newly assigned metadata. CiteSource's analysis and visualization outputs can be harnessed for a variety of use cases, such as optimizing literature source selection, honing and understanding the effectiveness of search strings, and evaluating the impacts of literature sources and supplementary search methods. Overall, CiteSource provides a tool for evidence synthesizers to make informed data-driven decisions that boost the efficiency, rigor, and transparency of search strategies and associated reporting.
    Keywords:  evidence synthesis; information retrieval; reproducibility; search strategy; systematic searching
    DOI:  https://doi.org/10.1017/rsm.2026.10084
  5. Med Sci Educ. 2026 Feb;36(1): 11-15
      Systematic reviews in medical education often classify outcomes using the Kirkpatrick framework, but manual coding is time-consuming and subjective. We conducted a proof-of-concept study testing ChatGPT (GPT-5, August 2025 release) on 32 full-text articles from a published systematic review of sepsis education. Agreement with human-coded outcomes was modest: 50% percent agreement, unweighted κ = 0.170 (95% CI 0.000-0.458), weighted κ = 0.351 (95% CI 0.074-0.629). Most disagreements were between adjacent levels.
    Supplementary Information: The online version contains supplementary material available at 10.1007/s40670-026-02639-1.
    Keywords:  Artificial intelligence in medical education; ChatGPT; Generative AI; Kirkpatrick framework; Systematic reviews
    DOI:  https://doi.org/10.1007/s40670-026-02639-1
  6. Mach Learn Knowl Extr. 2025 Mar 26. 7(2): 28
      As climate change transforms our environment and human intrusion into natural ecosystems escalates, there is a growing demand for disease spread models to forecast and plan for the next zoonotic disease outbreak. Accurate parametrization of these models requires data from diverse sources, including the scientific literature. Despite the abundance of scientific publications, the manual extraction of these data via systematic literature reviews remains a significant bottleneck, requiring extensive time and resources, and is susceptible to human error. This study examines the application of a large language model (LLM) as an assessor for screening prioritisation in climate-sensitive zoonotic disease research. By framing the selection criteria of articles as a question-answer task and utilising zero-shot chain-of-thought prompting, the proposed method achieves a saving of at least 70% work effort compared to manual screening at a recall level of 95% (NWSS@95%). This was validated across four datasets containing four distinct zoonotic diseases and a critical climate variable (rainfall). The approach additionally produces explainable AI rationales for each ranked article. The effectiveness of the approach across multiple diseases demonstrates the potential for broad application in systematic literature reviews. The substantial reduction in screening effort, along with the provision of explainable AI rationales, marks an important step toward automated parameter extraction from the scientific literature.
    Keywords:  AI-assisted disease surveillance; automated AI literature screening; biomedical text mining for disease tracking; climate-sensitive zoonotic disease modelling; information retrieval in medical literature; large language models in systematic reviews; systematic literature review automation; zero-shot relevancy ranking
    DOI:  https://doi.org/10.3390/make7020028
  7. Med Sci Educ. 2026 Feb;36(1): 73-79
      Thematic analysis is a form of qualitive analysis performed to identify patterns within text-based datasets such as open-ended responses. While thematic analysis is used extensively in medical education research, it has several limitations, such as subjective interpretation by graders and the time required to manually code responses in large datasets. There is potential to overcome many of these challenges with the use of Artificial Intelligence (AI) platforms, such as the free to use large language model, ChatGPT. The goal of this study was to evaluate whether AI can be used in thematic analysis to replace manual graders as the gold standard. The dataset used in this study was related to first year medical students' thoughts and feelings regarding the act of cadaveric dissections. Three different methods were used to instruct the AI to grade the responses, and each method was repeated three times. Various measures related to precision and accuracy were compared, both within the repeated tests using AI and between the AI generated results and those obtained by manual coders. Results show that Method 3 had greater accuracy and agreement with the manual coders, but less precision compared to the other two methods. All methods had an agreement greater than 80%. These findings demonstrate that AI has promise in being used for thematic analysis, but the method used to instruct the AI has a strong influence on the results. When using AI for thematic analyses, it is imperative to carefully document and refine the methodology as well as still incorporate some human oversight to ensure accurate results.
    Supplementary Information: The online version contains supplementary material available at 10.1007/s40670-025-02587-2.
    Keywords:  AI; AI in medical education; ChatGPT; Generative AI; Medical education; Qualitative research; Thematic analysis
    DOI:  https://doi.org/10.1007/s40670-025-02587-2
  8. Online J Public Health Inform. 2026 Apr 07. 18 e80824
       Background: Public opinion, which may be influenced by personal experiences, news, and social media, can impact compliance with public health measures (PHMs) during health emergencies. Artificial intelligence (AI) tools offer opportunities to analyze public opinion in real time during health emergencies. However, their performance in accurately identifying sentiment and themes in health-related online content remains unclear.
    Objective: This study aimed to evaluate the performance of natural language processing-based and large language model (LLM)-based AI tools when compared to human coding for sentiment analysis, topic modeling, and thematic analysis of public health datasets. Tools were selected to reflect those available to public health analysts and decision-makers.
    Methods: Data were collected via Google Alerts (GA) and social media posts from X (formerly known as Twitter) relevant to COVID-19 mitigation PHMs from December 2022 to February 2023. Following relevance screening, the sentiment of the complete datasets was analyzed by a human rater, with descriptive statistics used to summarize the overall sentiment profile. Subsets of 400 GA articles and 400 tweets were manually coded for sentiment by 2 human raters. Results were compared with outputs from 5 AI tools, including VADER (Valence Aware Dictionary and Sentiment Reasoner), SentimentGI, SentimentQDAP, Microsoft Azure, and OpenAI's ChatGPT-4. Topic modeling of the GA and X datasets was conducted using latent Dirichlet allocation in R and zero-shot prompting in ChatGPT-4 and compared with manual topic summaries. Thematic analysis of positive and negative sentiment datasets was conducted by a human rater and ChatGPT-4, with outputs evaluated for proficiency and reasonableness. The sentiment of the entire datasets was analyzed by a human rater, and descriptive statistics were calculated.
    Results: Of 2227 GA results and 3484 tweets, 58% (n=1238) and 71% (n=2473), respectively, were relevant to PHMs. Human-coded sentiment analysis showed mostly neutral reporting in the news media, while social media expressed more polarized views. Across both datasets, AI tools demonstrated poor concordance with human-coded sentiment (Cohen κ <0.5 for all tools and sentiment categories). Topic modeling with ChatGPT-4 aligned more closely with human-rated topics than latent Dirichlet allocation, and of the 20 LLM-generated thematic outputs, 13 were rated proficient, and 7 were rated partially proficient. LLM outputs provided coherent, high-level summaries but lacked contextual insight. Human and LLM thematic analyses both identified themes of vaccine effectiveness, debate regarding PHMs, and public trust.
    Conclusions: Accessible AI tools demonstrate limited reliability for sentiment classification of health-related online text but show promise for rapid thematic exploration when combined with human oversight. These tools could complement traditional qualitative research in the context of health emergencies; however, they require human review to enhance the accuracy of interpretation. Further research is needed for non-English datasets.
    Keywords:  AI; COVID-19; artificial intelligence; equity; public health informatics; public opinion; sentiment analysis; social media
    DOI:  https://doi.org/10.2196/80824
  9. JMIR AI. 2026 Apr 06. 5 e81149
       BACKGROUND: Translating evidence-based therapies from "bench to bedside" remains challenging, and implementation science (IS) experts are crucial for this process. Qualitative analyses are essential, but require extensive time and cost for manual coding. Now, many turn to artificial intelligence (AI) to accelerate the pace of qualitative analysis, but significant questions remain about the quality, validity, and ethics of applying large language models like ChatGPT (OpenAI) to qualitative data. To this end, we have developed a method for AI-assisted rapid qualitative analysis that addresses these concerns.
    OBJECTIVE: This study aimed to develop AI-assisted rapid qualitative analysis for implementation science as an open-source encoder-based small language model (SLM) to aid IS experts. We focus on 2 efficient and high-performing SLMs: distilled bidirectional encoder representations from transformers (DistilBERT) and efficiently learning an encoder that classifies token replacements accurately (ELECTRA). The objective is to assess these models' accuracy in reproducing expert coding, their generalizability to new coding scenarios, and enhancing their accessibility for nontechnical experts through user-friendly tools.
    METHODS: Two previously coded IS datasets were used to train DistilBERT and ELECTRA models. These datasets were coded by IS experts using a mixed deductive and inductive approach, with initial categories derived from the domains of an IS framework: Practical, Robust Implementation, and Sustainability Model. We fine-tuned and evaluated DistilBERT and ELECTRA on these datasets, measuring performance by area under the precision-recall curve and Cohen κ. To facilitate use by nonprogrammers, we then developed an open-source Python package (pytranscripts) to streamline transcript processing, model classification, and evaluation. Additionally, a companion Streamlit web application allows users to upload interview transcripts and obtain automated coding and analytics without any coding expertise.
    RESULTS: Our findings demonstrate the success of leveraging SMLs to significantly accelerate qualitative analysis while maintaining high levels of accuracy and agreement with human annotators, although results are not universal and depend on how researchers approach qualitative coding. On the original dataset, DistilBERT achieved near-perfect agreement with human coders (Cohen κ=0.95), while ELECTRA showed substantial agreement (Cohen κ=0.71). However, both models' performance declined on the second, more ambiguous dataset, with DistilBERT's Cohen κ dropping to 0.48 and ELECTRA's to 0.39. Two primary drivers of performance drop appear to be related to the number of codes applied to the dataset, and whether coders apply multiple codes to each piece of data or constrain themselves to applying one.
    CONCLUSIONS: This work demonstrates that SLMs can meaningfully assist qualitative researchers with coding tasks as long as attention is paid to how experts code data that will train the SLM. This can be especially valuable in settings where deploying large language models is impractical or undesirable.
    Keywords:  artificial intelligence; health care; implementation science; interviews; natural language processing; qualitative analysis; small language model
    DOI:  https://doi.org/10.2196/81149
  10. J Med Internet Res. 2026 Apr 10. 28 e95004
       Unlabelled: This commentary reviews the study by Jones et al, which evaluated whether GPT-4 could improve the readability of injectable medication guidelines while preserving important safety information. The study found that GPT-4 produced modest readability gains comparable to manual revision, but also introduced omissions and meaning changes in a minority of sections. These findings highlight both the potential and limitations of early large language models (LLMs) in clinical contexts. However, this study reflects the capabilities of a specific model in a rapidly evolving domain. Since the release of GPT-4, advances in multistep reasoning, model-critique workflows, and structured validation have substantially improved the ability of newer systems to detect omissions, maintain factual fidelity, and support controlled editing. As a result, some documented limitations may stem from the constraints of a single-model, single-pass workflow rather than intrinsic flaws in LLM-assisted guideline revision. This commentary highlights the need for evaluation frameworks that can keep pace with LLM progress and emphasizes that clinical oversight and user-centered testing remain essential. Updated research using contemporary models is needed to determine how emerging architectures can more safely support clarity, consistency, and maintenance of clinical guidelines.
    Keywords:  artificial intelligence; clinical decision support; clinical guidelines; large language model; patient safety; readability
    DOI:  https://doi.org/10.2196/95004