bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–01–11
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Syst Rev. 2026 Jan 03.
       BACKGROUND: Systematic reviews are fundamental to evidence-based medicine, but the process of screening studies is time-consuming and prone to errors, especially when conducted by a single reviewer. False exclusions of relevant studies can significantly impact the quality and reliability of reviews. Artificial intelligence (AI) tools have emerged as secondary reviewers in detecting these false exclusions, yet empirical evidence comparing their performance is limited.
    METHODS: This study protocol outlines a comprehensive evaluation of four AI tools (ASReview, DistillerSR Artificial Intelligence System [DAISY], Evidence for Policy and Practice Information [EPPI]-Reviewer, and Rayyan) in their capacity to act as secondary reviewers during single-reviewer title and abstract screening for systematic reviews. Utilizing a database of single-reviewer screening decisions from two published systematic reviews, we will assess how effective AI tools are at detecting false exclusions while assisting single-reviewer screening compared to the dual-reviewer reference standard. Additionally, we aim to determine the overall screening performance of AI tools in assisting single-reviewer screening.
    DISCUSSION: This research seeks to provide valuable insights into the potential of AI-assisted screening for detecting falsely excluded studies during single screening. By comparing the performance of multiple AI tools, we aim to guide researchers in selecting the most effective assistive technologies for their review processes.
    SYSTEMATIC REVIEW REGISTRATION: (Open Science Framework): https://osf.io/dky26.
    Keywords:  AI tools; Falsely excluded studies; Rapid reviews
    DOI:  https://doi.org/10.1186/s13643-025-03031-7
  2. J Pediatr Soc North Am. 2026 Feb;14 100294
       Background: Current research has focused on the use of large language models (LLMs) to augment systematic reviews. LLMs are limited by their vulnerability to "hallucinations"; retrieval-augmented generation (RAG) reduces these by limiting the model's source knowledge to user-provided material. The purpose of this study was to evaluate the accuracy and reliability of a RAG-LLM in quality assessment of observational studies in pediatric orthopaedic literature as compared to manual review.
    Methods: Previously published systematic reviews of observational studies in pediatric orthopaedics containing reported Newcastle-Ottawa Scale (NOS) scores from our group were included. After uploading observational study source files, NotebookLM (Google, Mountain View, CA) evaluated each of the included studies using the NOS scoring sheet. Agreement among scores across all NotebookLM trials was determined using a two-way random, average measures, absolute agreement intraclass correlation coefficient [ICC(2,k)]. Agreement among individual scores generated by each NotebookLM instance (LM1, LM2, LM3, and LM4) and ground truth (published manual review score) was calculated using a two-way random, single measures, absolute agreement intraclass correlation coefficient [ICC(2,1)].
    Results: Two systematic reviews comprising a total of 27 observational studies were included. ICC across all measurements (ICC(2,k)-Reviewer-LM1,2,3,4) was 0.69 (95% CI: 0.46-0.84), indicating moderate agreement. ICC comparing individual NotebookLM scores to ground truth demonstrated poor agreement [ICC(2,1) LM1-Reviewer = 0.27 (95% CI: -0.064 to 0.57), LM2-Reviewer = 0.18 (95% CI: -0.12 to 0.48), LM3-Reviewer = 0.081 (95% CI: -0.24 to 0.41), and LM4-Reviewer = 0.23 (95% CI: -0.14 to 0.55)]. Percent agreement ranged from 14.8% to 29.6%. Single measures ICCs comparing individual NotebookLM scores across multiple trials demonstrated moderate-to-poor agreement.
    Conclusions: NotebookLM demonstrated low reliability and accuracy in performing quality assessment of observational studies. Caution should be taken when implementing LLMs to augment research efforts in pediatric orthopaedics.
    Key Concepts: (1)NotebookLM (Google, Mountain View, CA) demonstrated low reliability and accuracy in performing quality assessment of observational studies.(2)Caution should be taken when implementing artificial intelligence tools such as large language models (LLMs) to augment research efforts, even retrieval-augmented generation (RAG)-LLM models that reduce hallucinations.(3)Until emerging artificial intelligence technologies are further validated it remains essential that researchers and clinicians continue to critically appraise new studies independently.
    Level of Evidence: IV.
    Keywords:  Generative artificial intelligence; Meta-analysis; Pediatric orthopaedics; Retrieval-augmented generation large language models; Systematic review
    DOI:  https://doi.org/10.1016/j.jposna.2025.100294
  3. Graefes Arch Clin Exp Ophthalmol. 2026 Jan 03.
       PURPOSE: To systematically evaluate and compare the performance of four leading large language models (LLMs) in generating medical literature reviews across topics of varying research maturity, thereby providing insights for their effective and responsible application in academic writing.
    METHODS: In this comparative study, using standardized prompts, we instructed four leading LLMs (GPT-4, Gemini 2.5 Pro, Grok-3, and DeepSeek R1) to generate literature reviews on nine topics related to small incision lenticule extraction (SMILE) surgery. These topics were categorized into three groups by research maturity: well-researched, controversial, and open. Seven ophthalmology experts evaluated the generated content across four dimensions: quality, accuracy, bias, and relevance, while all references were verified for authenticity. Performance differences among models were evaluated using group comparison tests followed by post-hoc analysis.
    RESULTS: Significant performance variations were identified across all four models and dimensions (p < 0.001). Specifically, Gemini ranked highest in content quality, accuracy, and bias control. In contrast, DeepSeek, despite its high-quality score, received the lowest relevance score. Grok-3 demonstrated the highest reference authenticity (p < 0.001), whereas GPT-4's was the lowest (p < 0.001). All models showed diminished performance on open topics and exhibited severe reference fabrication ("hallucinations").
    CONCLUSION: Rather than excelling universally, LLMs exhibit distinct and task-specific strengths that mandate a task-driven, hybrid strategy in tool selection. Reference fabrication was found to be a pervasive issue across all models, regardless of the task topic, elevating human verification from a best practice to an essential safeguard for academic integrity.
    Keywords:  AI evaluation; Large language models; Literature review; Medical content generation; SMILE surgery
    DOI:  https://doi.org/10.1007/s00417-025-07092-1
  4. bioRxiv. 2025 Dec 24. pii: 2025.12.22.696019. [Epub ahead of print]
      The rapid expansion of biomedical literature has made comprehensive manual synthesis increasingly difficult to perform effectively, creating a pressing need for AI systems capable of reasoning across verified evidence rather than merely retrieving it. However, existing retrieval-augmented generation (RAG) methods often fall short when faced with complex biomedical questions that require iterative reasoning and multi-step synthesis. Here, we developed Queryome, a deep research system consisting of specialized large language model (LLM) agents that can adapt their orchestration dynamically to a wide range of queries. Using a hybrid semantic-lexical retrieval engine spanning 28.3 million PubMed abstracts, it performs iterative, evidence-grounded synthesis. On the MIRAGE benchmark, Queryome achieved 88.98 % accuracy, surpassing prior systems by up to 14 points, and improved reasoning accuracy on the biomedical Human's Last Exam (HLE) subset from 15.8% to 19.3%. Moreover, in a task for constructing a review article, it earned the highest composite score in comparison with Deep Research from OpenAI, Google, Perplexity, and Scite.AI, reflecting its strong literature retrieval and synthesis capabilities.
    DOI:  https://doi.org/10.64898/2025.12.22.696019
  5. medRxiv. 2025 Dec 29. pii: 2025.12.19.25342712. [Epub ahead of print]
       Objectives: Case reports and case series comprise a significant portion of the biomedical literature, yet unlike case reports, the National Library of Medicine does not index case series as a Publication Type. This hurts clinicians' and researchers' ability to retrieve, identify and analyze evidence from this type of study.
    Materials and Methods: PubMed articles mentioning "case series" in title or abstract were characterized to learn what are considered to be case series by the authors themselves. We then set aside articles better indexed as other standard publication types - case reports, cohort studies, reviews and clinical trials -- as well as those that discuss (rather than report the results of) case series studies, to create a corpus of typical case series articles. A random sample of these articles was evaluated by two annotators who confirmed that the great majority satisfy a formal definition of "case series".
    Results: The corpus was utilized in an automated transformer-based machine learning indexing model. Case series performance of this model on hold-out data was excellent (precision = 0.887, recall = 0.952, F1 = 0.918, PR-AUC = 0.941) and manual evaluation of 100 articles tagged as "case series" revealed that 88% satisfied a formal definition of case series.
    Discussion and Conclusion: This study demonstrates the feasibility of automatically indexing case series articles. Indexing should enhance their discoverability, and hence their medical value, for evidence synthesis groups as well as general users of the biomedical literature.
    DOI:  https://doi.org/10.64898/2025.12.19.25342712
  6. Integr Med Res. 2025 Dec;14(4): 101222
      Generative artificial intelligence (GenAI) chatbots powered by large language models (LLMs) are increasingly used in health research to support a range of academic and clinical activities. While increasingly adopted in biomedical research, their application in traditional, complementary, and integrative medicine (TCIM) remains underexplored. TCIM presents unique challenges, including complex interventions, culturally embedded practices, and variable terminology. This article provides a practical, evidence-informed guide to help TCIM researchers engage responsibly with GenAI chatbots through prompt engineering, the design of clear, structured, and purposeful prompts to improve output relevance and accuracy. The guide outlines strategies to tailor GenAI chatbot interactions to the methodological and epistemological diversity of TCIM. It presents use cases across the research process, including research question development, study design, literature searches, selection of reporting guidelines and appraisal tools, quantitative and qualitative analysis, writing and dissemination, and implementation planning. For each stage, the guide offers examples and best practices while emphasizing that AI-generated content should always serve as a starting point, not a final product, and must be reviewed and verified using credible sources. Potential risks such as hallucinated outputs, embedded bias, and ethical challenges are discussed, particularly in culturally sensitive contexts. Transparency in GenAI chatbot use and researcher accountability are emphasized as essential principles. While GenAI chatbots can expand access to research support and foster innovation in TCIM, they cannot substitute for critical thinking, methodological rigour, or domain-specific expertise. Used responsibly, GenAI chatbots can augment human judgment and contribute meaningfully to the evolution of TCIM scholarship.
    Keywords:  AI chatbots; Generative artificial intelligence; Large language models; Prompt engineering
    DOI:  https://doi.org/10.1016/j.imr.2025.101222
  7. Med Teach. 2026 Jan 10. 1-10
       INTRODUCTION: The integration of artificial intelligence (AI) tools into medical education presents new opportunities for enhancing students' research skills and scientific writing. However, concerns remain about the potential for cognitive disengagement and the ethical use of AI when lacking appropriate educational supervision. This study aimed to evaluate a novel educational strategy combining structured AI assistance with mentor guidance to support narrative review writing among third-year medical students.
    METHODS: A structured framework was implemented during the endocrine module, involving AI-assisted objective formulation, mentor-guided objective refinement, literature search and summarization, review drafting followed by AI-assisted rephrasing. Students worked in groups, each supervised by a trained mentor. A validated questionnaire assessed student perceptions across four domains: framework and guidelines, AI-generated objectives, skills developed and mentor role, and overall satisfaction. Descriptive statistics were performed and chi-square tests evaluated associations between perceptions and AI tool usage (ChatGPT vs. DeepSeek).
    RESULTS: Eighty-seven students completed the survey. Perceived improvement in research readiness was observed; confidence in literature searching rose from 29.8% to 69%, while 75.8% reported increased familiarity with PubMed/Google Scholar. Most students (80.5%) expressed satisfaction with the AI mentor hybrid approach, and 82.8% agreed it prepared them for future research. There were no significant differences in perceived outcomes between AI tools used. Mentor involvement was deemed essential by 69% of students, and a minority believed AI alone could replicate the same outcomes. Common challenges included limited access to articles and peer collaboration difficulties, while key learning outcomes included improved summarization and ethical AI use.
    DISCUSSION: This study supports the integration of AI tools within a structured, mentor-guided educational framework to enhance critical evaluation and scientific writing in medical education. Human oversight and mentorship drive skill development and minimize the risk of unmoderated AI use in academic settings.
    Keywords:  Artificial intelligence (AI); mentorship; research skills; review writing; undergraduate medical students
    DOI:  https://doi.org/10.1080/0142159X.2025.2604240
  8. PLoS One. 2026 ;21(1): e0339769
      This study addresses the challenge of distinguishing human translations from those generated by Large Language Models (LLMs) by utilizing dependency triplet features and evaluating 16 machine learning classifiers. Using 10-fold cross-validation, the SVM model achieves the highest mean F1-score of 93%, while all other classifiers consistently differentiate between human and machine translations. SHAP analysis helps identify key dependency features that distinguish human and machine translations, improving our understanding of how LLMs produce translationese. The findings provide practical insights for enhancing translation quality assessment and refining translation models across various languages and text genres, contributing to the advancement of natural language processing techniques. The dataset and implementation code of our study are available at: https://github.com/KiemaG5/LLM-translationese.
    DOI:  https://doi.org/10.1371/journal.pone.0339769