bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–12–28
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Antioxid Redox Signal. 2025 Dec 11.
      The exponential growth of biomedical literature has rendered traditional search methods inadequate. Artificial intelligence (AI) tools have emerged and are developing as transformative solutions for literature search and knowledge mining. This first article of a series, intended to address different components of biomedical research, provides a comprehensive analysis of recent advancements, practical applications, and challenges in deploying AI for biomedical research. The objective of this work is to synthesize the evolution, capabilities, and limitations of AI-driven tools for literature discovery, summarization, and evidence synthesis, offering actionable insights for researchers across disciplines. AI tools have progressed from keyword-based retrieval to semantic and multimodal approaches. Platforms such as Elicit, BioGPT, and PubTator 3.0 enable rapid extraction of gene-disease associations and evidence-based insights, while ResearchRabbit and Connected Papers visualize citation networks. Systematic review tools like Rayyan and Covidence reduce screening time by up to 50%. Variability in output quality, risk of hallucination, and lack of algorithmic transparency pose challenges. Open-source solutions (e.g., BioGPT, DeepChem) and explainability-focused tools (e.g., Scite.ai) offer promising pathways to mitigate these concerns. AI-driven literature workflows can accelerate hypothesis generation, systematic reviews, and translational research. However, close human expert oversight remains indispensable to ensure rigor and interpretive accuracy. These technologies are not a passing trend; they are forging the contours of tomorrow's research landscape. The peril lies as much in reckless adoption as in willful oblivion. This editorial serves as a general roadmap for integrating trustworthy AI tools into biomedical research, fostering high-impact innovation. Antioxid. Redox Signal. 00, 000-000.
    DOI:  https://doi.org/10.1177/15230864251405885
  2. BMC Health Serv Res. 2025 Dec 20.
      
    Keywords:  Abstract screening; Artificial intelligence; Large language models; Scoping review
    DOI:  https://doi.org/10.1186/s12913-025-13901-4
  3. J Clin Epidemiol. 2025 Dec 18. pii: S0895-4356(25)00442-1. [Epub ahead of print] 112109
       OBJECTIVES: To implement and evaluate a semi-automated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation (GRADE) certainty of evidence (CoE) for direct comparisons within two living network meta-analyses.
    METHODS: For each of three GRADE domains (study limitations, indirectness, and inconsistency) decision rules were developed and used to generate automated judgements for each domain and the overall certainty. Inputs included risk of bias and indirectness ratings for each study and measures of heterogeneity. Indirectness ratings were made by two independent reviewers and resolved through consensus. With the help of an online tool (customized to our project), two independent raters viewed forest plots and additional data and could confirm or modify the suggested rating. Disagreements were resolved by consensus. We evaluated inter-rater reliability and accuracy.
    RESULTS: Across 374 direct comparisons, there was perfect agreement (100%) between the automated judgement and reviewer consensus when only a single study was available (n=292), and near-perfect agreement when more than one study was available (99 to 100% for the three GRADE domains and 96% for overall rating). Inter-rater reliability was near perfect (Gwet's AC1 kappa ranging from 96% to 100%).
    CONCLUSION: Automated judgements using established decision rules agreed with expert judgement for the vast majority of GRADE CoE ratings.
    Keywords:  Assessment; Development and Evaluation (GRADE); Grading of Recommendation; Living Systematic reviews; Network meta-analysis (NMA); Randomized Controlled Trials (RCTs); Rheumatoid arthritis (RA); Semi-Automation; Systematic Reviews
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.112109
  4. J Am Med Inform Assoc. 2025 Dec 23. pii: ocaf223. [Epub ahead of print]
       OBJECTIVES: To develop AutoReporter, a large language model (LLM) system that automates evaluation of adherence to research reporting guidelines.
    MATERIALS AND METHODS: Eight prompt-engineering and retrieval strategies coupled with reasoning and general-purpose LLMs were benchmarked on the SPIRIT-CONSORT-TM corpus. The top-performing approach, AutoReporter, was validated on BenchReport, a novel benchmark dataset of expert-rated reporting guideline assessments from 10 systematic reviews.
    RESULTS: AutoReporter, a zero-shot, no-retrieval prompt coupled with the o3-mini reasoning LLM, demonstrated strong accuracy (CONSORT 90.09%; SPIRIT: 92.07%), substantial agreement with humans (CONSORT Cohen's κ = 0.70, SPIRIT Cohen's κ = 0.77), runtime (CONSORT: 617.26 s; SPIRIT: 544.51 s), and cost (CONSORT: 0.68 USD; SPIRIT: 0.65 USD). AutoReporter achieved a mean accuracy of 91.8% and substantial agreement (Cohen's κ > 0.6) with expert ratings from the BenchReport benchmark.
    DISCUSSION: Structured prompting alone can match or exceed fine-tuned domain models while forgoing manually annotated corpora and computationally intensive training.
    CONCLUSION: Large language models can feasibly automate reporting guideline adherence assessments for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.
    Keywords:  adherence; concordance; large language model; quality control; reporting guideline
    DOI:  https://doi.org/10.1093/jamia/ocaf223
  5. J Korean Med Sci. 2025 Dec 22. 40(49): e342
      Choosing the right statistical tests is essential for reliable results, but errors, like picking the wrong test or misinterpreting data, can easily lead to incorrect conclusions. Research integrity implies presenting research that is honest, clear, and uses correct statistics. By identifying statistical errors, artificial intelligence (AI) systems such as Statcheck and GRIM-Test increase the reliability of research and assist reviewers. AI helps non-experts analyze data, but it can be unpredictable for experts dealing with complex data analysis. Still, its ease of use and growing abilities show promise. Recent studies show that AI is increasingly helpful in research, assisting in spotting errors in methodology, citations, and statistical analyses. Tools like LLMs, Black Spatula, YesNoError, and GRIM-Test improve accuracy, but they need good data and human checks. AI has moderate accuracy overall but performs better in controlled settings. The Statcheck and GRIM-Test are especially good at spotting statistical errors. As more studies are retracted, AI offers helpful, albeit imperfect, support. It can speed up peer review and reduce reviewer workload, but it still has limits, such as bias and a lack of expert judgment. AI also brings risks like misreading results, ethical issues, and privacy concerns, so editors must make final decisions. To use AI safely and effectively, large, well-labeled datasets, teamwork across fields, and secure systems are required. Human oversight is always necessary to review research processes and ensure their reliability; humans must make the final decision and utilize AI responsibly.
    Keywords:  Artificial Intelligence; Publications; Scientific Misconduct; Statistics
    DOI:  https://doi.org/10.3346/jkms.2025.40.e342
  6. Eur J Appl Physiol. 2025 Dec 24.
      The integration of Large Language Models (LLMs) into scientific writing presents significant opportunities for scholars but also risks, including misinformation and plagiarism. A new body of literature is shaping to verify the capability of LLMs to execute the complex tasks that are inherent to academic publishing. In this context this study was driven by the need to critically assess LLM's out-of-the-box performance in generating evidence synthesis reviews. To this end, the signature topic of the authors' group, cross-education of voluntary force, was chosen as a model. We prompted a popular LLM (Gemini 2.5 Pro, Deep Research enabled) to generate a scoping review on the neural mechanisms underpinning cross-education. The resulting unedited manuscript was submitted for formal peer-review to four leading subject-matter experts. Their qualitative feedback on manuscript's structure, content, and integrity was collated and analyzed. Peer-reviewers identified critical failures at fundamental stages of the review process. The LLM failed to: (1) identify specific research questions; (2) adhere to established methodological frameworks; (3) implement trustworthy search strategies; (4) objectively synthesize data. Importantly, the Results section was deemed interpretative rather than descriptive. Referencing was agreed as the worst issue being inaccurate, biased toward open-access sources (84%), and containing instances of plagiarism. The LLM also failed to hierarchize evidence, presenting minor or underexplored findings as established evidence. The LLM generated a non-systematic, poorly structured, and unreliable narrative review. These findings suggest that the selected LLM is incapable of autonomously performing scientific synthesis and requires massive human supervision to correct the observed issues.
    Keywords:  Evidence synthesis; Generative AI; Neurophysiology; Peer review; Plagiarims; Scholarly Publishing
    DOI:  https://doi.org/10.1007/s00421-025-06100-w
  7. Qual Health Res. 2025 Dec 24. 10497323251401503
      Artificial intelligence (AI) is now routinely deployed in qualitative health. Comparative evaluations indicate that these systems reproduce coding methods but can falter on culturally nuanced or emotionally complex material. Conventional reflexivity guidelines focus on investigator positionality and provide limited guidance for assessing algorithmic influence at early stages in the analysis process. We introduce the AI-Reflexivity Checklist (ARC), a pre-analysis, evidence-informed checkpoint that sets the appropriate human-in-the-loop (HITL) posture-delegate, assist/augment, or human-led-for LLM-assisted qualitative coding of textual data. Literature from science and technology studies, empirical studies of AI-assisted qualitative analysis, and pragmatic workflow models informed the identification of five decision domains: descriptive scope, contextual variation, experiential depth, ethical exposure, and output reversibility. These domains are operationalized as five sequential prompts completed before AI is introduced. If the planned task is purely descriptive, meanings are stable across contexts, experiential nuance is minimal, ethical risk is low, and outputs can be fully revised or reversed; automation is permitted with routine human verification. Elevated ratings on experiential or ethical domains point to an assist/human-led posture unless pilot evidence meets pre-specified acceptance criteria; lack of reversibility remains a blocker because it precludes audit and repair. ARC extends existing reflexivity practice to encompass algorithmic actors, offers a brief record suitable for review, and mitigates early path-dependency toward indiscriminate automation.
    Keywords:  artificial intelligence; ethics; healthcare; large-language models; qualitative; reflexivity
    DOI:  https://doi.org/10.1177/10497323251401503
  8. J Clin Epidemiol. 2025 Dec 23. pii: S0895-4356(25)00451-2. [Epub ahead of print] 112118
       OBJECTIVES: Incomplete reporting of research limits its usefulness and contributes to research waste. Numerous reporting guidelines have been developed to support complete and accurate reporting of healthcare research studies. Completeness of reporting can be measured by evaluating the adherence to reporting guidelines. However, assessing adherence to a reporting guideline often lacks uniformity. In 2019, we developed a reporting adherence tool for the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement. With recent advances in regression and artificial intelligence (AI)/machine learning (ML) based methods, TRIPOD+AI (www.tripod-statment.org) was developed to replace the TRIPOD statement. The aim of this study was to develop an updated adherence tool for TRIPOD+AI.
    STUDY DESIGN AND SETTING: Based on the TRIPOD+AI full reporting guideline, including the accompanying Explanation and Elaboration light, and TRIPOD+AI for Abstracts, we updated and expanded the original TRIPOD adherence tool and refined the adherence elements and their scoring rules through discussions within the author team and a pilot test.
    RESULTS: The updated tool comprises of 37 main items and 136 adherence elements and includes several automated scoring rules. We developed separate TRIPOD+AI adherence tools for model development, model evaluation, and for studies describing both in a single paper .
    CONCLUSION: A uniform approach to assessing reporting adherence of TRIPOD+AI allows for comparisons across various fields, monitor reporting over time, and incentivizes primary study authors to comply.
    PLAIN LANGUAGE SUMMARY: Plain language summary: Accurate and complete reporting is crucial in biomedical research to ensure findings can be effectively used. To support researchers in reporting their findings well, reporting guidelines have been developed for different study types. One such guideline is TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis), which focuses on research studies about medical prediction tools. In 2024, TRIPOD was updated to TRIPOD+AI to address the increasing use of artificial intelligence and machine learning in prediction model studies. In 2019, we developed a scoring system to evaluate how well research papers on prediction tools adhered to the TRIPOD guideline, resulting in a reporting completeness score. This score allows for easier comparison of reporting completeness across various medical fields, and to monitor improvement in reporting over time. With the introduction of TRIPOD+AI, an update of the scoring system was required to align with the new reporting recommendations. We achieved this by reviewing our previous scoring system and incorporating the new items from TRIPOD+AI to better suit studies involving AI. We believe that this system will facilitate comparisons of prediction model reporting completeness across different fields and encourage improved reporting practices.
    Keywords:  Reporting completeness; TRIPOD; TRIPOD+AI; adherence; artificial intelligence; machine learning; prediction models; reporting guidelines
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.112118