bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–10–12
three papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Acta Psychol (Amst). 2025 Oct 03. pii: S0001-6918(25)00939-4. [Epub ahead of print]260 105626
      The term "cognitive efficiency" (CE) lacks a unified definition and consistent measurement across diverse academic disciplines, hindering interdisciplinary research. Concurrently, while artificial intelligence (AI) tools are rapidly evolving, systematic methodologies for their application in literature reviews remain nascent. This paper addresses these two critical gaps. First, through an AI-assisted systematic review of 96 scholarly articles, we propose a consolidated definition of CE as "a measure of an individual's memory recall and ability to process information within a given reaction time," providing much-needed clarity. Second, we present a novel, iterative methodology for conducting systematic reviews that strategically integrates the strengths of currently accessible AI tools with essential human judgment and expertise. Our findings highlight AI's proficiency in individual article comprehension and theme identification, while also demonstrating its current limitations in complex data synthesis and inter-paper comparison. This research offers both a clearer conceptualization of cognitive efficiency and a robust, reproducible framework for leveraging AI to enhance the efficiency and rigor of future systematic literature reviews.
    Keywords:  Artificial intelligence (AI); Cognitive efficiency (CE); Conceptual clarity; Human-AI collaboration; Literature review methodology; Systematic review
    DOI:  https://doi.org/10.1016/j.actpsy.2025.105626
  2. Sci Rep. 2025 Oct 07. 15(1): 34993
      Large language models (LLMs) perform tasks such as summarizing information and analyzing sentiment to generate meaningful and natural responses. The application of GenAI incorporating LLMs raises potential utilities for conducting qualitative research. Using a qualitative study that assessed the impact of the COVID-19 pandemic on the sexual and reproductive health of adolescent girls and young women (AGYW) in rural western Kenya: our objective was to compare thematic analyses conducted by GenAI using LLM to qualitative analysis conducted by humans, with regards to major themes identified, selection of supportive quotes, and quality of quotes; and secondarily to explore quantitative and qualitative sentiment analysis conducted by the GenAI. We interfaced with GPT-4o through google colaboratory. After inputting the transcripts and pre-processing, we constructed a standardized task prompt. Two investigators independently reviewed the GenAI product using a rubric based on qualitative research standards. When compared to human-derived themes, we did not find disagreement with the sub-themes raised by GenAI, but did not consider some to rise to level of a theme. Performance was low and variable with regards to selection of quotes that were consistent with and strongly supportive of thematic and sentiment analysis. Hallucinations ranged from a single word or phrase change to truncation or combinations of text that led to modified meaning. GenAI identified numerous and relevant biases, primarily related to the underlying training data and its lack of cultural understanding. Few prior studies have directly compared LLM-driven thematic coding with human coding in qualitative analysis, and our study - grounded in qualitative study rigor - allowed for a thorough evaluation. GenAI implemented in GPT-4o was unable to provide a thematic analysis that is indistinguishable from a human analysis. We recommend that it can currently be used as an aid in identifying themes, keywords, and basic narrative, and potentially as a check for human error or bias. However, until it can eliminate hallucinations, provide better contextual understanding of quotes and undertake a deeper scrutiny of data, it is not reliable or sophisticated enough to undertake a rigorous thematic analysis equal in quality to experienced qualitative researchers.
    DOI:  https://doi.org/10.1038/s41598-025-18969-w
  3. BMC Oral Health. 2025 Oct 10. 25(1): 1594
       BACKGROUND: This study aimed to assess and compare ChatGPT-4o and Gemini Pro's ability to generate structured abstracts from full-text systematic reviews and meta-analyses in orthodontics, based on adherence to the PRISMA Abstract (PRISMA-A) Checklist, using a customised prompt developed for this purpose.
    MATERIALS AND METHODS: A total of 162 full-text systematic reviews and meta-analyses published in Q1-ranked orthodontic journals since January 2019 were included. Each full-text article was processed by ChatGPT-4o and Gemini Pro, using a PRISMA-A Checklist-aligned structured prompt. Outputs were scored using a tailored Overall quality Score OQS derived from 11 PRISMA-A checklist. Inter-rater and time-dependent reliability were assessed with Intraclass Correlation Coefficients (ICCs), and model outputs were compared using Mann-Whitney U tests.
    RESULTS: Both models yielded satisfactory OQS in generating PRISMA-A checklist compliant abstracts; however, ChatGPT-4o consistently achieved higher scores than Gemini Pro. The most notable differences were observed in the "Included Studies" and "Synthesis of Results" sections, where ChatGPT-4o produced more complete and structurally coherent outputs. ChatGPT-4o achieved a mean OQS of 21.67 (SD 0.58) versus 21.00 (SD 0.71) for Gemini Pro, a difference that was highly significant (p < 0.001).
    CONCLUSIONS: Both LLMs demonstrated the ability to generate PRISMA-A-compliant abstracts from systematic reviews, with ChatGPT-4o consistently achieving higher quality scores than Gemini Pro. While tested in orthodontics, the approach holds potential for broader applications across evidence-based dental and medical research. Systematic reviews and meta-analyses are essential to evidence-based dentistry but can be challenging and time-consuming to report in accordance with established standards. The structured prompt developed in this study may assist researchers in generating PRISMA-A-compliant outputs more efficiently, helping to accelerate the completion and standardisation of high-level clinical evidence reporting.
    Keywords:  ChatGPT; Gemini; Large language models; Meta-analyses; PRISMA-A; Prompt engineering; Systematic review
    DOI:  https://doi.org/10.1186/s12903-025-06982-4