bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–10–26
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Int J Dent. 2025 ;2025 2677641
       Introduction: Dental implantology has seen rapid technological advancements, with artificial intelligence (AI) increasingly integrated into diagnostic, planning, and surgical processes. The release of chat-generative pretrained transformer (ChatGPT) and its subsequent updates, including the deep research function, presents opportunities for AI-assisted systematic reviews. However, its efficacy compared to traditional manual research has not been researched.
    Materials and Methods: A systematic review was conducted on May 6, 2025, to evaluate recent innovations in dental implantology and AI. Two parallel searches were performed: one using ChatGPT 4.1's deep research tool in the PubMed database and another manual PubMed search by two independent reviewers. Both searches used identical keywords and Boolean operators targeting studies from 2020 to 2025. Inclusion criteria were peer-reviewed studies related to implant design, osseointegration, guided placement, and other predefined outcomes.
    Results: The manual search identified 124 articles, of which 23 met the inclusion criteria. ChatGPT retrieved 114 articles, selected 13 for inclusion, yet only included 11 in its synthesis. Two cited articles by the AI software were nonexistent, and numerous relevant studies were not retrieved, whereas the remaining articles were correct and found by manual search as well. ChatGPT had high specificity (98%) and low sensitivity (47.8%), with a statistically significant difference compared to manual search and selection.
    Discussion: AI tools like ChatGPT show promise in literature search, synthesis, and assistance, especially in improving readability and identifying trending topics in science. Nevertheless, the current state of deep research function lacks the reliability required for conducting systematic reviews due to issues such as made-up references and missed articles. The results highlight the need for human supervision and improved safeguards.
    Conclusions: ChatGPT's deep research function can support, but not replace manual systematic search and selection. It offers substantial benefits in writing support and preliminary synthesis due to acceptable accuracy, but limitations in reliability and low sensitivity (47.8%) require cautious use and transparent reporting of any AI involvement in scientific research.
    Keywords:  ChatGPT; artificial intelligence; deep research; implantology
    DOI:  https://doi.org/10.1155/ijod/2677641
  2. Cureus. 2025 Sep;17(9): e92590
      Objective While Large Language Models (LLMs) show great promise for various medical applications, their black-box nature and the difficulty of reproducing results have been noted as significant challenges. In contrast, conventional text mining is a well-established methodology, yet its mastery remains time-consuming. This study aimed to determine if an LLM could achieve literature analysis outcomes comparable to those from traditional text mining, thereby clarifying both its utility and inherent limitations. Methods We analyzed the abstracts of 5,112 medical papers retrieved from PubMed using the single keyword "text mining." We used Google Gemini 2.5 (Google Inc., Mountain View, CA, USA) and instructed it to extract distinctive words, concepts, trends, and co-occurrence network concepts. These results were then qualitatively compared with those obtained from conventional text mining tools, VOSviewer and KH Coder. Results Google Gemini appeared to conceptually aggregate individual words and identify research trends. The concepts for co-occurrence networks also showed visual similarity to the networks generated by the traditional tools. However, the LLM's analytical output was based on its own unique interpretation and could not be directly compared with the statistically derived co-occurrence patterns. Furthermore, since this study relied on a visual comparison of network diagrams rather than rigorous quantitative analysis, the conclusions remain qualitative. Conclusion Google Gemini indicated an ability to extract keywords, concepts, and trends. A co-occurrence network visually similar to those generated by conventional text mining tools was created. While it showed particular strengths in conceptual summarization and trend detection, its limitations - including its black-box nature, reproducibility challenges, and subjective interpretations - became apparent. With a proper understanding of these constraints, LLMs may serve as a valuable complementary tool, with the potential to accelerate literature analysis in medical research.
    Keywords:  co-occurrence network; large language model; medical literature analysis; pubmed database; text mining
    DOI:  https://doi.org/10.7759/cureus.92590
  3. Cochrane Evid Synth Methods. 2025 Nov;3(6): e70042
       Background: Public health events of international concern highlight the need for up-to-date evidence curated using sustainable processes that are accessible. In development of the Global Repository of Epidemiological Parameters (grEPI) we explore the performance of an agentic-AI assisted pipeline (GREP-Agent) for screening evidence which capitalizes on recent advancements in large language models (LLMs).
    Methods: In this study, the performance of the GREP-Agent was evaluated on a data set of 2000 citations from a systematic review on measles using four LLMs (GPT4o, GPT4o-mini, Llama3.1, and Phi4). The GREP-Agent framework integrates multiple LLMs and human feedback to fine-tune its performance, optimize workload reduction and accuracy in screening research articles. The impact on performance of each part of this Agentic-AI system is presented and measured by accuracy, precision, recall, and F1-score metrics.
    Results: The results show how each phase of the GREP-Agent system incrementally improves accuracy regardless of the LLM. We found that GREP-Agent was able to increase sensitivity across a broad range of open source and proprietary LLMs to 84.2%-88.9% after fine-tuning and to 86.4%-95.3% by varying workload reduction strategies. Performance was significantly impacted by the clarity of the screening questions and setting thresholds for optimized workload reduction strategies.
    Conclusions: The GREP-Agent shows promise in improving the efficiency and effectiveness of evidence synthesis in dynamic public health contexts. Further development and refinement of adaptable human-in-the-loop AI systems for screening literature are essential to support future public health response activities, while maintaining a human-centric approach.
    DOI:  https://doi.org/10.1002/cesm.70042
  4. Stat Med. 2025 Oct;44(23-24): e70263
      Modern large language models (LLMs) have reshaped the workflows of people across countless fields-and biostatistics is no exception. These models offer novel support in drafting study plans, generating software code, or writing reports. However, reliance on LLMs carries the risk of inaccuracies due to potential hallucinations that may produce fabricated "facts", leading to erroneous statistical statements and conclusions. Such errors could compromise the high precision and transparency fundamental to our field. This tutorial aims to illustrate the impact of LLM-based applications on various contemporary biostatistical tasks. We will explore both the risks and opportunities presented by this new era of artificial intelligence. Our ultimate conclusion emphasizes that advanced applications should only be used in combination with sufficient background knowledge. Over time, consistently verifying LLM outputs may lead to an appropriately calibrated trust in these tools among users.
    Keywords:  causal analysis; diagnostic accuracy; generative AI; individual‐level surrogacy; large language model; latent class analysis; meta‐analysis; sample sizes planning; simulation study; translation programming languages
    DOI:  https://doi.org/10.1002/sim.70263
  5. Forensic Sci Med Pathol. 2025 Oct 23.
      
    Keywords:  Advanced data analysis; Artificial intelligence; ChatGPT; Forensic science; Machine learning
    DOI:  https://doi.org/10.1007/s12024-025-01113-5
  6. Sch Psychol. 2025 Oct 20.
      Generative artificial intelligence (AI) applications are becoming increasingly influential in psychology training, practice, and research. In this study, the procedures (e.g., coding process) and products (e.g., codes, categories, themes, core story) of a qualitative content analysis (QCA) conducted by Chat Generative Pre-trained Transformer (ChatGPT)-4 and novice human researchers were compared, and advantages and disadvantages of each approach were considered. Data included open-ended survey responses from trainers (N = 60) in school psychology programs regarding assessment practices during the COVID-19 pandemic. Findings indicated that ChatGPT-4 conducted QCA with products that were similar, overall, to human coders and in significantly less time. However, ChatGPT-4's process was not transparent, and some codes and themes were unclear. Meanwhile, human coding allowed for the selection and implementation of a purposeful, coherent methodological approach and an auditable and systematic process resulting in defensible themes. Considerations for the use of AI in qualitative research are considered and discussed, and future research directions are provided. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
    DOI:  https://doi.org/10.1037/spq0000715