bims-arines 2025-10-26 papers

Int J Dent. 2025 ;2025 2677641

Comparing Manual and ChatGPT Deep Research on Systematic Search and Selection in the PubMed Database on the Topic of Dental Implantology.

Bulcsú Bencze, Alwin Sokolowski, Jae-Hyun Lee, Péter Hermann, Tamás Hegedüs, Wataru Kozuma, Reo Ikumi, Michael Payer, Ángel-Orión Salgado-Peralvo, Dániel Végh.

Introduction: Dental implantology has seen rapid technological advancements, with artificial intelligence (AI) increasingly integrated into diagnostic, planning, and surgical processes. The release of chat-generative pretrained transformer (ChatGPT) and its subsequent updates, including the deep research function, presents opportunities for AI-assisted systematic reviews. However, its efficacy compared to traditional manual research has not been researched.
Materials and Methods: A systematic review was conducted on May 6, 2025, to evaluate recent innovations in dental implantology and AI. Two parallel searches were performed: one using ChatGPT 4.1's deep research tool in the PubMed database and another manual PubMed search by two independent reviewers. Both searches used identical keywords and Boolean operators targeting studies from 2020 to 2025. Inclusion criteria were peer-reviewed studies related to implant design, osseointegration, guided placement, and other predefined outcomes.
Results: The manual search identified 124 articles, of which 23 met the inclusion criteria. ChatGPT retrieved 114 articles, selected 13 for inclusion, yet only included 11 in its synthesis. Two cited articles by the AI software were nonexistent, and numerous relevant studies were not retrieved, whereas the remaining articles were correct and found by manual search as well. ChatGPT had high specificity (98%) and low sensitivity (47.8%), with a statistically significant difference compared to manual search and selection.
Discussion: AI tools like ChatGPT show promise in literature search, synthesis, and assistance, especially in improving readability and identifying trending topics in science. Nevertheless, the current state of deep research function lacks the reliability required for conducting systematic reviews due to issues such as made-up references and missed articles. The results highlight the need for human supervision and improved safeguards.
Conclusions: ChatGPT's deep research function can support, but not replace manual systematic search and selection. It offers substantial benefits in writing support and preliminary synthesis due to acceptable accuracy, but limitations in reliability and low sensitivity (47.8%) require cautious use and transparent reporting of any AI involvement in scientific research.

Keywords: ChatGPT; artificial intelligence; deep research; implantology

DOI: https://doi.org/10.1155/ijod/2677641

Cureus. 2025 Sep;17(9): e92590

Potential and Limitations of Large Language Models for Medical Literature Analysis: A Preliminary Investigation.

Takahiro Kamihara, Takuya Omura, Atsuya Shimizu.

Objective While Large Language Models (LLMs) show great promise for various medical applications, their black-box nature and the difficulty of reproducing results have been noted as significant challenges. In contrast, conventional text mining is a well-established methodology, yet its mastery remains time-consuming. This study aimed to determine if an LLM could achieve literature analysis outcomes comparable to those from traditional text mining, thereby clarifying both its utility and inherent limitations. Methods We analyzed the abstracts of 5,112 medical papers retrieved from PubMed using the single keyword "text mining." We used Google Gemini 2.5 (Google Inc., Mountain View, CA, USA) and instructed it to extract distinctive words, concepts, trends, and co-occurrence network concepts. These results were then qualitatively compared with those obtained from conventional text mining tools, VOSviewer and KH Coder. Results Google Gemini appeared to conceptually aggregate individual words and identify research trends. The concepts for co-occurrence networks also showed visual similarity to the networks generated by the traditional tools. However, the LLM's analytical output was based on its own unique interpretation and could not be directly compared with the statistically derived co-occurrence patterns. Furthermore, since this study relied on a visual comparison of network diagrams rather than rigorous quantitative analysis, the conclusions remain qualitative. Conclusion Google Gemini indicated an ability to extract keywords, concepts, and trends. A co-occurrence network visually similar to those generated by conventional text mining tools was created. While it showed particular strengths in conceptual summarization and trend detection, its limitations - including its black-box nature, reproducibility challenges, and subjective interpretations - became apparent. With a proper understanding of these constraints, LLMs may serve as a valuable complementary tool, with the potential to accelerate literature analysis in medical research.

Keywords: co-occurrence network; large language model; medical literature analysis; pubmed database; text mining

DOI: https://doi.org/10.7759/cureus.92590

Cochrane Evid Synth Methods. 2025 Nov;3(6): e70042

Enhancing Evidence Synthesis Efficiency: Leveraging Large Language Models and Agentic Workflows for Optimized Literature Screening.

Bing Hu, Emmalie Tomini, Tricia Corrin, Kusala Pussegoda, Elias Sandner, Andre Henriques, Alice Simniceanu, Luca Fontana, Andreas Wagner, Stephanie Brazeau, Lisa Waddell.

Background: Public health events of international concern highlight the need for up-to-date evidence curated using sustainable processes that are accessible. In development of the Global Repository of Epidemiological Parameters (grEPI) we explore the performance of an agentic-AI assisted pipeline (GREP-Agent) for screening evidence which capitalizes on recent advancements in large language models (LLMs).
Methods: In this study, the performance of the GREP-Agent was evaluated on a data set of 2000 citations from a systematic review on measles using four LLMs (GPT4o, GPT4o-mini, Llama3.1, and Phi4). The GREP-Agent framework integrates multiple LLMs and human feedback to fine-tune its performance, optimize workload reduction and accuracy in screening research articles. The impact on performance of each part of this Agentic-AI system is presented and measured by accuracy, precision, recall, and F1-score metrics.
Results: The results show how each phase of the GREP-Agent system incrementally improves accuracy regardless of the LLM. We found that GREP-Agent was able to increase sensitivity across a broad range of open source and proprietary LLMs to 84.2%-88.9% after fine-tuning and to 86.4%-95.3% by varying workload reduction strategies. Performance was significantly impacted by the clarity of the screening questions and setting thresholds for optimized workload reduction strategies.
Conclusions: The GREP-Agent shows promise in improving the efficiency and effectiveness of evidence synthesis in dynamic public health contexts. Further development and refinement of adaptable human-in-the-loop AI systems for screening literature are essential to support future public health response activities, while maintaining a human-centric approach.

DOI: https://doi.org/10.1002/cesm.70042

Stat Med. 2025 Oct;44(23-24): e70263

ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations.

Dennis Dobler, Harald Binder, Anne-Laure Boulesteix, Jan-Bernd Igelmann, David Köhler, Ulrich Mansmann, Markus Pauly, André Scherag, Matthias Schmid, Amani Al Tawil, Susanne Weber.

Modern large language models (LLMs) have reshaped the workflows of people across countless fields-and biostatistics is no exception. These models offer novel support in drafting study plans, generating software code, or writing reports. However, reliance on LLMs carries the risk of inaccuracies due to potential hallucinations that may produce fabricated "facts", leading to erroneous statistical statements and conclusions. Such errors could compromise the high precision and transparency fundamental to our field. This tutorial aims to illustrate the impact of LLM-based applications on various contemporary biostatistical tasks. We will explore both the risks and opportunities presented by this new era of artificial intelligence. Our ultimate conclusion emphasizes that advanced applications should only be used in combination with sufficient background knowledge. Over time, consistently verifying LLM outputs may lead to an appropriately calibrated trust in these tools among users.

Keywords: causal analysis; diagnostic accuracy; generative AI; individual‐level surrogacy; large language model; latent class analysis; meta‐analysis; sample sizes planning; simulation study; translation programming languages

DOI: https://doi.org/10.1002/sim.70263

Forensic Sci Med Pathol. 2025 Oct 23.

Critical appraisal on "Leveraging chatgpt's advanced data analysis for forensic science research and applications".

Tarun Madan Kanade, Vandana Sonwaney, Mrunal Milind Pandit.

Keywords: Advanced data analysis; Artificial intelligence; ChatGPT; Forensic science; Machine learning

DOI: https://doi.org/10.1007/s12024-025-01113-5

Sch Psychol. 2025 Oct 20.

Human versus machine: A comparative analysis of qualitative coding by humans and ChatGPT-4.

Adam Lockwood, Daniel S Newman, Kandace W Mossing, Amy Glubzinski, Elaina Cohen.

Generative artificial intelligence (AI) applications are becoming increasingly influential in psychology training, practice, and research. In this study, the procedures (e.g., coding process) and products (e.g., codes, categories, themes, core story) of a qualitative content analysis (QCA) conducted by Chat Generative Pre-trained Transformer (ChatGPT)-4 and novice human researchers were compared, and advantages and disadvantages of each approach were considered. Data included open-ended survey responses from trainers (N = 60) in school psychology programs regarding assessment practices during the COVID-19 pandemic. Findings indicated that ChatGPT-4 conducted QCA with products that were similar, overall, to human coders and in significantly less time. However, ChatGPT-4's process was not transparent, and some codes and themes were unclear. Meanwhile, human coding allowed for the selection and implementation of a purposeful, coherent methodological approach and an auditable and systematic process resulting in defensible themes. Considerations for the use of AI in qualitative research are considered and discussed, and future research directions are provided. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

DOI: https://doi.org/10.1037/spq0000715