bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–06–22
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Clin Nurs. 2025 Jun 16.
       AIM: To examine the feasibility of using a large language model (LLM) as a screening tool during structured literature reviews to facilitate evidence-based practice.
    DESIGN: A proof-of-concept study.
    METHODS: This paper outlines an innovative method of abstract screening using ChatGPT and computer coding for large scale, effective and efficient abstract screening. The authors, new to ChatGPT and computer coding, used online education and ChatGPT to upskill. The method was empirically tested using 400 abstracts relating to public involvement in nursing education from four different databases (CINAHL, Scopus, ERIC and MEDLINE), using four versions of ChatGPT. Results were compared with a human nursing researcher and reported using the CONSORT 2010 extension for pilot and feasibility trials checklist.
    RESULTS: ChatGPT-3.5 Turbo was most effective for rapid screening and had a broad inclusionary approach with a false-negative rate lower than the human researcher. More recent versions of ChatGPT-4, 4 Turbo, and 4 omni were less effective and had a higher number of false negatives compared to ChatGPT-3.5 Turbo and the human researcher. These more recent versions of ChatGPT did not appear to appreciate the nuance and complexities of concepts that underpin nursing practice.
    CONCLUSION: LLMs can be useful in reducing the time nurses spend screening research abstracts without compromising on literature review quality, indicating the potential for expedited synthesis of research evidence to bridge the research-practice gap. However, the benefits of using LLMs can only be realised if nurses actively engage with LLMs, explore LLMs' capabilities to address complex nursing issues, and report on their findings.
    IMPLICATIONS FOR THE PROFESSIONAL AND/OR PATIENT CARE: Nurses need to engage with LLMs to explore their capabilities and suitability for nursing purposes.
    PATIENT OR PUBLIC CONTRIBUTION: No patient or public contribution.
    DOI:  https://doi.org/10.1111/jocn.17818
  2. J Pediatr Urol. 2025 Jun 02. pii: S1477-5131(25)00303-1. [Epub ahead of print]
       INTRODUCTION: The European Association for Urology - European Society for Pediatric Urology (EAU-ESPU) guidelines comprise a comprehensive publication of evidence based clinical guidelines for the field of Pediatric urology. The goal is to produce recommendations to optimize patient care and provide an assessment of benefits and harms and possible alternative treatment options. Artificial intelligence (AI) has immensely evolved and is often used in urology. With the emergence of Chat Generative Pre-trained Transformer (ChatGPT) and CoPilot, a new dimension in AI was reached and more widespread use of AI became possible. ChatGPT and CoPilot are both large language models (LLMs).
    OBJECTIVES: The aim of the current study was to test the ability of LLMs to provide a trustworthy update of two of the chapters of the EAU-ESPU Pediatric Urology Guideline.
    STUDY DESIGN: Three LLM's (Chat-GPT 3.5, Chat-GPT 4.0 and CoPilot) were asked to perform a systematic update of the hydrocele and varicocele chapters. For both chapters two standard conversations were written; one humane dialogue and one conversation in which we included minor prompt engineering, i.e. few-shot prompting. All conversations were performed five times by an independent researcher and outcomes were scored for accuracy, consistency and reliability, using several predefined criteria by two reviewers.
    RESULTS: A total of sixty conversations were analyzed. All three LLMs were unable to update the guidelines with the recent relevant literature because of the lack of access to the correct scientific databases. Furthermore, a high variability was seen in the responses provided by the LLMs, although the input text was similar every time. The use of basic prompting in the structured conversations compared to the humane responses improved the consistency of the responses. The reproducibility, consistency, and reliability of the updates provided by the LLMs were assessed to be inadequate, despite the use of basic prompting.
    DISCUSSION: Development of AI and specific plug-ins for LLMs are developing at a very fast pace. A specific follow-up project would be to create specific plug-ins and advanced prompt engineering in cooperation with AI experts for existing LLMs to update the guidelines with access to the relevant databases and correct instructions to follow the handbook of the guidelines.
    CONCLUSION: At the moment LLMs cannot replace the panel members of the EAU Guidelines panel in their work to update the clinical guidelines. They have demonstrated inadequate consistency, reliability, accuracy, and are not able to incorporate new literature.
    Keywords:  Artificial intelligence; Clinical guidelines; Large language models; Pediatric urology
    DOI:  https://doi.org/10.1016/j.jpurol.2025.05.030
  3. Oncology. 2025 Jun 13. 1-16
       BACKGROUND: Most tools trying to automatically extract information from medical publications are domain agnostic and process publications from any field. However, only retrieving trials from dedicated fields could have advantages for further processing of the data.
    METHODS: We trained a small transformer model to classify trials into randomized controlled trials (RCTs) vs. non-RCTs and oncology publications vs. non-oncology publications. In addition, we used two large language models (GPT-4o and GPT-4o mini) for the same task. We assessed the performance of the three models and then developed a simple set of rules to extract the tumor entity from the retrieved oncology RCTs.
    RESULTS: On the unseen test set consisting of 100 publications, the small transformer achieved an F1-score of 0.96 (95% CI: 0.92 - 1.00) with a precision of 1.00 and a recall of 0.92 for predicting whether a publication was an RCT. For predicting whether a publication covered an oncology topic, the F1-score was 0.84 (0.77 - 0.91) with a precision of 0.75 and a recall of 0.95. GPT-4o achieved an F1-score of 0.94 (95% CI: 0.90 - 0.99) with a precision of 0.89 and a recall of 1.00 for predicting whether a publication was an RCT. For predicting whether a publication covered an oncology topic the F1-score was 0.91 (0.85 - 0.97) with a precision of 0.91 and a recall of 0.91. The rule-based system was able to correctly assign every oncology RCT in the test set to a tumor entity.
    CONCLUSION: In conclusion, classifying publications depending on whether they were randomized controlled oncology trials or not was feasible and enabled further processing using more specialized tools such as rule-based systems and potentially dedicated machine learning models.
    DOI:  https://doi.org/10.1159/000546970
  4. Ther Innov Regul Sci. 2025 Jun 14.
       INTRODUCTION: Generative artificial intelligence (AI) has the potential to transform and accelerate how information is accessed during the regulation of human drug and biologic products.
    OBJECTIVES: Determine whether a generative AI-supported application with retrieval-augmented generation (RAG) architecture can be used to correctly answer questions about the information contained in FDA guidance documents.
    METHODS: Five large language models (LLMs): Flan-UL2, GPT-3.5 Turbo, GPT-4 Turbo, Granite, and Llama 2, were evaluated in conjunction with the RAG application Golden Retriever to assess their ability to answer questions about the information contained in clinically oriented FDA guidance documents. Models were configured to precise mode with a low temperature parameter setting to generate precise, non-creative answers, ensuring reliable clinical regulatory review guidance for users.
    RESULTS: During preliminary testing, GPT-4 Turbo was the highest performing LLM. It was therefore selected for additional evaluation where it generated a correct response with additional helpful information 33.9% of the time, a correct response 35.7% of the time, a response with some of the required correct information 17.0% of the time, and a response with any incorrect information 13.4% of the time. The RAG application was able to cite the correct source document 89.2% of the time.
    CONCLUSION: The ability of the generative AI application to identify the correct guidance document and answer questions could significantly reduce the time in finding the correct answer for questions about FDA guidance documents. However, as the information in FDA guidance documents may be relied on by sponsors and FDA staff to guide important drug development decisions, the use of incorrect information could have a significantly negative impact on the drug development process. Based on our results, the correct citation documents can be used to reduce the time in finding the correct document that contains the information, but further research into the refinement of generative AI will likely be required before this tool can be relied on to answer questions about information contained in FDA guidance documents. Rephrasing questions by including additional context information, reconfiguring the embedding and chunking parameters, and other prompt engineering techniques may improve the rate of fully correct and complete responses.
    Keywords:  Document search; FDA guidance; Generative AI; Large language model
    DOI:  https://doi.org/10.1007/s43441-025-00798-8
  5. Curr Pharm Teach Learn. 2025 Jun 19. pii: S1877-1297(25)00139-X. [Epub ahead of print]17(10): 102418
       BACKGROUND: Artificial intelligence (AI) has emerged as a promising tool to support qualitative data analysis, yet its role in faculty-led studies that incorporate student researchers remains under investigation. This study examined differences in inductive thematic analysis generated by student and faculty researchers using AI compared to traditional faculty-led coding.
    METHODS: Three qualitative datasets were analyzed using OpenAI's ChatGPT by faculty and student researchers.
    RESULTS: Findings showed AI-assisted analyses identified most themes accurately, though faculty-generated AI results aligned more closely with expert-reviewed themes than student-generated AI results.
    CONCLUSIONS: AI may be a valuable tool to enhance efficiency particularly in initial evaluation of qualitative data.
    Keywords:  Artificial intelligence; Educational technology; Qualitative research; Research personnel; Universities
    DOI:  https://doi.org/10.1016/j.cptl.2025.102418
  6. Oncology. 2025 Jun 13. 1-18
       PURPOSE: The automated classification of clinical trials and key categories within the medical literature is increasingly relevant, particularly in oncology, as the volume of publications and trial reports continues to expand. Large Language Models (LLMs) may provide new opportunities for automating diverse classification tasks. They could be used for general-purpose text classification retrieving information about oncological trials.
    METHODS AND MATERIALS: A general text classification framework with adaptable prompt, model and categories for the classification was developed. The framework was tested with four datasets comprising nine binary classification questions related to oncological trials. Evaluation was conducted using a locally hosted Mixtral-8x7B-Instruct v0.1-GPTQ model and three cloud-based LLMs: Mixtral-8x7B-Instruct v0.1, Llama3.1-70B-Instruct, and Qwen-2.5-72B.
    RESULTS: The system consistently produced valid responses with the local Mixtral-8x7B-Instruct model and the Llama3.1-70B-Instruct model. It achieved a response validity rate of 99.70% and 99.88% for the cloud-based Mixtral and Qwen models, respectively. Across all models, the framework achieved an overall accuracy of >94%, precision of >92%, recall of >90%, and an F1-score of >92%. Question-specific accuracy ranged from 86.33% to 99.83% for the local Mixtral model, 85.49% to 99.83% for the cloud-based Mixtral model, 90.50% to 99.83% for the Llama3.1 model, and 77.13% to 99.83% for the Qwen model.
    CONCLUSIONS: The LLM-based classification framework exhibits robust accuracy and adaptability across various oncological trial classification tasks. While there remain some challenges such as strong prompt dependence and high computational and hardware demands, LLMs will play a crucial role for automating the classification of oncological trials and literature as the technology continues to advance.
    DOI:  https://doi.org/10.1159/000546946
  7. Nature. 2025 Jun 19.
      
    Keywords:  Machine learning; Medical research; Research data
    DOI:  https://doi.org/10.1038/d41586-025-01942-y