bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–12–07
twelve papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Curr Rev Musculoskelet Med. 2025 Dec 02. 19(1): 7
       PURPOSE OF REVIEW: To analyze the efficacy and efficiency of current large language models (LLMs), specifically GPT-5 in screening titles and abstracts for three review topics within different subspecialties in orthopedics.
    RECENT FINDINGS: Python scripts were developed to call on the GPT-5 model via OpenAIs application programming interface (API). Two human reviewers simultaneously performed screening based on the same inclusion and exclusion criteria. Performance metrics such as specificity, sensitivity, accuracy, positive predictive value (PPV), negative predictive values (NPV), and F1 scores for GPT-5 were calculated based on a gold-standard inclusion and exclusion list developed by a third human adjudicator. Efficiency metrics included total cost and time to completion for each task. The number of titles and abstracts to screen ranged between 668 and 1,131 amongst the three review topics. All performance metrics were above 92.3% amongst all three topics, with sensitivities ranging from 94.1%-100%. Time to completion ranged between 38.5-174.3 minutes. Cost ranged from $1.32-$3.73USD GPT-5 demonstrated exceptional accuracy, sensitivity, specificity, PPV, NPV, and F1 scores in automating title and abstract screening for three orthopedic systematic review topics in three different subspecialties. Results are similar to previous studies investigating the role of AI for screening, specifically increased accuracy and time-to-completion relative to humans. The average rate of screening ranged from 6.5-17.4 abstracts per minute and the average price ranged from $0.002-$0.0036USD per abstract, suggesting a high degree of efficiency compared to current standards.
    Keywords:  Artificial intelligence; Automation; GPT-5; LLM; Large language model; Systematic review
    DOI:  https://doi.org/10.1007/s12178-025-10001-y
  2. Sci Rep. 2025 Dec 01.
      The capability of Large Language Models (LLMs) to support and facilitate research activities has sparked growing interest in their integration into scientific workflows. This paper aims to evaluate and compare against human researchers the performance of 6 different LLMs in conducting the various tasks necessary to produce a systematic literature review. The evaluation of the 6 LLMs was split into 3 tasks: literature search, article screening and selection (task 1); data extraction and analysis (task 2); final paper drafting (task 3). Their results were compared with a human-produced systematic review on the same topic, serving as reference standard. The evaluation was repeated on two rounds to evaluate between-version changes and improvements of LLMs over time. Out of the 18 scientific articles to be extracted from the literature for task 1, the best LLM managed to identify 13. Data extraction and analysis for task 2 was only partially accurate and cumbersome. The full papers generated by LLMs for task 3 were short and uninspiring, often not fully adhering to the standard PRISMA 2020 template for a systematic review. Currently, LLMs are not capable of conducting a scientific systematic review in the medical domain without prompt-engineering strategies. However, their capabilities are advancing rapidly, and, with an appropriate supervision they can provide valuable support throughout the review process.
    Keywords:  Artificial intelligence; Evidence-based medicine; Generative artificial intelligence; Large language models; Scientific writing; Systematic review
    DOI:  https://doi.org/10.1038/s41598-025-28993-5
  3. JCO Clin Cancer Inform. 2025 Dec;9 e2500233
       PURPOSE: The rapid expansion of scientific literature has made it increasingly challenging for clinicians and researchers to efficiently identify relevant evidence. While large language models (LLMs) offer promising solutions for automating literature review tasks, few tools support integrated workflows that enable trend analysis as well. This study aimed to develop and evaluate Rapid Clinical Evidence eXplorer (RaCE-X), a Generative Pre-trained Transformer (GPT)-based automated pipeline designed to streamline abstract screening, extract structured information, and visualize key trends in clinical research.
    METHODS: We used GPT-4.1 mini to screen 865 PubMed abstracts based on predefined screening criteria. Structured information was then extracted from the 87 relevant abstracts based on a predefined information model covering nine fields. A gold standard data set was created through expert review to assess model performance. The extracted information was visualized through an interactive dashboard. Usability was evaluated using the Post-Study System Usability Questionnaire (PSSUQ) and open-ended feedback from five clinical research coordinators.
    RESULTS: RaCE-X demonstrated high screening performance (precision = 0.954, recall = 0.988, F1 = 0.971) and achieved strong average performance in information extraction (precision = 0.977, recall = 0.989, F1 = 0.983), with no hallucinations identified. Usability testing indicated generally positive feedback (overall PSSUQ score = 2.8), with users noting that RaCE-X was intuitive and effective for data interpretation.
    CONCLUSION: RaCE-X enables efficient GPT-based abstract screening, structured information extraction, and research trend exploration, thereby facilitating the summary of clinically relevant evidence from the biomedical literature. This study demonstrates the feasibility of using LLMs to reduce manual workload and accelerate evidence-based research practices.
    DOI:  https://doi.org/10.1200/CCI-25-00233
  4. Int J Med Inform. 2025 Nov 24. pii: S1386-5056(25)00422-8. [Epub ahead of print]207 106205
       OBJECTIVE: Current systematic literature reviews largely rely on manual screening of articles retrieved through keyword search, which is time-consuming and difficult to scale. To address this limitation, large language model (LLM)-based approaches offer the potential to automate the screening process. In this study, we aim to enhance the efficiency and accuracy of literature screening by developing an LLM-based method and exploring techniques such as rule-based preprocessing, prompt engineering (i.e., retrieval-augmented generation (RAG)) and ensemble strategies.
    METHODS: We explored a hybrid framework that combines RAG prompting with LLM-based classification strategies. Our methods were developed and evaluated on a corpus of 6331 biomedical articles, focusing on identifying literature discussing the applications of LLMs in patient care using Electronic Health Record (EHR) data. We evaluated three recent models-DeepSeek-V3, Deepseek-R1 and GPT-4o under three prompting strategies: binary classification prompting, RAG prompting, and justification-based prompting. Given the context of literature screening, recall (sensitivity) was prioritized in this study to maximize the inclusion of relevant studies. Additionally, we also considered other metrics, including precision, specificity, and negative predictive value (NPV) to minimize the inclusion of irrelevant articles. To evaluate generalizability, the models and prompts were further tested on ten additional topics related to "Cancer Immunotherapy and Targeted Therapy" and "LLMs in Medicine."
    RESULTS: The hybrid approach combining rule-based preprocessing with DeepSeek R1 using RAG prompting (rule + DeepSeek-reasoner@Prompt II) achieved the best overall performance among individual models, with a precision of 0.34, recall of 0.77, NPV of 1.00, and g-mean of 0.87. Ensemble methods outperformed this approach greatly in precision, achieving a perfect score of 1.00 in the main use case, but showed comparable performance across other metrics, with the exception of F1. In generalizability tests, DeepSeek-R1 achieved the highest F1-score (0.93) and accuracy (0.88) across additional topics, whereas ensemble methods did not show substantial improvement.
    CONCLUSION: This study introduces a LLM-based approach that integrates RAG prompting and ensemble strategies to enhance literature screening, offering substantial gains in accuracy and scalability. The findings establish a foundation for advancing LLM-driven evidence synthesis in biomedical research and clinical decision support.
    Keywords:  Ensembles; Information retrieval; Large language model; Literature screening; Prompt engineering; Retrieval-augmented generation
    DOI:  https://doi.org/10.1016/j.ijmedinf.2025.106205
  5. J Coll Physicians Surg Pak. 2025 Dec;35(12): 1626-1628
      Artificial intelligence (AI) tools have been integrated into medical research and writing at a rapid pace since ChatGPT was launched in November 2022. This development has created unprecedented opportunities for efficiency and accessibility in research and writing. This viewpoint examines the potential benefits and risks associated with the adoption of AI tools, particularly among medical students and early-career and healthcare researchers. The authors discuss that the uncritical use of AI can potentially lead to superficial learning and compromise the development of essential critical thinking skills. Effective use of AI tools requires background knowledge, as illustrated by examples in research question generation, literature review, data interpretation in clinical trials, and manuscript preparation. The authors emphasise the value of traditional skills, such as critical analysis, in-depth reading, and independent literature search in medical professions and suggest strategies for the ethical and effective integration of AI tools in research workflows, with a focus on building a strong foundation of knowledge before relying on these tools. This study offers some recommendations for educators and senior researchers in guiding the next generation of medical professionals. There is a need for collaboration and dialogue among all key stakeholders to ensure that AI tools enhance, rather than diminish, the quality and integrity of medical research and education. Key Words: Artificial intelligence, Medical writing, Medical research.
    DOI:  https://doi.org/10.29271/jcpsp.2025.12.1626
  6. Find ACL ACL. 2025 Jul;2025 21421-21443
      Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM - Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.
    DOI:  https://doi.org/10.18653/v1/2025.findings-acl.1103
  7. JMIR Med Educ. 2025 Dec 02. 11 e70190
       BACKGROUND: Large language models (LLMs) offer the potential to improve virtual patient-physician communication and reduce health care professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable health care tool.
    OBJECTIVE: This study aimed to evaluate the efficacy of Med-RISE, an information retrieval and augmentation tool, in comparison with baseline LLMs, focusing on enhancing accuracy and safety in medical question answering across diverse clinical domains.
    METHODS: This comparative study introduces Med-RISE, an enhanced version of a retrieval-augmented generation framework specifically designed to improve question-answering performance across wide-ranging medical domains and diverse disciplines. Med-RISE consists of 4 key steps: query rewriting, information retrieval (providing local and real-time retrieval), summarization, and execution (a fact and safety filter before output). This study integrated Med-RISE with 4 LLMs (GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B) and assessed their performance on 4 multiple-choice medical question datasets: MedQA (US Medical Licensing Examination), PubMedQA (original and revised versions), MedMCQA, and EYE500. Primary outcome measures included answer accuracy and hallucination rates, with hallucinations categorized into factuality (inaccurate information) or faithfulness (inconsistency with instructions) types. This study was conducted between March 2024 and August 2024.
    RESULTS: The integration of Med-RISE with each LLM led to a substantial increase in accuracy, with improvements ranging from 9.8% to 16.3% (mean 13%, SD 2.3%) across the 4 datasets. The enhanced accuracy rates were 16.3%, 12.9%, 13%, and 9.8% for GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B, respectively. In addition, Med-RISE effectively reduced hallucinations, with reductions ranging from 11.8% to 18% (mean 15.1%, SD 2.8%), factuality hallucinations decreasing by 13.5%, and faithfulness hallucinations decreasing by 5.8%. The hallucination rate reductions were 17.7%, 12.8%, 18%, and 11.8% for GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B, respectively.
    CONCLUSIONS: The Med-RISE framework significantly improves the accuracy and reduces the hallucinations of LLMs in medical question answering across benchmark datasets. By providing local and real-time information retrieval and fact and safety filtering, Med-RISE enhances the reliability and interpretability of LLMs in the medical domain, offering a promising tool for clinical practice and decision support.
    Keywords:  ChatGPT; health care communication; large language models; medical question answering; retrieval-augmented generation
    DOI:  https://doi.org/10.2196/70190
  8. Cureus. 2025 Oct;17(10): e95719
      Introduction Artificial intelligence (AI) is becoming more integrated in different research assignments, and this ongoing development opens opportunities to optimize resources, e.g., using AI in resource-intensive and time-consuming tasks like qualitative analysis of interview data. We aimed to test if Microsoft's Copilot could perform a content analysis on interview data using Graneheim and Lundman's method comparable to human analysis. Methodology We used a company-protected version of Microsoft's AI-powered assistant Copilot, which is based on large language models. The company-protected Copilot version ensured data security. A manual analysis of six interviews was conducted before this study using Graneheim and Lundman's method of content analysis. We conducted four analyses using Copilot and compared the results with those obtained through manual analysis. Copilot was prompted to use Graneheim and Lundman's method, and we tried providing it with an objective and a context. Results When prompted to use Graneheim and Lundman's method, Copilot was able to perform content analyses with high resemblance to the manual one, especially in terms of selecting meaningful units, as well as when coding them, which is within the descriptive analysis. It could also create subthemes and overarching themes resembling the manual ones; however, the interpretive analysis lacked nuances compared to the manual one. Copilot produced more accurate manifest content when only given Graneheim and Lundman's method. When given the objective, the analysis was shorter with fewer meaningful units. When given the context of the interviews, Copilot over-interpreted, and the analysis was mainly descriptive. Conclusions Copilot was able to perform a content analysis very similar to the manual one regarding the descriptive analyses on the manifest content using Graneheim and Lundman's method. However, it's interpretation of latent content lacked nuance - a limitation Copilot itself acknowledged. Copilot performed best when guided by the methodological framework alone, rather than the study's objective or context. While content analysis remains, a co-creative process requiring manual input, especially during interpretation, Copilot shows promising potential in supporting the early stages of analysis focused on manifest content.
    Keywords:  analyzing data; artificial intelligence (ai); content analysis; microsoft copilot; qualitative research
    DOI:  https://doi.org/10.7759/cureus.95719
  9. Qual Health Res. 2025 Dec 05. 10497323251389800
      The rapid advancement of artificial intelligence (AI) is increasingly shaping research methodologies across disciplines. However, its integration in qualitative research remains controversial due to epistemological, ethical, and human-centered concerns. This study explores the perspectives of 14 expert qualitative researchers from socio-anthropological and healthcare fields working in Italian academic and hospital settings, with a focus on the opportunities, challenges, and future directions of AI use in qualitative inquiry. Through semi-structured interviews and reflexive thematic analysis, four main themes were developed. First, participants expressed ambivalent attitudes-balancing curiosity with technophobia and emphasizing the need for human oversight and contextual interpretation. Second, an anthropological and philosophical dimension was constructed, underscoring the importance of reflexivity, creativity, and researcher identity as essential counterbalances to AI's mechanistic tendencies. Third, researchers acknowledged AI's practical benefits in tasks such as transcription and data management, and they remained skeptical of its ability to perform complex interpretative work. Finally, ethical and sustainability concerns were raised, including algorithmic bias, data privacy, and the environmental impact of AI technologies. The findings reveal persistent epistemological tensions but also highlight emerging opportunities for AI to enhance research efficiency and accessibility, provided that human interpretative agency remains central. Participants stressed the importance of developing robust ethical frameworks, fostering critical reflexivity, and adopting innovative conceptual approaches to responsibly integrate AI into qualitative research and education. This study offers valuable insights for scholars and practitioners navigating the evolving landscape of AI in qualitative inquiry, advocating a balanced approach that leverages AI's potential while safeguarding the human core of qualitative research.
    Keywords:  artificial intelligence; perceptions; qualitative research; reflexive thematic analysis
    DOI:  https://doi.org/10.1177/10497323251389800
  10. Nature. 2025 Dec 05.
      
    Keywords:  Ethics; Machine learning; Technology
    DOI:  https://doi.org/10.1038/d41586-025-03936-2
  11. Front Artif Intell. 2025 ;8 1689178
      The evaluation of medical Artificial Intelligence (AI) systems presents significant challenges, with performance often varying drastically across studies. This narrative review identifies prompt quality-the way questions are formulated for the AI-as a critical yet under-recognized variable influencing these outcomes. The analysis explores scientific literature published between January 2018 and August 2025 to investigate the impact of prompt engineering on the perceived accuracy and reliability of conversational AI in medicine. Results reveal a "performance paradox," where AI sometimes surpasses human experts in controlled settings yet underperforms in broader meta-analyses. This inconsistency is strongly linked to the type of prompt used. Critical concerns are highlighted, such as "prompting bias," which may invalidate study conclusions, and AI "hallucinations" that generate dangerously incorrect information. Furthermore, a significant gap exists between the optimal prompts formulated by experts and the natural queries of the general public, raising issues of safety and health equity. In the end we were interested in finding out what the optimal balance existed between the complexity of a prompt and the value of the generated response, and, in this context, whether we could attempt to define a path toward identifying the best possible prompt.
    Keywords:  AI ethics; artificial intelligence; general public; generated responses; medical AI; medical information; performance evaluation; prompt engineering
    DOI:  https://doi.org/10.3389/frai.2025.1689178