bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–03–02
seven papers selected by
Farhad Shokraneh



  1. Ann Intern Med. 2025 Feb 25.
       BACKGROUND: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis.
    OBJECTIVE: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews.
    DESIGN: Diagnostic test accuracy.
    SETTING: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI).
    PARTICIPANTS: None.
    MEASUREMENTS: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity).
    RESULTS: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD.
    LIMITATIONS: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles.
    CONCLUSION: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences.
    PRIMARY FUNDING SOURCE: None.
    DOI:  https://doi.org/10.7326/ANNALS-24-02189
  2. Neonatology. 2025 Feb 25. 1-16
       BACKGROUND: Only a few studies have addressed the potential of large language models (LLM) in risk of bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk of bias assessments of neonatal studies.
    METHODS: We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk of bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk of bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen's Kappa statistics (with 95% confidence intervals for each risk of bias domain and for the overall assessment.
    RESULTS: From nine reviews a total of 61 randomized studies were analyzed. A total of 427 judgements were compared. The overall kappa was 0.43 (95%CI 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95%CI: 0.59-0.70). The Cohen's kappa was assessed for each domain and the best agreement was observed in the allocation concealment (kappa=0.73, 95%CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (kappa=-0.03, 95%CI: -0.07-0.02).
    CONCLUSION: ChatGPT-4o failed to achieve sufficient agreement in the risk of bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently the use of ChatGPT-4o in risk of bias assessments should not be promoted.
    DOI:  https://doi.org/10.1159/000544857
  3. J Clin Med. 2025 Feb 18. pii: 1363. [Epub ahead of print]14(4):
      Objectives: This review aimed to evaluate the role of ChatGPT in original research articles within the field of oral and maxillofacial surgery (OMS), focusing on its applications, limitations, and future directions. Methods: A literature search was conducted in PubMed using predefined search terms and Boolean operators to identify original research articles utilizing ChatGPT published up to October 2024. The selection process involved screening studies based on their relevance to OMS and ChatGPT applications, with 26 articles meeting the final inclusion criteria. Results: ChatGPT has been applied in various OMS-related domains, including clinical decision support in real and virtual scenarios, patient and practitioner education, scientific writing and referencing, and its ability to answer licensing exam questions. As a clinical decision support tool, ChatGPT demonstrated moderate accuracy (approximately 70-80%). It showed moderate to high accuracy (up to 90%) in providing patient guidance and information. However, its reliability remains inconsistent across different applications, necessitating further evaluation. Conclusions: While ChatGPT presents potential benefits in OMS, particularly in supporting clinical decisions and improving access to medical information, it should not be regarded as a substitute for clinicians and must be used as an adjunct tool. Further validation studies and technological refinements are required to enhance its reliability and effectiveness in clinical and research settings.
    Keywords:  ChatGPT; chatbot; generative artificial intelligence; oral; oral and maxillofacial surgery; review literature as topic; surgery
    DOI:  https://doi.org/10.3390/jcm14041363
  4. World Neurosurg. 2025 Feb 25. pii: S1878-8750(25)00165-2. [Epub ahead of print] 123809
       INTRODUCTION: Artificial intelligence (AI) has become an increasingly prominent tool in the field of neurosurgery, revolutionizing various aspects of patient care and surgical practices. AI-powered systems can provide real-time feedback to surgeons, enhancing precision and reducing the risk of complications during surgical procedures. The objective of this study is to review the role of AI in training neurosurgical residents, improving accuracy during surgery and reducing complications.
    METHODS: The literature search method involved searching PubMed using relevant keywords to identify English literature publications, including full texts, and concerning human subject matter from its inception until May 2024, initially generating 247,747 results. Articles were then screened for topic relevancy by abstract contents. Further articles were retrieved from the sources cited by the initially reviewed articles. A comprehensive review was then performed on various studies, including observational studies, case-control studies, cohort studies, clinical trials, meta-analyses, and reviews by 4 reviewers individually and then collectively.
    RESULTS: Studies on AI in neurosurgery reach more than 4,000 produced over a decade alone. The majority of studies regarding clinical diagnosis, risk prediction, and intraoperative guidance remain retrospective in nature. In its current form, AI based paradigm performed inferiorly to neurosurgery residents in test taking.
    CONCLUSION: AI has potential for broad applications in neurosurgery from use as a diagnostic, predictive, intraoperative, or educational tool. Further research is warranted for prospective use of AI based technology for delivery of neurosurgical care.
    Keywords:  artificial intelligence; complications; machine-based learning; neurosurgery
    DOI:  https://doi.org/10.1016/j.wneu.2025.123809
  5. JMIR Hum Factors. 2025 Feb 25. 12 e52358
       Unlabelled:
    Background: Emergency and acute medicine doctors require easily accessible evidence-based information to safely manage a wide range of clinical presentations. The inability to find evidence-based local guidelines on the trust's intranet leads to information retrieval from the World Wide Web. Artificial intelligence (AI) has the potential to make evidence-based information retrieval faster and easier.
    Objective: The aim of the study is to conduct a time-motion analysis, comparing cohorts of junior doctors using (1) an AI-supported search engine versus (2) the traditional hospital intranet. The study also aims to examine the impact of the AI-supported search engine on the duration of searches and workflow when seeking answers to clinical queries at the point of care.
    Methods: This pre- and postobservational study was conducted in 2 phases. In the first phase, clinical information searches by 10 doctors caring for acutely unwell patients in acute medicine were observed during 10 working days. Based on these findings and input from a focus group of 14 clinicians, an AI-supported, context-sensitive search engine was implemented. In the second phase, clinical practice was observed for 10 doctors for an additional 10 working days using the new search engine.
    Results: The hospital intranet group (n=10) had a median of 23 months of clinical experience, while the AI-supported search engine group (n=10) had a median of 54 months. Participants using the AI-supported engine conducted fewer searches. User satisfaction and query resolution rates were similar between the 2 phases. Searches with the AI-supported engine took 43 seconds longer on average. Clinicians rated the new app with a favorable Net Promoter Score of 20.
    Conclusions: We report a successful feasibility pilot of an AI-driven search engine for clinical guidelines. Further development of the engine including the incorporation of large language models might improve accuracy and speed. More research is required to establish clinical impact in different user groups. Focusing on new staff at beginning of their post might be the most suitable study design.
    Keywords:  artificial intelligence; clinical experience; clinical impact; clinical practice; developing; emergency care; hospital care; information retrieval; information search; machine learning; mobile phone; study design; testing; training; user group; user satisfaction; users
    DOI:  https://doi.org/10.2196/52358
  6. Foot Ankle Spec. 2025 Feb 22. 19386400251319567
       INTRODUCTION: As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly-from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model's alignment with clinical consensus statements in foot and ankle surgery.
    METHODS: Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel's ratings, and intra-rater analysis was performed.
    RESULTS: The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen's kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen's kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.
    CONCLUSION: ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.
    LEVEL OF EVIDENCE: Level IV: Retrospective cohort study.
    Keywords:  artificial intelligence; machine learning; medical informatics; natural language processing; surgical decision-making
    DOI:  https://doi.org/10.1177/19386400251319567