bims-helfai Biomed News
on AI in health care
Issue of 2026–01–18
27 papers selected by
Sergei Polevikov



  1. Nature. 2026 Jan;649(8097): 584-589
      The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment1. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding4. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.
    DOI:  https://doi.org/10.1038/s41586-025-09937-5
  2. J Eval Clin Pract. 2026 Feb;32(1): e70365
       BACKGROUND: Clinical documentation is a major contributor to physician burnout, and artificial intelligence (AI) scribes are increasingly being adopted to help reduce the burden of documentation. These tools automatically generate clinical notes from patient-provider conversations using speech recognition and natural language processing. However, their usability and effectiveness still remain an issue.
    AIM: To synthesise the existing evidence on usability-related barriers and facilitators influencing the adoption and use of AI scribes for clinical documentation in healthcare settings.
    METHOD: The scoping review employed the methodology developed by Arksey and O'Malley in 2005 and further expanded by Levac and Colquhoun in 2010. We searched PubMed, Scopus, Ovid MEDLINE, and Web of Science to identify relevant studies published in English between 2015 and 2025. All findings were reported according to PRISMA guidelines for scoping reviews.
    RESULTS: Of 4588 identified records, 14 studies met the inclusion criteria and employed qualitative, quantitative, and mixed-methods. AI scribes were consistently associated with reduced cognitive load, faster documentation, improved work-life balance, and positive user experience. However, common barriers included frequent errors, excessive note length, limited formatting options, and poor integration with electronic health records (EHR). Editing demands varied by clinician experience, with some finding that time savings were lost when substantial corrections were needed. Overall, usability was rated more favourably in routine or protocol-driven visits, with mixed outcomes reported on long-term burnout and workflow impact.
    CONCLUSION: AI scribes show promise in reducing documentation burden and improving clinical workflow, but important usability challenges remain. Enhancing accuracy, streamlining integration, and allowing greater customization will be essential to support broader adoption and sustained use in clinical practice.
    DOI:  https://doi.org/10.1111/jep.70365
  3. Eur Radiol. 2026 Jan 16.
      Multiple articles have touted the longitudinal promise of artificial intelligence (AI) in radiology, including projections of streamlining repetitive tasks, improving workflow, and reducing physician burnout. The purpose of this article is to review publications directly assessing the impact of AI on radiologist burnout and the impact of AI on the established drivers of radiologist burnout. Our analysis found conflicting, inconclusive limited data that AI reduces radiologist burnout, and the balance of data does not support that AI improves the drivers of burnout. How AI affects radiologist burnout remains a "black box", with the final impact yet to be determined. KEY POINTS: Question While AI has been touted to reduce radiologist burnout, the literature to date supporting this claim has not been explored. Findings Our analysis found inconclusive, limited data that AI reduces radiologist burnout, and that the balance of data does not support that AI improves the drivers of burnout. Clinical relevance Despite the optimism towards AI implementation in radiology, how AI truly affects radiologist burnout remains a "black box", with the final impact yet to be determined.
    Keywords:  Artificial intelligence; Burnout; Radiologist; Radiology; Wellness
    DOI:  https://doi.org/10.1007/s00330-025-12278-6
  4. Med Sci Monit. 2026 Jan 17. 32 e950916
      BACKGROUND We suggest that testing a large language model (LLM) chatbot in terms of the accuracy of the references it provides could be a powerful, quantifiable means of rating its inherent degree of misinformation, since the accuracy of the bibliographic data can be directly verified. Given the growing reliance on artificial intelligence (AI) tools in academic research and clinical decision-making, such a rating could be extremely useful. MATERIAL AND METHODS In this study, we compared 3 versions of ChatGPT and 3 versions of Gemini by asking them to provide references about 25 highly cited topics in otorhinolaryngology (those with "guidelines" in the title). Answers were sought on 3 consecutive days to assess the variability and consistency of responses. In total, the 6 chatbots returned 1947 references, which were carefully checked against PubMed, Web of Science, and Google Scholar, and rated according to accuracy. Ratings were given based on correct authorship, complete bibliographic details, and proper DOI numbers. RESULTS Common discrepancies noted were wrong author names and erroneous DOI numbers. Across the 6 chatbots, ChatGPT-4.1 (with web search enabled) achieved the best accuracy, with a score of 51%, with Gemini 2.5 Pro being second at 41%. The 2 versions with a web search facility performed better than the 4 versions without. Topics having higher citation counts were associated with lower error rates, suggesting that more widely disseminated scientific findings result in more accurate references. CONCLUSIONS Our findings provide a solid benchmark for rating AI-driven bibliographic retrieval and underline the need for further refinement before these tools can be reliably integrated into academia and clinical applications.
    DOI:  https://doi.org/10.12659/MSM.950916
  5. Zhejiang Da Xue Xue Bao Yi Xue Ban. 2026 Jan 12. 1-8
       OBJECTIVES: To systematically evaluate the performance of generative artificial intelligence (GenAI) models, DeepSeek-V3 and the Qwen3 series, in the differential diagnosis of weight loss.
    METHODS: A search was conducted in the PubMed database for all case reports published in the American Journal of Case Reports between January 1, 2012 and June 2, 2025, containing the term "weight loss" in the title or abstract. Two senior general practitioners independently verified and assessed whether each case met the diagnostic criteria for weight loss (emaciation). Cases that did not meet these criteria, had incomplete information, or fell within the scope of clearly defined specialized diagnoses and treatments were excluded. The remaining cases were then compiled into standardized clinical case summaries. These summaries were presented to DeepSeek-V3 and the Qwen3 series models (Qwen3-235B-A22B, Qwen3-30B-A3B, and Qwen3-32B) to generate ranked lists of the top 10 differential diagnoses. The models were not specifically fine-tuned for this task. Sensitivity, precision, and F1-score were used to evaluate performance. Intergroup comparisons were performed using McNemar's test and Cochran's Q test.
    RESULTS: A total of 87 case were analyzed. For DeepSeek-V3, the sensitivity for Top1, Top5, and Top10 diagnoses was 26.44%, 56.32%, and 65.52%, respectively, with corresponding precision values of 26.44%, 11.26%, and 6.55%. For Qwen3-235B-A22B, the sensitivity values were 21.84%, 43.68%, and 59.77%, with corre-sponding precision values of 21.84%, 8.74%, and 5.98%. DeepSeek-V3 demonstrated significantly better performance than Qwen3-235B-A22B in sensitivity, precision, and F1-score at the Top5 level (P=0.043). Among the Qwen3 series models, Qwen3-235B-A22B showed the best performance in sensitivity, precision, and F1-score for the Top1 diagnosis, outperforming Qwen3-32B and Qwen3-30B-A3B. However, the differences among the three Qwen3 models across all diagnostic levels were not statistically significant (all P>0.05).
    CONCLUSIONS: Domestic GenAI models exhibit a characteristic of "breadth over precision" in the differential diagnosis of weight loss, with DeepSeek-V3 performing better at key diagnostic levels. Although the sensitivity and precision for the top-ranked diagnosis require improvement, these models can serve as effective clinical decision support tools, broadening the diagnostic perspectives of general practitioners. They may hold significant application value in the management of undifferentiated diseases.
    Keywords:  Artificial intelligence; Differential diagnosis; Language model; Undifferentiated disease; Weight loss
    DOI:  https://doi.org/10.3724/zdxbyxb-2025-0463
  6. Wiad Lek. 2025 ;78(11): 2481-2488
       OBJECTIVE: Aim: To analyze the potential of artificial intelligence in the process of solving innovative ideas in the system of training future doctors, improving the educational process in medical universities, and analyzing the concept of quot; health and quot; as a key term in professional activity. In order to achieve the stated goal, we plan to examine the importance of this concept for regulatory documents and propose an original (authorial) solution to these issues.
    PATIENTS AND METHODS: Materials and Methods: A systematic literature search was carried out in the following databases: PubMed, Scopus, Web of Science, Google Scholar. Additional grey literature was identified through institutional repositories, conference proceedings, and relevant policy documents of the Ukrainian government and the European Union. Such keywords and their combinations as "artificial intelligence", "medical education", "empathy", "anamnesis", "academic integrity", "iatrogenesis", "digital transformation", "higher medical school", "information culture", "AI in healthcare", "Poland", "Ukraine" were used. Inclusion criteria encompassed peer-reviewed articles published between 2018 and 2025 in English, Ukrainian, or Polish that focused on the use of AI in medical education, clinical training, and healthcare organization, as well as studies analyzing its advantages, limitations, ethical considerations, and offering comparative or practical recommendations. Exclusion criteria included non-scientific sources, articles unrelated to medical education or healthcare, and studies lacking clear methodology or outcome measures.
    CONCLUSION: Conclusions: Artificial intelligence in organizing the administrative work of medical institutions has the significant potential. This refers to information about staff and patients, organizing communication schedules with specialists of the relevant profile, and optimizing schedules.
    Keywords:   creativity ; medicine ; artificial intelligence ; communication ; education ; personality ; university
    DOI:  https://doi.org/10.36740/WLek/214800
  7. Cureus. 2026 Jan;18(1): e101412
      Hospital scheduling, particularly for on-call shifts and daily assignments, is a complex task that must account for numerous factors, such as service requirements, staff preferences, and unplanned absences. Traditional methods often result in significant administrative burden and can lead to staff frustration, potentially affecting the quality of care. This study explores the use of artificial intelligence-based large language models (AI LLMs) to automate hospital scheduling through widely accessible tools, aiming to simplify the process, reduce manual effort, and enhance fairness. ChatGPT® (OpenAI, San Francisco, CA, USA) is used to translate natural language instructions into VBA (Visual Basic for Applications) macros, which automate the creation of on-call and daily activity schedules. The process involves collecting staff preferences via a Microsoft Excel (Microsoft Corporation, Redmond, WA, USA) sheet, followed by AI-generated VBA macros that automate the creation of the schedule, ensuring adherence to various constraints such as equitable shift distribution and prioritization of specific roles. The system was developed by a non-IT professional and does not require advanced programming skills. The implementation of this AI-driven scheduling system resulted in a significant reduction in administrative time and increased schedule fairness, as decisions were based on clear, consistently applied rules. The system also minimized conflicts within teams, improving both organizational efficiency and staff satisfaction. However, the development of such a process was not without its challenges, particularly in terms of rule formulation and Excel cell references. The integration of Microsoft Excel® and AI LLM provides a simple and reproducible solution for hospital schedule organization, reducing administrative burden and promoting fairness. This model, which can be adapted to other sectors facing similar challenges, enables teams to retain control over the process.
    Keywords:  artificial intelligence in medicine; chat gpt; emergency medicine; excel; pediatrics; scheduling
    DOI:  https://doi.org/10.7759/cureus.101412
  8. J Craniofac Surg. 2026 Jan 13.
       OBJECTIVE: This study aimed to evaluate and compare the accuracy, reliability, and comprehensibility of information provided by 4 artificial intelligence (AI)-based language models (ChatGPT-4, Google Gemini, Microsoft Copilot, and DeepSeek-v3) for orthognathic surgery.
    METHODS: A cross-sectional content analysis was carried out to evaluate the responses generated by ChatGPT-4, Gemini, Copilot, and DeepSeek-v3. A total of 118 questions covering 12 domains related to orthognathic surgery were formulated, and the AI-generated answers were systematically assessed. A 5-point Likert scale was used to independently score the responses. Descriptive statistics were used. The Fisher exact test was applied to examine relationships between categorical variables when the expected value was <5. All analyses were performed by the IBM SPSS 27 program.
    RESULTS: Significant differences were observed among the AI models (P=0.022). DeepSeek-v3 demonstrated the highest proportion of objectively true responses (87.3%), outperforming Gemini, ChatGPT-4, and Copilot. While ChatGPT-4 and DeepSeek-v3 performed significantly better in the "postoperative" domain by providing "objectively true" answers (P=0.038), Gemini and Copilot generated a greater proportion of "selected facts." Domain-specific variations were statistically significant only for Gemini (P<0.001).
    CONCLUSIONS: The results indicate that the reliability of AI-assisted language models in delivering medical information is subject to variation depending on the specific topic addressed. In its first comparative assessment within this study, DeepSeek-v3 outperformed the other evaluated models in terms of informational accuracy.
    Keywords:  Artificial intelligence; ChatGPT-4; DeepSeek-v3; Google Gemini; Microsoft Copilot; orthognathic surgery
    DOI:  https://doi.org/10.1097/SCS.0000000000012367
  9. JAMA Psychiatry. 2026 Jan 14.
       Importance: The potential of tools using artificial intelligence (AI) to address the many challenges in delivery of mental health care has been widely discussed. However, the possible negative consequences of AI for such care have received less attention.
    Observations: Integrating AI with mental health care has the potential to expand access and improve quality of care. It may also contribute to improvements in diagnosis, risk stratification, and development of novel therapeutics. At the same time, availability of AI chatbots and stratification algorithms may diminish access to human-delivered care. Reliance on AI tools may have other unanticipated adverse consequences on clinical practice, including diminished human clinician skill. The probabilistic nature of many of these tools, including large language models, makes their capacity to cause harm difficult to determine.
    Conclusions and Relevance: The likely benefits of AI for psychiatric care delivery must be balanced against substantial risks. Strategies to mitigate this risk may require regulation to enhance transparency and systematically evaluate the impact of AI in practice, as well as clinician training to make optimal use of these emerging methods.
    DOI:  https://doi.org/10.1001/jamapsychiatry.2025.4116
  10. Cureus. 2025 Dec;17(12): e99286
       INTRODUCTION: Accurate and up-to-date educational resources are crucial for medical professionals to deliver effective patient care, particularly in conditions like pediatric asthma, which has a high disease burden in children. Timely interventions are essential to manage this condition appropriately and to ensure better outcomes. With the rapid advancement of artificial intelligence in healthcare, AI tools like Google Gemini are being explored as quick and accessible alternatives for generating medical content.  Methods: A cross-sectional observational study was conducted to focus on four core topics related to the management of pediatric asthma. Prompts for each of the core topics were entered in Google Gemini and UpToDate to generate responses. The WebFx Readability Tool was used to assess readability utilizing metrics such as Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, word count, sentence count, words per sentence, difficult word count, and percentage. The collected data were analyzed using the Mann-Whitney U test, and a p-value of < 0.05 was considered statistically significant.
    RESULTS: When comparing the readability characteristics between UpToDate and Google Gemini, statistically significant differences were found, indicating that Google Gemini is more accessible for individuals with lower literacy skills. UpToDate received higher scores on the Simple Measure of Gobbledygook (SMOG) index across all four core topics, denoting it as hard to understand for the normal population. Google Gemini scored a greater difficulty word percentage across all four topics.
    CONCLUSION: Google Gemini was found to use more complex vocabulary while still maintaining overall accessibility, making it appropriate for patients with lower literacy levels. Although certain readability parameters demonstrated Google Gemini to be a more reader-friendly tool for assessing and understanding medical content, the high percentage of difficult words may make it more challenging for younger individuals and lower socio-economic populations to access.
    Keywords:  artificial intelligence; asthma; clinical decision support; educational content; google gemini; medical education; uptodate
    DOI:  https://doi.org/10.7759/cureus.99286
  11. Int J Med Inform. 2025 Dec 31. pii: S1386-5056(25)00467-8. [Epub ahead of print]209 106250
       OBJECTIVE: The rapid expansion of digital healthcare has heightened the volume of patient communication, thereby increasing the workload for healthcare professionals. Large Language Models (LLMs) hold promises for offering automated responses to patient questions relayed through eHealth platforms, yet concerns persist regarding their effectiveness, accuracy, and limitations in healthcare settings. This study aims to evaluate the current evidence on the performance and perceived suitability of LLMs in healthcare, focusing on their role in supporting clinical decision-making and patient communication.
    MATERIALS AND METHODS: A systematic search in PubMed and Embase up to June 11, 2025 identified 330 studies, of which 20 met the inclusion criteria for comparing the accuracy and adequacy of medical information provided by LLMs versus healthcare professionals and guidelines. The search strategy combined terms related to LLMs, healthcare professionals, and patient questions. The ROBINS-I tool assessed the risk of bias.
    RESULTS: A total of nineteen studies focused on medical specialties and one on the primary care setting. Twelve studies favored the responses generated by LLMs, six reported mixed results, and two favored the healthcare professionals' response. Bias components generally scored moderate to low, indicating a low risk of bias.
    DISCUSSION AND CONCLUSIONS: The review summarizes current evidence on the accuracy and adequacy of medical information provided by LLMs in response to patient questions, compared to healthcare professionals and clinical guidelines. While LLMs show potential as supportive tools in healthcare, their integration should be approached cautiously due to inconsistent performance and possible risks. Further research is essential before widespread adoption.
    Keywords:  Artificial intelligence; Healthcare; Large Language Models (LLMs); Natural Language Processing (NLP); Patient questions
    DOI:  https://doi.org/10.1016/j.ijmedinf.2025.106250
  12. BMC Med Ethics. 2026 Jan 10.
       BACKGROUND: This systematic review aims to synthesize the current knowledge about the applications and challenges of Artificial Intelligence (AI) technologies in healthcare, while evaluating the extent to which the European Union (EU) AI Act and the European Health Data Space (EHDS) contribute to ensuring responsible, secure, and ethically sound adoption of AI in clinical practice.
    METHODS: This review adheres to the guidelines set by the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) and has also been registered in PROSPERO. The PubMed®, Web of Science™, Scopus and ScienceDirect® databases were used as scientific search strategy. In addition, records identified through other sources (grey literature) were also assessed for eligibility and included. All studies published between 2020 and 2024 about the application of AI and its regulation and ethical implications, particularly in healthcare, were included. Eligible studies were assessed for potential risk of bias during data extraction and quality evaluation screening.
    RESULTS: A total of 76 studies were included. Although AI technologies have several applications in the healthcare sector such as disease diagnosis, treatment, clinical data management, automated surgery, remote health monitoring, elderly patient care and/ or biomedical research, important ethical issues are raised when using AI, namely data privacy, safety, lack of transparency, explainability, trust and potential biases.
    CONCLUSIONS: A proper application and compliance with established ethical principles, and legal regulations such as the EU AI Act and the EHDS are fundamental to ensure a responsible, safe, sustainable and trustworthy use of AI in healthcare.
    Keywords:  Artificial intelligence; EHDS; EU AI act; Ethics; Healthcare; Regulation
    DOI:  https://doi.org/10.1186/s12910-025-01372-5
  13. Cutis. 2025 Nov;116(5): 184-185
      Artificial intelligence (AI) technology has proven to be a valuable tool in the diagnosis and classification of dermatologic conditions. To our knowledge, no prior studies have investigated the role of dermatologists in the development of AI programs for classification of dermatologic conditions other than skin cancers. In this study, we aimed to analyze AI tools used for diagnosing and classifying skin disease and evaluate the role of dermatologists in the creation of AI technology. Additionally, we investigated the number of clinical images used in datasets to train AI programs and compared tools that were created with dermatologist input to those created without dermatologist/ clinician involvement.
    DOI:  https://doi.org/10.12788/cutis.1295
  14. Health Sci Rep. 2026 Jan;9(1): e71492
       Background and Aims: ChatGPT is a popular large language model with potential educational applications in medicine. However, its performance in standardized, multi-disciplinary medical exams has not been comprehensively assessed. This study evaluates ChatGPT's accuracy and quality in Iran's national medical pre-internship exam.
    Methods: We tested ChatGPT (GPT-3.5, May 3rd version) on 195 multiple-choice questions from the March 2022 Iranian pre-internship exam, covering 23 medical specialties. Questions with visual content were excluded. Each question was asked in a new chat to avoid memory bias. Responses were evaluated by 55 experts using a 5-point Likert scale and compared against the official answer key. Data were analyzed descriptively using SPSS.
    Results: ChatGPT answered 68.6% of questions correctly. Expert ratings averaged 4.23/5 (SD = 1.21), indicating good to excellent quality. Best-performing specialties included pharmacology (85.7%), otorhinolaryngology (83.3%), and dermatology (83.3%). Lower performance was observed in pulmonology (42.9%) and epidemiology (50%).
    Conclusion: ChatGPT shows promise as a supplemental educational tool in medical education, but its accuracy varies by specialty. Faculty guidance is essential to ensure responsible integration until further improvements and validations are made.
    Keywords:  ChatGPT; artificial intelligence; examination; medical education
    DOI:  https://doi.org/10.1002/hsr2.71492
  15. Maedica (Bucur). 2025 Dec;20(4): 765-770
       Objectives: This study evaluated the diagnostic performance of two large language models (LLMs), ChatGPT and Google Gemini, to identify common retinal and optic nerve diseases benchmarked against an experienced ophthalmologist.
    Methods: Thirty standardized case vignettes, each comprising a brief clinical history and a high-resolution fundus image, were independently evaluated by ChatGPT, Google Gemini and an ophthalmologist. Ten retinal and optic nerve diseases were included. Diagnostic accuracy was calculated against a gold standard defined by consensus of two retina specialists. Inter-rater agreement was assessed using Cohen's kappa (κ). Secondary outcomes included interpretation time and clarity of explanation.
    Results: The ophthalmologist achieved the highest diagnostic accuracy (96.7%), followed by ChatGPT (90.0%) and Google Gemini (86.7%). Agreement between ChatGPT and Gemini was moderate (κ = 0.51, p = 0.004). ChatGPT showed moderate agreement with the ophthalmologist (κ = 0.47, p = 0.002), while Gemini demonstrated fair agreement with the ophthalmologist (κ = 0.36, p = 0.01). ChatGPT was the fastest (mean 21.7 seconds), followed by Gemini (25.7 seconds) and the ophthalmologist (149.8 seconds). Clarity of interpretation was highest for the ophthalmologist (mean 4.53/5), followed by ChatGPT (3.60/5) and Gemini (2.96/5), with significant differences between groups.
    Conclusion: Ophthalmologists remain superior in diagnostic accuracy and clarity. However, ChatGPT and Google Gemini demonstrated strong performance in several retinal conditions. Their rapid evaluation times indicate potential utility as adjunct tools in triage, screening and education.
    Keywords:  ChatGPT; Google Gemini; artificial intelligence; retinal and neuro-ophthalmic diseases
    DOI:  https://doi.org/10.26574/maedica.2025.20.4.765
  16. Adv Ophthalmol Pract Res. 2026 Feb-Mar;6(1):6(1): 8-19
       Background: Vision and vision-language foundation models, a subset of advanced artificial intelligence (AI) frameworks, have shown transformative potential in various medical fields. In ophthalmology, these models, particularly large language models and vision-based models, have demonstrated great potential to improve diagnostic accuracy, enhance treatment planning, and streamline clinical workflows. However, their deployment in ophthalmology has faced several challenges, particularly regarding generalizability and integration into clinical practice. This systematic review aims to summarize the current evidence on the use of vision and vision-language foundation models in ophthalmology, identifying key applications, outcomes, and challenges.
    Main text: A comprehensive search on PubMed, Web of Science, Scopus, and Google Scholar was conducted to identify studies published between January 2020 and July 2025. Studies were included if they developed or applied foundation models, such as vision-based models and large language models, to clinically relevant ophthalmic applications. A total of 10 studies met the inclusion criteria, covering areas such as retinal diseases, glaucoma, and ocular surface tumor. The primary outcome measures are model performance metrics, integration into clinical workflows, and the clinical utility of the models. Additionally, the review explored the limitations of foundation models, such as the reliance on large datasets, computational resources, and interpretability challenges.The majority of studies demonstrated that foundation models could achieve high diagnostic accuracy, with several reports indicating excellent performance comparable to or exceeding those of experienced clinicians. Foundation models achieved high accuracy rates up to 95% for diagnosing retinal diseases, and similar performances for detecting glaucoma progression. Despite promising results, concerns about algorithmic bias, overfitting, and the need for diverse training data were common. High computational demands, EHR compatibility, and the need for clinician validation also posed challenges. Additionally, model interpretability issues hindered clinician trust and adoption.
    Conclusions: Vision and vision-language foundation models in ophthalmology show significant potential for advancing diagnostic accuracy and treatment strategies, particularly in retinal diseases, glaucoma, and ocular oncology. However, challenges such as data quality, transparency, and ethical considerations must be addressed. Future research should focus on refining model performance, improving interpretability and generalizability, and exploring strategies for integrating these models into routine clinical practice to maximize their impact in clinical ophthalmology.
    Keywords:  Artificial intelligence; Clinical integration; Ophthalmology; Vision foundation models; Vision-language models
    DOI:  https://doi.org/10.1016/j.aopr.2025.10.004
  17. Aesthetic Plast Surg. 2026 Jan 12.
       BACKGROUND: Double eyelid surgery is a common cosmetic procedure that creates a crease in the upper eyelid. Due to insufficient understanding of the procedure, numerous consultations have emerged, placing a heavy burden on plastic surgeons. The rise of large language models (LLMs) offers a potential solution to this issue.
    METHODS: This study collected sixteen questions commonly of concern to individuals seeking the surgery via an online questionnaire and assessed the efficacy of fifteen popular LLMs in answering these questions with both English and Chinese inputs. All responses from the LLMs were scored multidimensionally by three expert eyelid plastic surgeons across dimensions including professionalism, patient friendliness, informativeness, practicality, and logical clarity. The scoring results were statistically analyzed using the Friedman test and Nemenyi post-hoc test.
    RESULTS: With English input, ERNIE-Bot, ChatGPT-4o, and Gemini-2.0-Flash consistently ranked among the top three across most evaluation dimensions. In contrast, Claude-3.7-Sonnet, HuatuoGPT, ZoeGPT, CompliantGPT, and BastionGPT ranked lower across all dimensions, with performance significantly lagging behind the top performers. For Chinese input, DeepSeek-R1 maintained a leading position across all dimensions, forming the first tier alongside DeepSeek-V3, Gemini-2.0-Flash, and ERNIE-Bot. Meanwhile, Claude-3.5-Haiku, ZoeGPT, Llama3.3-70B-Instruct, CompliantGPT, HuatuoGPT, and BastionGPT ranked lower in multiple dimensions, with a significant gap relative to first-tier models.
    CONCLUSION: This study demonstrated LLMs' potential as medical consultation tools for double eyelid surgery, providing useful guidance for both English and Chinese users. Future research should focus on fine-tuning LLMs with more specialized medical data and exploring workflows for surgeon-LLM collaboration to validate their clinical utility.
    LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
    Keywords:  Double eyelid surgery; Large language models; Medical consultation; Performance evaluation
    DOI:  https://doi.org/10.1007/s00266-025-05458-8
  18. Comput Struct Biotechnol J. 2026 ;31 157-168
      Biomedical data continues to grow significantly, coming from different sources and being updated daily. This makes manual extraction not only time-consuming but also impossible to keep up with due to this constant increase. In this context, biomedical relation extraction, which aims to automate the discovery of relationships between entities from free texts, becomes an essential step for knowledge discovery. While fine-tuning Transformer models such as T5, PubMedBERT, BioBERT, ClinicalT5, and RoBERTa has shown satisfactory results, it requires specific datasets, which are time-consuming to create and costly since they require domain experts. One ideal solution is the use of Generative Artificial Intelligence (GenAI), as it is directly applicable to a problem without the need for data creation. In this paper, we explore these generative large language models (LLMs) to evaluate whether they can be reliable when it comes to processing biomedical data. To do so, we study the relation extraction task of four major biomedical tasks, namely chemical-protein relation extraction, disease-protein relation extraction, drug-drug interaction, and protein-protein interaction. To address this need, our study focuses on comparing the performance of fine-tuned Transformer models with generative models such as Mistral-7B, LLaMA2-7B, GLiNER, LLaMA3-8B, Gemma, RAG, and Me-LLaMA-13B, using the same datasets in both experiments, showing that fine-tuned Transformer models achieve performance levels roughly twice those obtained by generative LLMs. These models require more pretraining on specific data, as demonstrated by Me-LLaMA (pretrained on MIMIC-III), which shows a significant improvement in performance compared to the model pretrained on a general domain. In terms of performance, fine-tuned Transformer models on domain-specific biomedical data achieved average scores ranging from 84.42 to 90.35, while generative models obtained significantly lower scores, between 36.64 and 53.94. Among the generative LLMs, LLaMA3-8B, RAG, and Me-LLaMA-13B achieved the top three scores, with Me-LLaMA, pretrained on MIMIC-III, reaching 45.76, illustrating the benefit of domain-specific pretraining.
    Keywords:  Generative artificial intelligence; Large language models; Named entity recognition; Natural language processing; Pretrained language models; Relation extraction
    DOI:  https://doi.org/10.1016/j.csbj.2025.12.004
  19. J Am Med Inform Assoc. 2026 Jan 13. pii: ocaf233. [Epub ahead of print]
       BACKGROUND: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.
    METHODS: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.
    RESULTS: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.
    CONCLUSION: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
    Keywords:  electronic health records; evaluation; large language models; natural language processing; review
    DOI:  https://doi.org/10.1093/jamia/ocaf233
  20. JAMIA Open. 2026 Feb;9(1): ooaf098
       Objectives: The surge in publications increases screening time required to maintain high-quality literature reviews. One of the most time-consuming phases is title and abstract screening. Machine learning tools have semi-automated this process for systematic reviews, with limited success for scoping reviews. ChatGPT, a chatbot based on a large language model, might support scoping review screening by identifying key concepts and themes. We hypothesize that ChatGPT outperforms the semi-automated tool Rayyan, increasing efficiency at acceptable costs while maintaining a low type II error.
    Materials and Methods: We conducted a retrospective study using human screening decisions on a scoping review of 15 307 abstracts as a benchmark. A training set of 100 abstracts was used for prompt engineering for ChatGPT and training Rayyan. Screening decisions for all abstracts were obtained via an application programming interface for ChatGPT and manually for Rayyan. We calculated performance metrics, including accuracy, sensitivity, and specificity with Stata.
    Results: ChatGPT 4.0 decided upon 15 306 abstracts, vastly outperforming Rayyan. ChatGPT 4.0 demonstrated an accuracy of 68%, specificity of 67%, sensitivity of 88%-89%, a negative predictive value of 99%, and an 11% false negative rate when compared to human researchers' decisions. The workload savings were at 64% reasonable costs.
    Discussion and Conclusion: This study demonstrated ChatGPT's potential to be applied in the first phase of the literature appraisal process for scoping reviews. However, human oversight remains paramount. Additional research on ChatGPT's parameters, the prompts and screening scenarios is necessary in order to validate these results and to develop a standardized approach.
    Keywords:  ChatGPT; artificial intelligence; automation; large language model; scoping review; screening
    DOI:  https://doi.org/10.1093/jamiaopen/ooaf098
  21. Neuroimage. 2026 Jan 13. pii: S1053-8119(26)00038-8. [Epub ahead of print] 121720
      Efficient allocation of attentional resources is critical when humans collaborate with artificial intelligence (AI): they must focus on their task while monitoring the AI to intervene if it fails. Inefficient allocation-such as excessive monitoring or overreliance-can impair performance and cause critical errors. Whether humans appropriately offload attentional effort to an AI depends on factors such as the AI's competency, the user's expertise, and their propensity to trust. Yet, trust in AI is a latent variable that is difficult to measure. Here, we introduce an EEG-based approach to directly track how attentional resources are shared between a human and an AI. Participants performed a visual search task either alone or with an AI whose competency was varied. The N2pc component-an established neural marker of selective visual attention-was used to index attention deployment. Results showed that the N2pc amplitude varied with the AI's competency: smaller amplitudes indicated greater offloading and trust in the high- versus low-competency condition. The findings demonstrate that neurophysiological markers such as the N2pc can serve as implicit, non-disruptive measures of trust that inform about the cognitive mechanisms underlying trust calibration. The study thus establishes the N2pc as a promising marker for quantifying attention allocation in collaborative human-AI search tasks and extends its relevance from visual attention research to the study of trust in automation.
    Keywords:  XX; XXXX
    DOI:  https://doi.org/10.1016/j.neuroimage.2026.121720
  22. medRxiv. 2026 Jan 08. pii: 2026.01.06.26343564. [Epub ahead of print]
       Background: Gender disparities in academic medicine have been previously reported, but prior bibliometric studies have been limited by small sample sizes and reliance on manual gender annotation methods. These bottlenecks constrain previous analyses to only a small subset of clinical literature. To assess gender-based differences in authorship trends, research impact, and scholarly output over time in clinical research at scale, we hypothesized that large language models (LLMs) can be an effective tool to facilitate systematic bibliometric analysis of academic research trends.
    Methods: We conducted a retrospective, cross-sectional bibliometric study evaluating manuscripts published between January 2015 and September 2025 across over 1,000 PubMed-indexed academic medical journals. Over 1 million manuscripts, written by more than 10 million authors across 13 medical specialties, were analyzed. To enable this large-scale study, the genders of manuscript authors were annotated using a scalable LLM-based pipeline compatible with consumer-grade hardware.
    Results: We found that the proportion of female principal investigators has increased over time across different medical subspecialties. However, studies led by male authors tended to be published in higher-impact journals and cited more frequently than those led by female authors. We also observed that researchers of the same gender tended to work together when compared to colleagues of the opposite gender.
    Conclusions: While our findings revealed persistent gender-based differences in authorship trends, citation practices, and journal placement, we also observed ongoing, meaningful progress in female representation within academic medical research over time. Our results suggest that LLMs can be a powerful tool to scalably and periodically track this continued progress in future academic medical research.
    Plain Language Summary: Academic research is important to advance the field and practice of medicine. To obtain an accurate picture of the differences in medical research and impact between male and female researchers, we leveraged large language models (LLMs) to identify author genders for over one million medical research papers published between 2015 and 2025. We found that the number of women serving as lead researchers has increased over time across many medical specialties. However, important gaps in achieving gender equality in medical research remain. Our study ultimately helps demonstrate that LLMs can help us monitor gender-based trends in academic research in the future.
    DOI:  https://doi.org/10.64898/2026.01.06.26343564
  23. JMIR Med Inform. 2026 Jan 14. 14 e74240
       Background: Artificial intelligence (AI) offers potential solutions to address the challenges faced by a strained mental health care system, such as increasing demand for care, staff shortages, and pressured accessibility. While developing AI-based tools for clinical practice is technically feasible and has the potential to produce real-world impact, only a few are actually implemented into clinical practice. Implementation starts at the algorithm development phase, as this phase bridges theoretical innovation and practical application. The design and the way the AI tool is developed may either facilitate or hinder later implementation and use.
    Objective: This study aims to examine the development process of a suicide risk prediction algorithm using real-world electronic health record (EHR) data through a qualitative case study approach for clinical use in mental health care. It explores which challenges the development team encountered in creating the algorithm and how they addressed these challenges. This study identifies key considerations for the integration of technical and clinical perspectives in algorithms, facilitating the evolution of mental health organizations toward data-driven practice. The studied algorithm remains exploratory and has not yet been implemented in clinical practice.
    Methods: An exploratory, multimethod qualitative case study was conducted, using a hybrid approach with both inductive and deductive analysis. Data were collected through desk research, reflective team meetings, and iterative feedback sessions with the development team. Thematic analysis was used to identify development challenges and the team's responses. Based on these findings, key considerations for future algorithm development were derived.
    Results: Key challenges included defining, operationalizing, and measuring suicide incidents within EHRs due to issues such as missing data, underreporting, and differences between data sources. Predicting factors were identified by consulting clinical experts; however, psychosocial variables had to be constructed as they could not directly be extracted from EHR data. Risk of bias occurred when traditional suicide prevention questionnaires, unequally distributed across patients, were used as input. Analyzing unstructured data by natural language processing was challenging due to data noise, but ultimately enabled successful sentiment analysis, which provided dynamic, clinically relevant information for the algorithm. A complex model enhanced predictive accuracy but posed challenges regarding understandability, which was highly valued by clinicians.
    Conclusions: To advance mental health care as a data-driven field, several critical considerations must be addressed: ensuring robust data governance and quality, fostering cultural shifts in data documentation practices, establishing mechanisms for continuous monitoring of AI tool usage, mitigating risks of bias, balancing predictive performance with explainability, and maintaining a clinician "in-the-loop" approach. Future research should prioritize sociotechnical aspects related to the development, implementation, and daily use of AI in mental health care practice.
    Keywords:  artificial intelligence; electronic health records; implementation science; mental mealth services; prediction algorithms; suicide prevention
    DOI:  https://doi.org/10.2196/74240
  24. Eur Radiol. 2026 Jan 13.
       OBJECTIVES: Studies have reported promising results regarding artificial intelligence (AI) as a tool for improved mammographic screening interpretive performance. We analyzed AI malignancy risk scores from two versions of the same commercial AI model.
    MATERIALS AND METHODS: This retrospective cohort study used data from 117,709 screening examinations performed in BreastScreen Norway 2009-2018. The mammograms were processed by two versions of the commercially available AI model, Transpara (version 1.7 and 2.1). The distributions of exam-level risk scores (AI score 1-10) and risk categories were evaluated for both AI versions on all examinations, including 737 screen-detected and 200 interval cancers. Scores between 1-7 were categorized as low risk, 8-9 as intermediate risk, and 10 as high risk of malignancy.
    RESULTS: Area under the receiver operating curve was 0.908 (95% CI: 0.986-0.920) for version 1.7 and 0.928 (95% CI: 0.917-0.939) for 2.1 when screen-detected and interval cancers were considered as positive cases (p < 0.001). A total of 87.1% (642/737) and 93.5% (689/737) of the screen-detected cancers had an AI score of 10 with version 1.7 and 2.1, respectively. Among interval cancers, 45.0% (90/200) had AI score 10 with version 1.7 and 44.5% (89/200) had AI score 10 with version 2.1.
    CONCLUSION: A higher proportion of screen-detected breast cancers had the highest AI score of 10 with the newer version of the AI model compared to the older version. For interval cancers, there was no difference in the proportion of cases assigned to the highest score between the two versions.
    KEY POINTS: Question Studies have reported promising results regarding the use of AI in mammography screening, but comparisons of updated versus older versions are less studied. Findings In our study, 87.1% (642/737) of the screen-detected cancers were classified with a high malignancy risk score by the old version, while it was 93.5% (689/737) for the newer version. Clinical relevance Understanding how version updates of AI models might impact screening mammography performance will be important for future quality assurance and validation of AI models.
    Keywords:  Artificial intelligence; Breast cancer; Mammography; Screening
    DOI:  https://doi.org/10.1007/s00330-025-12240-6
  25. J Food Drug Anal. 2025 Dec 15. 33(4): 487-500
      Artificial intelligence (AI) technologies are increasingly integrated into healthcare, yet their economic value remains uncertain. Traditional economic evaluation methods may not adequately capture the unique features of AI, including dynamic model evolution, scalability, and broader societal impacts. This systematic review synthesized existing evidence on the cost-effectiveness of AI-based healthcare interventions and assessed the methodological rigor of published studies. A comprehensive search identified health economic evaluations of AI applications published between September 2019 and March 2025, following PRISMA and SWiM guidelines and registered in PROSPERO (CRD42025641230). Eligible studies were full economic evaluations comparing AI-based interventions with non-AI alternatives, and data were extracted on study characteristics, analytical methods, decision-analytic models, perspectives, outcomes, and AI-specific costs. Methodological quality was evaluated using the CHEERS checklist. A total of 52 studies from 15 countries were included, most published after 2020, focusing on diabetic retinopathy screening, cancer detection, and cardiovascular disease applications. Cost-utility analysis was the predominant method (79%), followed by cost-effectiveness analysis (15%). Nearly all studies (98%) concluded that AI-based strategies were cost-effective, cost-beneficial, or cost-saving. However, reporting of AI-specific costs was inconsistent, while over 90% of studies detailed expenses such as software licensing, per-test charges, or maintenance fees, some omitted cost information entirely, limiting comparability. Overall, AI-based healthcare interventions are generally reported as cost-effective, but methodological heterogeneity, incomplete cost reporting, and potential publication bias constrain the reliability and comparability of current evidence. Standardized economic evaluation frameworks that incorporate comprehensive cost structures and account for the evolving nature of AI are urgently needed.
    DOI:  https://doi.org/10.38212/2224-6614.3570
  26. J Korean Med Sci. 2026 Jan 12. 41(2): e24
       BACKGROUND: The integration of artificial intelligence, specifically large language models, into editorial processes, is gaining interest due to its potential to streamline manuscript assessments, particularly regarding ethical and transparency reporting in public health journals. This study aims to evaluate the capability and limitations of ChatGPT-4.0 in accurately detecting missing ethical and transparency statements in research articles published in high-ranked (Q1) versus low-ranked (Q4) public health journals.
    METHODS: Articles from top-tier (Q1) and low-tier (Q4) public health journals were analyzed using ChatGPT-4.0 for the presence of essential ethical components, including ethics approval, informed consent, animal ethics, conflicts of interest, funding notes, and open data sharing statements. Performance metrics such as sensitivity, recall, and precision were calculated.
    RESULTS: ChatGPT exhibited high sensitivity and recall across all evaluated components, accurately identifying all missing ethics statements. However, precision varied significantly between categories, with notably high precision for data availability statements (0.96) and significantly lower precision for funding statements (0.16). A comparative analysis between Q1 and Q4 journals showed a marked increase in missing ethics statements in the Q4 group, particularly for open data sharing statements (4 vs. 50 cases), ethics approval (2 vs. 5 cases), and informed consent statements (3 vs. 8 cases).
    CONCLUSION: ChatGPT-4.0 in preliminary screening shows considerable promise, providing high accuracy in identifying missing ethics statements. However, limitations regarding precision highlight the necessity for additional human checks. A balanced integration of artificial intelligence and human judgment is recommended to enhance editorial checks and maintain ethical standards in public health publishing.
    Keywords:  Artificial Intelligence; ChatGPT-4.0; Editorial Policies; Ethics; Natural Language Processing; Public Health
    DOI:  https://doi.org/10.3346/jkms.2026.41.e24
  27. Ann Card Anaesth. 2026 Jan 01. 29(1): 81-88
       INTRODUCTION: Patient education significantly improves outcomes, especially in high-risk procedures. However, traditional educational resources often fail to address patient literacy and emotional needs adequately. Large language models like ChatGPT (OpenAI) and Gemini (Google) offer promising alternatives, potentially enhancing both accessibility and comprehensibility of procedural information. This study evaluates and compares the effectiveness of ChatGPT and Gemini in generating accurate, readable, and clinically relevant patient education materials (PEMs) for pulmonary artery catheter insertion.
    METHODOLOGY: A comparative, single-blinded study was conducted using structured validation methods using a common prompt for both gen artificial intelligence (AI) chatbots. AI-generated PEMs were assessed by board-certified anesthesiologists and intensivists. Face validity was determined using a 5-point Likert scale evaluating appropriateness, clarity, relevance, and trustworthiness. Content validity was measured by calculating content validity index. Accuracy and completeness were evaluated by a separate expert panel using a 10-point Likert scale. Readability and sentiment analysis were assessed via automated online tools.
    RESULTS: Both chatbots achieved robust face and content validity (S-CVI = 0.91). ChatGPT scored significantly higher on accuracy [9.00 vs. 8.00; P = 0.021] and perceived trustworthiness, while Gemini outperformed in readability (Flesch Reading Ease score: 65 vs. 54; Flesch-Kincaid Grade Level: 7.58 vs. 8.64) and clarity. Both outputs maintained a neutral emotional tone.
    CONCLUSION: AI chatbots show promise as innovative tools for patient education. By leveraging the strengths of both AI-driven technologies and human expertise, healthcare providers can enhance patient education and empower individuals to make informed decisions about their health and medical care involving complex clinical procedures.
    Keywords:  Face validity; generative artificial intelligence; patient education; pulmonary arteries; readability; sentiment analysis
    DOI:  https://doi.org/10.4103/aca.aca_145_25