bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–06–14
fifty papers selected by
Thomas Krichel, Open Library Society



  1. Journal Mass Commun Educ. 2025 Dec;80(4): 393-416
      This study considers a practice-based learning project in which 173 undergraduates enrolled in a journalism class reported on oral health issues in marginalized communities. Health fairs at local libraries provided a starting place for students to cover oral health care, an underreported topic in the media. Three nonprofit journalism organizations partnered with the effort to publish student-produced content. Through written responses from all students and 30 in-depth interviews with professional journalists, librarians, health professionals, and students, this qualitative case study presents a replicable community-centered intervention based on a journalism teaching hospital model that connects students with professionals and community stakeholders.
    Keywords:  collaborative learning; community journalism; community of practice; experiential learning; pedagogy; social learning; teaching; undergraduate education
    DOI:  https://doi.org/10.1177/10776958251360340
  2. Front Artif Intell. 2026 ;9 1751832
      The Fourth Industrial Revolution (4IR) has accelerated the adoption of Artificial Intelligence (AI) across knowledge intensive sectors, yet its integration in Library and Information Services (LIS) within resource constrained educational environments remains limited and under examined. This empirical paper seeks to explore the impact of AI integration in LIS and the associated challenges. This research examines the various applications of AI in libraries. The paper intends to identify the benefits and challenges associated with AI integration in libraries. This study adopted a qualitative descriptive design informed by phenomenological principles to study how AI application is perceived by key stakeholders to be beneficial and what barriers constrain AI uptake, and in what new areas AI can be applied in teachers' college libraries. The study was anchored to the Technology Acceptance Model (TAM) where the effect of Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) on the stakeholders' preparedness to adopt AI was analysed as an interpretive lens. Sixty participants (library staff, ICT experts and Computer Science students) were interviewed in-depth, participated in focus group discussions, and were observed using structured observations. All data was analysed thematically using Braun and Clarke's six-step framework. Evidence found that there was no real implementation of AI in academic libraries. Actual usage was almost non-existent, mainly automated processes within existing library systems, such as those integrated within Koha. Nonetheless, the participants perceived AI as potentially useful for automating mundane tasks, increasing ease in locating information, providing an improved user experience, and for continuous availability through the use of chatbots. In spite of the strong perceived usefulness, the actual PEOU and institutional preparedness for AI are limited by infrastructure challenges, lack of funds, limited digital literacies of users, ethical issues, and wider issues of digital disparity. The study stresses the need for a context-specific AI governance strategy, building capacity of all stakeholders, and adoption of a phased approach in low-resource academic libraries. The findings provide new insights into Global South literature by presenting an in-depth, multi-stakeholder view of teachers' college AI readiness, emphasizing the interplay between technological novelty, infrastructural challenges, and ethics.
    Keywords:  TAM-based study; artificial intelligence; constraints; integration; library and information services; machine learning; perceived benefits
    DOI:  https://doi.org/10.3389/frai.2026.1751832
  3. PLoS One. 2026 ;21(6): e0351303
       INTRODUCTION: This paper describes the development and validation of highly sensitive search hedges for Ovid MEDLINE and Ovid APA PsycInfo that effectively identify literature on transgender and gender diverse (TGD) populations.
    METHODS: Two librarians developed the search hedges using relevant keywords and controlled vocabulary terms, building on previous work on identifying transgender populations in evidence synthesis. The hedges were tested and refined to capture diverse and expansive gender identities across cultures and disciplines. The hedges were validated for sensitivity using a gold standard set of 144 articles from the Knowsy portal of evidence syntheses tagged as Two-Spirit, transgender, or gender non-binary. To assess precision an international research team of subject experts independently screened a randomized sample of search results in a two-stage screening process with an additional screener resolving disputes.
    RESULTS: The final search hedges demonstrated 100% sensitivity in both MEDLINE and APA PsycInfo, identifying all 144 relevant articles from the Knowsy gold standard set. The MEDLINE search hedge achieved a 71% precision, and the APA PsycInfo hedge achieved a 67% precision. These results balance comprehensive retrieval while minimizing non-relevant articles for an efficient screening process.
    CONCLUSIONS: These search hedges in MEDLINE and APA PsycInfo are valuable tools for researchers and librarians to more effectively identify literature on TGD populations. These tools will be crucial for ongoing work in addressing gaps in research and health disparities faced by TGD populations and will be particularly valuable for researchers conducting evidence synthesis projects related to this population.
    DOI:  https://doi.org/10.1371/journal.pone.0351303
  4. Front Public Health. 2026 ;14 1827765
       Introduction: China's vigorous advancement of the "Internet Plus Healthcare Services" initiative, as a core component of social security system optimization for improving residents' health welfare, furnishes a robust impetus for the integration of Generative Artificial Intelligence (GenAI) within the healthcare sector, particularly in the domain of health information provision, where GenAI is widely acknowledged to harbor substantial transformative potential. Nevertheless, empirical research specifically probing the modalities through which users adopt GenAI for health information seeking remains limited in extant scholarly literature.
    Methods: This study garnered primary data via a structured online survey and employed partial least squares structural equation modeling (PLS-SEM) to dissect the antecedent factors and underlying mechanisms governing users' health information seeking intention via GenAI, grounded in the perspective of user perceptions.
    Results: PLS-SEM results indicate that user perceptions of GenAI (perceived competence, perceived convenience, and perceived anthropomorphism) positively influence on both user trust in GenAI and subjective norms, which in turn positively affect users 'health information seeking intention through GenAI. Moreover, digital health literacy significantly moderates the relationship between user perceptions of GenAI and their health information seeking intention.
    Discussion: These findings yield valuable empirical insights for facilitating the optimized and scaled adoption of GenAI in health information services, enhancing the public health output of digital health policies, improving residents' health welfare, and further alleviating the operational burdens borne by traditional healthcare resources.
    Keywords:  GenAI users; generative artificial intelligence; health information seeking intention; internet plus healthcare; perceived characteristics
    DOI:  https://doi.org/10.3389/fpubh.2026.1827765
  5. medRxiv. 2026 Jun 04. pii: 2026.06.03.26354854. [Epub ahead of print]
       Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design.
    Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs-cohort, case-control, cross-sectional, and case report.
    Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission.
    Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.
    DOI:  https://doi.org/10.64898/2026.06.03.26354854
  6. bioRxiv. 2026 Jun 01. pii: 2026.05.28.728511. [Epub ahead of print]
      Interpreting gene function in specific biological contexts is essential for biomedical research, yet manual literature review is labor-intensive. We developed GeneKnow, a source-grounded framework that uses generative AI models within a controlled hybrid workflow to produce reliable, traceable literature synthesis supported by authentic citations. Through systematic benchmarking, we showed that GeneKnow outperforms mainstream web-interface AI tools in generating trustworthy context-specific gene function syntheses without fabricated citations and minimizing hallucinations.
    DOI:  https://doi.org/10.64898/2026.05.28.728511
  7. Clin Radiol. 2026 Apr 28. pii: S0009-9260(26)00147-9. [Epub ahead of print]99 107371
       AIM: When radiologists encounter an unfamiliar musculoskeletal neoplasm on conventional and advanced MR images, they now have an option to search the internet for MRI image examples. The purpose of this project was to systematically evaluate the quality of images returned by a widely used general search engine.
    MATERIALS AND METHODS: Systematic internet searches were conducted for 16 benign and malignant musculoskeletal neoplasms, focusing on T1-weighted (T1W), T2-weighted (T2W), contrast-enhanced, diffusion-weighted (DWI), and out-of-phase MR images. The top five images from each search were evaluated for image quality and clinical relevance using a 5-point scoring key.
    RESULTS: General internet engine returned correct sequence among the top five results for 88% of lesions when searched for T1W images, 100% for T2W images, 100% for contrast-enhanced images, 63% for DWI, and 25% for out-of-phase images. Fleiss Kappa statistic demonstrated substantial agreement (Kappa = 0.72) for binary "useful" vs "not useful" image designation, and moderate agreement (Kappa = 0.48) for all five categories.
    CONCLUSION: The general internet search engine returned useful results when searching for conventional MRI sequences but performed sub-optimally when searching for advanced MR image examples.
    DOI:  https://doi.org/10.1016/j.crad.2026.107371
  8. Respir Med. 2026 Jun 06. pii: S0954-6111(26)00316-1. [Epub ahead of print]260 108948
       OBJECTIVE: The utilization of artificial intelligence-based chatbots in the healthcare field is increasingly prevalent. Nonetheless, the quality of responses for these chatbots on this topic is unknown, particularly in considering the sparse data about exercise and physical activity in patients with pulmonary arterial hypertension (PAH). The aim was to evaluate and compare the accuracy and readability of responses provided by ChatGPT, Gemini, and DeepSeek regarding exercise training and physical activity in patients with PAH.
    METHODS: ChatGPT, Gemini, and DeepSeek were prompted with the command "Can you list the 20 most frequently asked questions about exercise training and physical activity for patients with PAH worldwide?" The identified questions were reviewed by the research team, and 10 clinically relevant questions were selected. These questions were then posed to each chatbot in separate chat sessions. The accuracy of responses was assessed utilizing a 4-point Likert-type scale. For the readability assessment, the Flesch-Kincaid Grade Level (FKGL) was utilized. Data were analyzed utilizing the SPSS software.
    RESULTS: Overall, median accuracy scores ranged from 1 to 2 among the AI chatbots, with a significant difference observed only between ChatGPT and DeepSeek in favor of DeepSeek (p = 0.007). The readability scores of ChatGPT (9.09 ± 1.87) and DeepSeek (8.79 ± 1.35) were similar, Gemini's score (10.91 ± 1.23) higher than that of other chatbots (p = 0.011).
    CONCLUSION: All three chatbots provided responses to inquiries on exercise training and physical activity in PAH with acceptable accuracy. Additionally, responses generated by ChatGPT and DeepSeek were easier to read compared with those generated by Gemini.
    Keywords:  Artificial-intelligent; ChatGPT; DeepSeek; Exercise training; Gemini; Pulmonary arterial hypertension
    DOI:  https://doi.org/10.1016/j.rmed.2026.108948
  9. Pain Physician. 2026 May;29(3): 251-257
       BACKGROUND: The transforaminal epidural steroid injection (TFESI) is a widely used interventional procedure for managing radicular pain. Although TFESI is well established as a safe and effective treatment, patients frequently seek detailed explanations regarding its procedural steps, expected outcomes, and potential risks. Artificial intelligence (AI)-based platforms, particularly large language models (LLMs) such as ChatGPT, have emerged as accessible sources of periprocedural medical information. However, the accuracy, readability, and empathy of AI-generated responses in the context of interventional pain management remain uncertain.
    OBJECTIVES: To compare the accuracy and readability of responses generated by ChatGPT and fellowship-trained pain medicine physicians to common patient questions about TFESIs and to assess the potential utility of AI in patient education and periprocedural guidance.
    STUDY DESIGN: A cross-sectional comparative study.
    METHODS: Twenty frequently asked patient questions about TFESIs were retrospectively identified from pain clinic consultations and submitted individually to ChatGPT-4o and to fellowship-level physicians. Two interventional pain specialists independently evaluated all responses for accuracy and empathy using a 5-point Likert scale; discrepancies were resolved by a third reviewer. Readability was analyzed using the Readable® tool kit across 7 indices: Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGRL), Gunning Fog Index, SMOG Index, Coleman-Liau Index, average word and sentence length, and estimated overall reach.
    RESULTS: Both sources delivered highly accurate responses. However, ChatGPT's answers had significantly lower FRES scores, reflecting reduced reading ease, and higher scores across all other readability indices, indicating greater linguistic complexity and lower accessibility. These responses required a higher level of education to understand. Although empathy scores for ChatGPT were lower than the physicians', the difference was not statistically significant.
    LIMITATIONS: This study assessed a single AI platform (ChatGPT-4o). Accuracy and empathy ratings were performed subjectively by 2 pain specialists, which might have limited generalizability. Additionally, AI-generated responses can vary with software updates, reducing reproducibility across time.
    CONCLUSION: ChatGPT provides accurate information regarding TFESIs but demonstrates lower readability and a less empathetic tone than answers given by fellowship-trained physicians. With targeted improvements in clarity and patient-centered communication, AI holds potential as a useful adjunct in patient education and clinical support.
    Keywords:  Artificial intelligence; pain medicine; readability; transforaminal epidural steroid injections
  10. JMIR Form Res. 2026 Jun 08. 10 e89156
       Background: Systemic lupus erythematosus (SLE) is a complex, fluctuating disease, creating a continuous need for reliable patient information. A prior study concluded that patients with SLE often turn to the internet, including artificial intelligence (AI) chatbots, for information regarding SLE. The rise of AI chatbots as a primary information source presents a critical challenge regarding the accuracy of the information they provide.
    Objective: This study aimed to evaluate the performance of the latest generation of AI chatbots (ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash) in answering frequently asked questions about SLE.
    Methods: Twenty-two frequently asked questions about SLE in Bahasa Indonesia (the Indonesian language) were posed to each chatbot. Responses were independently and blindly evaluated for accuracy by 5 clinical immunologists using a 4-point Likert scale. Readability was assessed using the Flesch reading ease score formula. Statistical comparisons for accuracy and readability were performed using repeated-measures ANOVA or the Friedman test, followed by the Bonferroni test for pairwise comparisons. The Spearman ρ was used to evaluate correlations among accuracy, readability, and word count.
    Results: Gemini 2.5 Flash demonstrated the highest accuracy, with a mean score of 1.25 (SD 0.53), significantly outperforming ChatGPT-4o (mean 1.71, SD 0.61; P<.001). Gemini 2.5 Flash significantly outperformed ChatGPT-4o in 2 evaluated domains. The interreliability analysis revealed a statistically significant level of agreement among the 5 evaluators across all responses (Kendall W=0.389; P<.001). Readability for all 3 chatbots was low (median Flesch reading ease score 42.22-46.66). Gemini 2.5 Flash produced the longest responses (8509 total words), followed by DeepSeek-V3 (5410 words) and ChatGPT-4o (3632 words). A significant negative correlation was found between word count and lower accuracy (ρ=-0.401; P=.001).
    Conclusions: Our study found that ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash provided overall satisfactory responses to SLE-related questions. The highest accuracy was demonstrated by Gemini 2.5 Flash; however, the absolute differences in scores among the 3 AI chatbots were relatively small. All 3 AI chatbots demonstrated low readability, which may limit accessibility for patient use. This finding highlights a critical "blind spot" in which clinical accuracy, as rated by experts, does not equate to patient accessibility. Thus, further research is required to develop more comprehensive evaluation frameworks incorporating safety, factuality, and calibration of AI chatbots across different medical fields and topics.
    Keywords:  AI; ChatGPT; DeepSeek; Gemini; SLE; artificial intelligence; chatbot; systemic lupus erythematosus
    DOI:  https://doi.org/10.2196/89156
  11. Appl Clin Inform. 2026 Jun 09.
       BACKGROUND: Parents of children with rare and serious illnesses often have unmet information needs. Large language models (LLMs) can help parents seek medical information. However, few studies have observed parents' use of LLMs or how they would use it in conjunction with their patient portal.
    OBJECTIVES: We provided parents of children with cancer or vascular anomalies (VAs) access to a secure HIPAA-compliant chatbot. We characterized how parents used the tool while accessing their child's patient portal and evaluated the chatbot responses.
    METHODS: Parents participated in think-aloud sessions (n=48). Parents accessed a HIPAA-compliant GPT 4 Endpoint and entered queries about their child's illness. We examined query length and chatbot response length, accuracy, and readability. We also conducted content analysis on parent queries.
    RESULTS: We analyzed 451 queries and 451 responses. Parents' queries ranged from 1 to 104 words. They entered primarily short well-formed questions or phrases/statements. Some entered single words or incomplete phrases. Content was related to diagnosis/etiology, treatment, symptoms/side effects, laboratory values, imaging results, clinician notes/documentation, and supportive resources, with some differences between VA and cancer contexts. Chatbot responses ranged from 9 to 883 words The mean accuracy rating was 4.9±0.5 and the mean Flesch Reading Ease score was 28.4±15.0 (college-graduate level).
    CONCLUSIONS: Parents' queries varied in length, complexity, and content, with some differences indicating unique information needs by disease context. Chatbot responses were accurate yet written at a reading level potentially challenging for some users. Future studies should consider these patterns and characteristics when designing health-related chatbot-based tools.
    DOI:  https://doi.org/10.1055/a-2888-9182
  12. Eur Spine J. 2026 Jun 12.
       OBJECTIVE: Large language models (LLMs) are increasingly used as clinical information tools; however, their ability to accurately interpret evidence-based spine guidelines remains unclear. This study compared the performance of ChatGPT-5.1, Gemini, and Perplexity in interpreting the North American Spine Society (NASS) guideline for lumbar disc herniation with radiculopathy.
    METHODS: Nineteen open-ended clinical questions derived from the NASS guideline were submitted to each LLM under standardized conditions. Responses were evaluated by two blinded clinicians using validated Likert scales for clinical accuracy (1-5), reliability, and usability (1-7). Semantic similarity to guideline-based answers was assessed using the Universal Sentence Encoder, surface-level textual similarity using ROUGE-L F1 scores, and readability using multiple established readability indices. Reference reliability was analyzed using the Reference Hallucination Score.
    RESULTS: Perplexity demonstrated significantly higher clinical accuracy (3.95 ± 0.70) compared with ChatGPT-5.1 (3.45 ± 0.68) and Gemini (3.50 ± 0.65) (p = 0.018). Reliability and usability scores were also highest for Perplexity (4.85 ± 1.05 and 4.75 ± 0.95, respectively; both p < 0.01). Semantic similarity scores were greater for Perplexity (0.71 ± 0.06) than for ChatGPT-5.1 (0.64 ± 0.07) (p < 0.001), whereas Gemini achieved the highest ROUGE-L F1 scores (0.14 ± 0.04; p < 0.001). Readability indices were comparable across models, indicating similar levels of textual complexity. ChatGPT-5.1 exhibited the highest reference hallucination (8.10 ± 2.85), while Perplexity showed the lowest (4.15 ± 2.70) (p < 0.001).
    CONCLUSIONS: LLMs show significant variability in guideline-based clinical reasoning. Although none should be used as independent decision-making tools, reference-oriented models may provide more reliable adjunctive support for evidence-based spine practice.
    Keywords:  Artificial intelligence; Clinical guidelines; Large language models; Lumbar disc herniation; Radiculopathy
    DOI:  https://doi.org/10.1007/s00586-026-10069-1
  13. J Exp Orthop. 2026 Apr;13(2): e70770
       Purpose: To assess the accuracy, potential safety concerns and readability of single-shot answers generated by the free GPT-4o ChatGPT interface to 15 predefined surgeon-level hip arthroscopy (HAS) learning-curve questions, using expert ratings (Mika scale) and interrater reliability analysis.
    Methods: Fifteen questions were selected based on frequency in HAS teaching courses. Each question was submitted once to ChatGPT in a new chat without additional prompting. Eight high-volume hip arthroscopists, serving as faculty and trainers, independently rated every answer using the 4-point Mika scale (1 = excellent, 4 = unsatisfactory). Consensus ratings were defined by the modal score or, in case of ties, by panel discussion with safety-oriented adjudication. Interrater reliability was evaluated using intraclass correlation coefficients (ICCs). Readability metrics were assessed using the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level (FKGL).
    Results: After consensus, 2 of 15 responses (13.3%) were rated excellent, 9 (60%) satisfactory with minimal clarification required, 3 (20%) satisfactory with moderate clarification required and 1 (6.7%) unsatisfactory, yielding a mean accuracy score of 2.2 ± 0.8 (median, 2.0; range, 1-4). The single unsatisfactory answer addressed patient positioning, and pharmacologic venous thromboembolism prophylaxis was rated satisfactory but raised safety concerns. Interrater reliability was moderate for single ratings (ICC(2, 1) = 0.58) and excellent for the mean of all raters (ICC(2, 8) = 0.92). Readability indicated a college-level demand (mean FRES 34, mean FKGL 13).
    Conclusions: GPT-4o provided mostly satisfactory and useful answers to common HAS questions posed by surgeons in their learning curve, but a minority of responses required substantial clarification or were judged unsafe if applied uncritically. These findings support the use of large language models as an adjunct educational tool, while highlighting the need for expert verification in safety-critical topics.
    Level of Evidence: Level IV, cross-sectional, comparative simulation study.
    Keywords:  ChatGPT; artificial intelligence; hip arthroscopy; large language models; surgical learning curve
    DOI:  https://doi.org/10.1002/jeo2.70770
  14. Digit Health. 2026 Jan-Dec;12:12 20552076261459527
       Objective: This study assesses ChatGPT-4o's responses to common patient inquiries regarding urinary incontinence (UI), a condition that significantly impacts quality of life but often goes untreated due to low healthcare-seeking behavior. The evaluation focuses on four key metrics: understandability, actionability, reliability, and readability.
    Material and Methods: In this non-human subject qualitative study, 13 patient-focused questions-derived from AUA/SUFU and EAU guidelines-were posed to ChatGPT-4o in Turkish. The questions were categorized into four themes: Definition, Diagnosis, Management, and Surgical Considerations. Three blinded experts (an urogynecologist, a urologist, and a pelvic floor physiotherapist) independently evaluated the responses using the Patient Education Materials Assessment Tool (PEMAT) for understandability and actionability and the modified DISCERN (mDISCERN) tool for reliability. Readability was measured using the Çetinkaya-Uzun formula, specifically designed for Turkish text. Statistical analysis included descriptive statistics and the Intraclass Correlation Coefficient (ICC) to determine inter-rater reliability.
    Results: In evaluating ChatGPT-4o's performance in urinary incontinence education, experts found strong agreement in their assessments, with inter-rater reliability scores were 0.80 (95% CI: 0.70-0.91) for PEMAT and 0.82 (95% CI: 0.70-0.91) for mDISCERN. The AI's responses were consistently highly understandable, particularly when explaining diagnoses (achieving a peak score of 94.4 %), yet they were significantly less actionable, meaning they often failed to provide clear, practical steps for patients to follow. This gap was most evident in surgical considerations, which were deemed the least actionable at 68.2 %. The overall reliability of the content was rated as "fair" across all categories-with surgical information being the most reliable. Most responses were classified as "difficult," requiring a university-level education to comprehend, with surgery-related topics being the most linguistically complex.
    Conclusion: While ChatGPT-4o yields comprehensible health information, its limited actionability and high linguistic complexity pose barriers to patients with lower health literacy.
    Keywords:  artificial intelligence; chatgpt-4o; patient education; urinary incontinence
    DOI:  https://doi.org/10.1177/20552076261459527
  15. Phlebology. 2026 Jun 09. 2683555261460252
      ObjectivesLipedema is a chronic disorder characterized by pain and disproportionate fat distribution, and its diagnosis is frequently overlooked. The aim of this study was to evaluate and compare the responses generated by contemporary artificial intelligence models-ChatGPT-5o, Gemini-3, and Perplexity AI-to structured clinical questions developed in accordance with the 2024 S2k Lipedema Guideline. The models were analyzed in terms of clinical accuracy, readability, and reference reliability to assess their performance in delivering guideline-based medical information.MethodsThis cross-sectional and comparative study was conducted by submitting 30 structured clinical questions, prepared on the basis of the relevant guideline, to three large language models. Responses collected on 10 February 2026, were evaluated using a seven-point Likert scale (reliability) and a five-point scale (accuracy). Text readability was assessed using six established indices, including the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), and Gunning Fog Index (GFOG). Reference reliability was examined by analyzing hallucination tendencies as defined in the literature.ResultsA statistically significant difference in reliability was observed among the models (p = .041); Perplexity (4.95 ± 1.20) achieved significantly higher scores than ChatGPT-5o (4.38 ± 1.05) (p = .038). In readability analyses, Perplexity (12.80 ± 2.10) required a significantly higher educational level according to FKGL scores compared to both ChatGPT-5o (p = .041) and Gemini-3 (p = .036). Regarding reference reliability, ChatGPT-5o outperformed Perplexity in source verifiability (p = .031), bibliographic precision (p = .044), and total RHS scores (p = .027), emerging as the most robust model in this domain. No statistically significant differences were found among the models in terms of clinical accuracy and usefulness (p > .05). Inter-rater agreement was excellent (Kappa: 0.92-0.97).ConclusionIn this study, ChatGPT-5o distinguished itself in reference quality, whereas Perplexity demonstrated superior reliability. However, the complex linguistic structures accompanying efforts to maintain high medical accuracy may constitute a significant barrier for individuals with limited e-health literacy. Although these systems show strong potential as medical information resources, they cannot yet replace expert physician oversight in terms of patient safety. A balanced approach between technical reliability and patient-centered simplification remains necessary.
    Keywords:  ChatGPT; artificial intelligence; lipedema; online medical information; patient education; readability
    DOI:  https://doi.org/10.1177/02683555261460252
  16. J Clin Med. 2026 May 22. pii: 4025. [Epub ahead of print]15(11):
      Background/Objectives: Inborn errors of immunity (IEI) are rare and complex pediatric disorders that create significant information gaps for families and non-specialist healthcare professionals. Large language models (LLMs) such as ChatGPT are increasingly used as on-demand health information resources; however, evidence on their performance in rare pediatric diseases remains limited. This study aimed to evaluate the reliability, quality, readability, understandability, reproducibility, and safety-related concerns of ChatGPT-4o responses to frequently searched questions about pediatric IEI posed by healthcare professionals and patients/caregivers. Methods: This cross-sectional evaluation used the publicly accessible ChatGPT-4o interface to generate responses to 20 frequently searched questions about pediatric IEI, equally distributed between healthcare professional (n = 10) and patient/caregiver queries (n = 10). Three pediatric allergy-immunology specialists independently evaluated response quality using the modified DISCERN (mDISCERN) and Global Quality Scale (GQS) tools, supplemented by a structured expert-based assessment of misinformation, safety-related concerns, suspected factual issues, missing disclaimers, and clinically meaningful inter-iteration inconsistency. Text readability was assessed using four validated indices (ARI, FRES, FKGL, GFR), comprehensibility using the Patient Education Materials Assessment Tool (PEMAT), and reproducibility using natural language processing methods. Results: ChatGPT-4o demonstrated strong overall performance, with median mDISCERN and GQS scores of 4 (IQR: 3-5) for both query types. Readability scores substantially exceeded recommended thresholds, with FKGL scores of 12.96 ± 0.69 and 10.83 ± 0.67 for professional and patient/caregiver queries, respectively. Mean PEMAT understandability scores were 71.80 ± 5.75% for professional queries and 80.80 ± 4.73% for patient/caregiver queries (p = 0.001). Reproducibility was high, with semantic similarity rates of 86.10 ± 3.84% and 87.30 ± 3.68%, respectively. Suspected factual issues were identified in 4 of 20 responses (20%), safety-related concerns in 3 (15%), clinically meaningful inter-iteration inconsistencies in 3 (15%), and missing medical disclaimers in all 20 responses (100%). Conclusions: ChatGPT-4o showed strong performance across validated quality metrics for pediatric IEI information support; however, its high reading level, universal absence of medical disclaimers, and occasional clinically meaningful inconsistencies limit its suitability as a standalone source for clinically sensitive guidance. These findings underscore the need for AI-driven patient education tools with improved readability, adaptive complexity adjustment, and safety-oriented communication.
    Keywords:  ChatGPT; artificial intelligence; health literacy; inborn errors of immunity; patient education; pediatric; readability
    DOI:  https://doi.org/10.3390/jcm15114025
  17. BMC Musculoskelet Disord. 2026 Jun 09.
       BACKGROUND: Large language models (LLMs) are increasingly used by patients to obtain health information. Postoperative rehabilitation after anterior cruciate ligament reconstruction (ACLR) has distinct phase boundaries and safety considerations. Therefore, responses should be not only clear and understandable, but also medically accurate, safe, and stage-fit. This study compared the performance of three publicly accessible LLMs in standardized post-ACLR rehabilitation question answering.
    METHODS: This was a standardized, blinded, expert-rated comparative evaluation study. On a single prespecified data collection day in March 2026, 30 English-language rehabilitation questions were submitted separately to GPT-5.4, Doubao, and MiniMax-M2.7. The questions covered five postoperative rehabilitation phases. Responses were anonymized and randomly reordered before blinded rating by five orthopaedic clinicians across five domains: Accuracy, Safety, Stage-fit, Completeness, and Understandability. Paired non-parametric tests, effect size analyses, intraclass correlation coefficients, and linear mixed-effects modelling were used for statistical analysis.
    RESULTS: A total of 90 model-generated responses and 450 expert rating records were included. Overall scores differed significantly among the three models (Friedman χ² = 46.067, P < 0.001; Kendall's W = 0.768). GPT-5.4 achieved the highest overall score (4.61 ± 0.13), followed by MiniMax-M2.7 (4.53 ± 0.19), whereas Doubao had the lowest score (3.86 ± 0.29). GPT-5.4 performed best in Accuracy, Safety, and Stage-fit; MiniMax-M2.7 achieved the highest score for Completeness; and Doubao achieved the highest mean score for Understandability. Inter-rater agreement was good [ICC(3,k) = 0.893], and sensitivity analysis supported the primary findings.
    CONCLUSIONS: The three models showed distinct rating profiles in standardized single-turn post-ACLR rehabilitation question answering. Evaluation of patient-facing rehabilitation information should not rely solely on linguistic fluency, but should prioritize medical accuracy, safety, and Stage-fit. These findings provide preliminary benchmark evidence in a phase-sensitive rehabilitation setting, but they should not be interpreted as evidence supporting clinical implementation, clinician substitution, or patient benefit.
    Keywords:  Anterior cruciate ligament reconstruction; Expert evaluation; Large language models; Patient education; Postoperative rehabilitation
    DOI:  https://doi.org/10.1186/s12891-026-10051-4
  18. J Oral Pathol Med. 2026 Jun 11.
       BACKGROUND: Early diagnosis is crucial in improving oral cancer outcomes. Patient education materials support timely recognition and management. However, these resources are often written above recommended reading levels, beyond patients' health literacy and limiting accessibility.
    OBJECTIVES: To assess the readability of available patient information on oral cancer by the NHS, to evaluate three large language models (LLMs; ChatGPT, Claude and Gemini) in simplifying texts while preserving their content, and to propose an improved leaflet based on UK materials, expert review and LLM adjustment to match average UK reading levels.
    METHODS: Materials were collected from NHS-affiliated websites. Original and LLM-simplified texts were assessed using validated readability tools (FRES, FKGL, GFI, CLI and SMOG). Content fidelity was assessed using character 3-5-g cosine, sentence-content retention and latent semantic analysis (LSA). An expert review was applied to the proposed leaflet.
    RESULTS: LLM-revisions significantly improved readability across all five indices (p < 0.0001). Mean FRES of original texts was 66.4 ± 7.7, while Claude (81.6 ± 6.2) was the only model to surpass the 80 benchmark. Semantic similarity to source text remained high (LSA means 0.97 ± 0.04, 0.94 ± 0.09 and 0.96 ± 0.08; character 3-5-g cosine 0.85 ± 0.05, 0.80 ± 0.08 and 0.82 ± 0.08 for respective models). Baseline readability of the proposed leaflet was comparable to NHS materials (FRES 65.7); Claude increased this to 81.2.
    CONCLUSIONS: LLM-based simplification enhanced readability while preserving content fidelity. This approach can help enhance accessibility, particularly for populations disproportionately affected by oral cancer. With human oversight, it could be adopted at the policy level to standardise patient education and reduce health literacy disparities.
    Keywords:  National Health Service (UK); content fidelity; health literacy; large language models; medical education; oral cancer; patient information; readability
    DOI:  https://doi.org/10.1111/jop.70158
  19. BMC Psychiatry. 2026 Jun 12.
       OBJECTIVE: Large language models (LLMs) are increasingly used by patients and caregivers as sources of health information. However, their performance in addressing attention-deficit/hyperactivity disorder (ADHD)-related questions has not been systematically compared. This study aimed to evaluate and compare the accuracy, reproducibility, quality, usefulness, and reliability of responses generated by ChatGPT (GPT-4o), Gemini, and DeepSeek R1.
    METHODS: In this cross-sectional comparative study, 22 commonly asked ADHD-related questions identified from publicly available digital sources were categorized into four domains: basic knowledge, diagnosis and assessment, treatment and medication, and long-term outcomes. Each question was presented to all three models using the same standardized prompts in separate chat sessions. The generated responses were independently evaluated by two specialists in child and adolescent psychiatry. Reproducibility was examined by repeating the same queries on different days. Descriptive statistics and non-parametric repeated-measures analyses were used to compare model performance.
    RESULTS: All models showed high overall accuracy, with mean scores of 91% for ChatGPT (GPT-4o), 89% for Gemini, and 87% for DeepSeek R1. Reproducibility followed a similar pattern (89%, 86%, and 84%, respectively). Gemini and DeepSeek performed relatively better in basic knowledge and diagnostic domains, whereas ChatGPT (GPT-4o) showed stronger performance in treatment and long-term outcome-related questions. Significant differences were observed in quality, usefulness, and reliability across models, with ChatGPT (GPT-4o) achieving the highest overall expert-rated scores.
    CONCLUSION: Although large language models generally provided accurate responses to ADHD-related questions, notable differences were observed in the depth, clarity, and clinical usefulness of the information across models. These systems may serve as supportive sources of information for patients and caregivers; however, their responses should be interpreted with caution and should not replace professional clinical evaluation or medical advice.
    Keywords:  Attention-deficit/hyperactivity disorder; Digital health information; Large language models
    DOI:  https://doi.org/10.1186/s12888-026-08280-x
  20. BMC Psychiatry. 2026 Jun 10.
       BACKGROUND: Bipolar disorder is a clinically sensitive and diagnostically complex condition in which unclear or incomplete psychoeducational information may contribute to misunderstanding of symptoms, delayed help-seeking, and unsafe interpretation of treatment options. Large language models are increasingly used as on-demand sources of mental health information, yet comparative evidence on the quality and readability of AI-generated information about bipolar disorder remains limited.
    METHODS: This cross-sectional content analysis evaluated 180 responses generated by ChatGPT, Gemini, and DeepSeek to 20 bipolar disorder-related questions derived from Google Trends. Each question was asked in three independent new sessions for each model. Information quality was assessed using the 20-item EQIP instrument, and readability was evaluated using Flesch-Kincaid Grade Level, Flesch Reading Ease, and word count. To address the non-independence of repeated responses nested within prompts, a linear mixed-effects model was used with AI model and question category as fixed effects and question ID as a random intercept.
    RESULTS: In the mixed-effects analysis, AI model significantly predicted EQIP scores. Compared with ChatGPT, Gemini and DeepSeek generated higher EQIP scores, with DeepSeek showing the largest estimated difference. Question category also contributed to information quality, although category-level pairwise comparisons did not remain significant after Bonferroni adjustment. Higher EQIP scores were moderately associated with longer responses and more favorable readability indices. Inter-rater analyses showed moderate absolute agreement for total EQIP scores and variable item-level agreement.
    CONCLUSIONS: Within the specific models, access conditions, prompts, date, and settings tested in this study, AI-generated bipolar disorder information differed across models in EQIP-rated quality and readability. These findings should be interpreted as content-quality findings rather than evidence of clinical accuracy, safety, or patient benefit. AI-generated psychoeducation should therefore be treated as a supplementary information source requiring expert review rather than a replacement for clinician-guided education.
    Keywords:  Artificial intelligence; Bipolar disorder; Digital mental health; EQIP; Large language models; Patient information; Psychoeducation; Readability
    DOI:  https://doi.org/10.1186/s12888-026-08262-z
  21. Cureus. 2026 May;18(5): e108426
      Background Clear, accurate, and empathetic communication is essential in pediatric anesthesia, where parental anxiety and information needs are high. Traditional patient information leaflets (PILs), while clinically robust, may lack emotional engagement. Large language model (LLM)-based chatbots, such as ChatGPT and Google Gemini, offer a novel, interactive approach to patient education, yet their role in pediatric anesthesia remains inadequately explored. Objective To evaluate and compare the readability, accuracy, completeness, sentiment, and parental satisfaction of artificial intelligence (AI)-generated patient education materials (ChatGPT and Google Gemini) with a clinician-authored departmental PIL (DPIL) for pediatric general anesthesia.  Methods This pilot cross-sectional study evaluated responses generated by ChatGPT and Google Gemini to seven frequently asked questions derived from the departmental PIL. Three blinded leaflets were presented in randomized order using a computer-generated sequence and evaluated by 10 anesthetists for accuracy and completeness using 10-point Likert scales. Readability was assessed using Flesch Reading Ease and Flesch-Kincaid Grade Level scores. Sentiment analysis and parental satisfaction were also assessed. Both descriptive and inferential statistical analyses were performed.  Results The DPIL demonstrated the highest readability, followed by ChatGPT, with Gemini scoring the lowest. All materials exceeded the recommended sixth-grade readability level. No significant differences were observed in accuracy or completeness among the three sources (p > 0.05). Parents consistently perceived ChatGPT responses as more reassuring and relatable, while the DPIL was viewed as informative but formal. Gemini responses were often considered linguistically complex. ChatGPT demonstrated a neutral and more empathetic sentiment compared with the other leaflets. Conclusion Clinician-authored PILS remain the most reliable source of pediatric anesthesia information. AI-generated content, particularly ChatGPT, may enhance clarity and emotional reassurance when used as a clinician-reviewed adjunct rather than a replacement.
    Keywords:  artificial intelligence and education; chatgpt; large language models (llms); medical communication; parental satisfaction; patient information leaflets; pediatric anaesthesia; readability score
    DOI:  https://doi.org/10.7759/cureus.108426
  22. Eur Spine J. 2026 Jun 10.
       AIM/BACKGROUND: Large language models (LLMs) are increasingly used by patients to obtain medical information. Adolescent idiopathic scoliosis (AIS), a chronic condition requiring long-term monitoring and treatment decisions, generates substantial demand for reliable and understandable patient education. Although LLMs may function as accessible explanatory tools, their suitability for patient-oriented use remains uncertain. This study aimed to perform an expert-led, patient-centered evaluation of two widely accessible LLMs, Claude Sonnet 4.5 and GPT 5.2, focusing on their ability to deliver accurate, clear, and conceptually adequate responses to common AIS-related patient questions.
    METHODS: A cross-sectional comparative design was used with 100 high-frequency patient questions covering ten clinical domains. Responses generated by both models using standardized zero-shot prompts were independently assessed by expert clinicians: factual accuracy by three raters (two orthopedic spine surgeons and one senior pediatric physiotherapist), and clarity and conceptual coverage by two raters (one surgeon and the physiotherapist). A structured evaluation framework examined three dichotomous dimensions relevant to patient education: factual accuracy, clarity and understandability, and conceptual coverage. Model performances were compared using McNemar's test, and inter-model agreement was assessed with Krippendorff's alpha.
    RESULTS: Both models demonstrated equally high factual accuracy (91%). However, clarity was limited, with only one-third of responses rated as sufficiently understandable. A significant difference was observed in conceptual coverage, with Claude Sonnet 4.5 outperforming GPT 5.2 (46% vs. 29%, p = 0.012), particularly in domains requiring integrative explanations.
    CONCLUSION: Despite strong factual accuracy, current LLMs show deficiencies in clarity and conceptual depth, limiting their reliability as standalone patient education tools for AIS. These findings highlight the necessity of clinician mediation and the importance of patient-centered evaluation criteria before clinical adoption.
    CLINICAL TRIAL REGISTRATION: As this study is not a clinical trial, clinical trial registration is not applicable.
    Keywords:  Adolescent idiopathic scoliosis; Artificial intelligence; Clinical communication; Health literacy; Large language models; Patient education
    DOI:  https://doi.org/10.1007/s00586-026-10046-8
  23. Neurosurgery. 2026 Jun 09.
       BACKGROUND AND OBJECTIVES: Effective patient education is essential in neurosurgery, but many materials exceed recommended readability levels, which can limit comprehension and informed consent. Simplification can also alter tone, potentially introducing bias. Recent studies have used large language models such as Chat Generative Pre-trained Transformer (ChatGPT) to simplify neurosurgical patient education materials (PEMs), but the impact of this process on sentiment and emotional tone remains unclear. Our objective was to assess the sentiment and emotional tone of neurosurgical PEMs before and after conversion to a lower reading level by ChatGPT.
    METHODS: A total of 336 neurosurgical PEMs covering stroke, spinal stenosis, hydrocephalus, epilepsy, and pituitary brain tumors were analyzed for readability, sentiment, and emotion. Each was then simplified to a seventh grade level using GPT-4.0. Readability was evaluated using Flesch-Kincaid Grade, Flesch Reading Ease, Gunning Fog Index, Automated Readability Index, Coleman-Liau Index, and Simple Measure of Gobbledygook. Sentiment and emotional tone were described using the Valence Aware Dictionary and sEntiment Reasoner (VADER) algorithm and National Research Council Canada Emotion Lexicon. Paired statistical t-tests assessed the significance of changes.
    RESULTS: Simplification produced substantial improvements in readability across all 6 indices and all neurosurgical topics (P < .001). Sentiment shifted toward increased positivity, reflected by higher VADER compound scores, more positive tokens, and fewer neutral tokens. Disgust decreased significantly across every topic, whereas sadness, surprise, and joy increased modestly; fear and anger showed no significant change. Topic-level analyses mirrored global patterns, demonstrating consistent directional effects. Overall, simplification achieved large readability gains while introducing small but measurable alterations in emotional tone.
    CONCLUSION: The decrease in neutral and negative sentiment suggests a shift toward more persuasive language. Modest but consistent shifts in sentiment and emotional tone accompanying artificial intelligence-assisted simplification highlight the potential for unintended affective shifts during artificial intelligence simplification and warrant monitoring when deploying large language models for patient-facing materials. Current PEMs pose a communication barrier between patient and provider, but providers must be careful.
    Keywords:  Chat Generative Pre-trained Transformer; ChatGPT; Large language models; Patient education materials; Readability; with
    DOI:  https://doi.org/10.1227/neu.0000000000004107
  24. Healthcare (Basel). 2026 Jun 01. pii: 1535. [Epub ahead of print]14(11):
       BACKGROUND/OBJECTIVES: Large language models (LLMs) are increasingly consulted for information about cleft lip and palate (CLP), yet the reliability of their outputs across clinical domains has not been evaluated. This study aimed to compare the quality of CLP-related information generated by GPT-4o and Gemini 2.5 Pro across multiple thematic domains using a validated quality instrument and a reliability-first analytic framework.
    METHODS: Fifty-four standardized CLP questions across six domains were submitted to GPT-4o (OpenAI) and Gemini 2.5 Pro (Google DeepMind) on 25 September 2024 via their public interfaces, using new, history-free sessions and default settings, yielding 108 responses. Three independent, CLP-experienced raters scored each response using the Global Quality Score (GQS; 1-5 scale assessing accuracy, completeness, and clinical usefulness). Before comparing models, we applied a reliability-first filter: only domains where all three raters showed substantial agreement (Fleiss' kappa [κ] ≥ 0.60) were included in statistical comparisons. Domains that failed this threshold were analyzed qualitatively to identify the source of disagreement. A descriptive taxonomy of errors was developed for low-scoring responses.
    RESULTS: Three domains met the reliability threshold (General Care Information, General Cleft Information, and Pre-Treatment Information; 30 paired questions). Both models performed at a high and practically equivalent level: GPT-4o median GQS 4.33 (IQR 4.00-5.00) versus Gemini 2.5 Pro 5.00 (IQR 4.00-5.00); the difference was not statistically significant (Wilcoxon V = 139.00, p = 0.691; Hodges-Lehmann median difference 0.00, 95% CI -0.33 to 0.67). Three domains were excluded because rater agreement was insufficient; qualitative review showed this reflected genuine clinical practice variation rather than clear model errors. The most common inaccuracies were overgeneralization of outcomes, outdated surgical timing, and omission of multidisciplinary team roles.
    CONCLUSIONS: Both models provided high-quality CLP information in domains supported by clinical consensus, indicating they may serve as useful adjuncts for general patient and family counseling. Clinicians should, however, verify any treatment-specific content against current institutional protocols before relaying it to patients. Future research should assess readability, alignment with health literacy, and patient comprehension of AI-generated CLP information.
    Keywords:  ChatGPT; GPT-4o; Gemini; Global Quality Score; cleft lip and palate; health information quality; inter-rater reliability; large language model
    DOI:  https://doi.org/10.3390/healthcare14111535
  25. Sci Rep. 2026 Jun 06.
      Large language models (LLMs) have demonstrated strong performance in answering knowledge-based questions in healthcare education. Specialty examinations offer a standardized and objective framework to assess these capabilities. However, to date, no study has evaluated LLM performance on the Turkish Pharmacy Specialty Examination (EUS), a nationally standardized exam applied for the purpose of admitting candidates to pharmacy specialization programs. Therefore, this study aimed to comparatively evaluate LLM performance on EUS questions in terms of accuracy, self-reported confidence, and readability. This study conducted a comparative evaluation of three LLMs-ChatGPT-5.1, DeepSeek-R1, and Gemini 2.5 Flash-using publicly available 84 multiple-choice questions from the EUS between 2017 and 2025. Each question was submitted to each model in a separate, newly initiated session using a standardized prompt. Model performance was assessed based on answer accuracy, self-reported confidence (1-5 scale), and readability of generated responses, using the Flesch reading ease (FRE), gunning fog index (GFI), and simple measure of Gobbledygook (SMOG) indices. All statistical analyses were performed using non-parametric repeated-measures methods, including Cochran's Q test for paired categorical comparisons and the Friedman test with Durbin-Conover post-hoc analyses for readability scores, with two-tailed significance set at p < 0.05. Overall, the evaluated LLMs exhibited high performance. Gemini 2.5 Flash achieved the highest overall accuracy rate (92.9%), followed by ChatGPT-5.1 (90.5%) and DeepSeek-R1 (89.3%), with no statistically significant difference among the models (p = 0.584). Self-reported confidence was predominantly maximal (5/5), with ChatGPT-5.1, DeepSeek-R1, and Gemini 2.5 Flash assigning maximum confidence to 87.5, 55.6, and 66.7% of incorrect responses, respectively. Significant differences in readability were observed among the evaluated LLMs. ChatGPT-5.1 generated texts with lower GFI and SMOG scores compared with DeepSeek-R1 and Gemini 2.5 Flash (p < 0.05), indicating lower linguistic complexity. No statistically significant differences were identified among models for FRE. LLMs demonstrated high and comparable accuracy when answering domain-specific pharmacy examination questions; however, occasional overconfidence in incorrect responses highlights the need for careful oversight. Differences in linguistic complexity underscore the importance of selecting models optimized for readability in educational settings. Overall, these findings suggest that LLMs may have potential as supplementary tools in pharmacy education within examination-based contexts, provided that expert guidance and critical appraisal are maintained to ensure reliability and clarity.
    Keywords:  Artificial intelligence; ChatGPT; DeepSeek; Gemini; Large language models; Pharmacy specialty examination
    DOI:  https://doi.org/10.1038/s41598-026-57001-7
  26. Naunyn Schmiedebergs Arch Pharmacol. 2026 Jun 12.
      The increasing use of oral anticancer drugs (OADs) in cancer therapy shifts greater responsibility towards patients, thereby also placing a higher informational burden on them. While intensified pharmacological/pharmaceutical care programs have proven beneficial for patients undergoing OAD treatment, their universal availability is currently limited. Given that patients frequently seek health information online, AI-powered chatbots may present a promising resource to address these increasing, yet often unmet information needs. This study aims to evaluate the readability, completeness of relevant information, and accuracy provided by AI-powered chatbots in response to patient questions about OAD treatment. Microsoft Bing's Copilot and Google's Gemini were queried in June 2024 on four patient questions regarding ten commonly prescribed and ten recently approved OADs in triplicate. Readability of chatbot answers was assessed using the Flesch reading-ease score (scale 0-100). Completeness of relevant information and accuracy were evaluated based on corresponding standardized written patient information materials. Both chatbots' answers demonstrated low readability according to the overall mean Flesch reading-ease scores of 38.8 (Copilot) and 50.9 (Gemini). Overall median completeness of relevant information of Copilot's and Gemini's answers was 61.1% (IQR, 35.3-78.7%) and 73.8% (IQR, 50.0-100.0%), respectively. Conversely, accuracy of chatbot answers was consistently high, with an overall median accuracy of 100.0% (IQR, 83.3-100.0%) for Copilot and 100.0% (IQR, 98.5-100.0%) for Gemini. AI-powered chatbots provide overall accurate information on OADs. However, their moderate completeness of relevant information and low readability may limit their current practical utility in meeting cancer patients' information need.
    Keywords:  Artificial intelligence; Drug safety; Large language model; Medication safety; Oral anticancer drugs; Patient safety
    DOI:  https://doi.org/10.1007/s00210-026-05561-w
  27. Eur J Dent Educ. 2026 Jun 12.
       OBJECTIVES: The performance of five popular, widely available large language models (LLMs): ChatGPT-4o, Gemini 2.5 Flash, Llama 4, DeepSeek-V3, and Microsoft Copilot in operating dentistry education was evaluated by employing a multiple-choice question-based assessment system.
    MATERIAL AND METHODS: This was done using a set of 150 MCQs covering areas of endodontics, dental caries, paediatric, preventive, aesthetic and restorative dentistry, biomaterials, and periodontics. The LLM's performance was assessed using classification metrics (accuracy, sensitivity, predictive reliability), textual similarity metrics (BLEU score, cosine similarity, Word Error Rate), and readability metrics (Flesch Reading Ease score).
    RESULTS: The highest classification accuracy was achieved by Gemini 2.5 Flash and ChatGPT-4o, showing their high sensitivity and high overall predictive reliability. The model with the most textual similarity to the reference answers was ChatGPT-4o with BLEU of 0.10 ± 0.0279, a high cosine similarity of 0.48 ± 0.0422, and a relatively low Word Error Rate (WER) of 5.57 ± 0.7301, and a Flesch Reading Ease score of 13.53 ± 4.9449.
    CONCLUSION: In medical education, ChatGPT-4o exhibited the highest accuracy, reference textual overlap, semantic alignment, lower number of errors, and readability among the five evaluated LLMs, making it a valuable assistant for dental healthcare professionals.
    DOI:  https://doi.org/10.1111/eje.70213
  28. Paediatr Child Health. 2026 Jun;31(4): 299-309
       Objectives: To conduct an inductive content analysis of publicly available online resources to describe existing content provided by Canadian government and social service health organizations for parents of infants.
    Methods: Online resources were obtained as part of a larger environmental scan on Canadian infant sleep information. Inclusion criteria required that English or French written resources focused on sleep in infants under 2 years, be created by a Canadian governmental or health service organization and include behavioural infant sleep information. Resources were independently coded by two research assistants using inductive content analysis.
    Results: A total of 51 resources were identified, with most being in English (n = 45, 88.24%). Most resources contained information on developing a bedtime routine (n = 41, 80.39%), feeding (n = 44, 86.27%), and infant sleep patterns and needs (n = 48, 94.12%). Information on sleep patterns and needs mainly focused on age-specific sleep considerations. Information on individual and family considerations and the causes of sleep difficulties and disruptions were less common.
    Conclusions: This content analysis demonstrated that the majority of content on infant sleep from Canadian government and social service organizations is related to age-specific sleep patterns and/or infant feeding. Less information is available on how parents can influence infant sleep (particularly when experiencing difficulties with infant sleep), individual and family considerations (especially on cultural considerations and family well-being), and causes of sleep difficulties. Future research and resource development could address potential knowledge gaps by using community co-design and collaboration.
    Keywords:  Health information; Health promotion; Infant; Parent; Qualitative research; Sleep
    DOI:  https://doi.org/10.1093/pch/pxag018
  29. Int J Gynaecol Obstet. 2026 Jun 11.
      This study aimed to systematically evaluate the quality of widely accessible, English language, patient-facing resources on heavy menstrual bleeding (HMB), identifying strengths, gaps, and opportunities for improvement. Resources were obtained from (1) a systematic Google Trends-informed Google search and (2) submissions from subject matter experts. Eligible resources were independently reviewed and scored using validated tools for credibility (QUEST), aesthetics (modified Abbott's scale), and readability. To evaluate clinical utility, resources were graded using a tool designed by international experts for comprehensiveness and accuracy across six domains. Inter-rater reliability was assessed with the intraclass correlation coefficient (ICC) and Cohen's kappa. Of 353 resources identified, 63 met the inclusion criteria. Mean QUEST score (maximum = 28) was 18.7 ± 5.1, with 4/63 (6%) resources meeting all credibility criteria. Mean aesthetics score (maximum = 7) was 5.3 ± 1.1, with only 15/63 (24%) resources with graphics. Most resources (40/63, 64%) were written at a grade 10-12 reading level. Mean comprehensiveness score (maximum = 6) was 4.6 ± 0.8, with gaps most often observed in advice on when to seek emergency care and information about iron deficiency. Mean accuracy score (maximum = 12) was 8.0 ± 2.3, with only 3/63 (13%) resources containing completely accurate content. There was moderate-to-substantial agreement across reviewers (ICC = 0.60-65 for aesthetics credibility; κ = 0.38-49 for accuracy comprehensiveness). In conclusion, most online resources accessed by patients are moderately credible, difficult to read, and underutilize figures. Content gaps exist mainly related to education on safety and iron deficiency. Very few resources are entirely accurate. We recommend targeted revisions to patient education resources to better inform and empower patients experiencing HMB.
    Keywords:  digital health resources; health communication; health literacy; information‐seeking behavior; menorrhagia; patient education
    DOI:  https://doi.org/10.1002/ijgo.71102
  30. Lupus. 2026 Jun 08. 9612033261458430
      ObjectiveWe evaluated the reading & comprehension levels of Systemic Lupus Erythematosus (SLE) Patient educational materials (PEMs) available online, from both nonprofit and for-profit organizations.MethodsWe analyzed PEMs from four nonprofit organizations (American College of Rheumatology [ACR], Lupus Foundation of America [LFA], Lupus Research Alliance [LRA], Lupus Society of Illinois [LSI]) & three for-profit company's platforms (Aurinia, AstraZeneca, GlaxoSmithKline [GSK]). Reading & comprehension scores were calculated using six standard tools, and comparisons were performed using one-way ANOVA & Tukey's post-hoc analysis. A p-value ≤0.05 was considered statistically significant.ResultsThe average Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES) across all PEMs were 10.05 ± 0.70 and 52.68 ± 4.10, respectively. Materials from nonprofit organizations had a FKGL of 10.35 ± 0.88 and a FRES of 51.19 ± 4.91, indicating a reading level requiring 10th- to 12th-grade proficiency. For-profit organizations had a slightly lower FKGL of 9.98 ± 0.42 and a higher FRES of 54.35 ± 2.65. These differences were not statistically significant (p = 0.53 and 0.16). However, significant within-group differences were observed. Among nonprofits, PEMs from LFA had the most favorable readability metrics (FKGL 9.19 ± 0.44, FRES 58.34 ± 3.11), compared to ACR (10.64 ± 0.56, 49.81 ± 2.24), LRA (10.53 ± 0.61, 50.54 ± 2.69), and LSI (11.28 ± 0.45, 45.64 ± 2.03) (p < 0.01). Among for-profits, PEMs from Aurinia had significantly better readability scores (FKGL 9.59 ± 0.46, FRES 55.77 ± 3.05) than those from AstraZeneca (9.93 ± 0.38, 53.13 ± 2.48) and GSK (9.86 ± 0.25, 54.28 ± 2.16) (p < 0.001).ConclusionMost SLE patient education materials available online are written at or above a 10th-grade level. These findings highlight the urgent need to improve PEM readability to support patients with lower health literacy.
    Keywords:  health literacy; patient education materials; patient-centered care; reading and comprehension; systemic lupus erythematosus
    DOI:  https://doi.org/10.1177/09612033261458430
  31. Surg Endosc. 2026 Jun;40(6): 5123-5135
       BACKGROUND: The internet has become a central and increasingly influential source of health information for patients around the world. Patients commonly use the internet to seek information about specific treatments and to connect with others with similar health concerns. However, the quality, reliability, and integrity of online health information resources remain a concern as many websites often lack quality control, peer review, or compliance with up-to-date guidelines. We aimed to evaluate the quality and readability of online information regarding bariatric surgery.
    METHODS: A web-based search using the terms "bariatric surgery" and "weight-loss surgery" was performed on the top search engines. A review was carried out on the first 10 results from each search engine and was evaluated using validated instruments such as JAMA Benchmarks, DISCERN, EQIP, Global Quality Scale, and standardized readability and visibility indices. Visual content and images representing bariatric surgery were also evaluated regarding their context and representation.
    RESULTS: Seventeen bariatric surgery websites were included. Quality and integrity of the websites across validated instruments were inconsistent with a median JAMA score 1 [IQR: 0-3], and 29.4% meeting no transparency criteria. Critical patient-centered content domains were frequently underrepresented, such as risk reporting (5.9%), long-term outcomes (41.2%), financial considerations (11.8%), and shared decision-making support (58.8%). Readability of content was a common issue (median Flesch-Kincaid Grade Level 12.0, [IQR:9.9-14.6]). Bariatric surgery-related visual content was mainly outcome-focused, and showed imbalances in gender and racial representation (71.4% female, and 83.0% lighter skin tones). More complex bariatric surgeries and non-surgical obesity treatments were inconsistently discussed (BPD/DS 41.2%, SADI-S 29.4%, medical therapy 23.5%).
    CONCLUSION: Despite high visibility, popular web-based resources for bariatric surgery demonstrate major deficiencies in transparency, readability, and patient-centered content, and may hinder the informed consent and shared decision-making process highlighting the need for substantial improvement in the quality and integrity of bariatric surgery websites.
    Keywords:  Bariatric surgery; Internet; Patient education; Patient information; Websites
    DOI:  https://doi.org/10.1007/s00464-026-12845-y
  32. Cureus. 2026 May;18(5): e108351
      Background Water birth is increasingly popular, yet debates persist regarding its safety and efficacy. There is a need to assess the accuracy and sentiment of publicly available water birth content, and it is essential that water birth content be accurate to support informed decision-making. Objective To identify whether certain YouTube video characteristics play a role in determining whether the contents of the video are aligned with the published recommendations set forth by the American College of Obstetricians and Gynecologists.  Materials and methods An analysis of the top 100 English-language YouTube videos on "water birth" sorted by "most viewed" was conducted on March 9, 2023. After applying inclusion and exclusion criteria, a final set of videos met the inclusion criteria and were included in the analysis. Video characteristics were recorded. Video accuracy was assessed against 16 ACOG water birth guidelines. Scores: 1 (accurate), 0 (not mentioned), or -1 (inaccurate). Transcripts were analyzed using MonkeyLearn Sentiment Analyzer to determine sentiment.  Results Of the videos analyzed, the majority were neutral in their accuracy, while a smaller proportion were deemed accurate or contained inaccuracies. Critical safety topics, such as umbilical cord avulsion or neonatal infection risks, were almost universally omitted. Videos created by healthcare professionals demonstrated greater accuracy, while personal vlogs were predominantly neutral. Sentiment analysis revealed that most videos conveyed a negative sentiment, followed by positive and then neutral sentiment. Notable geographic disparities were observed, with North American content exhibiting greater emotional polarization compared to international content. Conclusion Most widely viewed YouTube content on water birth lacks alignment with ACOG guidelines, particularly regarding risk communication, posing misinformation risks.
    Keywords:  health misinformation; maternal health care; maternal medicine; water birth; youtube videos
    DOI:  https://doi.org/10.7759/cureus.108351
  33. Cureus. 2026 May;18(5): e108600
      Aims and objectives Social media platforms such as YouTube allow adolescents and young adults to document substance use and share related beliefs with large audiences. This study primarily aimed to identify and describe hazardous alcohol consumption methods depicted in highly viewed YouTube videos. Secondary exploratory objectives were to characterize the apparent demographic characteristics of on-screen performers, quantify video popularity and engagement, and evaluate the quality, reliability, and scientific accuracy of alcohol-related information presented in these videos. Methods We conducted a retrospective content analysis of YouTube videos identified between January and March 2025 using 48 search terms related to hazardous alcohol use. Key variables included the number of views, the number and apparent characteristics of participants, and the method of alcohol consumption depicted. Video content and quality were assessed using the global quality score (GQS) and a modified DISCERN reliability score, and explicit scientific claims were classified as substantiated or unsubstantiated by two board‑certified toxicologists. Results A total of 278 videos involving 15 distinct methods of alcohol consumption were analyzed, most of which met predefined criteria for hazardous use. Risky practices included alcohol inhalation, alcohol enemas, vodka eyeballing, drunkorexia, funneling, drinking hand sanitizer, marijuana moonshine, and alcohol‑soaked tampons, among others. Only three videos (1.1%) contained trigger warnings. Collectively, the videos were viewed 75 million times (mean 269,784 views) and liked five million times, and they featured 722 participants or observers, predominantly male, Caucasian, and aged 21-25 years. The median GQS and reliability scores were 1 (interquartile range or IQR 2-3) and 1 (IQR 1-2), respectively, and 78.7% (159/202) of scientific claims in informational videos conflicted with published toxicology literature. Interrater agreement was substantial to excellent (Cohen's kappa 0.66-0.76). Conclusions Hazardous alcohol use is highly visible in popular YouTube videos, which rarely include accurate risk information or explicit harm‑reduction messages. These low‑quality, often misleading depictions suggest that YouTube and similar platforms may contribute to alcohol‑related informational environment for the youth and serve as venues for future efforts to address misinformation and hazardous drinking norms.
    Keywords:  alcohol; content analysis; hazardous alcohol consumption; hazardous effects; social media; toxicology; youtube
    DOI:  https://doi.org/10.7759/cureus.108600
  34. Medicine (Baltimore). 2026 Jun 05. 105(23): e49173
      This study aimed to assess the quality and reliability of health information in the 100 most-viewed YouTube videos related to semaglutide for weight loss, as of December 2024. The study also explored the relationship between engagement metrics and content quality, with attention to the prevalence of misinformation. A cross-sectional evaluation was conducted in December 2024. The top 100 English-language YouTube videos retrieved using the search term "semaglutide weight loss" were analyzed. Each video was assessed using 2 validated tools: the Global Quality Score and the Modified DISCERN (quality assessment tool for consumer health information) instrument. Viewer engagement data - including likes, comments, and views - were recorded. Statistical analyses included descriptive statistics and multiple linear regression to examine relationships between engagement metrics and content quality. Videos from academic and healthcare-affiliated sources generally scored higher in quality assessments, while those produced by individual users tended to lack source citations and balanced information. Although certain engagement metrics, such as the number of likes and comments, showed modest associations with higher Global Quality Scores, view count did not consistently predict quality. A notable portion of user-generated videos lacked discussion of semaglutide's risks and contraindications. The study highlights the variability in quality among semaglutide-related videos on YouTube. Engagement does not necessarily reflect the reliability of content, underscoring the importance of guiding viewers toward credible health sources. Enhancing digital health literacy and promoting greater visibility of evidence-based content may help improve the quality of health information encountered on widely used digital platforms.
    Keywords:  DISCERN; Global Quality Score; Semaglutide; YouTube; digital literacy; health communication; misinformation; obesity
    DOI:  https://doi.org/10.1097/MD.0000000000049173
  35. J Craniofac Surg. 2026 Jun 08.
       BACKGROUND: Infantile hemangiomas are a benign vascular abnormality that presents in infancy. Parents may seek a variety of resources to gather information, including social media. This study seeks to compare information on infantile hemangiomas on social media platforms.
    METHODS: The top videos across YouTube, Instagram, and TikTok were analyzed. A total of 361 videos were screened, with 150 included in the final analysis. Engagement was assessed by the number of views. Reliability was determined using the modified DISCERN (mDISCERN) score. Kruskal-Wallis test compared the mean mDISCERN score.
    RESULTS: Fifty videos from each platform were included. The videos cumulatively had 1,239,836 likes and 27,416,017 views. Videos were predominately posted by family members (N=63) and health care providers (N=44). Video content largely focused on patient stories (N=77) and clinician explanations (N=63). There is a significant difference in mDISCERN between platforms (P<0.001). YouTube videos scored significantly higher than both Instagram and TikTok (P=0.000 and P=0.000, respectively). No difference was noted in mDISCERN between Instagram and TikTok (P=0.073). There is a significant difference in the number of views across the different platforms (P<0.001). TikTok videos had significantly more views than both YouTube (P=0.00) and Instagram (P=0.00). There was no significant difference in the number of views between YouTube and Instagram.
    CONCLUSIONS: There are a significant number of videos available on social media discussing infantile hemangiomas. TikTok had the most user engagement despite having less reliable videos than YouTube. Improving the reliability of videos on social media is necessary to decrease the dissemination of misinformation.
    Keywords:  Infantile hemangioma; modified DISCERN; reliability; social media; strawberry hemangioma
    DOI:  https://doi.org/10.1097/SCS.0000000000012916
  36. J Cardiothorac Surg. 2026 Jun 06.
       BACKGROUND: The rapid expansion of short-form educational video platforms has substantially increased public access to health information; however, the characteristics and quality of videos concerning patent ductus arteriosus (PDA) have not been systematically evaluated. This study aimed to evaluate the quality and reliability of short-form videos related to PDA posted on TikTok and Bilibili.
    METHODS: The Chinese keyword "patent ductus arteriosus" was used to retrieve relevant videos from TikTok and Bilibili, yielding 140 videos for the final analysis. Uploaders were classified according to publicly available account information. Professional uploaders were defined as accounts identifying the uploader as a healthcare professional and displaying official platform verification and/or an explicit affiliation with a recognized medical institution. Credentials were verified using publicly visible profile elements, including verification badges, profile descriptions, professional titles, and stated institutional affiliations. All included videos were independently evaluated by two reviewers. Because paired reviewer-level ratings were available for the Global Quality Score (GQS), inter-rater reliability for GQS was assessed before consensus adjudication using the intraclass correlation coefficient (ICC) and quadratic weighted Cohen's kappa. Video quality and reliability were assessed using five established instruments: the Global Quality Score (GQS), Video Information and Quality Index (VIQI), Patient Education Materials Assessment Tool (PEMAT), the JAMA benchmark criteria, and modified DISCERN (mDISCERN). Only the first 100 algorithm-ranked videos from each platform were screened, in order to reflect the content most likely to be encountered by typical users, although this approach may preferentially capture videos favored by platform recommendation systems. No independent clinical subject-matter expert (such as a neonatologist or cardiologist) was separately involved in the formal scoring process; instead, the evaluation focused on quality, reliability, transparency, and understandability using established assessment instruments. Clinical accuracy was not independently assessed or adjudicated in this study.
    RESULTS: A total of 140 short videos related to patent ductus arteriosus (PDA) were included in the analysis, with 57 from Bilibili and 83 from TikTok. TikTok videos demonstrated significantly higher audience engagement than those on Bilibili, with markedly greater numbers of likes, favorites, shares, and comments. Bilibili videos were slightly longer in duration, and there was no significant difference in posting time between the two platforms. Videos on TikTok also achieved significantly higher scores across all five quality assessment tools-mDISCERN, GQS, VIQI, PEMAT, and the JAMA benchmark-and most high-quality videos were uploaded by professional individuals. In the present study, these professional individuals were defined on the basis of publicly visible healthcare-related identity information and platform verification status. When stratified by uploader type, videos created by professionals consistently outperformed those from non-professional individuals and institutions in both quality scores and engagement metrics. Professional videos were predominantly found on TikTok. Correlation analyses indicated weak to moderate positive associations between most quality indicators and likes, favorites, and shares on both platforms, although the correlation coefficients remained low. Notably, the average JAMA benchmark score was approximately half of the maximum possible score on both platforms. Inter-rater reliability for GQS was acceptable, with a single-measure ICC of 0.632, an average-measure ICC of 0.774, and a quadratic weighted Cohen's kappa of 0.630.
    CONCLUSIONS: The overall quality of PDA-related health information on major Chinese short-video platforms appears to be moderate. TikTok and professional uploaders demonstrated clear advantages in reliability, comprehensibility, and communication effectiveness. Platform attributes and uploader background exert significant influence on video quality and dissemination performance. Future efforts should focus on strengthening platform oversight, encouraging greater involvement of qualified healthcare professionals, and standardizing the disclosure of information sources and conflicts of interest. Such measures are essential for improving the accuracy, quality, and trustworthiness of online cardiovascular health information and for better supporting parents of children with PDA and the general public. These findings should be interpreted as reflecting informational quality, structure, transparency, and understandability rather than independently verified clinical accuracy.
    Keywords:  Bilibili; Health Information; Patent ductus arteriosus; Social Media; TikTok; Video Quality
    DOI:  https://doi.org/10.1186/s13019-026-04393-2
  37. BMC Public Health. 2026 Jun 10.
       BACKGROUND: Allergic Rhinitis (AR) affects over 500 million people globally, posing a significant public health burden. Video-sharing platforms like YouTube and Bilibili have become primary sources of health information. This study aimed to compare content quality, user engagement, and alignment with seasonal search trends of AR-related videos on these two culturally distinct platforms.
    METHODS: We retrieved the top 200 AR-related videos from each platform (keywords: "allergic rhinitis" for YouTube, "" for Bilibili) published between January 2022 and January 2025. After applying predefined exclusion criteria (irrelevance, non-English YouTube videos, out-of-timeframe, advertisements), 240 videos (91 YouTube, 149 Bilibili) were retained. Quality was assessed using the Patient Education Material Assessment Tool (PEMAT), Video Information Quality Index (VIQI), Global Quality Scale (GQS), and the Modified DISCERN Scale (mDISCERN). We also integrated Google Trends and Baidu Index data to analyze video characteristics, engagement, and correlations with seasonal search trends.
    RESULTS: AR-related video quality was generally low, though YouTube scored significantly higher than Bilibili in median PEMAT-Total (76.5 vs. 71.4), GQS (4 vs. 3), and mDISCERN (3 vs. 1) (all P < 0.001). YouTube featured more content from medical professionals (33.3%, n = 30/91), whereas Bilibili's was predominantly from non-professionals (61.4%, n = 92/149). Bilibili showed higher user engagement, with greater interactions through comments and donations. Both platforms showed spring and autumn search peaks, coinciding with allergen seasons. A moderate positive correlation emerged between Bilibili videos and Baidu Index (r = 0.37; P = 0.03), while YouTube videos correlated strongly with Google Trends (r = 0.63; P = 0.03). Professional content scored higher on both platforms, but treatment-related Bilibili videos correlated negatively with search volume (r = - 0.69; P = 0.01), signaling potential misinformation risks during peak periods.
    CONCLUSIONS: For AR-related videos, YouTube offers better content quality, while Bilibili excels in user engagement. Both platforms need improved content quality and coverage. A seasonal "demand-content" loop exists between AR search trends and video content, carrying misinformation risks during peak periods. We recommend year-round promotion of evidence-based content, adding medical warnings to misleading information, and encouraging collaboration between medical professionals and social media creators.
    Keywords:  Allergic rhinitis; Baidu Index; Bilibili; Cross-sectional study; Google Trends; Public engagement; Public health; Search engine; Seasonal trend; Social media; YouTube
    DOI:  https://doi.org/10.1186/s12889-026-28015-7
  38. BMC Public Health. 2026 Jun 13.
       BACKGROUND: Long COVID, characterized by persistent symptoms such as fatigue and cognitive impairment, continues to pose a global public health burden. Bilibili and TikTok are major platforms for public health information; however, the lack of systematic evaluation contributes to low-quality information and may hinder effective health management. We aimed to evaluate the quality and reliability of long COVID-related videos on these platforms.
    METHODS: This cross-sectional study analyzed long COVID-related videos from Bilibili and TikTok. Inter-rater reliability was assessed using Cohen's kappa coefficient. Quality and reliability were evaluated using the Global Quality Score (GQS) and modified DISCERN (mDISCERN). Multivariate linear regression was performed to identify the independent predictors of video quality.
    RESULTS: The median GQS and mDISCERN scores were both significantly higher for Bilibili than for TikTok (P = 0.016 and P = 0.039, respectively). Professional-source videos showed significantly higher quality than non-professional ones (P < 0.001). Regression analyses revealed that a professional source was the strongest independent predictor of both GQS and mDISCERN scores (all P < 0.001). Video duration and shares were significantly, albeit weakly, associated with the GQS, whereas other engagement metrics were not.
    CONCLUSIONS: The overall quality of long COVID videos was suboptimal. Professional source was the primary independent predictor of higher quality, while engagement and duration showed limited influence. Strengthening platform review mechanisms and promoting evidence-based information are needed to improve public health communication.
    Keywords:  Bilibili; Content reliability; Global Quality Score (GQS); Health information quality; Long COVID; Modified DISCERN (mDISCERN); Professional sources; Short-video platforms; TikTok; User engagement
    DOI:  https://doi.org/10.1186/s12889-026-27621-9
  39. Naunyn Schmiedebergs Arch Pharmacol. 2026 Jun 12.
      This study is a cross-sectional analysis aimed at systematically evaluating the information quality and reliability of popular science content related to drug-induced liver injury (DILI) on the two major Chinese video platforms, TikTok (Douyin) and BiliBili, and analyzing its content characteristics. On December 20, 2025, searches were conducted in the Chinese versions of the TikTok (Douyin) and BiliBili mobile applications using "" as the single keyword. Exclude content that does not meet the requirements based on the exclusion criteria, and ultimately retain the top 100 videos from each platform that meet the standards for analysis (N = 200). Two trained reviewers independently performed blinded assessments using the Global Quality Score (GQS), Journal of the American Medical Association (JAMA) benchmark criteria, and the modified DISCERN (mDISCERN) tool. Non-parametric tests were used to compare differences between groups, and Spearman correlation analysis was employed to explore the relationship between video characteristics and quality scores. User engagement metrics (likes, favorites, shares) for TikTok (Douyin) videos were significantly higher than those for BiliBili (p < 0.001). In terms of information quality, TikTok (Douyin) videos scored significantly higher than BiliBili on the GQS, JAMA, and mDISCERN scales (p < 0.001). There were differences in quality across content types: videos on "medication knowledge" received the highest mDISCERN reliability scores and "disease knowledge" videos scored higher in GQS practicality. Correlation analysis showed a weak positive correlation between user engagement metrics and mDISCERN scores. Among the 107 videos mentioning liver injury-related drugs, chemical drugs (antibacterial, chemotherapeutic, anti-tuberculosis drugs, and anti-inflammatory drugs) and traditional Chinese medicines (such as He Shou Wu) were mentioned most frequently. In this cross-sectional sample, TikTok (Douyin) videos demonstrated higher quality scores and user engagement than those on BiliBili, while professionals outperformed general users only on the JAMA criteria. Although some videos mentioned medications associated with liver injury, the information was generally oversimplified and biased toward trending topics. Hence, active information seekers should critically appraise the scientific soundness of medical short videos on platforms like TikTok and BiliBili before making healthcare decisions.
    Keywords:  BiliBili; Drug-induced liver injury; Information quality; Short videos; TikTok
    DOI:  https://doi.org/10.1007/s00210-026-05560-x
  40. J Pain Symptom Manage. 2026 Jun 09. pii: S0885-3924(26)00814-6. [Epub ahead of print]
       BACKGROUND: Cancer pain remains a major public health concern and a leading cause of suffering in patients with cancer. With the rapid expansion of short video platforms, such as TikTok, an increasing number of users are turning to these platforms for information on cancer pain management. This study evaluated the quality, reliability, and content characteristics of cancer pain-related short videos on TikTok.
    METHODS: A total of 241 videos were included in the final analysis. Data on video characteristics, uploader type, engagement metrics, and medical content were extracted. Two independent reviewers assessed video quality using the Global Quality Score (GQS), the modified DISCERN tool (mDISCERN), and the JAMA benchmark criteria.
    RESULTS: The included videos received substantial user engagement; however, overall quality was moderate, with median scores of 3.00 for GQS, 3.00 for mDISCERN, and 2.00 for JAMA. Healthcare professionals (HCPs) uploaded the majority of videos (77.18 percent) and provided significantly higher-quality and more reliable content than non-healthcare professionals (p<0.001). HCP videos more frequently covered diagnosis, prognosis, and clinical manifestations, whereas videos from non-healthcare professionals received higher comment engagement despite lower reliability.
    CONCLUSIONS: Spearman correlation analysis showed that user engagement metrics were strongly correlated with each other but had negligible associations with video quality. These findings indicate that although TikTok serves as an important platform for disseminating cancer pain information, substantial gaps remain in content accuracy, particularly among non-professional creators. Increased involvement of healthcare professionals and enhanced platform-level oversight may help improve the quality of cancer pain-related educational content shared on short-video platforms.
    Keywords:  Cancer pain; Health communication; Pain management; Social media; TikTok; Video quality
    DOI:  https://doi.org/10.1016/j.jpainsymman.2026.05.018
  41. Healthcare (Basel). 2026 May 28. pii: 1495. [Epub ahead of print]14(11):
      Background: TikTok has emerged as a major source of health information, including content related to intrauterine devices (IUDs). However, the accuracy, quality, and engagement patterns of IUD-related content on this platform remain insufficiently characterized. This study evaluated the informational quality, thematic focus, misinformation prevalence, and engagement metrics of widely viewed IUD-related TikTok videos. Methods: A descriptive cross-sectional content analysis was conducted of TikTok videos retrieved using 12 IUD-related search terms. Engagement metrics, creator characteristics, and content features were extracted, including educational, testimonial, and advice-seeking videos. Advice-seeking videos were included to capture user-generated concerns and inquiries that may influence engagement with health-related content on social media platforms. Informational reliability and quality were assessed using the Modified DISCERN instrument and the Global Quality Scale (GQS). Differences across groups were examined using t-tests, ANOVA, and chi-square tests. Results: A total of 458 videos were included. Nearly half were testimonial or advice-seeking (47.8%), while 38.9% were educational. Most content was produced by non-healthcare creators (76.4%). Engagement metrics did not differ significantly across video type, source, or creator qualification (all p > 0.05). Frequently discussed topics included adverse effects (36.5%), insertion experiences (31.2%), and device removal or discontinuation (19.9%). Overall informational quality was low, with mean GQS and Modified DISCERN scores of 2.0 and 1.8, respectively. Physician-created content demonstrated significantly higher quality and reliability scores (both p < 0.001). Conclusions: Widely viewed IUD-related TikTok content demonstrates high engagement but generally low informational quality.
    Keywords:  TikTok; content analysis; information quality; intrauterine devices; medical misinformation; reproductive health; social media
    DOI:  https://doi.org/10.3390/healthcare14111495
  42. Optom Vis Sci. 2026 06;103(6): e70071
       PURPOSE: To systematically evaluate the quality of eye disease videos on TikTok, WeChat, and rednote, explore links between engagement and quality, and offer evidence-based guidance for ophthalmic health communication.
    METHODS: The top 100 videos retrieved using the keywords "cataract," "glaucoma," and "high myopia" were screened on TikTok, WeChat, and rednote on 3 October 2025. Two reviewers independently assessed video quality using Journal of the American Medical Association (JAMA), the global quality score (GQS), modified DISCERN, and the Patient Education Materials Assessment Tool (PEMAT). Group differences were analyzed using Kruskal-Wallis and χ2/Fisher exact tests, and adjusted associations were examined using Poisson regression with robust standard errors.
    RESULTS: A total of 827 eligible videos were analyzed. Most videos were uploaded by physicians and focused on disease knowledge. Across TikTok, WeChat, and rednote, video characteristics, engagement, source, content, presentation form, and quality scores differed significantly. In adjusted analyses, compared with TikTok, WeChat videos had lower likes and comments, whereas rednote videos had lower engagement across all four outcomes. High-myopia videos showed higher engagement across all outcomes, while glaucoma videos showed higher collections and shares. Hospital-uploaded videos were associated with lower engagement, whereas news agency videos were associated with higher engagement. Personal experience videos were associated with higher comments and collections. Higher JAMA scores were consistently associated with lower engagement, whereas modified DISCERN and PEMAT actionability showed inverse associations only for selected outcomes.
    CONCLUSIONS: This study represents the first large-scale cross-sectional evaluation of science communication on potentially blinding eye diseases across major Chinese short-video platforms. High engagement does not equate to high quality; in fact, engagement metrics were significantly negatively correlated with reliability, scientific accuracy, and understandability. Clinicians should uphold scientific rigor and use accessible and friendly language to improve public eye health literacy.
    DOI:  https://doi.org/10.1002/ovs2.70071
  43. Digit Health. 2026 Jan-Dec;12:12 20552076261458906
       Purpose: This study aimed to evaluate the quality of dry eye disease (DED) treatment-related short videos on popular Chinese platforms.
    Methods: To better evaluate the quality of short videos related to DED treatment, the Dry-eye-related Short Videos Standardization Score (DSVSS), a preliminary disease-specific checklist, was developed tailored to DED clinical guidelines. On May 17, 2025, 305 videos (150 from Douyin, 155 from Bilibili) were retrieved using the keyword "dry eye treatment", and their quality and guideline consistency were evaluated with the Global Quality Score (GQS) and the DSVSS checklist. Basic data, including duration, likes, comments, collections, and shares were recorded. Statistical analysis was performed using the Mann-Whitney U test, Kruskal-Wallis H test, and Spearman's rank correlation to assess group differences and correlations.
    Results: Videos from Douyin were generally shorter but achieved higher user engagement, while videos from Bilibili were longer with lower interaction (both P<0.001). Median GQS was 3 for Douyin and 2 for Bilibili (P=0.041), and median DSVSS was 3 for both (P=0.116). Videos performed poorly in DSVSS checklist in definition and classification [0.05 (IQR 0.03-0.07)], emphasizing chronicity [0.07 (0.05-0.10)], and individualized treatment [0.08 (0.03-0.10)], but performed well in avoiding exaggeration [0.84 (0.63-0.91)], absence of advertising [0.78 (0.66-0.88)], and in providing warnings for special populations [0.91 (0.87-0.96)] (P<0.001).
    Conclusions: This study effectively identified critical deficiencies of current short-videos on DED treatment, underscoring the necessity for more professional, guideline-based content and stricter platform supervision to improve the quality of online health information.
    Keywords:  bilibili; douyin; dry eye disease; dry eye treatment; information quality; short videos
    DOI:  https://doi.org/10.1177/20552076261458906
  44. Aesthetic Plast Surg. 2026 Jun 11.
       BACKGROUND: Labiaplasty has experienced growing popularity, with over 10,800 procedures performed annually in the USA. Discussions about this surgery are shifting to social media, particularly TikTok, where health information is often presented with limited regulation or oversight. This raises concerns about the accuracy, quality, and influence of labiaplasty-related content.
    METHODS: We conducted a cross-sectional observational study analyzing the 110 most relevant TikTok videos under the term "labiaplasty" (July-August 2025). Video characteristics, engagement metrics (likes, shares, comments), and creator types were recorded. Content quality was assessed using the Global Quality Scale (GQS) by human reviewers and an AI model (ChatGPT-4.5-turbo). Sentiment analysis of video comments was performed by two human raters and the AI model. Statistical analyses included Wilcoxon signed-rank and Mann-Whitney U tests.
    RESULTS: Surgeons (52%) and patients (40%) produced most videos, primarily on educational (39%) or postoperative (28%) content. Overall, median human-rated GQS was 3.5 [IQR, 2.13-4.88], while the AI median was 3 [IQR, 2-4]. Videos with ≥2000 likes were more often created by patients (52% vs. 32%, p=0.012) and had significantly lower GQS scores (human: 2.5 vs. 4, p=0.003; AI: 2 vs. 3, p<0.001). Human inter-rater reliability for sentiment classification was slight (κ=0.161), with minimal agreement between AI and humans (κ=0.077).
    CONCLUSION: Labiaplasty content on TikTok is predominantly generated by surgeons and patients, yet lower-quality videos achieve higher engagement. Surgeons should proactively create accurate, relatable content to counterbalance misinformation. Refinement of AI tools is needed for reliable quality and sentiment assessment on social media.
    LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
    Keywords:  Communication [mesh]; Genitalia, female [mesh]; Social media [mesh]; Surgery, plastic [mesh]; Vulva [mesh]
    DOI:  https://doi.org/10.1007/s00266-026-05959-0
  45. Int J Womens Health. 2026 ;18 607589
       Introduction: Maternal eHealth literacy (MeHL) is a critical factor influencing self-care and health promotion during pregnancy. Inadequate MeHL may hinder pregnant women's effective use of online maternal health resources, increasing vulnerability to misinformation and suboptimal health decisions. However, research specifically exploring MeHL among pregnant women remains limited. Prior studies have largely focused on general populations or employed quantitative approaches, providing limited insight into how pregnant women navigate and make sense of online maternal health information in practice. Qualitative exploration is therefore needed to capture the contextual and multidimensional nature of these experiences.
    Objective: To explore how pregnant and postpartum women in Thailand seek, appraise, and apply online health information during pregnancy.
    Methods: This qualitative phase of a mixed-methods study aimed to develop an instrument for assessing electronic health literacy (eHL) among Thai pregnant women. Using a descriptive qualitative design guided by the eHealth Literacy Framework (eHLF), we explored pregnant women's internet use and eHL related experiences through in-depth interviews conducted between January and February 2024 with a purposive sample of 12 pregnant and 8 postpartum women from urban and suburban hospitals in Thailand.
    Results: Participants primarily used Google, YouTube, and Facebook for pregnancy information. eHLF-guided analysis identified themes across all seven eHL domains. Although participants demonstrated skills in accessing information, they expressed concerns regarding source credibility, data privacy, and complex medical terminology, and highlighted the need for Thai-language, user-friendly digital resources provided or endorsed by trusted national healthcare institutions.
    Conclusion: Pregnant women in these Thai hospital-based samples are active users of eHealth information but face challenges in navigating credibility and system usability. Healthcare providers and institutions should develop and promote reliable, accessible, and tailored digital health resources to enhance MeHL.
    Keywords:  eHealth literacy; maternal eHealth literacy; maternal health; pregnancy information; pregnant women
    DOI:  https://doi.org/10.2147/IJWH.S607589
  46. Digit Health. 2026 Jan-Dec;12:12 20552076261458077
       Objective: This study explored the associations between online health information seeking (OHIS), healthcare utilization, and exercise-related self-management behaviors among adults in China during the COVID-19 pandemic, focusing on individuals with long-term conditions (LTCs). It was guided by the biopsychosocial model and the Information-Motivation-Behavioral Skills (IMB) model.
    Methods: A cross-sectional analysis used observational data from 1,831 respondents in the 2021 China General Social Survey (CGSS). OHIS was defined as the frequency of using the internet to obtain health or medical information in the past 12 months. Healthcare utilization was measured by the frequency of medical visits, including both traditional Chinese and Western medicine. Exercise-related self-management was represented by regular physical exercise. Multinomial logistic regression was applied while controlling for demographic, psychosocial, and health-related factors.
    Results: OHIS adopters reported on average more frequent medical visits and higher levels of physical exercise than non-adopters. Meanwhile, among individuals with LTCs, OHIS is associated with less frequent medical visits but a higher likelihood of physical exercise, suggesting a potential pathway linking OHIS to exercise-related self-management behavior.
    Conclusion: OHIS was positively associated with healthcare utilization and exercise-related self-management behaviors during a period of restricted healthcare access in China. These findings suggest that accessible and reliable online health information may complement patients' exercise-related self-management capacities in developing countries, offering insights for integrating digital health strategies into primary care and LTCs management.
    Keywords:  COVID-19; exercise-related self-management; healthcare service utilization; long-term conditions; multinomial logit model; online health information seeking
    DOI:  https://doi.org/10.1177/20552076261458077
  47. Healthcare (Basel). 2026 May 28. pii: 1505. [Epub ahead of print]14(11):
       BACKGROUND: Cyberchondria is characterized by compulsive online health information seeking with additional psychological characteristics of behavioral addictions. Alexithymia, a transdiagnostic factor, is associated with difficulties in recognizing and differentiating emotions from bodily sensations. These characteristics may facilitate cyberchondria as a maladaptive strategy employed to cope with health anxiety. The present scoping review aims to examine the evidence regarding the association between alexithymia and cyberchondria.
    METHODS: The scoping review was performed in accordance with the PRISMA-ScR guidelines. A comprehensive search of major databases (i.e., PubMed, Scopus, PsycINFO, and Web of Science) and grey literature sources (i.e., ProQuest and Google Scholar) was conducted. Data extraction was centered on the study's design, the characteristics of the sample, the tools utilized, the primary findings, and other relevant variables.
    RESULTS: A total of 139 records were identified from the databases, and four studies met the inclusion criteria. An additional study was selected from grey literature. The included studies involved different populations, including healthcare workers, university students, and patients with chronic conditions. Across these populations, a significant association between alexithymia and cyberchondria was consistently reported, considering both total scores and their respective dimensions. Furthermore, alexithymia mediated or moderated the relationship between other psychological factors (e.g., perceived stress, somatosensory amplification) and cyberchondria.
    CONCLUSIONS: The scoping review revealed limited but growing research indicating the potential influence of alexithymia on cyberchondria, with implications for clinical and healthcare contexts. The findings also highlighted gaps in the literature and the need for further research in this area.
    Keywords:  alexithymia; cyberchondria; health anxiety; health-related internet use
    DOI:  https://doi.org/10.3390/healthcare14111505