bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–05–31
48 papers selected by
Thomas Krichel, Open Library Society



  1. Med Ref Serv Q. 2026 May 28. 1-17
      Problem-based learning (PBL) is part of an integrated dental school curriculum using patient cases to learn about biomedical sciences. In PBL, students must find and evaluate information sources to support their patient cases. The library supports PBL students through course-related LibGuides, recommended textbooks, and librarian assistance. A feedback survey was conducted with first-year dental students to assess their perceptions of the usefulness of PBL library resources and the benefits of librarian support. Results indicate that students found PBL resources useful and few asked for librarian assistance. Analyzing this data led to changes in library instruction.
    Keywords:  Assessment; LibGuides; PBL; case report; dental students; health sciences; librarian assistance; library instruction; problem-based learning; survey
    DOI:  https://doi.org/10.1080/02763869.2026.2673219
  2. Am J Pharm Educ. 2026 May 22. pii: S0002-9459(26)01363-X. [Epub ahead of print] 102007
       OBJECTIVE: To develop and validate a comprehensive search hedge for pharmacy education research to support efficient and precise evidence synthesis.
    METHODS: A title search for pharmacy education systematic reviews was undertaken to discover papers captured by a sensitive search hedge, defined as a pre-constructed set of database search terms designed to retrieve literature on a specific topic.1 Primary studies were extracted from 7 high-quality reviews. These were used to build a corpus of 100 PubMed identifiers (PMIDs) for measuring relative recall. The hedge was adapted from terms previously created by a member of the author team and tested by the research team for maximum sensitivity.
    RESULTS: The resulting PubMed search hedge was translated and re-validated for use in Embase via Elsevier, and MEDLINE via Ovid and EBSCO in June 2025. All searches attained 100% relative recall.
    CONCLUSION: The use of relative recall methodology by a team of expert searchers led to a highly sensitive validated search hedge. The validated hedge can be trusted to identify relevant literature for pharmacy education researchers, while the approach may be adapted to other pharmacy research domains and health science disciplines.
    Keywords:  information storage and retrieval; pharmacy education; scholarship of teaching and learning; search hedge; validation study
    DOI:  https://doi.org/10.1016/j.ajpe.2026.102007
  3. Int J Med Inform. 2026 May 15. pii: S1386-5056(26)00219-4. [Epub ahead of print]217 106479
       BACKGROUND: Large language models (LLMs) are increasingly used by patients for medical information and reassurance. In psychodermatology, where communication must address psychological distress, stigma, and functional impact, the safety and quality of AI-generated educational content have not been systematically assessed.
    OBJECTIVE: To evaluate the quality and safety of patient-facing educational responses generated by contemporary LLMs for psychodermatologic conditions in a bilingual setting, comparing English and a non-English language (Turkish).
    METHODS: This cross-sectional, scenario-based evaluation analyzed responses from five LLMs (GPT-4o, GPT-5, Claude 4 Sonnet, Gemini 2.5 Flash, and LLaMA 3.1 70B) to 16 standardized psychodermatology scenarios. Three blinded clinician-evaluators (two from dermatology and one from psychiatry) independently rated each response across six clinical communication domains. Readability indices and word counts were assessed, and repeated-measures nonparametric analyses were performed.
    RESULTS: Gemini 2.5 Flash achieved the highest overall quality scores in both English and Turkish, significantly outperforming LLaMA 3.1, Claude 4 Sonnet, and GPT-4o (P < 0.05). Across models, empathy and stigma-free communication scored highest, whereas actionability and risk management scored lowest. English outputs were longer than Turkish (mean 373.9 vs 269.4 words; P < 0.001). LLaMA 3.1 showed significantly lower quality in Turkish (66.3%) compared with English (77.2%; P < 0.001). Interrater agreement was good (ICC = 0.703).
    CONCLUSIONS: While LLMs demonstrated strong empathic and stigma-sensitive communication in psychodermatology, they consistently lacked actionable guidance and robust risk framing. These findings support cautious, clinician-supervised use of LLMs as adjunctive tools for patient education in psychodermatologic care.
    Keywords:  Artificial Intelligence; Health Communication; Large Language Models; Patient Education; Psychodermatology
    DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106479
  4. JMIR Cancer. 2026 May 28. 12 e79065
       Background: With increasing numbers of survivors with cancer, the importance of patient-centered information provision and communication to alleviate psychological burdens, such as anxiety and depression, is growing. However, substantial individual differences exist in patients with cancer information-seeking behaviors and use of support services, and few studies have comprehensively examined cognitive and psychological factors such as treatment status, sex, trust in information sources, and patient-provider relationships.
    Objective: This study aimed to integrate the theory of planned behavior and the patient-provider relationship model to identify latent subgroups among Japanese survivors with cancer using information-seeking behaviors, difficulties in information seeking, trust in information sources, and intentions to use psychosocial support services recommended by medical institutions.
    Methods: A CHERRIES (Checklist for Reporting Results of Internet E-Surveys)-compliant cross-sectional web survey was conducted in December 2024 with 350 Japanese survivors with cancer (at least 1 year post diagnosis, either undergoing treatment, or within 5 years after completing treatment). Exploratory factor analysis examined items such as difficulties in information seeking, trust in information sources, and assessment of relationships with physicians. Using the resulting factor structure and sociodemographic and clinical characteristics, latent class analysis was conducted. Differences between classes were examined using the chi-square test, Kruskal-Wallis test, and post hoc analyses.
    Results: Latent class analysis classified participants into 3 groups: women under observation, men under observation, and patients under treatment. In the women under observation group, evaluation of the reliability of information from nonmedical institutions was significantly higher than in the male group (χ²₂=12.30; P=.002). Although information-seeking behavior among men under observation was relatively limited, their evaluation of relationships with physicians was significantly higher than that of the treatment group (χ²₂=12.20; P=.002). The proportion of men who regarded promoting communication with doctors and health care professionals as a benefit of using support services was also significantly higher than that of women (χ²₂=12.57; P=.002 and Class 2 > Class 1; P=.001). In the treatment group, searches for information on life during treatment (χ²₂=7.22; P=.03), use of the Cancer Consultation Support Center (χ²₂=17.31; P<.001), and use of the National Cancer Center website (χ²₂=7.59; P=.02) were significantly higher than among men under observation. The treatment group also reported greater difficulty in seeking information (χ²₂=11.90; P=.003).
    Conclusions: Information-seeking behaviors, trusted sources, and perceived difficulties differed by sex and treatment stage among Japanese survivors with cancer. Patients undergoing treatment showed high information needs but greater difficulty in seeking information, suggesting reduced perceived behavioral control. Men under follow-up emphasized relationships with physicians, whereas women relied more on nonmedical information sources. These findings indicate that psychosocial support and information provision should be optimized according to patient-provider communication patterns.
    Keywords:  latent class analysis; mental health; patient preference; patient-centered communication; survivor with cancer; theory of planned behavior
    DOI:  https://doi.org/10.2196/79065
  5. Sci Rep. 2026 May 27.
      Large language models (LLMs) are increasingly used as a first-line source of information for everyday questions, including pediatric health guidance, yet the quality and readability of their outputs remain uncertain. This study aimed to comparatively evaluate the clinical reliability, quality, and readability of responses generated by four widely used LLMs to real-world, caregiver-style pediatric prompts. A curated set of 28 caregiver-style pediatric questions spanning six common themes was posed on August 5, 2025, to four widely used LLMs (ChatGPT-4o, Gemini 2.5 Pro, Grok-4, DeepSeek-V2), with one fresh session per question-model pair; the first, unedited outputs were retained. Under blinded conditions, responses were scored by four pediatricians using a modified DISCERN instrument (reliability/structure) and a global quality score (perceived usefulness). Readability was assessed with standard indices and length measures. Quality varied across platforms. Grok achieved the highest DISCERN scores, indicating stronger reliability and structural rigor, whereas Gemini received the highest global quality ratings. DeepSeek was consistently rated lower by experts but yielded the most readable outputs; ChatGPT showed intermediate performance. None of the models consistently met health-literacy targets for patient materials (approximately 6th-8th grade reading level). Grok generated the longest and most complex responses (often college level), while DeepSeek and Gemini produced comparatively simpler, more concise text. Across platforms, most responses were classified as moderate in reliability and usefulness; overtly unsafe advice was not identified. These findings suggest that current LLMs provide moderately useful pediatric health information but require improvements in readability, sourcing, and consistency before routine patient-facing use.
    Keywords:  Artificial intelligence (AI); Large language models (LLMs); Parenting; Patient education; Pediatrics; Readability
    DOI:  https://doi.org/10.1038/s41598-026-54812-6
  6. Digit Health. 2026 Jan-Dec;12:12 20552076261455128
       Introduction: Large language models (LLMs) are increasingly being used by patients for health information, yet their reliability in orthodontics remains uncertain. This study aims to evaluate the accuracy, reliability, quality, and readability of orthodontic retention information generated by ChatGPT 3.5, ChatGPT 4, Gemini, and Copilot.
    Materials and Methods: Twenty-three frequently asked questions about orthodontic retainers were collected and categorised into general retainer questions (n=8), fixed retainer questions (n=5), and removable retainer questions (n=10). Questions were entered into each AI model once under standardised conditions. Responses were anonymous and independently assessed by two consultant orthodontists. Accuracy was scored using a five-point Likert scale, reliability with the modified DISCERN tool, quality with the Global Quality Scale (GQS), and readability with the Flesch Reading Ease Score (FRES). Statistical analysis included ANOVA, Kruskal-Wallis, post-hoc tests, and intraclass correlation coefficients (ICC).
    Results: Evaluator agreement was excellent across all domains (ICC 0.821-0.957). ChatGPT 3.5 achieved the highest accuracy (mean 4.49), while ChatGPT 4 and Copilot scored highest in reliability (means 30.47 and 30.11). ChatGPT models outperformed Gemini and Copilot in quality, with over 75% of their responses rated good to excellent. Readability was low across all models; however, Copilot produced relatively more readable text (mean FRES score of 53.93).
    Limitations: This study is limited by its focus on single-turn responses which may not reflect the iterative interactions typical of real patient - AI conversations. In addition, the evolving nature of AI models may affect reproducibility, and its restriction to English, may limit its generalizability across languages.
    Conclusion: All AI models demonstrated moderate competence in providing orthodontic retention information, but their reliability was inconsistent, and readability was poor, necessitating human oversight and methodological refinement rather than serving as replacements for professional advice.
    Keywords:  ChatGPT; Copilot; Gemini; artificial intelligence; large language models; orthodontic retention
    DOI:  https://doi.org/10.1177/20552076261455128
  7. Front Med (Lausanne). 2026 ;13 1758735
       Background: With the rapid advancement of artificial intelligence, LLMs (LLMs) are now employed across diverse domains. In nursing, their capacity for high-quality content generation is especially promising, offering practical value for clinical management, research, and education. Among the leading Chinese models is DeepSeek-R1.
    Objective: This study aims to evaluate and compare the effectiveness of DeepSeek-R1 and ChatGPT-4.0 as online information sources for nursing professionals seeking evidence-based care strategies for gout patients.
    Methods: We identified the 15 highest-priority questions on gout and related nursing strategies by surveying the research site, patients, and healthcare providers. These questions, posed in Chinese, were separately submitted to DeepSeek-R1 and ChatGPT-4.0. The Flesch Kincaid Grade Level (FKGL) and the Flesch Reading Ease (FRE) were used to evaluate the readability of their answers. The mDISCERN score was employed to compare the accuracy of their responses, and the age of statistical reference materials was assessed to compare their timeliness. GraphPad Prism 8.0.1 was used for all statistical analyses and figure preparation.
    Results: Readability and citation characteristics differed between the two LLMs. The FKGL of DeepSeek-R1 (13.04 ± 1.62) exceeded that of ChatGPT-4.0 (11.41 ± 1.74; p = 0.013), whereas FRE was lower for DeepSeek-R1 (40.50 ± 8.12) than for ChatGPT-4.0 (49.08 ± 8.90; p = 0.010). The mDISCERN quality score was numerically higher for ChatGPT-4.0 (4.30 ± 0.73) than for DeepSeek-R1 (3.98 ± 0.70), but this difference was not statistically significant (p = 0.16). DeepSeek cited 21 sources and ChatGPT-4.0 23; clinical guidelines predominated in both corpora (38.1 vs. 47.8 %, respectively). The mean publication age (years elapsed from 2025) was significantly younger for DeepSeek-R1 (3.57 ± 2.33) than for ChatGPT-4.0 (5.42 ± 2.34; p < 0.05). In addition, DeepSeek-R1 provided 4 reference links were invalid.
    Conclusion: Both DeepSeek-R1 and ChatGPT-4.0 drew chiefly from high-level evidence and produced accurate, professional answers; ChatGPT-4.0 rendered them in markedly clearer prose. While DeepSeek-R1 offered more up-to-date citations, several of its reference links were non-functional.
    Keywords:  ChatGPT-4.0; DeepSeek-R1; gout; large language models (LLMs); nursing
    DOI:  https://doi.org/10.3389/fmed.2026.1758735
  8. J Am Podiatr Med Assoc. 2026 May 21. pii: 33. [Epub ahead of print]116(3):
       BACKGROUND: This study aims to compare the quality, reliability, and readability of information provided by artificial intelligence-based language models, ChatGPT-5 and DeepSeek V3, regarding foot and ankle disorders.
    METHODS: The quality, reliability, and readability of the texts generated by both AI models were analyzed using DISCERN, the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), the Global Quality Score (GQS), and the CLEAR scoring system. DISCERN was used to assess information reliability, PEMAT-P to evaluate understandability and actionability, GQS to assess overall quality, and CLEAR to evaluate content quality and accuracy. Standardized questions were asked to both models for 35 different foot and ankle disorders, and the generated texts were evaluated by two independent orthopedic specialists using a blinded method. Readability analysis was performed using word count, the Flesch-Kincaid Grade Level (FKGL; required reading level), and the Flesch Reading Ease (FRE; ease of readability) scoring systems.
    RESULTS: ChatGPT-5 scored significantly higher than DeepSeek V3 in DISCERN, PEMAT-P, GQS, and CLEAR evaluations (p < 0.05), indicating that ChatGPT-5 provides more reliable, comprehensive, and higher-quality information. DeepSeek V3 demonstrated better readability, producing simpler and more understandable content, as reflected in its lower FKGL score and higher FRE score.
    CONCLUSIONS: While ChatGPT-5 delivers more detailed and reliable health information, DeepSeek V3 offers simpler and more readable texts. Both models have distinct advantages for patient education. Future research should assess the impact of AI-generated health information on patient decision-making and its clinical application potential.
    Keywords:  ChatGPT-5; DeepSeek V3; artificial intelligence; information reliability; patient education; readability
    DOI:  https://doi.org/10.3390/japma116030033
  9. Healthcare (Basel). 2026 May 08. pii: 1278. [Epub ahead of print]14(10):
      Introduction: Artificial intelligence (AI)-based chatbots are becoming an increasingly popular source of health information, particularly for common dermatological conditions such as scabies. However, concerns remain about the accuracy, reliability, quality and readability of the information they provide. Objectives: The aim of this study was to evaluate the accuracy, reliability, quality and readability of responses generated by different AI chatbots in answer to patient questions about scabies. Methods: Scabies-related questions were collected from Quora, a publicly accessible question-and-answer platform, and screened for relevance. Following expert review, 20 representative questions were selected. Responses were generated by three large language models: ChatGPT-5.2, DeepSeek and Claude Sonnet 4.5. The outputs were evaluated by expert reviewers using the hallucination rate, modified DISCERN (mDISCERN), Global Quality Score (GQS), Flesch Reading Ease Score (FRES), and an accuracy assessment based on a 5-point Likert scale. Results: In this study, it was found that ChatGPT-5.2 demonstrated the highest information quality (mDISCERN: 33.6 ± 1.8) and readability (FRES: 63.25 ± 11.5). DeepSeek achieved the highest global quality score (GQS: 5.00 ± 0.00) and accuracy score (5.00 ± 0.00). Claude Sonnet 4.5 had lower scores across most metrics. There were significant differences in hallucination rates among the models (p = 0.003), with DeepSeek exhibiting higher rates. Overall, statistically significant differences were observed among the models in terms of quality, readability and accuracy. Conclusions: AI chatbots provide generally informative but variable-quality responses to scabies-related questions. While DeepSeek demonstrated higher accuracy and overall quality, it also showed higher hallucination rates, whereas ChatGPT-5.2 provided more readable and reliable responses. These findings highlight variability across models and the need for cautious use. AI tools should be considered supportive resources rather than substitutes for professional medical advice.
    Keywords:  artificial intelligence; chatbots; dermatology; patient education; readability; scabies
    DOI:  https://doi.org/10.3390/healthcare14101278
  10. Healthcare (Basel). 2026 May 13. pii: 1339. [Epub ahead of print]14(10):
      Background: Gingival recession is a common periodontal condition. With the increasing use of artificial intelligence (AI)-based chatbots, patients frequently seek online health information. However, the reliability, accuracy, and readability of AI-generated patient-oriented information on gingival recession remain unclear. Objective: To evaluate the quality, accuracy, and readability of ChatGPT-generated responses to patient-oriented questions related to gingival recession. Methods: A total of 288 patient-oriented questions were developed by an expert panel and categorized into fourteen thematic domains. Responses generated by ChatGPT (version 3.5) were independently evaluated by five oral health professionals using a modified Brief DISCERN instrument, an accuracy scoring system, and the Global Quality Score (GQS). Readability was assessed using the Flesch Reading Ease and Flesch-Kincaid Grade Level indices. Results: Significant differences were observed among thematic categories for DISCERN, accuracy, GQS, and readability scores (all p < 0.01). The highest modified Brief DISCERN, accuracy, and GQS scores were recorded for the Information Sources/AI Reliability category (DISCERN: 19.60 ± 2.29; accuracy: 4.67 ± 0.49; GQS: 4.33 ± 0.49), whereas the lowest scores were observed for the What Happens If Left Untreated? category (DISCERN: 14.27 ± 1.75; accuracy: 3.23 ± 0.43). Strong positive correlations were identified between DISCERN and accuracy (r = 0.784, p < 0.001) and between accuracy and GQS (r = 0.868, p < 0.001). Readability indices were not significantly correlated with accuracy or quality measures. Conclusions: ChatGPT provided patient-oriented information on gingival recession with variable performance across thematic domains; however, readability remained a limitation. AI-generated content should therefore be considered a supplementary resource rather than a substitute for clinician-guided patient communication.
    Keywords:  ChatGPT; artificial intelligence; gingival recession; health information quality; patient education; readability
    DOI:  https://doi.org/10.3390/healthcare14101339
  11. J Clin Med. 2026 May 19. pii: 3908. [Epub ahead of print]15(10):
      Background/Objectives: Migraine is a common and disabling neurological disorder, and many individuals increasingly seek information online. With the growing use of large language models (LLMs), such as ChatGPT, for patient education, concerns have emerged regarding the quality and reliability of the responses they generate, particularly in Arabic, where evidence remains limited. This study aimed to evaluate the reliability, quality, and accuracy of Arabic-language responses to frequently asked questions (FAQs) about migraine. Methods: A total of 25 FAQs were selected using a multisource approach and entered into four LLMs (ChatGPT-4.1, Gemini 3 Flash, DeepSeek-V3.2, and Grok 4.1), generating 100 responses. Responses were evaluated by a panel of expert neurologists using the modified DISCERN (mDISCERN), Global Quality Scale (GQS), and an accuracy scale. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Results: Significant differences were observed between chatbots for mDISCERN and GQS (both p < 0.001), whereas accuracy did not differ significantly across models (p = 0.072). DeepSeek and Grok demonstrated the highest mDISCERN scores (34.07 ± 1.31 and 34.29 ± 2.59, respectively), while DeepSeek achieved the highest GQS (4.95 ± 0.13). The clearest between-model differences were observed in source transparency and communication of uncertainty. Inter-rater reliability was good across all instruments (ICC range, 0.799-0.831). Conclusions: Medical content generated by the chatbots was broadly comparable, whereas important differences were observed in how that content was communicated. These tools may support patient education; however, their use should remain guided by clinical oversight and professional judgment.
    Keywords:  AI chatbot; Arabic language; DISCERN; artificial intelligence; large language model; migraine; patient education; quality assessment
    DOI:  https://doi.org/10.3390/jcm15103908
  12. Cranio. 2026 May 26. 1-8
       OBJECTIVE: This study aims to evaluate the accuracy and quality of responses generated by large language model-based chatbots to frequently asked questions related to temporomandibular disorders (TMD).
    METHODS: Ten questions were selected based on the most common inquiries made by patients with TMD to artificial intelligence (AI) chatbots. The responses of four widely used AI chatbots (ChatGPT Pro, ChatGPT 3.5, Deepseek, Grok3.0) were collected. Three expert evaluators assessed each chatbot's response using a modified Global Quality Scale (GQS).
    RESULTS: A statistically significant difference was observed among the four AI chatbots (p = 0.0097; η² = 0.09). ChatGPT Pro and Grok achieved significantly higher GQS scores than DeepSeek (p = 0.037*).
    CONCLUSION: While some AI chatbots show potential in answering TMD-related patient questions, variability in accuracy and reliability currently limits their use in clinical settings. Further training and validation are needed before integration into patient education or clinical decision-support systems.
    Keywords:  ChatGPT; Temporomandibular disorders; artificial intelligence
    DOI:  https://doi.org/10.1080/08869634.2026.2670637
  13. J Vitreoretin Dis. 2026 May 20. 24741264261448425
      Purpose: To determine whether large language models (LLMs) can be harnessed to improve the readability of educational material for retina patients. Methods: Forty-one documents (fact sheets presented in portable document format), each representing a vitreoretinal condition, from the American Society of Retina Specialists (ASRS) Retina Health Fact Sheets website, were downloaded in November of 2024. The multimodal LLM Generative Pre-trained Transformer 4 (GPT-4) was accessed through ChatGPT to generate patient education material on the same 41 vitreoretinal conditions. The model was then prompted to adjust the texts to a sixth-grade reading level. The text outputs for each of the 41 conditions were then analyzed through a readability calculator, and the Average Reading Level Consensus Calc (ARLCalc) score, a normalized average of 8 validated readability formulas that reflect a consensus readability grade level of the text, was recorded. Results: The ARLCalc scores for the ASRS Fact Sheet, GPT-4 Response, GPT-4 Enhanced, and ASRS Enhanced responses were 12.85 (± 0.89), 12.37 (± 0.97), 8.66 (± 0.87), and 9.37 (± 1.09), respectively. A statistically significant difference was found between the 4 groups (P < .001). Conclusions: LLMs may be used as a tool to improve the readability of patient-facing text. Patient education material created by specialty-trained authorship committees remains the gold standard for providing accurate medical information.
    Keywords:  large language models; readability; retina
    DOI:  https://doi.org/10.1177/24741264261448425
  14. J Am Acad Orthop Surg. 2026 May 26.
       INTRODUCTION: Online patient educational materials (PEMs) have poor readability, limiting their intended purposes in improving patient comprehension of health topics. Orthopaedic oncology PEMs are particularly complex. Although ChatGPT has demonstrated limited success in simplifying PEMs to the recommended sixth-grade reading level, other large language models (LLMs) have not been thoroughly evaluated. The goals of this study were to (1) assess baseline readability of online orthopaedic oncology PEMs, (2) evaluate five LLMs (ChatGPT-4o, Google Gemini, DeepSeek AI, Microsoft Copilot, and Meta AI) for improving readability while preserving accuracy and comprehension, and (3) to examine tradeoffs when PEMs were simplified below the sixth-grade level.
    METHODS: Seventy-two PEMs were collected from academic and professional sources. Readability metrics included the Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), and Flesch Reading Ease (FRE). Each PEM was rewritten by the five LLMs using the prompt: "rewrite this document to a sixth-grade reading level." Two independent graders then evaluated outputs for comprehension and accuracy (F1 score). ANOVA with pairwise comparisons assessed differences among LLMs and versus baseline (PEMs as written). A secondary analysis evaluated the effect on readability, accuracy, and comprehension of prompts to the fifth-grade, fourth-grade, and third-grade reading level.
    RESULTS: Baseline FKGL (8.7 ± 1.5) was between the eighth-grade and ninth-grade reading level, and GFI (10.5 ± 1.9) was slightly higher. Baseline FRE was 53.9 ± 8.2. All LLMs significantly improved readability (P < 0.001), and ChatGPT-4o, DeepSeek AI, and Google Gemini conversion produced the most readable outputs. Google Gemini achieved the highest F1 score of 0.986 (range: 0.765-0.986) and 100% comprehension. Accuracy and comprehension were compromised for MetaAI when prompted below sixth grade.
    CONCLUSION: ChatGPT-4o, Google Gemini, and DeepSeekAI effectively improved readability while preserving comprehension and accuracy. These findings may guide patient use of LLMs and inform healthcare-AI partnerships.
    DOI:  https://doi.org/10.5435/JAAOS-D-25-00883
  15. BMC Nephrol. 2026 May 25.
       BACKGROUND: Large language models are increasingly becoming a key resource for hemodialysis patients to access information on disease management. However, the information reliability, readability, and guideline concordance of LLM-generated hemodialysis-related educational texts remain insufficiently evaluated.
    METHODS: This study identified 42 dialysis-related questions from an initial pool of 200 candidate questions extracted from Google Trends, relevant clinical guidelines, and online forums. Using a standardized single-turn, zero-shot prompting strategy with default web-interface settings, these questions were independently input into five models (ChatGPT-4o, DeepSeek-V2.5, Gemini 2.5 Pro, Perplexity Pro, and Copilot). Two trained raters independently evaluated the outputs using the DISCERN, EQIP, JAMA, and GQS scales in a blinded review, with disagreements adjudicated by a third senior nephrologist. Readability was quantified using the FKGL, FRES, GFI, CLI, and SMOG metrics. Additionally, using internationally authoritative guidelines such as KDIGO as a benchmark, guideline concordance and potential text-level safety concerns in the generated outputs were reviewed against authoritative hemodialysis-related guidelines, and qualitative methods were employed to describe the issue of hallucinations in the model outputs.
    RESULTS: Significant differences were observed across the five LLMs for all four information-quality metrics (P < 0.001 for DISCERN and EQIP; P = 0.002 for GQS and JAMA). RAG-based models, particularly Perplexity and Copilot, showed relatively higher information reliability. None of the outputs met the recommended sixth-grade readability benchmark, and greater guideline concordance was often accompanied by higher linguistic complexity. RAG-based models also showed relatively better alignment with reference guideline statements, whereas non-retrieval-based models more often omitted guideline-recommended elements or provided less specific responses. Qualitative review identified several examples of model-generated "medical hallucinations," including contraindicated self-management suggestions, potentially inappropriate dietary advice, and out-of-scope clinical instructions presented as self-care, indicating potential text-level safety concerns if used without professional review.
    CONCLUSION: RAG-based models showed relatively better evidence support, information reliability, and guideline concordance in hemodialysis-related educational text generation. However, all evaluated LLMs produced outputs with readability barriers and occasional potentially unsafe or out-of-scope recommendations at the text level. These findings do not establish the actual clinical safety or effectiveness of LLM use among hemodialysis patients, but they indicate that unsupervised patient-facing use should be approached cautiously and that expert review and plain-language adaptation are necessary before such outputs are used as educational materials.
    Keywords:  Artificial intelligence; Complications; Guideline concordance; Hemodialysis; Information quality; Large language models; Readability
    DOI:  https://doi.org/10.1186/s12882-026-05048-z
  16. Otolaryngol Head Neck Surg. 2026 May 29.
       OBJECTIVES: Artificial Intelligence (AI) is increasingly integrated into medicine, including otolaryngology. However, concerns remain regarding the accuracy of generated content and the tendency of large language models (LLMs) to fabricate references. This study evaluates the accuracy, appropriateness, readability, and hallucination of references in 2 prevalent large language models, ChatGPT and Claude, in response to common otolaryngological questions.
    STUDY DESIGN: Prospective observational study.
    SETTINGS: Academic tertiary care center.
    METHODS: Thirty-six otolaryngologic questions were individually entered into ChatGPT 4.0 Plus and CLAUDE in separate sessions, with explicit instructions to avoid utilizing previous memories. To assess reproducibility, each query was submitted twice. Two otolaryngologists independently rated the accuracy of responses. Readability was evaluated using the Flesch Reading Ease (FRE) score. Reference hallucinations were assessed by analyzing the reference validity and relevance.
    RESULTS: ChatGPT and CLAUDE had an FRE of 47 and 25.2 out of 100, respectively. For patient readability, ChatGPT scored a 3.60 while Claude scored a 4.68 out of 5. Claude scored slightly higher on accuracy, receiving a score of 4.42 out of 5 while ChatGPT received a 3.81. Both models hallucinated at least half of their references, with some citations irrelevant or incorrectly formatted. Thematic analysis revealed frequent vagueness, poor clinical prioritization, and excessive jargon across both models.
    CONCLUSION: Both ChatGPT and CLAUDE often produced partially inaccurate, jargon-filled responses and failed to consistently provide valid references when answering common otolaryngologic patient questions. Our results highlight the need for better understanding and regulation of LLM limitations in clinical and patient-facing applications.
    Keywords:  Artificial Intelligence; ChatGPT; Large Language Models; medical misinformation; patient resources
    DOI:  https://doi.org/10.1002/ohn.70309
  17. BMC Anesthesiol. 2026 May 28.
       PURPOSE: This study aims to compare the responses provided by commonly used artificial intelligence-based chatbots such as ChatGPT-3.5, ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-R1 about dental local anesthesia, sedation, and general anesthesia in terms of accuracy, reliability, and readability.
    METHODS: Sixty questions were created from the American Dental Association (ADA) and American Society of Anesthesiologists (ASA) guidelines. Thirty were patient questions, thirty professional questions. Each group contained ten open-ended, ten multiple-choice, and ten true/false questions. The questions were submitted to ChatGPT-3.5, ChatGPT-4o, DeepSeek-R1, and Gemini 2.0 Flash. Four blinded pediatric dentists evaluated the answers with a modified global quality scale. Clinical safety and risk analysis were evaluated using a 3-point Likert scale. Readability was measured by Flesch Reading Ease Score (FRE) and Flesch-Kincaid Grade Level (FKGL). The intraclass correlation coefficient (ICC) tested inter-rater reliability. Significance was set at p < 0.05.
    RESULTS: DeepSeek-R1 demonstrated the highest overall accuracy and inter-rater agreement, providing the most accurate and reliability responses across all question types (p < 0.001). It was followed by Gemini 2.0 Flash, ChatGPT-4o, and ChatGPT-3.5. In terms of readability, Gemini 2.0 Flash consistently produced the most accessible responses, while DeepSeek-R1 was significantly less readable (p = 0.012). GPT-3.5 showed variability by question type, with MCQs being easier to read than open-ended ones (p = 0.027). No significant readability differences were observed across question types for ChatGPT-4o, Gemini 2.0 Flash, or DeepSeek-R1 (p > 0.05).
    CONCLUSION: Chatbot performance depends on both question type and evaluation criteria. DeepSeek-R1 excelled in accuracy and quality. Gemini 2.0 Flash produced the clearest, patient-friendly responses. AI chatbots can support communication in dental anesthesia. Choosing the right model may improve education, assist clinical training, and guide professional decisions.
    Keywords:  Anesthesia, General, Sedation; Anesthesia, Local; Artificial Intelligence; Chatbot; Dentistry; Readability
    DOI:  https://doi.org/10.1186/s12871-026-03934-5
  18. Front Public Health. 2026 ;14 1799204
       Background: Pertussis (whooping cough) is a highly contagious respiratory infection that continues to cause substantial morbidity and mortality, particularly among infants, and has re-emerged globally among adolescents and adults. Large language models (LLMs) are increasingly used for health communication and science popularization; however, evidence regarding their readability, quality, and educational suitability for disease-specific patient education remains limited. To date, no systematic evaluation has focused on LLM-generated pertussis health education materials.
    Objective: This study aimed to systematically evaluate and compare the performance of five mainstream LLMs in generating pertussis-related science popularization content, with particular attention to readability, informational quality, and educational suitability.
    Methods: A cross-sectional simulation study was conducted using 20 frequently asked pertussis-related questions covering five domains: basic knowledge, symptom presentation, diagnostic methods, treatment and management, and prevention and prognosis. On October 28, 2025, all questions were identically input into five publicly accessible LLMs. Text readability was assessed using seven classical indices. Two independent pharmacists performed blinded evaluations using the Chinese version of the Patient Education Materials Assessment Tool for print materials (C-PEMAT-P) and the Global Quality Score (GQS). Additionally, two independent clinical experts assessed the factual accuracy and guideline concordance of each LLM-generated response against the Chinese Pertussis Diagnosis and Treatment Guidelines (2024) using a 4-point scale. Inter-rater agreement was evaluated using Cohen's kappa coefficient.
    Results: ChatGPT, DeepSeek, and Doubao achieved significantly higher C-PEMAT and GQS scores than Wenxin Yiyan and Gemini (p < 0.001), indicating superior understandability, actionability, and overall quality. Median C-PEMAT scores across all models suggested generally acceptable accessibility for patient education. Regarding factual accuracy and guideline concordance, ChatGPT achieved the highest mean score. No harmful advice or direct guideline contradictions were identified in any model output. Correlation analyses showed weak associations between traditional readability metrics and GQS, whereas C-PEMAT demonstrated a moderate positive correlation with GQS (r = 0.34).
    Conclusion: Mainstream LLMs show preliminary capability in generating pertussis-related health education materials, although substantial inter-model variability persists. Domain-specific patient education assessment tools better capture perceived informational quality than generic readability metrics. These findings support the cautious, assistive use of LLMs in health communication within a human-AI collaborative framework.
    Keywords:  artificial intelligence; large language models; online medical education; pertussis; public health education
    DOI:  https://doi.org/10.3389/fpubh.2026.1799204
  19. J Clin Med. 2026 May 19. pii: 3896. [Epub ahead of print]15(10):
      Objectives: Coronary artery bypass grafting (CABG) remains a fundamental surgical treatment for advanced coronary artery disease. With the increasing use of large language models to obtain health information, patients are increasingly turning to these systems to understand surgical options. However, their performance in generating patient-oriented CABG information has not been sufficiently evaluated. Therefore, this study aimed to compare the responses generated by ChatGPT and DeepSeek-R1 to patient questions about CABG in terms of scientific accuracy, comprehensibility, and level of unnecessary detail. Methods: Forty patient-oriented questions were developed based on online sources and clinical experience. Responses were obtained from ChatGPT and DeepSeek under standardized conditions. A blinded panel of four cardiovascular surgeons evaluated the responses using a five-point Likert scale across three domains. Statistical analyses were performed using paired tests. Results: DeepSeek generated significantly longer responses than ChatGPT (212.88 ± 48.13 vs. 188.7 ± 50.34 words; p < 0.001). Accuracy scores were higher for DeepSeek (median 4.5 vs. 4.25; p = 0.004), whereas comprehensibility and unnecessary detail scores were similar between the models. Overall scores were high for both models (4.32 ± 0.28 vs. 4.27 ± 0.30; p = 0.34). Conclusions: The responses generated by both models were generally evaluated favorably by the expert panel, with only limited differences observed between them. DeepSeek demonstrated higher accuracy, whereas ChatGPT tended to produce shorter and more concise responses. However, given the variability observed at the individual-question level, these findings should be interpreted with caution. Large language models may support patient information delivery but should not be considered reliable stand-alone sources for clinical decision-making or patient counseling.
    Keywords:  artificial intelligence; coronary artery bypass grafting; large language models; patient education
    DOI:  https://doi.org/10.3390/jcm15103896
  20. Eur J Obstet Gynecol Reprod Biol. 2026 May 23. pii: S0301-2115(26)00275-7. [Epub ahead of print]323 115207
       BACKGROUND: Preeclampsia is a serious hypertensive disorder of pregnancy. Many patients nowadays turn to social media platforms for help. This study aimed to assess the quality of preeclampsia-related videos on popular social media platforms.
    METHODS: In this cross-sectional study, preeclampsia-related videos were collected from Douyin and YouTube on a single day. Videos were extracted for basic characteristics, engagement metrics, and uploader types. Video quality was assessed using four validated tools: the Global Quality Scale (GQS), modified DISCERN, JAMA benchmark criteria, and the Content Completeness Score (CCS). Statistical analyses included descriptive analysis, non-parametric tests and the Spearman correlation analysis.
    RESULTS: 188 videos (87 from YouTube, 101 from Douyin) were included. YouTube videos were significantly longer than Douyin videos (median 166.00 vs. 101.00 s, p < 0.05), but Douyin videos had drastically higher user engagement (likes, comments; p < 0.05). Douyin videos scored higher on overall quality but had statistically lower reliability score (mDISCERN median: 2.00 vs. 2.00, Z = -3.12, p < 0.05). No significant difference was found between platforms for transparency and content completeness. Videos from institutional and professional sources had higher scores across all quality metrics than those from individual uploaders. Notably, user engagement metrics showed weak or negligible correlations with all measures of informational quality and completeness on both platforms.
    CONCLUSION: The reliability of preeclampsia information on both Douyin and YouTube was suboptimal and inconsistent. Higher social engagement does not guarantee higher informational quality. These findings underscore the need for users to critically evaluate sources.
    Keywords:  Cross-sectional; Douyin; Preeclampsia; Social media; TikTok; YouTube
    DOI:  https://doi.org/10.1016/j.ejogrb.2026.115207
  21. Front Public Health. 2026 ;14 1845389
       Background: Large language models (LLMs) hold considerable potential in medical and health education; however, their reliability and interpretability in highly sensitive areas and in decision-making remain unclear. This study focuses on four publicly available LLMs and systematically evaluates their applicability in fertility preservation scenarios for breast cancer patients, thereby providing guidance for targeted use.
    Methods: This study utilizes Google Trends to identify and filter information on topics related to fertility preservation for breast cancer patients, and analyses the dialogue outputs of models such as GPT-5.4 Thinking, Gemini 3.0, DeepSeek-V3.2, and Microsoft Copilot. To ensure consistency in responses and the fairness of LLM baseline performance, only one response is generated per query, and no responses are generated repeatedly; all dialogues are submitted to the four large language models using standardized prompts. The study found that 26 fertility-preserving response outputs in breast cancer patients exhibited varying patterns, revealing characteristics relevant to fertility-preserving treatments and decision-making for breast cancer patients. The study utilized reliability assessment tools, including DISCERN, EQIP (Evaluation of Information Quality to Patients), GQS (Global Quality Score) and JAMA (JAMA benchmark criteria), a comprehensive assessment based on six widely used readability metrics [Automated Readability Index (ARI), Coleman-Liau Index (CLI), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG) with Flesch Reading Ease Score (FRES)].
    Results: The findings indicate that there are statistically significant differences in the reliability of various artificial intelligence programmes when it comes to providing highly sensitive, decision-intensive, multidisciplinary medical consultations regarding fertility preservation for breast cancer patients. The average intraclass correlation coefficients for all LLMs ranged from 0.715 to 0.978 (with all p-values < 0.001). Microsoft Copilot demonstrates superior performance in terms of information reliability and structural quality, DISCERN [56.5 (49.25, 61)], [EQIP65.0 (51.25, 75.0)], [GQS3.0 (3.0, 4.0)], JAMA [1.0 (1.0, 2.0)], with a higher score than GPT-5.4 Thinking, Gemini 3.0 and DeepSeek-V3.2, the model is capable of providing more reliable information and better decision-making support. The responses generated by all LLMs are too complex for the general public and fail to meet the recommended reading comprehension standards for years 6 to 8; the writing standards of most outputs are equivalent to those of secondary school education, or the reading level required for legal documents.
    Conclusion: This study reveals differences in the information provided by various LLMs regarding fertility preservation decisions for breast cancer patients, and recommends selecting a model suited to the specific clinical context; the Microsoft Copilot model demonstrated the best performance. Although LLMs demonstrate a certain degree of reliability when handling complex health enquiries, none have met the readability benchmark recommended for a year 6 reading level. Future research should focus on improving the reliability and readability of health information generated by LLMs to enhance comprehension among a wider audience.
    Keywords:  centralized decision-making; fertility preservation; large language models (LLMs); readability; reliability
    DOI:  https://doi.org/10.3389/fpubh.2026.1845389
  22. Ann Jt. 2026 ;11 23
       Background: Patients can obtain medical web information instantly, but there is no guarantee of its accuracy, reliability, or quality. The goal of this study is to critically analyze and provide a comprehensive assessment of the current state of online accessible patient information about robotic-assisted hip arthroplasty (RAHA).
    Methods: The top 50 search results for 'robotic hip arthroplasty' on Google, Bing, and Yahoo were screened. After excluding from the analysis duplicate entries, advertisements, non-English websites, video platforms, and unrelated websites, 27 patient-oriented websites were included and assessed by three independent reviewers with the DISCERN instrument, Flesch-Kincaid Reading Ease (FRE), Flesch-Kincaid Grade Level (FGL), and Journal of the American Medical Association (JAMA) benchmarks to evaluate the quality and readability of the selected websites.
    Results: The average DISCERN score was 45.2/80, indicating a fair level of information quality. The mean FRE score was 41.9/100, corresponding to difficult-to-read text, while the mean FGL score was 11.6, indicating that comprehension requires at least a high school education level. The average JAMA benchmark score was 0.96/4, reflecting poor adherence to established credibility and transparency criteria.
    Conclusions: Web-based reporting on academic, private hospital, and general websites needs to be significantly improved, as evidenced by the poor quality and readability of online patient information. Due to its potential benefits over traditional hip replacement surgeries, RAHA is a novel surgery that has recently attracted a lot of attention. However, there is a chance of generating erroneous expectations, unhappy patients, and dissatisfied healthcare providers if efforts are not put forward to create consistent guidelines and collaborate on patient education strategies.
    Keywords:  Patient education; health-related internet use; information quality; readability analysis; robotic hip arthroplasty
    DOI:  https://doi.org/10.21037/aoj-2025-1-87
  23. BMC Health Serv Res. 2026 May 25.
       BACKGROUND: Health literacy is a key determinant of individuals' ability to use medications safely and correctly. The readability of patient information leaflets (PILs) plays a critical role in patient safety and treatment adherence. Inadequate readability may limit patients' understanding of medication instructions and increase the risk of misuse. This study aimed to comparatively examine the readability levels of patient information leaflets for medications commonly used in the United Kingdom, Germany, and Türkiye.
    METHODS: A cross-sectional design was used in the study. The research was conducted using a document analysis approach. Patient information leaflets for 20 medications with identical names, dosages, and pharmaceutical forms in the United Kingdom, Germany, and Türkiye were analyzed. The sections "What is X and what is it used for?", "What you need to know before using X", "How to use X", "Possible side effects", and "How to store X" were examined separately. Readability was assessed using the Flesch Reading Ease formula for English texts, the Amstad formula for German texts, and the Ateşman formula for Turkish texts.
    RESULTS: Mean readability scores corresponded predominantly to the "difficult" or "very difficult" readability categories in four of the five analyzed sections in Germany and Türkiye and in three sections in the United Kingdom. Texts under the headings "What you need to know before using X" and "Possible side effects" were most frequently classified as "difficult" or "very difficult," largely due to long sentence structures and dense technical terminology. Although variations in readability scores were observed between countries, the overall findings suggest that the extensive use of long sentence structures and technical terminology may limit the comprehensibility of the leaflets for the general public.
    CONCLUSION: Despite existing regulatory requirements, patient information leaflets prepared in different linguistic and healthcare system contexts are not sufficiently accessible to the general public. Simplifying leaflet content and restructuring it in a more user-centered manner-both linguistically and structurally-is essential to promote safe medication use and protect public health. These findings may guide regulatory authorities and pharmaceutical companies in improving policies and practices related to leaflet readability.
    CLINICAL TRIAL NUMBER: Not applicable.
    Keywords:  Health literacy; Medication safety; Patient information leaflets; Public health; Readability
    DOI:  https://doi.org/10.1186/s12913-026-14822-6
  24. Plast Reconstr Surg Glob Open. 2026 May;14(5): e7747
       Background: Migraine is a prevalent neurological disorder that affects quality of life. Surgery is a specialized option for patients with chronic or refractory migraine and requires understanding complex information. As patients rely on online resources, health literacy-the ability to obtain, understand, and use health information-becomes crucial for informed decision-making. However, online medical information often exceeds recommended readability levels.
    Methods: A systematic Google search for "migraine surgery" was performed, excluding duplicates, sponsored content, non-English websites, scientific journals, and multimedia-only sources. Texts from 31 articles across 11 websites were extracted and assessed using 6 validated readability indices: Gunning Fog Index, Coleman-Liau Index, Flesch-Kincaid, Automated Readability Index, Simple Measure of Gobbledygook Index, and Flesch Reading Ease. Mean scores were calculated, and analysis of variance with Tukey honestly significant difference test post hoc tests was used to evaluate differences between websites.
    Results: The mean readability level across all websites was higher than the National Institutes of Health and American Medical Association recommended sixth-grade level, corresponding instead to high school or early college-level texts. Flesch Reading Ease averaged 44.9 ("difficult"). Readability varied significantly between websites; some were more accessible, whereas others were highly complex. Technical terminology, long sentences, and limited formatting or visual aids contributed to poor readability.
    Conclusions: Online patient information on migraine surgery is generally written at a level that is too difficult for the average reader, posing barriers to informed decision-making. Improving accessibility through simpler language, glossaries, and visual aids-and testing materials with users-could enhance patient understanding, autonomy, and engagement in healthcare decisions.
    DOI:  https://doi.org/10.1097/GOX.0000000000007747
  25. Sci Rep. 2026 May 23.
      This study aimed to comparatively evaluate the medical information delivery capacity and content quality of current large language models (LLMs), specifically ChatGPT (GPT-5.2), Gemini (3.1), and DeepSeek (V4), regarding oral cavity cancer (OCC) based on expert opinions. 20 open-ended questions addressing the risk factors, diagnosis, and treatment of OCC were directed to the three models. The responses were evaluated using a blinded method by 31 expert physicians from Oral and Maxillofacial Surgery, Otorhinolaryngology (ENT), and Medical Oncology. The Modified Global Quality Scale (1-5 points) was utilised for evaluation. Statistical analyses were performed using Kruskal-Wallis, ANOVA, and Bonferroni post-hoc tests, while Fleiss' Kappa coefficient determined inter-expert consistency. The general performance scores of the models were high (3.57-4.15). In the overall assessment, Gemini received statistically significantly higher scores than the DeepSeek model (p = 0.036). Significant performance differences were identified across 15 of 20 questions (p < 0.05); ChatGPT excelled on clinical and treatment-oriented questions, while Gemini stood out on comprehensive informational items. While no statistically significant difference was found among the specialist groups for the overall evaluation and 19 out of 20 questions (p > 0.05), a significant difference was observed solely for Q6 (p = 0.042). Although LLMs have the potential to generate high-quality information about OCC, their performance varies by content type and model architecture. While Gemini demonstrated more consistent performance overall, expert supervision remains essential before these tools can be used as reliable sources of clinical information. Clinicians must be aware of the specific strengths and limitations of different LLMs in OCC to better guide patients who increasingly use such tools for medical information.
    Keywords:  Artificial intelligence; Health literacy; Large language models; Mouth neoplasms; Quality of health care
    DOI:  https://doi.org/10.1038/s41598-026-53630-0
  26. Orthod Craniofac Res. 2026 May 30.
      Individuals with orofacial clefts experience a substantial lifetime burden of medical, surgical and dental care, often requiring complex treatment decisions. This scoping review aimed to collate and critically assess the available literature regarding the quality and readability of online cleft-related information. Four electronic databases (MEDLINE via PubMed, Embase via Ovid, Web of Science Core Collection and Scopus) were searched from inception to 17 June 2025. Eligible studies evaluated the quality or readability of cleft-related information on websites, social media platforms or YouTube. Thirty-four studies met the inclusion criteria. Quality assessment tools were used to evaluate accuracy, reliability and comprehensiveness, while readability instruments measured ease of understanding for lay audiences. Twenty-one of the 34 included studies (61.8%) reported low or inconsistent quality of cleft-related online information, with many websites and videos lacking reliability, completeness or readability appropriate for the general public. Online health information about orofacial clefts is abundant but frequently suboptimal in quality and readability. Families may be exposed to misleading or difficult-to-understand content, which could hinder informed decision-making. However, non-English language studies were under-represented in this review, which may limit the generalisability of findings. Clinicians and professional organisations should guide families towards trustworthy resources and develop accessible, high-quality online information. Improving the reliability and readability of cleft-related content has the potential to enhance patient education, shared decision-making and long-term treatment outcomes.
    Keywords:  online information; orofacial cleft; quality of information
    DOI:  https://doi.org/10.1111/ocr.70151
  27. Eur J Dent Educ. 2026 May 25.
       OBJECTIVE: This study aims to analyze the educational quality and technical accuracy of YouTube videos regarding inferior alveolar nerve (IAN) lateralization and transposition surgeries and to evaluate the reliability of critical surgical steps using a novel 'Modified IAN-Surgical Technical Index'.
    MATERIAL AND METHODS: A comprehensive search was conducted on YouTube on December 6, 2025, using four distinct keyword groups. From an initial pool of 200 videos (the first 50 results per keyword), 32 videos meeting the inclusion criteria were included in the final analysis. The videos were evaluated by two independent researchers using the JAMA Benchmark Criteria, the Global Quality Scale (GQS) and the Modified IAN-Surgical Technical Index. Additionally, the videos were categorized based on surgical technique, source type and educational quality.
    RESULTS: A total of 32 videos were analysed. No statistically significant difference in quality was observed between lateralization (n = 18) and transposition (n = 14) videos (p > 0.05). Videos with a GQS score of ≥ 3 received significantly higher user interaction in terms of view and like counts (p < 0.05). Regarding technical steps, 'Foramen Isolation' (p = 0.031) and 'Active Retraction' (p = 0.014) were significantly more prevalent in high-quality videos. Notably, the use of grafting/barriers was significantly more common in academic sources compared to commercial sources (p = 0.039).
    CONCLUSION: YouTube videos regarding IAN surgery demonstrate a heterogeneous distribution in terms of both general educational quality and technical accuracy. However, within this specific surgical field, high-quality educational materials garner significantly higher user interaction. The results indicate that modified indices incorporating procedure-specific technical steps are more effective than general quality scales for the analysis of surgical videos.
    Keywords:  YouTube; inferior alveolar nerve; surgical education; video analysis
    DOI:  https://doi.org/10.1111/eje.70188
  28. Healthcare (Basel). 2026 May 18. pii: 1376. [Epub ahead of print]14(10):
      Background: Social media platforms, particularly Instagram, have become significant sources of health information, yet the quality of dental content remains underexplored. This study compared the scope and reliability of information in popular Turkish-language Instagram posts on tooth whitening by poster source and examined associations with post format, purpose, and whitening approach. As a secondary aim, the association between information quality and normalized audience engagement was investigated within this algorithmically curated sample. Methods: This cross-sectional content analysis included 500 publicly accessible Turkish Instagram posts retrieved under the hashtag #dişbeyazlatma. The posts were classified by source, purpose, format, and whitening approach. Content scope and information reliability were assessed using the Descriptive Coverage Index (DCI) and Modified Treatment-Information Reliability (MTIR) scores by two calibrated evaluators. Engagement Rate was calculated as (likes + comments)/follower count × 100. Results: Most posts originated from dentist/clinic accounts (75.8%) and were marketing-oriented (72.0%). Dentist/clinic accounts demonstrated significantly higher MTIR scores than independent users and brand accounts (p < 0.001), whereas DCI did not differ significantly across sources. Raw engagement differences disappeared after normalization (p = 0.408). Reel posts scored higher than photo posts on both measures; carousel posts scored higher than photos on MTIR but not DCI. In-office whitening content scored significantly higher than DIY- or OTC-focused posts on both measures (p < 0.001). A weak positive association was observed between MTIR and Engagement Rate (r = 0.141). Conclusions: Popular Turkish Instagram posts on tooth whitening exhibited substantial variability in content scope and reliability. Independent users commanded greater raw audience reach yet predominantly produced DIY-focused content with substantially lower content scope scores than in-office and multi-method posts, and among the lowest reliability scores, raising a public health concern within this high-visibility content stratum. These findings may inform content development strategies for dental professionals and public health communicators targeting Turkish-speaking audiences.
    Keywords:  content analysis; engagement rate; misinformation; oral health information quality; social media; tooth whitening
    DOI:  https://doi.org/10.3390/healthcare14101376
  29. Yonsei Med J. 2026 Jun;67(6): 502-507
       PURPOSE: Ovarian cancer is the eighth most common malignancy among women worldwide and remains associated with poor survival. Patients and caregivers often experience unmet informational needs and turn to online resources such as YouTube; however, the quality and reliability of ovarian cancer-related content have not been evaluated.
    MATERIALS AND METHODS: On October 4, 2022, the 100 most-viewed YouTube videos retrieved using the keyword "ovarian cancer" were screened, and 86 English-language videos were included. Data collected included uploader type, views, length, likes, dislikes, comments, and content category. Educational quality and informational reliability were assessed using the Global Quality Score (GQS) and modified DISCERN criteria. Three obstetrics and gynecology residents independently rated all videos. Group comparisons were performed using the Mann-Whitney U and Kruskal-Wallis tests, and inter-rater reliability was assessed by intraclass correlation coefficients (ICC).
    RESULTS: Of the 86 videos, 56.98% were uploaded by medical sources and 43.02% by non-medical sources. Informational content comprised 49%, followed by patient experience (43%). The videos collectively accumulated over 16 million views. Non-medical videos were longer and received more likes and comments, with a higher like ratio. The mean GQS was 3.5±1.1 (moderate quality), and the mean modified DISCERN was 2.2±1.1 (low reliability). Modified DISCERN scores were higher in medical than non-medical videos (p=0.001). Inter-rater agreement was excellent (ICC>0.9).
    CONCLUSION: YouTube provides broad access to ovarian cancer-related information, but a substantial gap exists between educational quality and informational reliability. Professional involvement and platform-level strategies may improve evidence attribution and the educational value of popular videos.
    Keywords:  Ovarian neoplasms; health education; information dissemination; patient education as topic; social media
    DOI:  https://doi.org/10.3349/ymj.2025.0374
  30. Medicine (Baltimore). 2026 May 29. 105(22): e49080
      Video platforms are major sources of health information; however, the accuracy of musculoskeletal content is uncertain. Triangular fibrocartilage complex (TFCC) injury is common, and many patients seek guidance online. The present study aims to evaluate TFCC-related videos on Bilibili and TikTok to measure dissemination, quality, and reliability, and to identify features associated with higher-quality content. This study conducted a systematic screening and assessment of videos related to the TFCC on the Bilibili and TikTok platforms. Video characteristics were collected, and quality was assessed by Global Quality Score, modified DISCERN, and Journal of the American Medical Association (JAMA) benchmark criteria. A total of 211 TFCC-related videos were analyzed. TikTok exhibited higher dissemination metrics (likes, comments, collections, and shares; P < .001) but a shorter duration than Bilibili. TikTok also hosted a higher proportion of medical professionals (51% vs 27%) and demonstrated significantly higher JAMA benchmark criteria scores (P = .04). Videos from orthopedic and rehabilitation specialists achieved superior Global Quality Score, modified DISCERN, and JAMA scores (P < .001) compared to nonmedical uploaders. Video length correlated positively with quality on Bilibili (r = 0.39, P < .05), whereas engagement metrics did not correlate with information quality. Video quality and reliability of TFCC-related content varied by platform and creator, although the overall quality remained suboptimal. TikTok videos achieved broader reach and higher average quality, whereas medical professionals produced the most reliable content. Uploaders should ensure accuracy, originality, and clarity, while platforms should refine algorithms to highlight evidence-based videos.
    Keywords:  accurate health information; public health; social media; triangular fibrocartilage complex injury; video quality
    DOI:  https://doi.org/10.1097/MD.0000000000049080
  31. Clin Rheumatol. 2026 May 25.
       BACKGROUND: YouTube is increasingly being used for health information, however, the quality of videos on erythema nodosum (EN), the most common form of septal panniculitis, remains unclear.
    OBJECTIVE: This study evaluated English-language YouTube videos on EN for their quality and reliability.
    METHODS: In this cross-sectional study, the search was conducted on November 15, 2025, using the keywords "erythema nodosum," "erythema nodosum causes," "erythema nodosum symptoms," and "erythema nodosum treatment." The first 100 videos were screened for each search term. After applying the exclusion criteria, 61 videos were included in the analysis and categorized according to uploader type and presentation format. Quality and reliability were measured using the Global Quality Scale, modified DISCERN tool, JAMA Benchmark Criteria, and Patient Education Materials Assessment Tool for Audiovisual Content. The statistical analyses included inter-rater agreement, group comparisons, and correlations.
    RESULTS: Among the 61 videos, 47.5% were of low quality, 24.6% were moderate quality, and 27.9% were of high quality. Physician-uploaded videos were generally of higher quality, whereas patient-generated content lacked educational value. Traditional narration and slides dominated, with limited use of animations or patient stories. Viewer engagement, including likes and comments, correlated with quality, but view count did not. The longer and more recent videos tended to score better. The assessment tools showed complementary correlations.
    CONCLUSION: The quality and reliability of YouTube videos on EN are highly variable, with nearly half containing low-quality information. Physician-produced videos were generally more reliable, whereas patient-generated content showed limited educational value. These findings highlight the need for greater expert involvement, improved source transparency, and more engaging evidence-based educational content to reduce misinformation and support patient education on EN. Key Points • The quality of English-language YouTube videos on erythema nodosum demonstrates substantial variability, with nearly half categorized as low quality. • The highest quality content is mostly produced by physicians, while patient experience videos are educationally insufficient. • The numbers of likes and comments show a positive correlation with content quality, whereas view counts are not a reliable indicator of quality.
    Keywords:  Erythema nodosum; Information science; Internet; Social media
    DOI:  https://doi.org/10.1007/s10067-026-08178-9
  32. BMC Cardiovasc Disord. 2026 May 26.
       BACKGROUND: Congenital heart disease (CHD) is the most common birth defect and a major contributor to childhood morbidity and mortality. Short-video platforms-particularly TikTok and Bilibili-have become prominent sources of public health information. However, the quality and reliability of CHD-related educational content on these platforms remain insufficiently characterized.
    METHODS: We conducted a cross-sectional analysis in October 2025 by retrieving the top 100 CHD-related videos from TikTok and Bilibili (n = 200). Two independent reviewers assessed each video using the Global Quality Scale (GQS) and a modified DISCERN instrument. Video characteristics, uploader type, and user engagement indicators were extracted. Analyses included descriptive statistics, nonparametric group comparisons, and Spearman correlation tests.
    RESULTS: Overall video quality was moderate to high, with a median GQS score of 4 (IQR 3-4), while reliability was moderate, with a median modified DISCERN score of 3 (IQR 3-4). Medical professionals were the most common uploaders (49%). A notable tension between reach and scientific rigor was observed: TikTok videos achieved substantially higher engagement (median likes 733 vs 60; p < 0.001) yet exhibited lower average GQS scores than Bilibili. Narrative/storytelling formats generated the highest interaction (median likes 3,248 on TikTok) but received the lowest quality and reliability ratings, whereas expert explanations and animation/diagram-based videos showed superior credibility. Engagement indicators displayed weak to negligible negative correlations with GQS and modified DISCERN scores. After 2023, uploader profiles shifted toward professionalization (medical professional uploaders increased from 26.3% to 62.9%), but this change was not associated with statistically significant improvements in quality metrics.
    CONCLUSION: Although CHD-related short videos on major platforms demonstrate generally acceptable quality, content popularity remains poorly aligned with scientific reliability. These findings highlight the need to balance engagement with accuracy in digital health communication. Future efforts should strengthen science communication training for professional creators and implement platform-level quality assurance mechanisms to enhance the public health value of short-video content.
    Keywords:  Bilibili; Congenital heart disease; Content analysis; Health information; Quality; Reliability; Short video; TikTok
    DOI:  https://doi.org/10.1186/s12872-026-06019-w
  33. J Thorac Dis. 2026 Apr 30. 18(4): 299
       Background: In recent years, TikTok and Bilibili, two popular Chinese social media platforms, have played an important role in health communication by providing a wide range of health-related content to diverse audiences. This study aimed to evaluate the content, quality, and reliability of aortic dissection (AD)-related videos on these two platforms.
    Methods: Using the keyword "" (Chinese for aortic dissection), we initially retrieved the top 150 videos per platform under the default ranking. We recorded video duration, engagement metrics, and uploader identity. Video quality was assessed using the Global Quality Score (GQS) and the modified DISCERN (mDISCERN) scale. Between-group differences were compared using the Mann-Whitney U and Kruskal-Wallis tests. Spearman's rank correlation was applied to examine associations among variables.
    Results: A total of 228 videos were included. Content primarily focused on treatment (55.26%), with relatively sparse coverage of diagnosis (33.77%) and prognosis (14.91%). The median GQS was 2.50 [interquartile ranges (IQR), 2.00-3.00], and the mDISCERN score was 3.00 (IQR, 2.00-3.00). Videos on TikTok had higher GQS and mDISCERN scores (P<0.05). Compared with individual users and organizations, videos produced by specialist physicians achieved higher GQS and mDISCERN scores (P<0.05). Engagement metrics were not associated with GQS or mDISCERN (P>0.05).
    Conclusions: The overall quality and reliability of AD-related videos on TikTok and Bilibili remain suboptimal, with insufficient attention to diagnosis and prognosis. Specialist-produced content shows higher reliability, whereas popularity does not reflect accuracy. Future efforts should strengthen platform oversight and review, encourage greater participation by specialist physicians in producing AD content, and optimize health information dissemination strategies.
    Keywords:  Aortic dissection (AD); Bilibili; TikTok; information quality; social media
    DOI:  https://doi.org/10.21037/jtd-2026-1-0121
  34. Contracept Reprod Med. 2026 May 25.
       BACKGROUND: TikTok is a key source of contraceptive information, including female permanent contraception (FPC), for young adults. Therefore, this cross-sectional study seeks to assess the quality of health information regarding FPC as presented in top-viewed TikTok videos.
    METHODS: Two independent reviewers analyzed 101 most viewed videos with the hashtags #tubal, #tuballigation, and #tubestied for creator demographics, tonality, and content. Statistical analysis using the two sample Wilcoxen rank-sum (Mann-Whitney) test was performed to compare DISCERN and Patient Education Materials Assessment Tool (PEMAT) scores between medical professionals and laypeople.
    RESULTS: Many videos portray personal experiences and have a negative tone, highlighting distrust, side effects, and tubal failure resulting in pregnancy. Educational content has a significantly higher average medical accuracy (p = 0.003) and content accessibility (p = 0.02) scores when created by medical professionals as compared to laypeople.
    CONCLUSIONS: As TikTok promotes engagement over shared concerns, many videos portray a narrative of dissatisfaction and complication following permanent contraception procedures. Healthcare professionals may find value in utilizing the platform to intentionally share credible content and counter misinformation regarding permanent contraceptive methods.
    Keywords:  Consumer health information; Contraception; Mobile social media; TikTok; Tubal ligation
    DOI:  https://doi.org/10.1186/s40834-026-00459-7
  35. Sci Rep. 2026 May 28.
      Urolithiasis is a common and highly recurrent urological disorder. With the growing role of short video platforms in health communication, Douyin (TikTok) has become an important medium for public medical education. However, the quality and user engagement characteristics of urolithiasis-related content remain insufficiently evaluated. This study aimed to assess the content quality and user engagement of urolithiasis-related videos on TikTok and to explore how video source, content theme, and presentation style are associated with educational value and audience response. A total of 400 Chinese-language TikTok videos across five stone types, including renal, ureteral, bladder, urethral, and urinary tract stones, were analyzed. Content quality was evaluated using four validated instruments, namely the Journal of American Medical Association (JAMA) Benchmark Criteria, Global Quality Scale (GQS), modified DISCERN (mDISCERN) instrument, and the Patient Education Materials Assessment Tool (PEMAT). User engagement metrics included likes, comments, shares, and saves. Statistical analyses were conducted to examine variations in quality and engagement across source, content, and presentation form. Overall content quality was low (mean JAMA 1.23; GQS 2.84; mDISCERN 3.09). Videos focusing on prevention and recurrence demonstrated higher PEMAT understandability and actionability scores and were associated with greater user engagement. Hospital- and news agency-based videos showed higher content quality; however, lower engagement was primarily observed in hospital-based videos. Animation and image-text formats were associated with improved understandability. Traditional Chinese Medicine (TCM)-themed videos, although less frequent, demonstrated relatively high actionability and user engagement. The quality and communication effectiveness of urolithiasis-related videos on TikTok varied substantially. Prevention-oriented content, culturally relevant framing, and visually supported formats were associated with improved user engagement and understanding. Integrating evidence-based information with platform-adaptive design may enhance the effectiveness of digital health communication.
    Keywords:  Health information; Quality assessment; Short videos; TikTok; Urolithiasis
    DOI:  https://doi.org/10.1038/s41598-026-54245-1
  36. J Thorac Dis. 2026 Apr 30. 18(4): 281
       Background: TikTok (or named as Douyin in mainland China) has emerged as a major source of health information. Aortic dissection (AD) is a rapidly fatal emergency in which delayed recognition or misinformation can have catastrophic consequences, yet the quality of short-video content on this condition remains unclear. This study aimed to systematically assess the quality and reliability of AD-related videos on TikTok and to examine their association with user engagement and video features.
    Methods: A systematic search was conducted on TikTok using the keyword "aortic dissection" to identify videos published before March 1, 2026. Videos were included if they addressed AD and excluded if they were duplicates, irrelevant to the topic, or involved medical insurance content; 151 videos were ultimately analyzed. Video features (uploader type: healthcare professionals, general users, or news media; duration; and engagement metrics) and quality/reliability were evaluated using the Global Quality Scale (GQS, 1-5) and modified DISCERN (mDISCERN, 0-5) by two independent specialists. Continuous variables are presented as median [interquartile range (IQR)], and analyses used the Mann-Whitney U test, Spearman correlation, and multivariable linear regression.
    Results: Across the 151 videos, overall quality was moderate [median GQS: 3 (IQR, 2-4) and median mDISCERN: 2 (IQR, 1-2)]. Most videos (92.72%) were uploaded by health professionals (mainly physicians). Videos posted by health professionals had significantly higher GQS and mDISCERN scores than those posted by non-health professionals (GQS: P<0.001; mDISCERN: P=0.003). In the adjusted models, video length was positively associated with GQS (β=0.002, P=0.003) and mDISCERN scores (β=0.001, P=0.049), whereas higher numbers of likes and comments were associated with lower GQS (likes: β=-0.002, P=0.005; comments: β=-0.041, P=0.004) and mDISCERN scores (likes: β=-0.001, P=0.02; comments: β=-0.024, P=0.007), demonstrating a distinct "popularity paradox".
    Conclusions: Although Chinese-language TikTok videos on AD are predominantly created by health professionals, overall quality remains suboptimal. Longer videos tend to be higher quality, whereas high-engagement content exhibits lower reliability. Measures such as adopting structured storytelling formats and platform-certified labels could help transform TikTok into a reliable tool for public education on critical illnesses.
    Keywords:  Aortic dissection (AD); Global Quality Scale (GQS); TikTok; modified DISCERN; social media
    DOI:  https://doi.org/10.21037/jtd-2025-1-2540
  37. Sci Rep. 2026 May 29.
      Hydrocephalus is a syndrome characterized by disturbances in cerebrospinal fluid dynamics, impacting individuals across all age groups. Short video platforms have emerged as a primary medium for disseminating medical knowledge, offering a convenient means for patients and their families to access relevant information. Nevertheless, the varying quality of these videos presents significant challenges to the effective communication of medical knowledge. This study provides a systematic evaluation of popular science videos on hydrocephalus available on leading Chinese short video platforms. This study included 181 short science videos related to hydrocephalus collected from Bilibili and TikTok. The characteristics of these videos were analyzed, and their reliability and quality were assessed utilizing the modified DISCERN instrument, the Global Quality Score (GQS), and the JAMA benchmark criteria. Overall video quality was low on both platforms. The median GQS and modified DISCERN scores did not differ significantly between platforms (P > 0.05), while JAMA scores showed a significant difference (P < 0.05). On TikTok, longer videos were positively correlated with higher GQS and modified DISCERN scores; no such correlation was found on Bilibili. Treatment-related content was most common, with childhood hydrocephalus emphasized on TikTok and adult hydrocephalus on Bilibili. Expert-produced videos achieved significantly higher quality scores than content from non-experts (P < 0.05). On the TikTok and Bilibili platforms, the overall quality and reliability of hydrocephalus-related short videos remain suboptimal, and a portion of the content does not meet established standards for science communication. However, TikTok demonstrates higher user engagement and better video quality compared to Bilibili, with expert-produced videos proving to be more reliable. Educational videos focused on childhood hydrocephalus also show higher engagement than those addressing adult hydrocephalus. These findings highlight the urgent need to improve the quality of hydrocephalus-related short-video content and to promote early intervention strategies.
    Keywords:  GQS; Hydrocephalus; JAMA; Modified DISCERN score; Short video
    DOI:  https://doi.org/10.1038/s41598-026-55440-w
  38. J Thorac Dis. 2026 Apr 30. 18(4): 344
       Background: Rib fracture is one of the most common thoracic injuries and is associated with impaired respiratory function, severe pain, and adverse clinical outcomes. With the increasing demand for accessible health information, short-video platforms such as TikTok and Bilibili have emerged as major channels for public medical knowledge dissemination. However, due to the lack of standardized content regulation, the quality, accuracy, and reliability of health-related videos remain highly variable. In this context, a systematic evaluation of such content is needed. Therefore, this study aims to comprehensively assess the quality, reliability, and dissemination characteristics of rib fracture-related videos on TikTok and Bilibili, and to further examine the impact of uploader background on content quality.
    Methods: A cross-sectional study was conducted on 136 videos (TikTok=85; Bilibili=51). Video characteristics, uploader information, and engagement metrics were systematically collected. Content quality and reliability were evaluated using four validated instruments: the Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V), Video Information and Quality Index (VIQI), Global Quality Score (GQS), and modified DISCERN (mDISCERN). Non-parametric statistical tests were applied for group comparisons, and Spearman correlation analysis was used to explore associations between video quality and engagement indicators.
    Results: TikTok videos were significantly shorter and demonstrated higher engagement metrics compared with Bilibili videos (all P<0.001). In contrast, Bilibili videos showed superior educational quality. The GQS score was significantly higher on Bilibili (3 vs. 2, P=0.019), while PEMAT-A/V and VIQI scores were also higher, although without statistical significance. Videos uploaded by medical professionals or institutional accounts achieved significantly higher scores across all quality assessment tools (P<0.01). Correlation analysis indicated that video quality was weakly to moderately negatively associated with engagement metrics, particularly comment counts (e.g., TikTok mDISCERN: r=-0.270, P<0.05).
    Conclusions: The overall quality and reliability of rib fracture-related short videos on TikTok and Bilibili remain suboptimal. While Bilibili provides relatively higher-quality and more educational content, TikTok demonstrates stronger dissemination capacity. Uploader background plays a critical role in determining content quality. Enhancing professional participation and establishing standardized content review and certification mechanisms are essential to improve the accuracy, reliability, and public health value of medical information on short-video platforms.
    Keywords:  Rib fracture; health information quality; short video; social media
    DOI:  https://doi.org/10.21037/jtd-2025-1-2502
  39. BMC Gastroenterol. 2026 May 25.
       BACKGROUND: Choledocholithiasis, a disease with a rising incidence and potential for severe complications, has prompted many to seek health information online. TikTok and Bilibili have emerged as key platforms for disseminating such information. This study evaluates the quality and reliability of short videos on choledocholithiasis on these platforms.
    METHODS: This study analyzed the top 100 choledocholithiasis-related videos from TikTok and Bilibili. The Global Quality Score (GQS), modified DISCERN (mDISCERN) tool, and the Journal of the American Medical Association (JAMA) criteria were employed to assess video quality. Cohen's Kappa coefficient is used to assess inter-rater agreement.Group comparisons were conducted using Mann-Whitney U and Kruskal-Wallis H tests, while Spearman's correlation was utilized for correlation analysis.
    RESULTS: A total of 170 videos were included, predominantly uploaded by hepatobiliary surgeons. The content of these videos mainly focuses on treatment (81.76%), with limited coverage of etiology and diagnosis. The overall video quality was mediocre, with median scores of 3 for GQS (IQR: 3.00-4.00), 2 for mDISCERN (IQR: 2.00-2.00), and 2 for JAMA (IQR: 2.00-2.00). Videos from hepatobiliary surgeons generally exhibited superior quality. Notably, video quality showed no correlation with engagement metrics.
    CONCLUSIONS: The content of choledocholithiasis-related videos is structurally deficient. The videos' quality, reliability, and transparency are relatively poor, though those uploaded by hepatobiliary surgeons stand out for their superior quality and reliability. Importantly, video quality is independent of engagement metrics.
    Keywords:  Bilibili; Choledocholithiasis; Health information; Short video; TikTok
    DOI:  https://doi.org/10.1186/s12876-026-04960-w
  40. Sci Rep. 2026 May 29.
      This study aims to evaluate the information quality and reliability of Exercise-Induced Fatigue short videos on Douyin and to analyze their associations with video sources, content themes, and user engagement metrics. In this cross-sectional study, user engagement data from 190 Douyin videos were extracted using the Octopus web-scraping tool. Two independent reviewers evaluated video quality and reliability using the Global Quality Scale (GQS) and the modified DISCERN instrument (mDISCERN). The findings revealed that a majority of the videos (77.37%) were created by Non-professional Individuals. Although videos from Professional Individuals or Organizations demonstrated significantly higher quality and reliability (p < 0.001), their user engagement did not significantly differ from that of non-professional videos. The most popular content theme was Clinical Manifestations, yet this category represented an area of notably lower information quality (median GQS: 2 out of 5, IQR: 1-3; p < 0.001). Overall information reliability was insufficient (median mDISCERN: 2 out of 5, IQR: 1-3), with critical deficiencies in "mentioning uncertainties" (1.05%) and "providing additional resources" (25.26%). A key finding was that video quality and reliability were significantly negatively correlated with the number of Comments (GQS: ρ = -0.24, p = 0.001; mDISCERN: ρ = -0.18, p = 0.012), while no significant correlations were observed with other engagement metrics (Likes, Favorites, or Shares). Despite the high popularity of Exercise-Induced Fatigue content on Douyin, the platform suffers from overall low information quality and structural imbalances. High-engagement content themes tend to exhibit low informational value, and the misalignment between user interaction and information reliability poses potential risks to the public information environment. Collaborative efforts are urgently needed to improve content quality and optimize the digital health information environment.
    DOI:  https://doi.org/10.1038/s41598-026-47639-8
  41. Digit Health. 2026 Jan-Dec;12:12 20552076261455205
       Objective: This study evaluated the information quality and user engagement of osteoporosis-related videos on Bilibili and TikTok, and examined their associations with uploader characteristics, content topics, and video quality factors.
    Methods: On October 11, 2025, we conducted a systematic evaluation of the information quality and reliability of the top 100 Chinese-language short videos related to osteoporosis on TikTok and BiliBili platforms, ultimately including 171 valid videos for analysis. Using the Global Quality Scale (GQS) and a modified DISCERN instrument, we assessed multiple dimensions of video content. Furthermore, Spearman correlation analysis and the Kruskal-Wallis test were employed to examine how platform type, uploader category, and content characteristics influence video quality.
    Results: A total of 171 videos were included in the analysis, comprising 82 from Bilibili and 89 from Tiktok. Bilibili videos exhibited significantly longer durations compared to Tiktok videos (median 307.5 seconds vs. 151.0 seconds; P < 0.001). Conversely, Tiktok videos demonstrated significantly higher user engagement metrics, including median number of likes (3407.0 vs. 65.0; P < 0.001), collections (1432.0 vs. 78.0; P < 0.001), and shares (677.0 vs. 48.5; P < 0.001). Regarding uploader characteristics, professional institutions contributed only 4.1% of the total sample. Nevertheless, videos uploaded by professional institutions achieved the highest median GQS score (4.50) and mDISCERN score (4.00), significantly surpassing those uploaded by professional individuals and non-professional individuals (P = 0.011 and P < 0.001, respectively). User engagement metrics strongly intercorrelated ( r=0.87 -0.94, all P<0.001 ) but correlated only weakly with quality scores ( ∣r∣<0.27 ).
    Conclusions: Bilibili videos feature longer durations and more detailed content, whereas TikTok videos demonstrate superior user engagement. Videos uploaded by professional institutions attained higher quality ratings compared to other uploader types, although this finding is based on a small subsample (n=7) and should be interpreted with caution. Medication-related content attracted the greatest public attention. Nevertheless, the weak correlation between user engagement and quality scores indicates that high popularity does not equate to high informational reliability. These findings underscore the need to strengthen professional credentialing mechanisms and optimize algorithmic recommendations to enhance both the scientific accuracy and communicative reach of osteoporosis health information on short video platforms.
    Keywords:  BiliBili; GQS; TikTok; health information quality; mDISCERN; osteoporosis; short-video platforms
    DOI:  https://doi.org/10.1177/20552076261455205
  42. Sci Rep. 2026 May 28.
      Social media platforms, particularly short-video applications, have emerged as crucial channels for disseminating public health information. Human papillomavirus (HPV) infection in males represents a significant sexually transmitted disease burden with under-addressed health impacts. Despite its prevalence, the quality of HPV-related content targeting male audiences on these platforms remains unevaluated, posing potential risks to public health literacy. We conducted a cross-sectional analysis of 265 TikTok and Bilibili videos addressing male HPV infection. Video reliability and educational quality were assessed using the modified DISCERN (mDISCERN) instrument and Global Quality Scale (GQS). The study used statistical analysis to examine the content and quality of videos on the two platforms, the identities of the uploaders and the correlations between these factors and user engagement data. Overall video quality was low (median GQS = 2.00; mDISCERN = 2.00). TikTok outperformed Bilibili in both engagement metrics and quality scores (GQS: TikTok median = 2.00 [IQR 2.00-3.00] vs. Bilibili = 2.00 [2.00-2.00], p < 0.01; mDISCERN: 2.00 [2.00-3.00] vs. 2.00 [2.00-2.00]). Content gaps were prominent: only 15.8% (n = 42) covered prevention strategies (e.g., vaccination), while symptom descriptions dominated (29.8%). Specialist-generated videos scored significantly higher in quality (GQS = 3.00 [2.00-3.00]; mDISCERN = 2.00 [2.00-3.00]) than non-specialist content (GQS = 2.00 [2.00-2.00]; mDISCERN = 2.00 [2.00-2.00]). Engagement metrics showed no correlation with quality scores. Short-video platforms exhibit suboptimal quality in disseminating male HPV information, with TikTok marginally superior to Bilibili. Specialist involvement enhances content reliability, underscoring the importance of leveraging professional health communication on social media. Public health initiatives must prioritise engaging experts to amplify accurate prevention messaging, particularly regarding vaccination, and address current informational inequities, thereby improving community health outcomes.
    Keywords:  Bilibili; Health information; Male HPV infection; Short video; TikTok; Video quality
    DOI:  https://doi.org/10.1038/s41598-026-54717-4
  43. Front Health Serv. 2026 ;6 1830227
       Introduction: Artificial intelligence-based chatbots are increasingly used by patients to obtain medical information before healthcare encounters. However, the reliability and safety of chatbot-generated responses remain uncertain, particularly for topics involving procedural sedation.
    Methods: This study evaluated the quality, safety, and confabulation risk of responses generated by ChatGPT, Gemini, and Copilot to common patient questions about procedural sedation. A set of standardized patient-oriented questions was submitted to each chatbot, and responses were independently evaluated by clinical experts using predefined criteria assessing informational quality, clinical safety, and the presence of confabulated or misleading content.
    Results: The results demonstrated variability in response quality across chatbots, with several answers containing incomplete information, safety omissions, or potentially misleading statements. Although many responses provided generally understandable explanations, important clinical details relevant to patient safety were inconsistently addressed.
    Discussion: These findings suggest that while AI chatbots may support patient education, their responses regarding procedural sedation may contain safety gaps and confabulated content that limit their reliability as standalone sources of medical information. Careful oversight and clinician-guided use of AI-generated health information may therefore be necessary to ensure safe and accurate patient communication.
    Keywords:  artificial intelligence; chatbots; confabulation; patient education; procedural sedation
    DOI:  https://doi.org/10.3389/frhs.2026.1830227
  44. Womens Health (Lond). 2026 Jan-Dec;22:22 17455057261456878
      BackgroundSocial media is a widely used source of health information, yet the quality and accuracy of shared content can vary. Menstrual and hormonal health are especially prone to misinformation, given their complexity, individual variability, and the limited strength and consistency of available research evidence. This study explored Instagram content related to nutrition and menstrual health, identifying commonly recurring themes and assessing the quality and credibility of the information by comparing claims with current evidence and examining account holder credentials.ObjectivesTo identify predominant content themes and evaluate the quality and credibility of nutrition-related menstrual health information on Instagram.DesignQualitative study using inductive content analysis.MethodsInstagram accounts posting nutrition content about hormonal and menstrual health were identified through a structured search conducted on 29/01/25, with the ten most recent posts per account analysed. Inductive content analysis was used to code 52 posts from eligible accounts into emerging themes using NVivo software.ResultsOf 50 Instagram accounts identified, 21 met inclusion criteria. Four main themes were identified: Marketing and Engagement Strategies; Nutrition and Dietary Recommendations; Hormonal and Physiological Claims; and Symptoms, Wellbeing and Lifestyle Factors. Across these themes, many posts included claims that were not fully supported by current scientific evidence or lacked important context and nuance, particularly those relating to hormonal regulation, cycle-based nutrition advice and supplementation.ConclusionFindings from this study suggest that nutrition-related Instagram content on menstrual health frequently includes claims that are not fully aligned with current evidence or are presented without sufficient context or nuance. These findings highlight the importance of critically evaluating online health information and the need for clearer, evidence-based guidance in this area.
    Keywords:  content analysis; cycle syncing; hormone balance; menstrual cycle; nutrition; social media
    DOI:  https://doi.org/10.1177/17455057261456878
  45. BMC Pediatr. 2026 May 28.
      
    Keywords:  Bilibili; Information quality; Pediatric inguinal hernia; TikTok
    DOI:  https://doi.org/10.1186/s12887-026-07048-2
  46. Orthop Rev (Pavia). 2026 ;18 162078
       Background: Trust in physicians remains a cornerstone of effective healthcare delivery; however, the rapid expansion of online and non-physician health information sources has introduced new challenges to patient decision-making. In orthopedic practice, delayed evaluation of musculoskeletal symptoms-particularly those concerning for malignancy-may be influenced by misinformation and alternative care pathways.
    Methods: We conducted a cross-sectional, nationally weighted survey of U.S. adults (n=200) to evaluate care-seeking behavior, trust in information sources, and endorsement of cancer-related misconceptions. The survey assessed willingness to delay care for persistent bone or back pain, reliance on physician versus non-physician information sources, and responses to conflicting health information. Stratified analyses were performed based on chronic pain status and prior cancer-related experience.
    Results: Although 85% of respondents identified physicians as their most trusted source of health information, 55% reported regular use of online or social media platforms. A substantial proportion of participants reported willingness to delay care for ≥3 weeks or until symptom progression. Individuals with chronic pain demonstrated significantly higher odds of delayed care-seeking (p<0.05) and misinformation endorsement (OR 1.5-2.5, p<0.05). Reliance on online information was independently associated with delayed medical evaluation (OR ~2.0, p<0.01). Notably, a subset of respondents reported prioritizing online information over physician recommendations when conflicts arose, representing the highest-risk group for delayed care.
    Conclusion: A pronounced trust-behavior paradox exists in orthopedic care, wherein high trust in physicians does not consistently translate into timely care-seeking. Misinformation and reliance on non-physician information sources contribute to diagnostic delay, particularly among individuals with chronic pain or prior cancer-related experience. Targeted patient education and engagement in digital information spaces are critical to mitigating these risks.
    Keywords:  Cancer misconceptions; Care-seeking behavior; Musculoskeletal oncology; Online health information
    DOI:  https://doi.org/10.52965/001c.162078