bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–06–28
38 papers selected by
Thomas Krichel, Open Library Society



  1. Epilepsy Res. 2026 Jun 19. pii: S0920-1211(26)00126-9. [Epub ahead of print]226 107856
       BACKGROUND: Limited health literacy among individuals with epilepsy is associated with poor health outcomes, highlighting the importance of accessible, evidence-based self-care resources. Mobile health applications represent a promising avenue for supporting epilepsy self-management; however, the quality and reliability of app content may significantly affect patient trust, clinical interactions, and health outcomes. Therefore, this study aims to evaluate the quality of available Persian-language epilepsy-related mobile applications.
    METHODS: Persian-language apps related to epilepsy and seizures were systematically identified from Google Play, Café Bazaar, and IranApps using relevant keywords. After excluding non-Persian and duplicate apps, the eligible applications were independently evaluated using the user version of the Mobile Application Rating Scale (uMARS) and DISCERN tools.
    RESULTS: Of 659 identified applications, 78 were epilepsy-related; following exclusions, 11 apps met the inclusion criteria for full evaluation. The mean overall uMARS score was 2.8 ± 0.5 out of 5, with six of the 11 apps (54%) scoring above 3. The section-specific mean scores were as follows: engagement, 2.2 ± 0.5; functionality, 4.0 ± 0.4; esthetics, 3.3 ± 0.9; and information, 2.3 ± 0.4, out of 5. DISCERN total scores ranged from 26 to 40 out of 80 (mean 34.5 ± 4.2), and the mean reliability score was 18.2 ± 3.9.
    CONCLUSION: The results showed that Persian-language epilepsy-related apps demonstrated high functionality but limited support for behavior change, engagement, and esthetics. Information quality was generally poor, and none of the evaluated apps were free of charge. These findings highlight the urgent need for developing high-quality, evidence-based epilepsy apps that support comprehensive self-care and behavioral change strategies for Persian-speaking users.
    Keywords:  Content Evaluation; DISCERN; Epilepsy; Farsi; MHealth; UMARS
    DOI:  https://doi.org/10.1016/j.eplepsyres.2026.107856
  2. JMIR Form Res. 2026 Jun 23. 10 e81967
       Background: The quality of online information regarding the risks associated with meat consumption could play a crucial role in shaping consumers' behavior.
    Objective: This study aimed to investigate the quality of Italian, British, and American websites addressing this topic.
    Methods: A cross-sectional assessment of the top 100 British, Italian, and American web pages on the risks attributable to meat consumption was performed using the JAMA benchmarks tool, evaluating authorship by certified professionals and the inclusion of information on recommended meat consumption, potential meat substitutes, and coverage of issues such as diet sustainability and cancer, cardiovascular, and chronic disease prevention. Websites were then classified according to their stance toward meat consumption (neutral, promoting, or demonizing).
    Results: American and British websites were classified as high quality in 61% (61/100) and 78.1% (75/96) of cases, respectively, while only 22.3% (21/94) of Italian websites were classified as high quality. Multinomial regression showed that web pages with a demonizing stance toward meat consumption and those authored by certified health professionals were less likely to be Italian than American. Similarly, web pages discussing environmental risks and chronic diseases associated with excessive meat consumption were less likely to be Italian. Compared with American web pages, those promoting meat consumption and those authored by qualified professionals were less likely to be British. Web pages discussing chronic disease risks were also less likely to be British, whereas those mentioning cancer risks were more likely to be British.
    Conclusions: The widespread prevalence of poor online information quality, especially in certain countries, demands action. Promoting user education in assessing the reliability of websites and involving health professionals in this educational effort may represent viable strategies.
    Keywords:  Italy; United Kingdom; United States; cancer prevention; comparative analyses; diet sustainability; health risks; meat consumption; online information
    DOI:  https://doi.org/10.2196/81967
  3. J Craniofac Surg. 2026 Jun 26.
       BACKGROUND: Scar is an inevitable pathological product of tissue injury repair, and pathological scars often occur in exposed areas, bringing severe psychological burden and economic losses to patients. With the popularization of digital healthcare, patients increasingly rely on artificial intelligence (AI) for self-consultation, but the core capabilities of free generative AI in scar management have not been systematically evaluated.
    OBJECTIVE: This study compared and evaluated the comprehensive performance of ChatGPT-5.4 mini and Gemini 3 Flash in answering clinical and psychological questions of scar patients, investigated multi-dimensional differences, and provided support for the application of AI in patient education.
    METHODS: Fifteen core questions from scar patients were extracted and input into ChatGPT-5.4 mini and Gemini 3 Flash, respectively. The DISCERN-AI scale and Global Quality Scale (GQS) were used for evaluation, while multiple standardized tools were applied to quantify text readability and complexity. All data were subjected to a normality test and difference analysis using SPSS software.
    RESULTS: Both models demonstrated high clinical reliability, with no significant difference in target topic clarity (P=0.806). ChatGPT had better overall quality, with a GQS score of 4.8 (4.5, 4.9), which was significantly higher than Gemini's 4.6 (4.4, 4.7) (P=0.033). ChatGPT was also more rigorous in stating medical limitations and uncertain treatment options (5.0 versus 4.5, P<0.05). In contrast, Gemini performed better in patient demand relevance and empathy (4.5 versus 4.0, P=0.026). Both models achieved moderate scores in shared decision-making support. Readability analysis showed that the reading thresholds of both models were excessively high, far exceeding the internationally recommended 6th- to 8th-grade standard for patient education materials.
    CONCLUSION: ChatGPT-5.4 mini and Gemini 3 Flash have complementary advantages and potential as auxiliary tools for digital health education in scar patients, but both have a serious readability gap. For future large-scale applications, readability prompt intervention should be introduced, and it should be clearly stated that AI cannot replace professional diagnosis and treatment to ensure the inclusiveness and safety of digital medical information.
    Keywords:  ChatGPT-5.4 mini; Gemini 3 Flash; patient education; scar
    DOI:  https://doi.org/10.1097/SCS.0000000000013099
  4. Eur Arch Otorhinolaryngol. 2026 Jun 22.
       PURPOSE: To propose a novel, standardized, and safety-centered tool for Large Language Models (LLMs) evaluation.
    METHODS: The Medical Evaluation of Large Language Model Answers Questionnaire (MELMA-Q) was developed as a 30-item clinician-rated instrument spanning seven domains: Medical Accuracy/Groundedness, Clinical Reasoning/Management, Safety/Ethics/Trustworthiness, Linguistic Quality/Semantic Fidelity, Understandability/Literacy Adaptation, Usefulness/Decision Support, and Performance/Answer Behavior. The MELMA Clinical Acceptability Framework (MELMA-CAF) is a two-tier system that incorporates a non-compensatory safety gate and weighted scoring. Five standardized otolaryngology scenarios were posed to three LLMs (ChatGPT 5.2, Gemini Flash 3, DeepSeek v3.2), generating 15 responses, which were independently scored by five blinded ENT specialists. A web-based implementation (MELMA-W) operationalized rubric-based scoring and was compared with clinician ratings.
    RESULTS: All responses passed Tier A safety screening. Mean total MELMA-Q scores ranged from 72.4 to 85.6 across models; inter-rater reliability was excellent (ICC 0.89; 95% CI 0.84-0.93). MELMA-W validation using paired model × domain observations showed systematically higher clinician scores (bias of 0.804 Likert points).
    CONCLUSIONS: MELMA-Q and MELMA-W provide a structured, safety-centered pilot framework for evaluating LLM-generated medical responses in otolaryngology; however, broader validation across larger datasets, additional raters, and other clinical specialties remains required.
    Keywords:  Clinician-rated evaluation; Large language models; MELMA-Q; MELMA-W; Model-agnostic scoring and validation; Otolaryngology
    DOI:  https://doi.org/10.1007/s00405-026-10245-5
  5. Explor Res Clin Soc Pharm. 2026 Sep;23 100812
       Background: Large language model (LLM)-based chatbots are increasingly used by patients for health information; however, their reliability in high-risk cardiovascular therapies such as oral anticoagulation remains uncertain. This study evaluated the perceived accuracy, clarity, and completeness of LLM-generated responses to common patient queries compared with standard-derived expert responses (SDERs).
    Methods: A cross-sectional comparative study evaluated responses generated by ChatGPT-4.5, Gemini Pro 2.5, and DeepSeek-V3 to 11 frequently asked questions related to five oral anticoagulants: warfarin, dabigatran, apixaban, rivaroxaban, and edoxaban. Responses were generated using a standardized patient-focused prompt. SDERs were developed by cardiologists and clinical pharmacists using authoritative references. All responses were anonymized and independently assessed by two blinded clinical pharmacists using a five-point Likert scale evaluating accuracy, clarity, and completeness. Interrater reliability was assessed using linearly weighted Cohen's κ, and group comparisons were analyzed using the Friedman test with post hoc adjustments.
    Results: Interrater reliability ranged from fair to almost perfect (κ = 0.31-0.84). ChatGPT-4.5 achieved the highest mean ratings across all evaluation domains, particularly for completeness. Significant differences were observed among response sources for accuracy, clarity, and completeness (p < 0.05). Warfarin-related queries demonstrated significant differences in accuracy and completeness across response sources, whereas responses for direct oral anticoagulants showed no significant differences.
    Conclusion: ChatGPT-4.5 received the highest mean expert ratings for patient education regarding oral anticoagulants. Performance differences were most evident for warfarin-related queries, whereas responses for direct oral anticoagulants were broadly comparable across sources.
    Keywords:  ChatGPT; DeepSeek; Gemini; Large language models; Oral anticoagulants; Patient education
    DOI:  https://doi.org/10.1016/j.rcsop.2026.100812
  6. Foot Ankle Spec. 2026 Jun 26. 19386400261456922
       BACKGROUND: Patients increasingly use the Internet and artificial intelligence (AI) platforms ChatGPT for medical information, raising concerns about the accuracy and clinical depth of AI-generated content. This study evaluated the reliability and clinical utility of ChatGPT (GPT-3.5 and GPT-4.0) for common foot and ankle conditions compared with patient education materials from the American Orthopaedic Foot & Ankle Society (AOFAS) FootCareMD.
    METHODS: Between January 20 and 26, 2025, standardized prompts were used to query GPT-3.5 and GPT-4.0 across 15 common foot and ankle conditions. ChatGPT responses were compared with AOFAS FootCareMD content based on the number of symptoms, risk factors, and treatment options provided. Two fellowship-trained foot and ankle orthopaedic surgeons independently evaluated response accuracy, categorizing outputs as <50%, 50% to 74%, 75% to 99%, or 100% accurate. Paired t-tests were used for statistical comparisons, and inter-rater reliability was assessed using Cohen's weighted kappa.
    RESULTS: GPT-4.0 generated significantly more symptoms than AOFAS content (P = .015). In contrast, GPT-3.5 listed significantly fewer treatment options than both AOFAS and GPT-4.0 (P = .042). When addressing surgical management, both ChatGPT versions frequently provided vague or incomplete information. GPT-3.5 referenced surgery without procedural detail in 53% of responses, while GPT-4.0 lacked detailed surgical explanations or omitted them entirely in 80% of responses. Overall accuracy ratings were high, with 77% of responses judged as 75% to 99% accurate and only 3.4% rated below 50% accuracy. However, inter-rater agreement between surgeons was poor (κ = -0.02), for responses labeled as 100% accurate, highlighting subjectivity in grading AI-generated medical content.
    CONCLUSION: ChatGPT effectively provides general information on foot and ankle conditions, regarding causes and symptoms, and GPT-4.0 offers more comprehensive treatment discussions than GPT-3.5. Nevertheless, its limited depth and specificity regarding surgical options restrict its clinical usefulness. Until further improvements are made, AI-generated content should serve as a supplement rather than a replacement for expert-reviewed patient education resources.
    LEVEL OF EVIDENCE: Level III Case Control Study.
    Keywords:  ChatGPT; artificial intelligence; foot and ankle conditions; health information accuracy
    DOI:  https://doi.org/10.1177/19386400261456922
  7. Sarcoidosis Vasc Diffuse Lung Dis. 2026 Jun 22. 43(2): 18399
       BACKGROUND AND AIM: Hypersensitivity pneumonitis (HP) is a complex, immüne mediated interstitial lung disease in which accurate diagnosis and long term management require integration of clinical, radiologic, and exposure-related information. Patients increasingly use artificial intelligence (AI) based chatbots to obtain disease related information; however, the quality, readability, and patient usability of such content remain unclear. This study aimed to evaluate the quality, reliability, readability, and patient-centered usability of AI chatbot generated information on HP.
    MATERIALS AND METHODS: Using Google Trends, we identified four of the most frequently searched patient-oriented questions regarding HP: (1) What is HP and what causes it? (2) What are the clinical features of HP? (3) How is HP treated? (4) How is HP diagnosed? These questions were submitted verbatim to eight AI chatbots (ChatGPT-5.1, Claude 3, Microsoft Copilot, DeepSeek V3, Gemini Pro, Grok 4, Kimi K2, Perplexity AI). A total of 32 responses were independently evaluated in a blinded fashion by four pulmonology professors specializing in interstitial lung diseases. Content quality and reliability were assessed using DISCERN; understandability and actionability with PEMAT-P; global written readability with the Written Readability Rating (WRR); and structural readability with the Flesch-Kincaid Grade Level (FKGL).
    RESULTS: All chatbot outputs required advanced literacy, with FKGL scores ranging from 20.17 to 29.07 and a mean of approximately 24-25, indicating college or postgraduate reading level. No chatbot produced content within the recommended patient-appropriate range (FKGL ≤ 8). WRR scores declined with increasing clinical complexity, from 67.85 for definitional content (Q1) to 51.227 for diagnostic explanations (Q4). DISCERN scores varied substantially across models (35.001-57.103), with most chatbots falling into the "fair-good" range, reflecting partially reliable but incomplete information. [..] Conclusion: AI chatbots can generate clinically rich explanations of HP but currently produce content that is too complex and insufficiently actionable for most patients. [..].
    DOI:  https://doi.org/10.36141/svdld.2026.18399
  8. J Perinat Med. 2026 Jun 25.
       OBJECTIVES: To evaluate the readability and quality of publicly available patient information pamphlets on ultrasound and assess their accessibility for patients with varying literacy levels.
    METHODS: This was a cross-sectional descriptive study using the publicly available online International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) patient information library. A total of 155 English-language patient information materials ("pamphlets") on pregnancy and gynecology topics available in early 2025 were analyzed. Readability was assessed using Readability Studio™ software and four validated indices: Gunning Fog, SMOG, Coleman-Liau, and Flesch Reading Ease (FRE). The DISCERN instrument, a validated 16-item tool, was applied independently to evaluate reliability, clarity, and balance of treatment information. The main outcomes measures were the grade level of readability and DISCERN quality scores.
    RESULTS: Only one pamphlet (1 %) met the recommended eighth-grade readability standard. Most pamphlets (124; 80 %) were written at or above the 11th-grade level (mean Gunning Fog 14.8, SMOG 13.5, Coleman-Liau 12.9, FRE 45.2). The LIX index classified the majority as "difficult to technical." Despite the high reading level, DISCERN scores were uniformly high (4-5/5), indicating strong reliability, clarity, and balance of information, but poor accessibility for the average patient.
    CONCLUSIONS: ISUOG patient information materials are accurate, reliable, and evidence-based but written well above recommended readability standards, limiting comprehension for many patients. Simplifying language, shortening sentences, and involving health-literacy and cultural experts may improve accessibility and promote global equity in patient education.
    Keywords:  health equity; health literacy; patient education; readability
    DOI:  https://doi.org/10.1515/jpm-2026-0007
  9. Front Public Health. 2026 ;14 1785229
       Purpose: To systematically evaluate and compare the quality, readability, and query-model consistency of adenomyosis-related content generated by two large language models, ChatGPT (GPT-5) and DeepSeek (R1).
    Materials and methods: In total, 25 high-frequency patient queries were obtained based on Google Trends. Each query was processed using two interaction modes, namely, three consecutive repetitions and three independent cycles, on both large language models (ChatGPT GPT-5.0-web, released December 2025; DeepSeek R1-web, released November 2025). The generated texts (n = 300) were subsequently assessed for their readability [evaluated by Automated Readability Index (ARI), Flesch Reading Ease Score (FRES), and Gunning Fog Index (GFI)] and quality [assessed by DISCERN score, and Ensuring Quality Information for Patients (EQIP) tool]. Statistical comparisons were performed using non-parametric tests and t-tests.
    Results: In the cyclic mode, both ChatGPT and DeepSeek maintained stable output text readability and quality. DeepSeek-generated text demonstrated significantly superior readability across both interaction modes (lower ARI: 11.32 vs. 14.56, p < 0.001; higher FRES: 46 vs. 27, p < 0.001; lower GFI: 12.47 vs. 14.16, p < 0.001) and higher information quality (higher DISCERN: 62 vs. 43, p < 0.001; higher EQIP: 75 vs. 70, p < 0.001). Under the repetition mode, DeepSeek's output exhibited significant fluctuations across multiple metrics (ARI: p = 0.021; FRES: p = 0.015; GFI: p = 0.004; DISCERN: p = 0.013; EQIP: p < 0.001), while ChatGPT's output remained stable (all p > 0.05). Notably, the readability scores for both models indicated reading levels equivalent to undergraduate education, which is above the recommended level for general public health information.
    Conclusion: The findings of this study demonstrate that when generating information on adenomyosis, DeepSeek outperforms ChatGPT in terms of readability and several information quality metrics, whereas ChatGPT exhibits greater consistency in its outputs. However, the reading difficulty of texts generated by both models exceeds the level suitable for the general public, representing a key practical constraint limiting direct public use. Based on these results, AI chatbots may serve as complementary tools in patient education; however, their outputs should undergo expert review and be optimized for comprehensibility before broader clinical application. For clinicians and patients, these findings emphasize the importance of critically appraising AI-generated information and using it as a supplement to, rather than a substitute for, professional medical consultation.
    Keywords:  ChatGPT; adenomyosis; artificial intelligence; deepseek; health information quality; patient education; readability
    DOI:  https://doi.org/10.3389/fpubh.2026.1785229
  10. Healthcare (Basel). 2026 Jun 18. pii: 1769. [Epub ahead of print]14(12):
      Background/Objectives: Lung cancer is a leading cause of cancer-related mortality worldwide. As patients increasingly utilize large language models (LLMs) for health information, evaluating the readability and patient-centeredness of these tools is critical. This study aims to compare the performance of ChatGPT-4o mini, Microsoft Copilot, and Google Gemini in providing lung cancer information, focusing on their utility for individuals with limited health literacy. Methods: In this cross-sectional study (March 2026), 30 responses to ten standardized lung cancer-related queries were analyzed. Outputs were assessed using JAMA benchmarks and mDISCERN for quality, the SMOG index for readability, and PEMAT-P for understandability and actionability. Inter-rater reliability was analyzed using intraclass correlation coefficients (ICCs). Results: ChatGPT-4o mini demonstrated superior readability, achieving a sixth-grade level (SMOG: 6.23 ± 0.72, p < 0.001). Gemini achieved higher JAMA scores, indicating stronger academic rigour. While PEMAT-P scores were highest for ChatGPT (63.7%), all models exhibited moderate mDISCERN quality. Inter-rater reliability was excellent for JAMA (ICC = 1.000) and PEMAT-P (ICC = 0.883), though moderate for mDISCERN (ICC = 0.365), reflecting inherent interpretative subjectivity in qualitative assessment. No hallucinations were observed. Conclusions: Current LLMs exhibit a trade-off between accessibility and academic rigour: ChatGPT favours patient-friendly readability, while Gemini emphasizes structured content. The observed inter-rater variability in mDISCERN underscores the complexity of standardizing qualitative AI evaluation. These findings suggest that LLMs function best as complementary aids rather than substitutes for physician-led communication.
    Keywords:  artificial intelligence; health literacy; health promotion; health promotion programmes; large language models; lung cancer; patient education; readability; vulnerable populations
    DOI:  https://doi.org/10.3390/healthcare14121769
  11. Rev Assoc Med Bras (1992). 2026 ;pii: S0104-42302026000302202. [Epub ahead of print]72(3): e20251434
       OBJECTIVE: The aim of this study was to assess the quality and readability of ChatGPT and Gemini's responses to frequently asked questions about early intervention for individuals with at-risk infants.
    METHODS: Ten frequently asked questions about early intervention were selected by three researchers (a child development specialist, a physiotherapist, and a midwife) from a list generated by ChatGPT and Gemini. Questions were sent to ChatGPT version 4.0 and Gemini 1.5, and initial responses were recorded without follow-up queries. Ten independent experts (two special education specialists, two child development specialists, two physiotherapists, two midwives, and two pediatricians) The quality of ChatGPT and Gemini's responses was assessed using a four-grade rating system. Readability levels were analyzed using the Flesch-Kincaid Grade Level through WordCalc software.
    RESULTS: One of the answers given by ChatGPT was of higher quality than Gemini (p=0.025), while one answer given by Gemini was of higher quality than ChatGPT (p=0.033). The answers to the other questions were of similar quality, with Gemini having a lower level.
    CONCLUSION: This study compares the quality and readability of the answers given by artificial intelligence-based language models to demonstrate their potential to appeal to different user groups. While the models generally provided answers of similar quality, quantitative differences in readability were observed, suggesting potential suitability for different audiences. These findings contribute to understanding the role of AI tools in health communication.
    DOI:  https://doi.org/10.1590/1806-9282.20251434
  12. J Perioper Pract. 2026 Jun 23. 17504589261454163
       BACKGROUND: Effective patient education materials on prehabilitation are essential for optimising patients before surgery. With the growing use of generative artificial intelligence (AI) chatbots in health care communication, it is important to evaluate their suitability compared with established human-generated resources. We aimed to compare patient education materials on prehabilitation, generated by artificial intelligence chatbots (ChatGPT-4o, Gemini 2.5, and DeepSeek V3) with a National Health Service leaflet, assessing factual accuracy, readability, and emotional tone.
    METHODS: A comparative observational study design was used. All four patient education materials were blinded and evaluated by ten experts using a 10-point Likert-type scale. Readability was assessed using the Flesch Reading Ease and the Flesch-Kincaid Grade Level. Sentiment analysis was done using an online tool. Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) scores were calculated.
    RESULTS: National Health Service patient education material showed the highest mean ± SD accuracy scores 9.8 ± 0.3 from experts outperforming all artificial intelligence models (p = 0.000). Among artificial intelligence, Gemini scored highest. For readability, ChatGPT and the National Health Service were comparable. Sentiment analysis showed varying tones across all models. Patient Education Materials Assessment Tool for Printable Materials scores showed high understandability across all patient education materials (>75%), but actionability was highest for the National Health Service (93.3%).
    CONCLUSION: Artificial intelligence chatbots can generate readable and promising patient education materials. Traditional materials remain superior in accuracy and completeness. A hybrid 'human-in-the-loop' approach is recommended for effective patient education.
    Keywords:  Artificial Intelligence; Chatbots; Patient education; Perioperative care; Prehabilitation; Readability
    DOI:  https://doi.org/10.1177/17504589261454163
  13. Australas J Ageing. 2026 Jun;45(2): e70201
       OBJECTIVES: The global population of people aged 60 years and above is growing. With this, there is a rising need for home care services and demand for online health information to facilitate the decision-making among older adults and promote consumer-directed care. This exploratory study aimed to examine how aged care provider websites present and structure information about Home Care Packages (HCPs) to support consumer decision-making.
    METHODS: A web-based content analysis of Australian provider websites was conducted by accessing websites using Google Chrome in incognito mode and using an evaluation tool that assessed language readability (Flesch Reading Ease and Flesch-Kincaid Grade Level scores), usability features Web Content (Accessibility Guidelines V2.0) and the relevance and depth of HCP information. Descriptive statistics were calculated using SPSSv.29. Qualitative data were analysed using deductive content analysis. Data were extracted into Microsoft Excel and deductively coded using the structured evaluation tool developed for this study. Coded content was then narratively synthesised, augmented by researcher notes and illustrative quotes.
    RESULTS: From 25 systematically sampled websites, the median Flesch Reading Ease score was 52.5 (IQR 12.52) corresponding to a high school reading level of 10th-12th grade. Most sites featured small text (n = 21) and inconsistent information regarding services, pricing and case management. None of the websites in this analysis was found to be both user-friendly and comprehensive in their HCP content.
    CONCLUSIONS: Poor usability and inconsistent information may limit older adults' ability to access and understand HCP details, reducing engagement with these online resources.
    Keywords:  aged; consumer health information; health services for the aged
    DOI:  https://doi.org/10.1111/ajag.70201
  14. J Esthet Restor Dent. 2026 Jun 25.
       OBJECTIVE: This study aimed to evaluate the validity and reliability of responses generated by GPT-4o, Microsoft Copilot, Google Gemini, and DeepSeek to 20 frequently asked patient questions about tooth whitening.
    MATERIALS AND METHODS: Twenty common questions about tooth whitening were selected based on clinical experience and AI-generated suggestions. Each question was submitted three times to each chatbot through its official web interface. The responses were evaluated by two professors and four specialists in restorative dentistry using a five-point Likert scale based on a modified Global Quality Score. Validity was analyzed considering low-threshold and high-threshold criteria. Reliability was tested using Cronbach's alpha coefficient, whereas inter-rater reliability was calculated utilizing the intraclass correlation coefficient.
    RESULTS: In the low-threshold validity analysis, GPT-4o and DeepSeek yielded the highest validity rate by providing valid responses to all 20 questions. Microsoft Copilot and Google Gemini showed lower validity rates. No significant difference was found among the chatbots in low-threshold validity rates. In the high-threshold validity analysis, GPT-4o and DeepSeek showed the highest valid response rates, whereas Google Gemini and Microsoft Copilot showed lower rates. No significant difference was found among the chatbots in high-threshold validity rates. In the reliability analysis, the highest internal consistency was observed for DeepSeek, followed by Microsoft Copilot, Google Gemini, and GPT-4o.
    CONCLUSIONS: The evaluated chatbots showed different performance levels in terms of the validity and reliability of their responses to frequently asked patient questions about tooth whitening. GPT-4o and DeepSeek yielded the highest rates in the low-threshold and high-threshold validity analyses, whereas DeepSeek showed the highest internal consistency.
    CLINICAL SIGNIFICANCE: This study indicated that the evaluated AI chatbots generated generally valid but variable responses to frequently asked patient questions about tooth whitening. The findings support the professionally supervised use of chatbot-generated information as supplementary patient education material in dentistry.
    Keywords:  ChatGPT; DeepSeek; Gemini; artificial intelligence; copilot; large language models; whitening
    DOI:  https://doi.org/10.1111/jerd.70223
  15. Front Digit Health. 2026 ;8 1847603
       Background: Large language models (LLMs) are increasingly used by patients seeking cardiovascular health information through digital platforms. However, their accuracy and suitability for providing guidance on heterogeneous diseases such as cardiomyopathies and heart failure remain inadequately evaluated. This study systematically benchmarked state-of-the-art LLMs on patient-oriented heart failure and cardiomyopathy queries regarding clinical appropriateness and comprehensibility.
    Methods: Six prominent LLM Chatbots were tested on 50 expert-curated questions covering disease understanding and lifestyle advice. A web-based evaluation platform randomized and blinded responses for assessment by twelve reviewers (cardiologists, medical students, AI auto-graders). Responses were rated on a 1-5 Likert scale across nine domains, including appropriateness, readability, and empathy. Reviewers also chose their preferred model per question.
    Results: Linguistic complexity and output length varied substantially. Gemini provided the most readable responses (Flesch-Kincaid Grade 11.3 ± 1.9) but was among the most verbose (668.7 ± 116.1 words). Across 2,700 ratings, Gemini received the highest composite mean rating (4.41 ± 0.77), excelling in completeness and factual reliability, followed by Grok (4.23 ± 0.76). Confabulation avoidance scored consistently high across all models (4.49 ± 0.02), while conciseness scored lowest (3.81 ± 0.05). Consistently, evaluators selected Gemini as their preferred information source in 43.7%, followed by Grok (30.3%). Rating tendencies varied by evaluator group: Auto-graders gave the highest average scores (mean 4.58 ± 0.60), followed by students (4.10 ± 0.88), while experts were more conservative (3.79 ± 0.93).
    Discussion: All LLMs showed good accuracy avoiding medical misinformation, though variability exists in readability and comprehensiveness. While major factual errors or hallucinations were rare in our blinded evaluation, they were not entirely absent.
    Keywords:  artificial intelligence; cardiomyopathy; digital health; heart failure; patient education
    DOI:  https://doi.org/10.3389/fdgth.2026.1847603
  16. Ir J Med Sci. 2026 Jun 27.
       BACKGROUND: Accessible written information is fundamental to patient-centred care and informed decision-making, yet health materials frequently exceed the reading ability of their intended audiences. In breast health - where patient information leaflets are widely used to support screening, diagnosis and treatment decisions - readability assessments offer an objective measure of accessibility.
    AIMS: To evaluate the readability of breast health patient information leaflets (PILs) produced by two major Irish health information providers - the Health Service Executive (HSE) and the Irish Cancer Society (ICS) - and to assess their compliance with national literacy recommendations.
    METHODS: A cross-sectional readability analysis was conducted on 27 publicly available breast health PILs (13 HSE; 14 ICS). Text was extracted and assessed using five validated readability indices: Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Median readability scores and interquartile ranges were calculated, and differences between providers were analysed using Mann-Whitney U tests.
    RESULTS: Across all indices, both organisations' materials exceeded the recommended grade 7 readability threshold. ICS materials were significantly more difficult to read than HSE materials (p < 0.001). Median grade-level scores ranged from 7.1 (FKGL) to 9.0 (GFI) for HSE and 8.15 (FKGL) to 10.35 (GFI) for ICS. No significant differences were observed across topic categories.
    CONCLUSION: Most breast health PILs in Ireland exceed recommended readability levels, limiting accessibility for many patients. Routine readability assessment and plain-language revision are warranted to promote equitable, patient-centred communication.
    Keywords:  Breast neoplasms; Health communication; Health literacy; Patient education; Readability
    DOI:  https://doi.org/10.1007/s11845-026-04486-w
  17. Lasers Med Sci. 2026 Jun 26. pii: 134. [Epub ahead of print]41(1):
      To investigate the use of YouTube videos as an educational and informational resource regarding the use of red light therapy in eye diseases. On May 1, 2025, a comprehensive search was conducted on the YouTube platform ( https://www.youtube.com ) using the keywords "Red light therapy in eye disease" and "Use of red light therapy in eye diseases for patient information". Videos under 60 s in duration, those lacking audio, duplicate entries, irrelevant content, and videos with disabled comment sections were excluded from the analysis. These exclusion criteria were predefined prior to data collection. Following video selection (E.Ç., M.T.), all videos were independently evaluated by two ophthalmologists, and statistical analyses were performed using the finalized dataset. Parameters assessed included total views, number of comments, likes, time elapsed since upload, video duration, view rate, interaction index, video source, and content characteristics. In addition, videos were categorized according to quality level based on Global Quality Score (GQS) scores as low-to-moderate quality (GQS < 4) and high quality (GQS ≥ 4). Comparisons were then performed according to quality category with respect to uploader source, video purpose, and engagement-related parameters. Out of 110 YouTube videos reviewed, 53 were excluded based on predefined criteria. The remaining 57 videos covered myopia (22.8%), dry eye (33.3%), diabetic retinopathy (12.3%), and age-related macular degeneration (31.6%). Inter-rater agreement was high (mDISCERN ICC: 0.952; JAMA ICC: 0.923; GQS ICC: 0.890). Videos uploaded by physicians had significantly higher mDISCERN, JAMA, and GQS scores. Educational videos had higher mDISCERN and JAMA scores than patient-information videos. A moderate positive correlation was found between the number of likes and the interaction index (r = 0.458, p < 0.01). The interaction index was not significantly associated with mDISCERN, JAMA, or GQS scores, indicating that engagement metrics were not reliable indicators of informational quality. Among disease categories, age-related macular degeneration videos had the highest mean quality scores, although the differences were not statistically significant. Most videos were classified as low-to-moderate quality (82.5%), while 17.5% were classified as high quality. High-quality videos had significantly higher view counts and like counts than low-to-moderate quality videos. The overall quality and formal reliability of YouTube videos on red light therapy in ophthalmology were low to moderate. Physician-uploaded and educational videos provided higher-quality information. Although high-quality videos showed greater engagement, popularity metrics were not reliable indicators of informational quality. More evidence-based and accessible video content is needed to support patient education.
    Keywords:  Eye disease; Red light therapy; Scoring systems; YouTube
    DOI:  https://doi.org/10.1007/s10103-026-04920-6
  18. Ann Surg Oncol. 2026 Jun 22.
       BACKGROUND: Esophageal cancer is associated with substantial morbidity and mortality worldwide and is frequently diagnosed at advanced stages, leading patients and their relatives to seek health-related information beyond traditional clinical encounters. In recent years, YouTube has become a popular source of medical information. Nevertheless, questions persist regarding the accuracy, credibility, and overall reliability of the content available on the platform.
    METHODS: This cross-sectional study evaluated publicly available YouTube videos related to esophageal cancer. Data were collected on December 3, 2025, using a browser without a user login to minimize algorithm-driven bias. Viewer engagement metrics (views, likes, and comments), source categories, and country of origin were recorded for each video. Content quality and reliability were assessed using the DISCERN instrument, Journal of the American Medical Association (JAMA) benchmark criteria, and Global Quality Score (GQS). Non-parametric statistical analyses were used to compare quality outcomes across source categories and evaluate the correlations between the engagement metrics and quality scores.
    RESULTS: A total of 78 videos met the inclusion criteria, most of which originated in the USA (83.3%). Health-related channels constituted the largest source category (35.9%), followed by patient experience-based videos (23.1%), and private institutions (20.5%). Viewer engagement metrics (views, likes, and comments) did not differ significantly among source types (p > 0.05). In contrast, the content quality varied substantially. Videos produced by public institutions achieved the highest DISCERN, JAMA, and GQS values, whereas patient-experience-based videos demonstrated significantly lower quality and reliability (p < 0.001). Engagement metrics were strongly intercorrelated but showed no association with quality scores.
    CONCLUSION: YouTube videos related to esophageal cancer frequently exhibit moderate informational quality, and popularity metrics do not reflect content reliability. Source credibility plays a critical role in determining video quality, underscoring the need for greater involvement of healthcare professionals and public institutions in digital health content production.
    Keywords:  Esophageal cancer; Online health information; Video content analysis; YouTube
    DOI:  https://doi.org/10.1245/s10434-026-19996-1
  19. Health Informatics J. 2026 Apr-Jun;32(2):32(2): 14604582261463699
      BackgroundWeChat has become a central platform for health information seeking in China, particularly among young adults. Understanding the mechanisms underlying this behavior is crucial for improving digital health literacy and designing effective interventions.ObjectiveThis study applies the Comprehensive Model of Information Seeking (CMIS) to examine how WeChat use shapes users' perceptions of health information characteristics and utility, influencing health information-seeking behaviors among Chinese young adults.MethodsData were obtained from a cross-sectional online survey of 890 WeChat users aged 18-35 conducted in May 2024. Structural equation modeling (SEM) and the PROCESS macro were applied to test the hypothesized associations and mediating mechanisms involving perceived characteristics and utility.ResultsWeChat use, salience, and beliefs predicted perceived utility. Perceived characteristics and utility were positively associated with health information-seeking behaviors and mediated the relationship with WeChat use both in parallel and sequentially.ConclusionIntegrating platform-specific use into CMIS can promote health information seeking and inform digital health communication.
    Keywords:  WeChat use; beliefs; comprehensive model of information seeking; health information seeking; salience
    DOI:  https://doi.org/10.1177/14604582261463699
  20. Cureus. 2026 May;18(5): e109619
       BACKGROUND/OBJECTIVES: This study aimed to evaluate drug information-seeking behaviors, preferred sources, perceived reliability, and related challenges among physicians at King Abdulaziz Medical City, Western Region (KAMC-WR), Saudi Arabia.
    METHODS: In this cross-sectional study, we surveyed physicians at KAMC-WR, a tertiary care center in Saudi Arabia's Western Region, from October 15 to December 15, 2025. Data were collected via a validated, self-administered online questionnaire adapted from prior studies. The survey evaluated demographics, drug information needs, seeking behaviors, challenges, and the reliability of sources. Physicians were categorized as passive or active information seekers based on the number of clinical situations in which they sought drug-related information, with passive seekers reporting searches in zero to one clinical situation and active seekers reporting searches in two or more clinical situations. Data were analyzed using descriptive statistics, chi-square/Fisher's exact tests, the Mann-Whitney U test, and logistic regression in SPSS version 28. A p-value of <0.05 was considered statistically significant.
    RESULTS: Among 148 responding physicians, most searched for drug information daily or weekly. Frequently sought details included dosage regimens (146, 98.6%), adverse effects (138, 93.2%), and contraindications (138, 93.2%). Websites and clinical pharmacists were the most commonly used sources, while mobile apps were preferred for accessibility. Key challenges included polypharmacy management, source variability, and restricted subscription access. Specialists and consultants demonstrated higher odds of being active seekers compared with residents (adjusted OR=12.5, 95% CI: 1.66-93.89, p=0.014).
    CONCLUSIONS: In this study conducted at KAMC-WR, physicians frequently sought drug information during routine clinical practice, particularly through digital resources, but encountered access and workflow barriers. Improving institutional drug information resources and pharmacist collaboration could promote safer and more effective prescribing.
    Keywords:  drug information; information-seeking behavior; medication safety; physicians; saudi arabia
    DOI:  https://doi.org/10.7759/cureus.109619
  21. Andrology. 2026 Jun 22.
       BACKGROUND: The internet, including websites and large language models (LLMs), is an increasingly important information resource for medical patients. However, information quality varies, potentially leading to misinformation. Andrological patients may particularly rely on online sources due to the sensitive nature of their conditions.
    OBJECTIVES: To assess the prevalence and quality of internet research among andrological patients and evaluate common online sources for comprehensibility, readability, and accuracy.
    MATERIALS AND METHODS: Patients (n = 283) at four German andrological centers completed a questionnaire on their online information behavior between November 2022 and October 2023. Common online sources were objectively evaluated using the DISCERN tool and Flesch readability index in 2023. Popular LLMs were also assessed in October 2024 and compared to traditional websites.
    RESULTS: 67% (n = 190/283) of andrological patients seek medical information before appointments, primarily using the internet (52%, n = 148/283) and general practitioners (49%, n = 139/283). Patients under 50 predominantly use online sources (60%, n = 74/122). Official medical association websites and Wikipedia are preferred, but 30% also use commercial sites (n = 91/283). In general, most common websites and LLMs provide sufficient information but lack easy comprehensibility and readability.
    DISCUSSION: The study highlights the widespread use of online resources by andrological patients and emphasizes the importance of high-quality medical information. While online tools show promise, the value of physician-patient communication remains irreplaceable and cannot be replaced by chatting with LLMs.
    CONCLUSION: Official medical association should provide accurate and easily understood information for andrological patients. Efforts to improve the online media literacy and communication skills of medical personnel are necessary.
    TRIAL REGISTRATION NUMBER: The study was registered in the German WHO primary registry, the German Clinical Trials Register (DRKS) (Number: DRKS00029651).
    Keywords:  andrology; digital health; erectile dysfunction; online information; patient information; readability
    DOI:  https://doi.org/10.1111/andr.70286
  22. Health Commun. 2026 Jun 22. 1-12
      Women during emerging adulthood are disproportionately susceptible to sexually transmitted infections due to biological and socioeconomic vulnerabilities. Central to safe sex is the process by which women seek information regarding their male partners' intentions to use condoms, which is complicated by the gendered nature of sexual risks and the influence of sociocultural factors. Drawing on the theory of motivated information management, we examined how relationship quality, sexual relationship power, and sexual shame were associated with sexual risk information-seeking among Chinese emerging adult women in committed heterosexual relationships. Analyses of data from 294 participants showed that relationship quality and sexual relationship power were positively associated with outcome expectancy and efficacy assessments, subsequently contributing to more information seeking behaviors about condom use. Additionally, higher levels of sexual shame were associated with lower levels of efficacy assessments, and the effect of shame on efficacy was completely mediated by negative emotions, relationship quality, and sexual relationship power. These findings contribute to theoretical understandings of sexual health information management situated within specific relational and socio-cultural contexts and offer practical implications for improving sexual negotiation in Chinese cultural contexts.
    DOI:  https://doi.org/10.1080/10410236.2026.2691105
  23. Medicine (Baltimore). 2026 Jun 19. 105(25): e49362
      With the rapid growth of short video platforms, TikTok and Bilibili are becoming important channels for the public to access health information. Given that Asia is underrepresented in English-language medical literature, this study focuses on Mandarin Chinese videos to provide region-specific insights. This study aims to evaluate the content, quality, reliability, and transparency of sarcopenia-related videos on TikTok and Bilibili. Using the keyword "sarcopenia," a search was conducted on TikTok and Bilibili to retrieve the top 150 videos based on the comprehensive ranking. Video duration, engagement metrics, uploader type, and content information were extracted. Videos were evaluated using the Global Quality Score (GQS), modified DISCERN (mDISCERN), Journal of the American Medical Association (JAMA) benchmark criteria, and content completeness scores. Group comparisons were performed using Mann-Whitney U and Kruskal-Wallis tests, and correlation analysis was performed using Spearman correlation. A total of 188 videos were included. The content primarily focused on symptoms and treatment, with less coverage on diagnosis and prognosis. The median GQS score was 2.00 (interquartile range [IQR]: 2.00-3.00), the median mDISCERN score was 2.00 (IQR: 1.75-3.00), the median JAMA score was 1.00 (IQR: 1.00-1.00), and the median content completeness score was 6.00 (IQR: 3.00-9.00). Compared with nonprofessional individuals and nonprofessional organizations, videos uploaded by professional individuals achieved higher scores in GQS, mDISCERN, JAMA, and content completeness (P < .05). No significant correlations were found between engagement metrics and mDISCERN scores (P > .05). The overall quality and reliability of sarcopenia-related videos on TikTok and Bilibili are suboptimal. Videos uploaded by professional individuals demonstrated higher quality and reliability. Future efforts should strengthen platform content oversight, encourage more professional individuals to contribute to health education, and promote the dissemination of high-quality videos.
    Keywords:  Bilibili; TikTok; content quality; health education; sarcopenia
    DOI:  https://doi.org/10.1097/MD.0000000000049362
  24. JMIR Infodemiology. 2026 Jun 24. 6 e85397
       Background: Tetanus is a severe but vaccine-preventable neurological disease that remains a public health concern, especially in resource-limited settings. As social media becomes an important source of health information, concerns persist regarding the quality and reliability of tetanus-related content online.
    Objective: This study aimed to evaluate the quality, reliability, and thematic characteristics of tetanus-related videos on YouTube and TikTok and to examine the relationship between engagement metrics and information quality.
    Methods: A cross-sectional study was conducted using tetanus-related videos retrieved from YouTube and TikTok on August 1, 2025. The top 100 eligible videos from each platform were included (n=200). Video quality was assessed using the Global Quality Scale, whereas reliability and transparency were evaluated using the modified DISCERN tool and the Journal of the American Medical Association benchmark criteria. A thematic content analysis based on predefined coding categories was also performed. Video characteristics, source types, and engagement metrics were also collected. Spearman correlation analysis was used to examine associations between engagement indicators and quality scores.
    Results: YouTube videos showed significantly higher quality and reliability than TikTok videos, with higher median Global Quality Scale, modified DISCERN, and Journal of the American Medical Association scores (all P<.001). Compared with TikTok, YouTube videos more frequently discussed symptoms (92% vs 81%, P=.02), prevention (95% vs 78%, P<.001), treatment (88% vs 70%, P=.002), and wound management (77% vs 38%, P<.001). Lower-quality videos commonly contained incomplete prevention information, vague symptom descriptions, and limited source attribution. Videos produced by official medical organizations and professional health care creators generally achieved higher quality scores. Although engagement indicators were strongly correlated with each other, their associations with informational quality were relatively limited.
    Conclusions: YouTube provided more comprehensive and reliable tetanus-related information than TikTok, although content quality on both platforms remained inconsistent. Greater involvement from health care professionals and clearer evidence-based communication may help improve the quality of health information shared on social media platforms.
    Keywords:  TikTok; YouTube; content analysis; health communication; infodemiology; reliability; social media; tetanus; video quality
    DOI:  https://doi.org/10.2196/85397
  25. Medicine (Baltimore). 2026 Jun 19. 105(25): e49340
      Temporomandibular disorders (TMD) are common conditions that may substantially impair quality of life, yet the quality of health information available on short-video platforms remains unclear. This study evaluated the coverage of TMD-related content, the educational quality, and the reliability of TMD-related Chinese videos on TikTok and Bilibili. On October 5, 2025, TikTok and Bilibili were searched using the Chinese keyword "." After screening, 200 eligible videos were included, comprising 100 from TikTok and 100 from Bilibili. Video characteristics, uploader categories, engagement indicators, and content coverage were recorded. Video quality and reliability were assessed using the Global Quality Scale (GQS), modified DISCERN (mDISCERN), and Journal of the American Medical Association (JAMA) benchmark criteria. The median video duration was 136.50 seconds (Q1, 66.00; Q3, 235.25). The median GQS, mDISCERN, and JAMA scores were 3.00 (Q1, 2.00; Q3, 3.00), 3.00 (Q1, 2.00; Q3, 4.00), and 2.00 (Q1, 2.00; Q3, 3.00), respectively, indicating overall moderate quality and reliability. Diagnosis, symptoms, and treatment were the most frequently covered domains, whereas epidemiology was the least well covered; only 9.0% of videos provided a complete explanation of epidemiology, and 70.0% did not mention it at all. Bilibili videos were significantly longer than TikTok videos (median, 164.00 vs 81.50 seconds, P < .001), whereas TikTok videos showed significantly higher engagement in likes, comments, shares, and saves (all P < .001). No significant differences were observed between the 2 platforms in GQS, mDISCERN, or JAMA scores. Videos uploaded by specialized healthcare professionals had significantly higher GQS, mDISCERN, and JAMA scores than videos uploaded by other sources (all P < .001). Engagement indicators were strongly correlated with one another but did not reflect better informational quality. TMD-related videos on TikTok and Bilibili attracted substantial public attention; however, their overall educational quality and reliability were only moderate. Greater involvement of specialized healthcare professionals, clearer source disclosure, and more balanced topic coverage may help improve the quality of TMD-related health communication on short-video platforms.
    Keywords:  Bilibili; TikTok; health communication; social media; temporomandibular disorders
    DOI:  https://doi.org/10.1097/MD.0000000000049340
  26. Medicine (Baltimore). 2026 Jun 26. 105(26): e49438
      Sleep apnea-hypopnea syndrome (SAHS), a prevalent sleep-disordered breathing disease, burdens global health. In the digital era, short-video platforms such as TikTok and Bilibili have become a major source of health information for the public, but their quality is scarcely studied, raising accuracy and reliability concerns. This study aimed to systematically evaluate the reliability and quality of SAHS educational videos on TikTok and Bilibili using validated tools, such as modified DISCERN (mDISCERN), Global Quality Score (GQS), and Journal of the American Medical Association benchmark criteria (JAMA), and to analyze the associations between content quality, uploader types, and user engagement metrics. A cross-sectional analysis was conducted by retrieving the top 150 videos from each platform using "Sleep Apnea-Hypopnea Syndrome" as the keyword. After excluding duplicates and irrelevant videos, 274 videos (150 from TikTok, 124 from Bilibili) were analyzed using the GQS, the mDISCERN, and the JAMA. Video characteristics, uploader identity, content coverage, and user engagement metrics were evaluated and compared across platforms and uploader types. TikTok videos were significantly shorter but received higher user engagement (likes, comments, shares, and collections) compared with Bilibili videos. Healthcare professionals were the primary uploaders on TikTok (57%), whereas individual users dominated on Bilibili (54%). Video quality scores (GQS, mDISCERN, JAMA) were significantly higher on TikTok than on Bilibili (P < .001). Videos uploaded by healthcare professionals scored the highest in quality and reliability. Strong positive correlations were found among the engagement metrics, but only weak correlations existed between engagement and quality scores. TikTok had higher engagement and better quality than Bilibili, but the overall video quality of the SAHS content on both platforms still needs improvement. Healthcare professional-uploaded videos are more reliable. These findings highlight the need for better regulation and monitoring of health content on short-video platforms.
    Keywords:  Bilibili; TikTok; health information quality; short-video platforms; sleep apnea-hypopnea syndrome
    DOI:  https://doi.org/10.1097/MD.0000000000049438
  27. Trop Med Health. 2026 Jun 24.
       BACKGROUND: Snakebite envenomation is a severe medical emergency and major global health threat. As short-video platforms increasingly shape public access to health information, accurate and reliable digital education is essential for snakebite prevention and first-aid decision-making. Although the quality of medical videos on short-video platforms has been examined for several diseases, little is known about the quality, reliability, and content characteristics of snakebite envenomation-related videos on major Chinese platforms. This study aimed to systematically assess the quality, reliability, and content characteristics of snakebite envenomation-related videos on TikTok (Douyin) and Bilibili, and to explore strategies for improving the quality of online health information.
    METHODS: On February 17, 2026, we searched TikTok (Douyin) and Bilibili using the keywords "snakebite envenomation" and "snake envenoming". After videos with content irrelevant to the study keywords, commercial advertisements, duplicate videos, and videos uploaded within the preceding week were excluded according to predefined criteria, 220 eligible videos were included in the final analysis. For each video, we extracted duration, audience engagement metrics, uploader type, and other basic characteristics. Video quality was independently evaluated in a double-blind manner using the Global Quality Score (GQS), modified DISCERN (mDISCERN), and Journal of the American Medical Association (JAMA) benchmarks. Group differences were analyzed using Mann-Whitney U and Kruskal-Wallis tests, and correlations among variables were assessed using Spearman's rank correlation.
    RESULTS: Of the 300 initially screened videos, 220 were included in the final analysis, comprising 114(51.82%) videos from TikTok (Douyin) and 106(48.18%) from Bilibili. Most videos were presented in Mandarin Chinese. Etiology, clinical manifestations, treatment, and diagnosis were frequently covered, whereas epidemiology and prevention were less commonly addressed. Bilibili videos were significantly longer than TikTok (Douyin) videos 311.00 (150.75, 514.25) vs 101.00 (58.50, 171.00) seconds (P < 0.001), whereas TikTok (Douyin) videos had significantly more shares 59.00 (10.25, 717.75) vs 34.00 (3.00, 282.25) (P = 0.049). The overall median GQS, mDISCERN, and JAMA scores were 3.00 (3.00, 4.00), 3.00 (3.00, 3.00), and 1.00 (1.00, 2.00), respectively. No significant platform-based differences were observed in GQS or mDISCERN scores, but JAMA scores differed significantly between platforms (P = 0.020). Videos uploaded by healthcare professionals and nonprofit organizations had significantly higher GQS scores than those uploaded by individual users 4.00 (4.00, 5.00) and 4.00 (3.00, 4.00) vs 3.00 (2.00, 3.00), respectively; (P < 0.001), as well as higher JAMA scores both 2.00 (2.00, 2.00) vs 1.00 (1.00, 1.00) (P < 0.001). Engagement metrics were strongly interrelated (all P < 0.001), but showed limited associations with quality scores. GQS and JAMA correlated only with shares (P = 0.015 and P = 0.045, respectively), whereas mDISCERN was positively correlated with likes, collections, comments, and shares (P = 0.042, P = 0.025, P = 0.011, and P = 0.004, respectively).
    CONCLUSIONS: Snakebite envenomation-related videos on TikTok (Douyin) and Bilibili provided moderately useful health information, but showed important limitations in preventive content coverage and information transparency. Videos uploaded by healthcare professionals and nonprofit organizations were generally of higher quality and transparency than those uploaded by individual users, while audience engagement was not consistently aligned with content quality. These findings highlight the need for greater professional involvement, clearer source disclosure, and more standardized prevention- and first-aid-oriented content to improve the reliability and public health value of snakebite-related information on short-video platforms.
    Keywords:  Bilibili; GQS; JAMA; Snakebite envenomation; TikTok; mDISCERN
    DOI:  https://doi.org/10.1186/s41182-026-01006-5
  28. Health Informatics J. 2026 Apr-Jun;32(2):32(2): 14604582261464445
      ObjectivesShort video platforms have become the main channel for the public to obtain information about cancer. In recent years, TikTok has gradually become an important source of health information for Chinese patients. This study aims to assess the content, quality and reliability of videos related to tongue cancer on the Tiktok platform.MethodsFrom November 11 to 12, 2025, we searched for "tongue cancer" on TikTok platform. Based on the filtering criteria, collect information for the videos that meet the requirements. The global quality score (GQS) and the modified DISCERN (mDISCERN) were used to evaluate the quality and reliability of the videos. Finally, a spearman correlation analysis was conducted on all the indicators of the videos.ResultsThis analysis included 79 videos. The duration of the videos is relatively short, with a median of 85.00 (IQR: 45.00, 127.50), but the interaction indicators are relatively high. The symptoms and treatment were mentioned more frequently, while the topics of epidemiology, etiology, diagnosis and prevention were mentioned less. The GQS score of the videos uploaded by the doctors was significantly higher, with a median of 4.00 (IQR: 3.00, 4.00), compared to videos uploaded by individual users 1.50 (IQR: 1.00, 2.00). The trend of the mDISCERN score was similar. The GQS and mDISCERN scores of videos uploaded by doctors are higher than those of individual uploaders.ConclusionTikTok videos on tongue cancer are short but highly engaging. Videos uploaded by doctors have higher quality and reliability than those by individual users. Symptoms and treatment are covered more than epidemiology, etiology, diagnosis, and prevention. Therefore, increasing the number of videos uploaded by doctors is crucial for improving the quality and reliability of videos on the TikTok platform.
    Keywords:  GQS; Quality; Reliability; TikTok; Tongue cancer; mDISCERN
    DOI:  https://doi.org/10.1177/14604582261464445
  29. Medicine (Baltimore). 2026 Jun 19. 105(25): e49399
      Pulmonary embolism (PE) is a life-threatening cardiovascular emergency, yet public awareness remains insufficient. Social media platforms like TikTok and Bilibili have become crucial channels for the general population to access health information. Evaluating the quality and reliability of their content is significant for enhancing public health communication. Employing a cross-sectional design, this investigation systematically retrieved and analyzed short videos pertaining to pulmonary embolism on TikTok and Bilibili, incorporating 186 videos (99 from TikTok, 87 from Bilibili). Video quality and reliability were appraised utilizing the Global Quality Score (GQS) and the modified DISCERN (mDISCERN) tools. Spearman correlation analysis was conducted to explore associations among video length, engagement metrics, and quality/reliability scores. Significant differences were observed between TikTok and Bilibili in video duration, engagement metrics (likes, comments, shares), and quality/reliability scores (P < .05). Videos on Bilibili were significantly longer than those on TikTok, whereas TikTok exhibited superior audience interaction. Video content predominantly emphasized clinical manifestations and etiology, with inadequate attention accorded to prevention and epidemiology. Videos uploaded by pulmonologists significantly outperformed those from other healthcare professionals and science communicators in mDISCERN scores (P = .004). Video length was positively correlated with both GQS and mDISCERN scores. Engagement metrics were strongly intercorrelated; however, higher engagement did not necessarily correspond to higher video quality or reliability. Short-video platforms hold potential for disseminating pulmonary embolism-related health information, but content quality is inconsistent, and high engagement does not equate to high quality. It is recommended that platforms optimize content review mechanisms and encourage professional physicians to participate in content creation to enhance the accuracy and educational value of health-related information.
    Keywords:  Bilibili; TikTok; information quality; pulmonary embolism; short-video
    DOI:  https://doi.org/10.1097/MD.0000000000049399
  30. Medicine (Baltimore). 2026 Jun 19. 105(25): e49361
      TikTok and similar short-form video services are now widely adopted as prominent platforms for circulating health knowledge. However, the quality of content on keloid remains unclear. Keloid, a pathological scar with significant physical and psychological impact, necessitates accurate public education. This cross-sectional study analyzed 122 keloid-related TikTok videos on February 9, 2026, using an unlogged search to minimize bias. Video quality was assessed with 3 scoring systems: the global quality score, the modified DISCERN, and the benchmark criteria of the Journal of the American Medical Association, and uploader categories, content themes, and engagement metrics were analyzed. Videos featured a median length of 65.5 seconds and strong user engagement, but holistic quality was modest (median global quality score = 3.0, modified DISCERN = 2.0, Journal of the American Medical Association = 3.0). Content predominantly covered treatment (86.1%) and clinical manifestations (52.5%), whereas etiology, diagnosis, and recurrence were underrepresented. Videos from plastic surgeons and healthcare professionals had significantly higher quality scores than those from individual users (P < .05). There was no relationship found between engagement metrics and quality. In conclusion, keloid-related TikTok videos achieve wide reach but have limited informational quality, emphasizing the need for enhanced professional involvement and more comprehensive content to improve educational value.
    Keywords:  TikTok; health information quality; keloid; short video; social media
    DOI:  https://doi.org/10.1097/MD.0000000000049361
  31. Aesthet Surg J Open Forum. 2026 ;8 ojag098
       Background: The rising demand for cosmetic gluteal fat grafting is partially attributed to social influence and evolving beauty standards. On the social media platform TikTok alone, content related to what is colloquially known as the Brazilian Butt Lift (BBL) garnered over 15 billion views in the past 3 years, raising concerns about the potential influence of visually striking yet possibly misleading media content on patient decisions and perceptions. This study aimed to analyze the source and content of BBL-related TikTok videos to assess the quality of information presented to potential patients.
    Objectives: To characterize top-ranked Brazilian butt lift (BBL) content on TikTok by evaluating (1) creator type and video category, (2) engagement metrics, (3) the quality of educational content using the Patient Education Materials Assessment Tool (PEMAT), and (4) the geographic distribution of patient-reported procedure locations and provider/clinic locations.
    Methods: Using the TikTok application, 14 phrases related to the Brazilian butt lift procedure were analyzed. Video analysis included engagement metrics, digital creator type, and video category. The quality of educational content was assessed using the validated PEMAT. A locational analysis of the digital creators was also performed, focusing on the geographic area where patients received their procedures or where physician/cosmetic clinics were located. As an observational content analysis, causal inferences could not be made.
    Results: Three hundred fifty videos were included in our study. Patients had the highest percentage of videos (29.1%), followed by "other" (29.0%), and plastic surgeons (20%). Educational videos accounted for the highest percentage of video types (26.6%). Educational videos posted by plastic surgeons had significantly higher understandability and actionability scores than those of non-healthcare creators when using PEMAT (P < .001) but had substantially lower views, likes, and saves (P < .05). Locational analysis revealed that 77.1% of patient-generated videos with identifiable procedure location referenced procedures reported as having been performed internationally or in Miami (P = .021).
    Conclusions: The Brazilian butt lift garners high engagement on TikTok. Educational content is common among video subtypes and is high quality when posted by plastic surgeons; however, educational videos receive higher engagement statistics when posted by non-healthcare creators. Content posted by self-identified BBL patients more frequently referenced procedures performed internationally or in Miami; locations that have been associated in prior epidemiologic literature with higher complication and mortality rates in gluteal fat grafting and cosmetic tourism.
    Level of Evidence 5 Therapeutic:
    DOI:  https://doi.org/10.1093/asjof/ojag098
  32. Front Public Health. 2026 ;14 1830047
       Objective: This study aimed to systematically evaluate and compare the content quality and reliability of weight management-related short videos on TikTok and Bilibili, identify key factors influencing video quality, and develop an automated quality prediction tool to support the construction of a healthier digital health communication ecosystem.
    Methods: A cross-sectional study design was adopted. The top 100 weight management-related videos ranked by overall relevance were collected from TikTok and Bilibili, respectively (total n = 200). Two independent researchers evaluated video quality and information reliability using the Global Quality Score (GQS) and the DISCERN instrument. Spearman's rank correlation and Poisson regression analyses were conducted to explore associations between video characteristics and evaluation scores. An XGBoost-based model was developed to predict high-quality videos, and the SHAP framework was applied to interpret the model's decision-making mechanism.
    Results: Videos on Bilibili had significantly higher GQS scores than those on TikTok (p < 0.001). However, DISCERN scores were generally low on both platforms, with no statistically significant difference. On TikTok, most videos were uploaded by Nonprofessional individuals and primarily focused on Personal experience sharing (68%), whereas the distribution of uploader sources on Bilibili was relatively balanced. Videos published by Professional individuals demonstrated significantly higher quality and reliability than those published by Nonprofessional individuals. Regression analyses indicated that video duration and number of likes were positive predictors of both quality and reliability. The XGBoost model achieved good discriminative performance in the test set (AUC = 0.8072). SHAP analysis revealed that when video duration exceeded 12.75 min, its contribution to predicting high-quality videos shifted from negative to positive. Based on these findings, an accessible online evaluation platform for high-quality short videos was developed and deployed.
    Conclusion: Videos published by Professional individuals possess greater academic value and practical significance. However, weight management information on short-video platforms exhibits a mismatch between popularity and quality. The automated evaluation model and online tool proposed in this study provide strong support for the public in identifying reliable scientific information and for regulatory authorities in developing intelligent governance systems.
    Keywords:  health communication; information quality; machine learning; short video; weight management
    DOI:  https://doi.org/10.3389/fpubh.2026.1830047
  33. Sci Rep. 2026 Jun 22.
      The prognosis of heart failure is highly dependent on patient self-management, and short-video platforms have become a key channel for the public to access information about heart failure (HF). This cross-sectional study systematically evaluated HF-related short videos on TikTok and Bilibili platforms. After screening from January 10 to 11, 2026, 190 videos were included (103 from Bilibili and 87 from TikTok). Two cardiologists conducted quality assessments using the Global Quality Score, modified DISCERN scale, and Patient Education Assessment Tool for Audiovisual Materials (PEMAT-A/V). Results showed higher engagement for TikTok videos and longer duration for Bilibili videos (both P < 0.05). Content primarily focused on symptoms (71.1%) and treatment (66.3%), with insufficient coverage of prevention (36.3%). Overall quality was moderately low: median scores were 3 on both the Global Quality Score and modified DISCERN scale; PEMAT understandability score was 69%, and actionability score was 50%. Bilibili videos scored higher on actionability (P = 0.029), and videos uploaded by professional institutions demonstrated the best quality (P = 0.045). Longer video duration and inclusion of symptom-related content were independent predictors of higher GQS scores. Notably, video interactivity showed no positive correlation with content quality, revealing an obvious quality-dissemination decoupling phenomenon. This finding carries important clinical implications, as high-quality professional medical content fails to gain corresponding public dissemination, which hinders standardized popular science and long-term self-management among HF patients. The overall quality of heart failure short videos urgently requires improvement, necessitating enhanced professional content supply and rigorous quality control via multiparty collaborative mechanisms. Additionally, short-video health education should be integrated into routine clinical management and telemedicine systems to optimize patient self-management.
    Keywords:  Health education; Heart failure; Quality analysis; Short videos; Social media
    DOI:  https://doi.org/10.1038/s41598-026-58582-z
  34. Cureus. 2026 May;18(5): e109524
      Introduction Patients' utilization of the internet as a resource for obtaining medical information continues to expand, with increased prevalence and access to educational materials. One method of obtaining medical information online is artificial intelligence (AI)-generated patient education materials (PEMs). As such, the medical community has a fundamental obligation to assess the accuracy, quality, and readability of AI-generated PEMs as patient resources - a critical step in promoting health literacy, combating misinformation, and, ultimately, empowering patients. Given that the perceived severity of patellar tendon ruptures (PTR) can vary, providing clear information is important to support informed decision-making. This study aimed to evaluate and compare the readability and quality of AI-generated responses to patient questions about patellar tendon repair, using four different AI chatbots: ChatGPT 3.5, ChatGPT 4, Gemini 1.0, and Perplexity. Methods There were no significant differences in readability among the four different chatbots, and they all provided responses that were better than the average American reading level. The mean DISCERN scores were as follows: Perplexity (64.2±9.2), ChatGPT 3.5 (49±7.97), Gemini 1.0 (59.2±7.43), and ChatGPT 4 (52±6.28). Even though Perplexity demonstrated the highest mean DISCERN scores among the evaluated AI models, no statistically significant differences in readability were observed among the four chatbots, although results approached significance (p = 0.075). Question 15 of the DISCERN criteria, regarding shared decision-making, was consistently rated at a high level across each AI tool, with an average rating of 4.2 out of 5.  Results There were no significant differences in readability among the four different chatbots and they all provided responses that averaged above the average American reading level. The mean DISCERN scores were as follows: Perplexity (64.2±9.2), ChatGPT 3.5 (49±7.97), Gemini 1.0 (59.2±7.43), and ChatGPT 4 (52±6.28). Perplexity's score was statistically significant when compared to ChatGPT3.5, indicating that the responses of Perplexity were more accurate and reliable than ChatGPT3.5. Question 15 of the DISCERN criteria, regarding shared decision-making, was consistently rated at a high level across each AI tool, with an average rating of 4.2 out of 5.  Conclusion This study found that readability remains consistent across various AI tools, while the quality of the information may vary. Perplexity outperformed ChatGPT 3.5 in providing accurate information on patellar tendon ruptures. AI tools demonstrated variability in informational quality scores, although these differences were not statistically significant, highlighting the importance of carefully evaluating AI-generated content before using it as a patient education resource.
    Keywords:  artificial intelligence; chatbot; chatgpt; patellar tendon rupture; patient education
    DOI:  https://doi.org/10.7759/cureus.109524
  35. Digit Health. 2026 Jan-Dec;12:12 20552076261462658
       Background: Short video platforms have become important channels for the public to obtain health information, but the quality of health science popularization content varies greatly. Existing studies lack a comprehensive exploration of the determinants of video quality and their interaction mechanisms.
    Objective: This study aimed to identify the key features influencing the quality of cerebrovascular disease health science popularization short videos and clarify their configurational effects.
    Methods: Python web-crawling technology was used to collect health science popularization short videos on TikTok related to cerebrovascular diseases over the past year, and the video quality was evaluated using the Grade Quality Score (GQS) tool by two medical professionals. Eight machine learning models were constructed to identify key quality-related features. The joint effect of six features was analyzed for necessity and sufficiency by using the fuzzy set qualitative comparative analysis (fsQCA) method. Finally, the Kruskal-Wallis H test was employed to evaluate differences in quality among videos of varying duration.
    Results: A total of 541 valid videos were collected. Most videos were posted by medical staff (77.27%), among which high-quality videos (with GQS > 3) accounted for 14.42%. The importance of video duration reached 30.5%, making it the most crucial feature affecting video quality. The fsQCA results indicated that short duration was one of the conditions for high-quality videos, with the optimal duration being 3 to 5 minutes.
    Conclusions: Video duration was the main determinant of the quality of cerebrovascular disease health science popularization short videos. Improving the short-video communication skills of medical professionals and optimizing video duration are effective ways to enhance the quality of health science popularization content.
    Keywords:  fuzzy-set qualitative comparative analysis; health science popularization; machine learning algorithms; video duration; video quality
    DOI:  https://doi.org/10.1177/20552076261462658