bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–05–10
38 papers selected by
Thomas Krichel, Open Library Society



  1. J Libr Outreach Engagem. 2026 Feb 16. 5(1): 132-143
      This pilot study examined perceived science learning and attitudes of youth participants (N=100) in a four-lesson, hands-on, nutrition-based STEM (science, technology, engineering, and mathematics) education backpack program offered for children in kindergarten through eighth grade at a public library in rural Appalachian Mississippi. Using a constructivist theoretical framework, our team developed the program and implemented it via drive-through distribution. Science kits included all materials, supplies, and books; postage-paid evaluation postcards; and shelf-stable lunches and snacks meeting USDA summer meal guidelines. Twenty-three of 100 youth participants (23% response rate) returned at least one evaluation postcard. Participants were primarily female (65%), non-Hispanic (90%), White (90%), and in kindergarten through second grade (54.5%). The majority of youth "agreed" or "super-agreed" with the following: I learned about science from the activity (94.2%); I liked doing the science activity (94.3%); I liked reading the related book (90.4%); I had fun completing the activity (98.1%); I would recommend the activity to others (94.2%); I would do the activity again (93.3%). Learning about science was positively correlated with most factors measured, including recommending an activity to others (Tau-b=0.471, p<.001). Liking the science lesson activity had the strongest positive correlation with recommending an activity (Tau-b=0.792, p<.001). Both pre- and post-program, participants perceived the library as "a good place to find science information and activities," with more than 85% of participants "agreeing" or "super-agreeing" with the statement. Hands-on, nutrition-based STEM education promotes perceived science-learning and positive attitudes toward science, warranting their further development for and implementation in public libraries across Mississippi to promote both science-learning and pursuit of science and health careers.
    Keywords:  STEM education; nutrition; public library; science; science learning
    DOI:  https://doi.org/10.21900/j.jloe.v5.1902
  2. Med Ref Serv Q. 2026 May 07. 1-13
      The rise of artificial intelligence (AI) has introduced new possibilities for hospital and clinic libraries. A research project surveyed hospital and clinic librarians regarding AI use in their libraries. AI tools are being used for cataloging, retrieving information, conducting systematic reviews, and answering research questions. Limitations such as hallucinations affect the value of AI for library research. Librarians' understanding of these tools is crucial to maximizing the benefits of AI while maintaining professional and ethical standards. This study examined AI usage in hospital and clinical libraries while acknowledging several limitations that affect the interpretation and generalizability of the findings.
    DOI:  https://doi.org/10.1080/02763869.2026.2639471
  3. Res Synth Methods. 2026 May 05. 1-11
      To examine the extent to which information sources other than journal articles are sought for systematic reviews. Cross-sectional study of published systematic reviews. We examined all published systematic reviews included in MEDLINE in a 4-week period in 2019. Both systematic reviews and protocols of reviews were eligible for inclusion. (1) Number and types of information sources sought in systematic reviews; (2) proportion of reviews that explicitly searched for study reports other than journal articles; (3) proportion of reviews that searched resources containing study reports other than journal articles. A total of 1,262 systematic reviews fulfilled the eligibility criteria. The median number of information resources searched for all systematic reviews was 4. Of the 1,262 reviews, study reports other than journal articles were sought in 40% (n = 502) of systematic reviews (97% (n = 64) of Cochrane reviews and 37% (n = 438) of non-Cochrane reviews). Trial registers were searched in 88% of Cochrane reviews and 21% of non-Cochrane reviews. In 99.3% (n = 1,253) of all the systematic reviews, the searches performed had the potential to identify study reports other than journal articles. Between a third and a half of systematic reviews search for study reports other than journal articles. Systematic review searches often search resources that include study reports other than journal articles, whether or not the reviewers explicitly sought them.
    Keywords:  bibliographic databases; evidence synthesis; information resources; systematic review
    DOI:  https://doi.org/10.1017/rsm.2026.10086
  4. Public Underst Sci. 2026 May 04. 9636625261437376
      This study examines six types of science, health, and medical sources, focusing on public perceptions of each source's gender, credibility, benevolence, and political affiliation. Results reveal that medical doctors were rated highest in credibility and benevolence. All of the expert sources were more likely to be reported as male, reflecting persistent stereotypes. Public health experts and academic scientists were perceived as more liberal, whereas medical doctors and industry scientists did not have perceived political affiliations. Across all sources, perceptions of political partisanship corresponded with lower credibility perceptions. Implications for science and health communication research and practice are considered.
    Keywords:  gender and science; polarization and partisanship; survey research
    DOI:  https://doi.org/10.1177/09636625261437376
  5. Fam Med. 2026 Feb;58(2): 105-111
      Ideally, educators should use the best available evidence to make decisions about their practices as teachers, scholars, and policymakers. However, the rapid increase of scholarly literature in medical education poses a major challenge. Knowledge syntheses (aka reviews), which contextualize and integrate information into a single resource, have become essential tools for navigating this information overload. This article presents an overview of knowledge synthesis in medical education, starting by defining it and providing an overview of the general steps. It then examines four key types of syntheses: systematic reviews, scoping reviews, meta-reviews, and realist reviews, providing examples of each type and, when possible, pointing to reporting guidelines and resources for conducting the type. The article then addresses common methodological pitfalls, including inadequate time planning, limited collaboration with end-users, insufficiently actionable findings, and narrow search strategies. The article concludes by presenting emerging innovations, such as artificial intelligence-supported methodologies, living reviews, and alternative knowledge translation activities.
    DOI:  https://doi.org/10.22454/FamMed.2026.196942
  6. J Vet Pharmacol Ther. 2026 May 06.
      At the Texas A&M University College of Veterinary Medicine (TAMU-CVM), the veterinary pharmacology faculty and library faculty have collaborated to teach aspects of Evidence-Based Veterinary Medicine (EBVM) since 2010. These skills are integral to drug and therapeutic decision-making and are required for veterinary graduate Day-One competency. Herein, we explain the progression of incorporation of EBVM teaching at TAMU-CVM to make clear that the development of teaching and assessment activities did not occur as a single design exercise, but rather in an iterative and reflective manner over several years. We describe the courses in which we have one or more lecture or laboratory sessions focused on scaffolding the skills of EBVM across 3 semesters, including the skills of writing clinical questions, searching the biomedical literature for evidence, critically appraising the evidence, and then applying the evidence to answer the clinical question to make a clinical recommendation. We share the specific contributions of the librarians and the pharmacologist in creating opportunities for students to develop the competencies of EBVM.
    Keywords:  EBVM; clinical skills; competencies; veterinary education
    DOI:  https://doi.org/10.1111/jvp.70084
  7. bioRxiv. 2026 Apr 28. pii: 2026.04.24.719925. [Epub ahead of print]
      Modern biomedical imaging workflows generate large volumes of derived images and short videos that must be reviewed, compared, curated, and reused following primary acquisition and analysis. In practice, these assets are often dispersed across nested filesystem hierarchies on local drives, external media, or network storage, limiting efficient retrieval, deduplication, and figure assembly. We present PixelDeck, an open-source, local-first browser application for organizing and interactively browsing large biomedical image and video libraries on commodity workstations. PixelDeck integrates recursive folder import, SHA-256-based duplicate detection, metadata extraction, thumbnail and preview generation, full-text search, and asynchronous export within a responsive interface, supported by a modular ingestion pipeline, managed storage layer, and interactive browsing environment optimized for high-volume media collections. The system is implemented using a Next.js and React frontend, a SQLite metadata store accessed via Prisma, managed local media storage, and a background worker that executes import and export tasks asynchronously, enabling scalable processing on standard hardware. To evaluate performance, we conducted structured benchmark imports using public histopathology images curated from PanopTILs, SICAPv2, and PanNuke datasets, where dataset-specific import behavior, duplicate detection, and ingestion metrics were recorded as reproducible outputs. Embedding-based analysis further demonstrates dataset-level separation consistent with underlying image characteristics. These results show that PixelDeck provides an efficient, scalable local curation layer for heterogeneous biomedical imaging collections, enabling streamlined dataset exploration and preparation for downstream analysis.
    DOI:  https://doi.org/10.64898/2026.04.24.719925
  8. bioRxiv. 2026 Apr 30. pii: 2026.01.09.697335. [Epub ahead of print]
      The rapid expansion of biomedical literature demands automated summarization tools that can reliably condense research articles into concise, accurate summaries. We benchmarked 62 text summarization methods, ranging from frequency-based and TextRank extractors to encoder-decoder models (EDMs) and large language models (LLMs), on 1,000 biomedical abstracts with author-generated highlights as reference summaries. Models were evaluated using a composite suite of lexical, semantic, and factual metrics, including ROUGE, BLEU, METEOR, embedding-based similarity, and factuality scores. Our results indicate that general-purpose language models (LMs) achieve the highest overall performance across lexical and semantic dimensions, outperforming both reasoning-oriented and domain-specific models. Notably, medium-sized models often outperform frontier-scale counterparts, suggesting an optimal balance between model capacity and computational efficiency. Statistical extractive methods consistently lag behind neural approaches. These findings provide a systematic reference for selecting biomedical summarization tools and highlight that broad pretraining remains more effective than narrow domain adaptation for generating high-quality scientific summaries.
    DOI:  https://doi.org/10.64898/2026.01.09.697335
  9. J Equine Vet Sci. 2026 May 01. pii: S0737-0806(26)00159-0. [Epub ahead of print]163 105924
       BACKGROUND: Artificial intelligence (AI) platforms are becoming increasingly popular as resources for equine information. However, these platforms generate responses from a wide range of sources and do not always distinguish between fact and opinion.
    AIMS/OBJECTIVES: The objective of this study was to assess the accuracy and quality of AI-generated answers to equine-related questions. Researchers hypothesized that AI platforms could answer basic equine questions effectively but would perform poorly on complex topics or questions.
    METHODS: Forty questions were written covering general horse care, facilities management, nutrition, genetics, and reproduction. Each question was categorized by difficulty level: beginner, intermediate, advanced, or trending. Three AI platforms were tested: ChatGPT (CGPT), Microsoft Copilot (MicCP), and ExtensionBot (ExtBot). Responses were scored for accuracy, relevance, thoroughness, and source quality (5 points each; total 20). Data were analyzed using PROC GLM in SAS (v. 9.4).
    RESULTS: Total score was affected by level (P = 0.002). Intermediate questions had the highest total score (15.95 ± 1.99). Accuracy was affected by platform (P < 0.001), level (P < 0.001), and topic (P = 0.015). CGPT (4.18 ± 0.93) and MicCP (4.08 ± 0.83) outperformed ExtBot (3.26 ± 1.21). Relevance was affected by platform (P = 0.042) and level (P < 0.001). Thoroughness was affected by platform (P < 0.001). Source quality differed by platform (P = 0.037).
    CONCLUSION: AI platforms could be resources; currently they fall short of the knowledge that Equine Extension Specialists can offer. AI platforms had difficulty addressing complex topics and demonstrated inconsistent performance across criteria.
    Keywords:  ChatGPT; Extension factsheets; ExtensionBot; Microsoft copilot
    DOI:  https://doi.org/10.1016/j.jevs.2026.105924
  10. Medicine (Baltimore). 2026 Apr 24. 105(17): e48539
      To evaluate the clinical appropriateness of ChatGPT's responses to questions frequently asked by osteosarcoma patients and their families. Ten questions frequently asked by osteosarcoma patients and their families were identified. Each question was submitted to OpenAI's GPT-5-based ChatGPT (August 2025 version) using separate user accounts. Two orthopedic oncology specialists independently evaluated the responses for clinical appropriateness using a 4-point Likert scale. Interrater agreement was analyzed with weighted Cohen kappa. Interrater agreement was found to be substantial (K = 0.667). One response was rated as an excellent response that did not require clarification, 5 responses were rated as satisfactory responses that required minimal clarification, and 4 responses were rated as satisfactory responses that required moderate clarification. There were no unsatisfactory responses requiring substantial clarification. ChatGPT's responses to osteosarcoma-related questions were found to be largely clinically appropriate. Nevertheless, given its limitations, artificial intelligence should be regarded as a supportive tool that requires physician oversight.
    Keywords:  ChatGPT; artificial intelligence; malignant bone tumors; osteosarcoma; patient education
    DOI:  https://doi.org/10.1097/MD.0000000000048539
  11. Int J Rheum Dis. 2026 May;29(5): e70676
       BACKGROUND: Although artificial intelligence (AI) is increasingly recognized for enhancing efficiency in healthcare services, its role in exercise and rehabilitation strategies remains unclear.
    OBJECTIVES: To assess the quality, reliability, accuracy, and readability of three large language models (LLMs), ChatGPT-5, DeepSeek-R1, and Gemini 2.5, in response to questions commonly asked by patients with rheumatoid arthritis (RA) regarding exercise and rehabilitation strategies.
    METHODS: Using a cross-sectional comparative design, a structured assessment framework was developed that included exercise- and rehabilitation-related questions grouped into five thematic domains between 22 and 29 September 2025: exercise and physical activity (S1), hand function (S2), joint protection techniques (S3), breathing and pulmonary health (S4), and general topics (S5). Information quality was evaluated with the modified DISCERN tool, while content reliability was evaluated with the Reliability Score, and accuracy was measured using a five-point likert Accuracy Scale. Readability was determined through the Flesch Reading Ease scale.
    RESULTS: DeepSeek-R1 and ChatGPT-5 achieved significantly higher scores for quality, reliability, accuracy, and readability compared with Gemini 2.5. In the S1 and S2 subgroups, both models consistently outperformed Gemini 2.5 across all evaluation metrics. Mean readability scores were 50.20 for DeepSeek-R1, 46.66 for ChatGPT-5, and 37.33 for Gemini 2.5, indicating that all responses were classified as difficult to read.
    CONCLUSIONS: This study highlighted that DeepSeek-R1 and ChatGPT-5 generated more accurate and reliable RA-related responses than Gemini 2.5; however, the complex language used by all models may limit accessibility for patients with low health literacy, underscoring the need for professional supervision in RA exercise planning.
    Keywords:  artificial intelligence; exercise; large language model; patient information; rehabilitation; rheumatoid arthritis
    DOI:  https://doi.org/10.1111/1756-185x.70676
  12. Rev Assoc Med Bras (1992). 2026 ;pii: S0104-42302026000202211. [Epub ahead of print]72(2): e20251453
       OBJECTIVE: The aim of this study was to compare the accuracy, scientific quality, and clarity of responses generated by GPT-4o and Gemini to frequently asked patient questions related to carotid artery disease and carotid endarterectomy.
    METHODS: In total, 40 unique carotid endarterectomy-related questions were compiled from online sources and clinical experience. Each was entered into separate new sessions with GPT-4o and Gemini 2.5 Flash in Turkish, and responses were collected without modification. Notably, four blinded cardiovascular surgeons independently rated each answer (1-5 Likert scale) in three domains: Accuracy, Scientific Quality, and Clarity. Mean response lengths and domain scores were compared using appropriate paired tests.
    RESULTS: GPT-4o produced longer responses than Gemini (258.1±101.6 vs. 193.2±43.7 words; p<0.001). Overall, GPT-4o had higher Accuracy scores (4.33±0.39 vs. 4.16±0.33; p=0.04), with no significant differences in Scientific Quality or Clarity (p=0.377 and p=0.154, respectively). In rater-level analyses, Gemini scored higher in Clarity for one rater, whereas GPT-4o was superior in Accuracy and Scientific Quality for another. Overall mean scores were comparable (4.17±0.36 vs. 4.13±0.31; p=0.636). Physician referral was recommended in 62.5% of GPT-4o and 52.5% of Gemini (p=0.366).
    CONCLUSION: Both GPT-4o and Gemini provided "good"-quality responses to carotid endarterectomy patient questions, with GPT-4o showing a modest accuracy advantage, with no difference in other domains. Explicit disclaimers on both platforms underscore their supportive, not definitive, role in patient education. Physicians should remain the primary source for individualized decisions, and AI-generated information should always be verified.
    DOI:  https://doi.org/10.1590/1806-9282.20251453
  13. BMC Oral Health. 2026 May 06.
       AIM: To evaluate the accuracy and consistency of responses generated by artificial intelligence (AI) chatbots in pediatric dentistry, specifically concerning fluoride usage.
    STUDY DESIGN: Descriptive cross-sectional study.
    METHODS: Four AI chatbots (ChatGPT, Gemini, Claude, Copilot) and four groups of dental professionals (pediatric dentists, general dentists, pediatric dentistry PhD students, and fifth-year dental students) answered 23 true-false questions based on IAPD, AAPD and EAPD guidelines. Each chatbot was tested 28 times per question in separate sessions. Accuracy was analyzed across four categories: Individual Topical Fluoride Applications, Professional Topical Fluoride Applications, Systemic Fluoride Applications, and Fluorosis. All groups were statistically compared with each other to evaluate differences in response accuracy across AI chatbots and human participant categories.
    RESULTS: Significant differences were observed in the accuracy of chatbot responses across fluoride application categories (p < 0.05). Claude achieved perfect accuracy in Systemic Fluoride Applications (100%), while the other AI models performed lower-with ChatGPT scoring the lowest (94.3%)-and Gemini showed the highest accuracy in Fluorosis-related questions (76.8%). Among professionals, pediatric dentists (82.3%) consistently had the highest accuracy.
    STATISTICS: Chi-square and Fisher's Exact tests were used to assess differences in response accuracy between groups. A p-value < 0.05 was considered statistically significant.
    CONCLUSIONS: Claude and Gemini demonstrated greater reliability in fluoride-related questions than ChatGPT and Copilot. However, expert oversight remains crucial in pediatric dental care.
    Keywords:  Accuracy; Artificial intelligence; Chatbots; Fluoride; Pediatric dentistry
    DOI:  https://doi.org/10.1186/s12903-026-08502-4
  14. Clin Spine Surg. 2026 Apr 21.
       STUDY DESIGN: Cross-sectional study.
    OBJECTIVE: To evaluate whether the answers of different versions of ChatGPT to frequently asked questions about AIS compiled from patient education websites the American Academy of Orthopaedic Surgeons (AAOS) and the Scoliosis Research Society (SRS) provide appropriate and sufficient information to patients.
    SUMMARY OF BACKGROUND DATA: Artificial intelligence chatbots have gained popularity due to their ability to analyze substantial scientific data using machine learning techniques and generate human-like responses in medicine. These responses can guide patients and families who are seeking information online after a diagnosis of AIS.
    METHODS: Thirty frequently asked questions, selected by expert spine surgeons, were posed to 3 versions of ChatGPT using a new internet browser window for each question, and the responses were recorded. Three orthopedic spine surgeons graded the accuracy of the responses against 2 selected expert websites using a Likert scale. Finally, the response accuracy was evaluated for patient use.
    RESULTS: Median Likert scores for ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o were 4 (1-5), 4 (2-5), and 4 (2-5), respectively. No significant differences were observed among versions within individual categories (all P>0.05). However, a significant difference was found in the overall response scores (P=0.004). Post hoc analysis revealed that ChatGPT-4o achieved significantly higher accuracy than ChatGPT-3.5 (P=0.005, Bonferroni-adjusted), whereas other pairwise comparisons were not significant. When the adequacy of the responses was evaluated, 26/30 (86%) of ChatGPT-3.5 responses were acceptable for patient use, whereas ChatGPT-4 and ChatGPT-4o provided appropriate responses in 29/30 (96%) of the questions.
    CONCLUSIONS: Successive ChatGPT versions demonstrated improved response reliability, with ChatGPT-4o showing a statistically significant advantage over ChatGPT-3.5. Given that ChatGPT-4 and ChatGPT-4o provided accurate and patient-appropriate answers in 96% of cases, these tools may assist in online patient education under clinician supervision.
    LEVEL OF EVIDENCE: Level III.
    Keywords:  ChatGPT; adolescent idiopathic scoliosis; artificial intelligence; patient education
    DOI:  https://doi.org/10.1097/BSD.0000000000002084
  15. JMIR Perioper Med. 2026 May 04. 9 e81374
       Background: Artificial intelligence (AI) models are being increasingly integrated into clinical care. Moreover, the availability of publicly accessible AI resources makes them attractive to patients seeking clinical information. Little is known regarding the use of large language models as patient resources for navigating major cancer diagnoses.
    Objective: This study aimed to evaluate the content, readability, and safety of ChatGPT (OpenAI; GPT-4o)-generated responses to common perioperative queries about hepatic, pancreatic, and colon cancers.
    Methods: A 28-question survey was developed based on frequently asked surgical questions for select malignancies. Surgical oncologists rated ChatGPT-4o-generated responses on a 5-point Likert scale for accuracy, quality, and tangibility. Readability was assessed using the Flesch-Kincaid Reading Grade Level (FKRGL) and Flesch Reading Ease (FRE). Respondents provided free-text comments and reported their comfort with patients using ChatGPT. Survey completion implied consent.
    Results: A total of 7 attending surgical oncologists with a median of 7 (IQR 4-13) years in practice completed the survey. Responses received mean scores of 3.5/5 (SD 0.28) for quality, 3.6/5 (SD 0.34) for accuracy, and 3.6/5 (SD 0.29) for tangibility. The responses had a median FKRGL score of 14.6 (IQR 13.3-15.6) and FRE score of 29.4 (IQR 20.5-36.3). On a post hoc analysis for select questions, the median FKRGL was 15.6 (IQR 14.4-16.7), decreasing to 7.1 (IQR 6.1-8.3) and 14.5 (IQR 13.2-15.4) with prompting and rephrasing, and the median FRE was 18.1 (IQR 14.6-24.7), increasing to 73.8 (IQR 66.6-79.3) and 32.0 (IQR 27.0-37.7) with prompting and rephrasing. Numerous inaccuracies and content gaps were reported, and approximately 43% (3/7) of providers did not report feeling "comfortable" in having patients consult publicly available AI for medical information.
    Conclusions: This study provides cautionary, yet optimistic, findings regarding the value of publicly accessible ChatGPT as a patient resource for abdominal malignancies. Providers should be prepared to effectively counsel patients to identify their educational attainment level when using ChatGPT to mitigate readability challenges.
    Keywords:  generative artificial intelligence; health literacy; patient education; perioperative care; surgical oncology
    DOI:  https://doi.org/10.2196/81374
  16. Sci Rep. 2026 May 03.
      Generative AI is rapidly entering patient education workflows, yet its safety profile for concussion management remains undefined. Utilizing the CHART framework, this cross-sectional audit assessed five platforms, specifically isolating retrieval-augmented generation (RAG) architectures against standard pre-trained Large Language Models (LLMs). We extracted 11 high-volume patient queries from Google Trends and administered them via a zero-shot protocol. Two blinded neurosurgeons then scored the outputs against the 2023 Amsterdam Consensus Statement using four validated instruments: DISCERN and EQIP to evaluate treatment and information quality, GQS for global content quality, and JAMA benchmarks for transparency. Reliability metrics diverged significantly across models (DISCERN and EQIP, p < 0.001). Perplexity Pro secured the highest DISCERN (47.36 ± 4.84) and EQIP (65.00 ± 5.48) values, statistically surpassing foundational models like ChatGPT and Gemini (p < 0.01) - a performance gap likely driven by its RAG design. In contrast, GQS scores did not differ significantly across models (p = 0.373), and JAMA-based transparency remained uniformly low (p < 0.001). Readability was assessed using six standard indices (FRES, FKGL, GFI, CLI, ARI, and SMOG), revealing that all models exceeded the 6th-grade reading level; most surpassed 10th-grade, with Perplexity Pro lowest at FKGL = 7.46. Although retrieval-augmented systems improve clinical accuracy, current iterations fail to provide transparent or readable advice. Clinical integration therefore requires rigorous human-in-the-loop verification and a shift toward plain-language algorithm optimization.
    Keywords:  Artificial intelligence; Chatbot; Concussion; Health literacy; Large language models; Patient education
    DOI:  https://doi.org/10.1038/s41598-026-51281-9
  17. JMIR AI. 2026 May 04. 5 e91369
       BACKGROUND: Large language models (LLMs) are increasingly used to generate patient-oriented medical information. In geriatrics, such information must balance accuracy, relevance, and safety, as older adults may be particularly susceptible to misleading or harmful advice. However, systematic evaluations of expert perceptions across multiple geriatric conditions remain limited.
    OBJECTIVE: This study aimed to explore geriatricians' perceptions of the accuracy, relevance, and potential harm of LLM-generated patient information across common geriatric conditions and to examine variability and interrater agreement in expert ratings.
    METHODS: In this cross-sectional expert rating study, 10 geriatricians evaluated 50 LLM-generated statements covering 5 geriatric conditions (sarcopenia, osteoporosis, urinary incontinence, depression, and dementia). Statements addressed diagnostic, etiological, prognostic, risk-related, and therapeutic aspects. Experts rated perceived accuracy, relevance, and potential harm using 5-point Likert scales. Rating distributions were summarized using medians and IQRs. The Kendall coefficient of concordance (W) was used exploratorily to assess agreement in the relative ordering of statements within predefined strata. Readability was assessed using Flesch-Kincaid Grade Level and Flesch Reading Ease.
    RESULTS: Expert ratings indicated high perceived accuracy (median 4.32, IQR 4.01-4.59 and perceived relevance (median 4.51, IQR 4.06-4.66), while perceived potential harm remained low (median 1.59, IQR 1.17-1.92). IQR values ranged from 0.00 to 1.38 with most values clustering below 0.5, indicating limited dispersion in expert ratings. Agreement in the relative ordering of statements varied across domains, with W values ranging from 0.27 to 0.62 (median 0.53, IQR 0.46-0.58), indicating moderate concordance. No statements combined low perceived accuracy with high perceived potential harm. Readability analysis indicated generally accessible language, with a median Flesch-Kincaid Grade Level of 8.3 (IQR 7.4-9.6) and a median Flesch Reading Ease score of 60.8 (IQR 50.1-66.9).
    CONCLUSIONS: LLM-generated patient information for common geriatric conditions was rated as largely accurate and relevant, with low potential harm in typical scenarios. Variability in expert emphasis and the exploratory nature of agreement analyses highlight the limitations of perception-based evaluation. Future studies should incorporate guideline-based validation, readability optimization, and patient-centered outcomes to more comprehensively evaluate the safety and suitability of LLM-generated information for geriatric patient education.
    Keywords:  ChatGPT; LLMs; artificial intelligence in health care; expert consensus; geriatric medicine; large language models; medical informatics; patient education
    DOI:  https://doi.org/10.2196/91369
  18. Cureus. 2026 Mar;18(3): e106221
       BACKGROUND: Diabetes, obesity, and hypertension are common chronic conditions in which the role of lifestyle alteration is central to control. Education materials may accompany these interventions, but will only be helpful if clear and credible. Since the development of large language models (LLMs) like ChatGPT and Google Gemini, the potential of contributing to the production of health education materials should be seriously taken into account.
    AIM: The aim of this study is to conduct a cross-sectional comparison of five LLMs (ChatGPT-4o, Google Gemini 2.5, Claude Sonnet 4, Grok 3, and Perplexity) in generating patient education brochures on diet and exercise for diabetes, hypertension, and obesity, evaluating their readability, originality, and reliability.
    PRIMARY OBJECTIVE: The primary objective is to compare the readability and reliability of AI-generated patient education materials.
    SECONDARY OBJECTIVE: The secondary objective is to assess lexical complexity and originality of generated content.
    METHODS: This cross-sectional study used standardized questions to generate brochures based on each response provided by the LLMs. Outputs were evaluated for readability (Flesch-Kincaid test), word complexity, novelty (PapersOwl plagiarism software), and consistency (modified DISCERN instrument). Descriptive statistics and one-way ANOVA were used, where p < 0.05 was deemed significant.
    RESULTS: ChatGPT produced the shortest and most readable content, evidenced by its lowest grade level (5.2 ± 0.8) and highest Flesch Reading Ease rating (70.0 ± 5.1). Gemini and Claude produced longer, more elaborate brochures that received higher reliability ratings (3.0 ± 0.0 and 3.0 ± 1.0, respectively) but were at higher reading levels (≈ 9th grade). Grok obtained medium for all the measures, while Perplexity produced shorter responses (≈ 444 words) but the lowest reliability score (1.3 ± 0.6). There were no major differences in originality scores across the tools.
    CONCLUSION: Every model had a strength: ChatGPT in readability, Gemini and Claude in reliability, Grok in balance, and Perplexity in conciseness. Every model demonstrated at least one parameter where it outperformed the others. The results validate LLMs in terms of producing patient-comprehensible leaflets, but human editing, updation with the latest guidelines, and human supervision will be needed before clinical application.
    Keywords:  artificial intelligence (ai); cross-sectional studies; diabetes; diet; excercise; hypertension; large language model (llm); obesity; patient education guide; quality evaluation
    DOI:  https://doi.org/10.7759/cureus.106221
  19. JMIR Bioinform Biotechnol. 2026 May 05. 7 e90572
       Unlabelled: Artificial intelligence (AI)-generated content on glucagon-like peptide-1 receptor agonists (GLP-1RAs) gave informationally detailed responses, but its readability remains suboptimal for many patients. Incorporating literacy-sensitive design principles into AI health communication is essential to ensure equitable access to digital medical information.
    Keywords:  AI; ChatGPT; GLP-1RA; Google Gemini; artificial intelligence; health literacy; readability
    DOI:  https://doi.org/10.2196/90572
  20. Cureus. 2026 Mar;18(3): e106160
       INTRODUCTION AND AIM: Artificial intelligence (AI) chatbots are increasingly used by patients to obtain medical information before seeking clinical care; however, the accuracy, readability, and consistency of AI-generated information in rheumatology remain uncertain. This study aimed to assess the readability, accuracy, and consistency of large language models (LLMs) generated patient education content for rheumatoid arthritis (RA), osteoarthritis (OA), and psoriatic arthritis (PsA).
    METHODS: From August 18, 2025, to August 24, 2025, three standardized patient-facing questions per disease were submitted daily to three LLMs, specifically ChatGPT (San Francisco, CA: OpenAI), Google Gemini (Mountain View, CA: Google LLC), and OpenEvidence (Cambridge, MA: OpenEvidence Inc.), with histories cleared between submissions. Readability (Hemingway grade level), word count, accuracy (on a 1-5 scale), and day-to-day consistency (Jaccard similarity) were measured. Responses from each model's most consistent day were accuracy-rated by rheumatology fellows and attendings.
    RESULTS: A total of 189 responses were collected. No model consistently met the American Medical Association (AMA)/National Institutes of Health (NIH)-recommended reading levels for sixth through eighth grade. OpenEvidence produced the most technical content (≥17th grade, indicating postbaccalaureate readability), while ChatGPT and Gemini averaged 11.9-12.0 grade. Gemini generated the longest responses (>500 words). ChatGPT showed the highest day-to-day stability (range: <0.07), Gemini moderate variability, and OpenEvidence both the widest range (0.24) and the highest average similarity (0.383). Accuracy ratings varied as follows: OpenEvidence generally scored higher for RA and PsA, while ChatGPT and Gemini were similar across diseases. OA responses showed minimal difference. Best and worst responses mirrored these trends.
    CONCLUSIONS: Current LLMs generate rheumatology information above the recommended reading levels. ChatGPT was the most consistent, Gemini the most detailed, and OpenEvidence the most technical. Persistent barriers to readability highlight the need for health-literacy-optimized AI communication.
    Keywords:  artificial intelligence in healthcare; degenerative joint disease; information literacy; information quality; large language models; patient education material; psoriatic arthritis; rheumatoid arthritis
    DOI:  https://doi.org/10.7759/cureus.106160
  21. Prim Care Companion CNS Disord. 2026 May 05. pii: 25m04077. [Epub ahead of print]28(3):
      Objective: To identify the readability levels of measures used in assessing psychosis.
    Methods: Measures were identified through a literature search. Fourteen measures met the inclusion criteria (written in English, developed in the US between 1997 and 2024, and publicly available) and were analyzed using 4 validated formulas: Gunning Fog, Simple Measure of Gobbledygook, FORCAST, and Flesch Reading Ease Score. Measures with an average readability score exceeding 6.00 were above the recommended reading level.
    Results: All measures exhibited mean readability scores above the recommended sixth-grade level. The mean reading levels of the instruction and item sections were 9.08 (SD=1.44, range, 7.13-10.70) and 9.06 (SD=1.98, range, 7.08-13.79), respectively.
    Conclusion: The findings indicate that measures used in assessing psychosis are written above the recommended reading levels and do not conform to suggested standards. The study highlights a significant gap in the readability of psychosis assessment measures, emphasizing the need for improvements to ensure accurate symptom assessment and effective treatment monitoring for individuals with psychotic disorders.
    Prim Care Companion CNS Disord2026;28(3):25m04077.
    Author affiliations are listed at the end of this article.
    DOI:  https://doi.org/10.4088/PCC.25m04077
  22. Ann Vasc Surg. 2026 May 06. pii: S0890-5096(26)00282-7. [Epub ahead of print]
       OBJECTIVES: Health literacy is a key determinant of patient outcomes. Patient education materials (PEMs) are intended to improve health literacy. The National Institute of Health (NIH) and American Medical Association (AMA) recommend that PEMs should be written at or below an 8th grade and 6th-grade reading level respectively. Despite these recommendations, repeated studies demonstrate that most PEMs are written at significantly higher reading levels, limiting their effectiveness for most patients. The purpose of this study was to evaluate the readability of publicly available PEMs from the Society for Vascular Surgery (SVS).
    METHODS: Fifteen SVS PEMs were identified and downloaded from the SVS website. They were then converted to plain text. Their readability was assessed using the following six validated indices: Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), Gunning-Fog Index (GFI), Coleman-Liau Index (CLI), Automated Readability Index (ARI), and Simple Measure of Gobbledygook (SMOG). Descriptive statistics, ANOVA, Kruskal-Wallis, and Cohen's d were used to compare results against NIH/AMA benchmarks and assess differences across indices.
    RESULTS: All 15 SVS PEMs analyzed were written above the recommended 6th-grade reading level (p < 0.01) and 12 exceeded the 8th-grade level (p < 0.05). Despite significant variations between readability metrics, all indices produced consistent grade level estimates for all PEMs and showed that the average readability scores using each metric was above 6th grade level. The mean readability grade levels across the six validated indices ranged from 8.63 to 14.56, with Cerebrovascular Disease, Arterial Dissection, and Deep Venous Thrombosis being the most difficult to understand. There were statistically significant variations observed between the readability metrics utilized (ANOVA p < 0.0001; Kruskal-Wallis p < 0.0001) with an Eta-squared analysis suggestive that 30.6% of the variance in readability scores was attributable to the choice of readability calculator used. FRE scores categorized 73% of PEMs as "fairly difficult" or "very difficult". Of note, the SMOG index produced the lowest grade levels, while the Coleman-Liau index produced the highest (difference = 3.33 grade level).
    CONCLUSION: PEMs from the SVS consistently exceed the readability thresholds recommended by the NIH and AMA, mirroring trends observed in other medical specialties. Poorly designed PEMs can exacerbate existing disparities in health literacy and further worsen poor outcomes in an already sick patient population. These results underscore the need for the SVS to improve patient handouts to better align with national readability standards.
    Keywords:  Flesch; PEM; SMOG; health literacy; patient education material; readability
    DOI:  https://doi.org/10.1016/j.avsg.2026.04.043
  23. Cureus. 2026 Apr;18(4): e106538
       BACKGROUND: Keloids are fibroproliferative scars that can disproportionately affect individuals with skin of color. Given the complexity of treatment options, many patients seek guidance through online health resources. However, these materials may not be written at a level accessible to the general public, particularly those with limited health literacy.
    OBJECTIVE: To evaluate the readability of the top 100 Google search results for "keloid scar" and determine whether these online patient education materials (PEMs) meet the readability standards recommended by the American Medical Association (AMA).
    METHODS: A Google search was conducted on April 29, 2025, using incognito mode with cleared cookies to minimize bias. The first 100 websites were screened, with exclusions for duplicate content, insufficient text (<250 words), scientific articles, clinician-targeted pages, and non-educational material. A total of 40 websites met the inclusion criteria. Readability was assessed using six validated formulas: Flesch-Kincaid Grade Level, Simple Measure of Gobbledygook (SMOG), Gunning Fog, Coleman-Liau, Automated Readability Index, and Linsear Grade Level. Mean grade-level scores were calculated and stratified by source type.
    RESULTS: Only 12.5% (5/40) of keloid-related PEMs had an average grade-level score at or below the AMA's recommended sixth-grade threshold. The average readability across all materials was at the 10th-grade level (mean=9.54; range 6.00-12.84). When considering individual formulas, the highest reading level assigned per resource ranged from grade 11.09 to 15.02. Government websites had the lowest mean readability (sixth-grade level), while "Other" sources scored highest (mean=10.38). Academic/hospital-based sites (mean=9.26), commercial (mean=9.79), and non-profit sources (mean=9.53) also exceeded recommended levels. A one-way analysis of variance (ANOVA) showed no significant difference in readability across source types (p=0.122).
    LIMITATIONS: Only English-language, text-based materials were analyzed; multimedia content and regional search variations were not assessed.
    CONCLUSIONS: Online PEMs related to keloid scars exceed recommended readability levels, potentially limiting their utility for patients with low health literacy. These findings underscore the need for more accessible, culturally inclusive materials, especially for populations disproportionately affected by keloids.
    Keywords:  dermatology; health literacy; keloid scars; online health information; patient education; readability; skin of color
    DOI:  https://doi.org/10.7759/cureus.106538
  24. Stroke. 2026 May 06.
       BACKGROUND: Informed consent forms (ICFs) for clinical trials are often written above the recommended eighth-grade level. We aimed to compare the readability of original ICFs used for National Institutes of Health-funded stroke-related clinical trials with ICFs edited for readability using artificial intelligence.
    METHODS: Publicly available ICFs associated with National Institutes of Health-funded stroke-related clinical trials were accessed through ClinicalTrials.gov (search period: inception to August 12, 2025). Using ChatGPT-4o, we created a customized Generative Pre-Trained Transformer (GPT) designed to lower the reading level to eighth grade or below while maintaining ICF content. We processed each ICF using this GPT to create edited ICFs. Standard readability metrics, including the Flesch-Kincaid grade level (primary outcome), were compared between original and edited ICFs using paired t tests or the McNemar test (cross-sectional design). We also assessed semantic similarity using the MPNet language model, which produced continuous scores from 0 (no similarity) to 1 (perfect similarity).
    RESULTS: ICFs were available for 46 stroke trials, including behavioral (n=21), device (n=15), drug (n=5), and other (n=5) intervention types. Mean reading levels were 11.52 for the original and 9.47 for the GPT-edited ICFs using the Flesch-Kincaid grade level (P<0.001). Only 1 (2%) of the original ICFs and 18 (39%) of the GPT-edited ICFs had a Flesch-Kincaid reading level at or below eighth grade (P<0.001). Both the Simple Measure of Gobbledygook and Gunning Fog Index favored the GPT-edited ICFs by 1 to 2 grade levels. The Flesch Reading Ease score favored the GPT-edited ICFs by about 8 points. The mean similarity score was 0.85 (SD=0.04).
    CONCLUSIONS: GPT-edited ICFs achieved a readability reduction of approximately 2 grade levels compared with the original ICFs while preserving high semantic similarity. Customized GPTs may be a useful tool to improve the readability of clinical trial ICFs.
    Keywords:  artificial intelligence; comprehension; health literacy; informed consent; stroke
    DOI:  https://doi.org/10.1161/STROKEAHA.126.055985
  25. Rev Assoc Med Bras (1992). 2026 ;pii: S0104-42302026000102203. [Epub ahead of print]72(1): e20250438
       BACKGROUND: YouTube, a video-sharing platform, aids health information sharing, and while social media's role in heart failure care remains unclear, it can enhance interaction, education, and engagement, fostering patient-centered care and encouraging treatment adherence and active health management.
    OBJECTIVE: The aim of the study was to evaluate the quality and usability of heart failure-related YouTube videos as a source of information for patients.
    METHODS: A total of 100 English-language YouTube videos on heart failure were analyzed. Videos were categorized based on uploader identity (healthcare vs. non-healthcare professionals) and assessed using quality criteria for consumer health information, Global Quality Scale, Journal of the American Medical Association criteria, and Video Power Index. Quantile regression analysis was performed to identify independent predictors of video quality.
    RESULTS: Of the videos analyzed, 69% were uploaded by healthcare professionals. The mean quality criteria for consumer health information score was 21, Global Quality Scale was 3, and Journal of the American Medical Association was 3. Videos from professionals and longer videos had significantly higher quality scores. Quantile regression showed that video duration predicted high Global Quality Scale values at the 75th and 90th percentiles, while professional source was a consistent predictor across most quantiles.
    CONCLUSION: The overall quality of YouTube videos on heart failure was found to be low to moderate, with substantial room for improvement. Videos uploaded by healthcare professionals, however, consistently demonstrated higher quality across evaluation metrics. Longer videos tend to have higher quality, but popularity does not correlate with content reliability. Efforts should be made to improve video content for better patient education.
    DOI:  https://doi.org/10.1590/1806-9282.20250438
  26. J Laparoendosc Adv Surg Tech A. 2026 May 06. 10926429261449963
       INTRODUCTION: YouTube has become a widely used tool for surgical education, offering open access to procedural videos for trainees and professionals alike. However, the reliability and pedagogical quality of these publicly available resources remain uncertain. In the context of minimally invasive inguinal hernia repair, we hypothesized that robotic (RT) surgery videos provide superior educational value compared with laparoscopic (LAP) ones. This study aimed to systematically evaluate and compare the quality of RT and LAP transabdominal preperitoneal (TAPP) inguinal hernia repair videos available on YouTube.
    METHODS: Based on a priori sample size calculation for moderate effect size (Cohen's d = 0.5), we determined that 63 videos per group would be required for adequate statistical power. On March 19, 2025, a structured search was performed on YouTube using the term "Transabdominal preperitoneal repair for inguinal hernia." This strategy generated an initial pool of 300 potentially eligible videos, which were screened sequentially until the predetermined sample size of 63 videos per group was achieved. Eligible content featured TAPP repairs via RT or LAP approach. Duplicates, non-inguinal TAPP procedures, videos consisting exclusively of animations, conference lectures, or irrelevant videos were excluded. The primary objective was to evaluate videos containing operative demonstrations of surgical procedures. After this selection, two blinded hernia surgeons independently assessed all videos using a newly developed 21-item qualitative evaluation tool and the validated LAParoscopic surgery Video Educational GuidelineS (LAP-VEGaS) score, a tool for evaluating surgery videos submitted to presentations and publications. Group comparisons were conducted using Welch's t-test and Mann-Whitney U test. Effect size was reported using Cohen's d. Both assessment tools demonstrated adequate inter-rater agreement and internal consistency, supporting their reliability for evaluating educational video content.
    RESULTS: From 300 videos screened, 126 met inclusion criteria (63 RT, 63 LAP). RT videos scored significantly higher than LAP videos on the newly developed qualitative evaluation tool (mean score 0.54 vs. 0.44; P < .001; Cohen's d = -0.60), indicating a moderate effect size. Similarly, RT videos demonstrated higher LAP-VEGaS scores (7.46 vs. 6.34), although this difference did not reach statistical significance (P = .091). These findings suggest that RT videos present superior adherence to technical and educational standards, respectively. Both assessment tools demonstrated adequate inter-rater agreement and internal consistency, supporting their reliability for evaluating educational video content.
    CONCLUSION: YouTube contains a large repository of TAPP repair videos, but quality is inconsistent. The new qualitative tool demonstrated strong reliability and internal consistency, supporting its use for educational video assessment. RT videos showed greater adherence to technical and educational standards compared with LAP. RT videos may therefore offer more structured learning content, but general quality improvements remain necessary across both approaches.
    Keywords:  TAPP; hernia; inguinal; surgical education; videos
    DOI:  https://doi.org/10.1177/10926429261449963
  27. Sci Rep. 2026 May 08.
       BACKGROUND: IgA nephropathy, as a common chronic kidney disease, relies on accurate health knowledge for long-term patient management. Short videos have become a significant channel for the public to access medical information, but their overall quality varies widely, and research in the field of IgA nephropathy remains largely unexplored. A cross-sectional study design was employed to systematically retrieve videos from both platforms using "IgA nephropathy" as the keyword. Based on the inclusion criteria, 186 short videos were obtained. Video characteristics, uploader categories, and interaction metrics were collected. Quality assessment was performed using the global quality scale (GQS), modified DISCERN (mDISCERN), Journal of the American Medical Association (JAMA) benchmark criteria, and the Video Information and Quality Index (VIQI). The overall quality of videos across both platforms was moderate. Videos published by healthcare personnel scored significantly higher than those from individual users on the GQS, mDISCERN, and JAMA criteria. Monologue-style videos demonstrated superior quality compared to other expression forms. TikTok significantly outperformed Bilibili in terms of VIQI-1/2 scores and also garnered higher levels of user interaction. Moderate to strong positive correlations were observed between interaction metrics and both GQS and VIQI scores, whereas no significant correlation was found with mDISCERN or JAMA scores. The primary focus of video content was on treatment and prognosis, while information on etiology and financial burden was notably insufficient. There are platform-specific differences in the visual presentation and information accuracy of IgA nephropathy-related short videos. Insufficient professional involvement and the absence of key information may limit the public's comprehensive understanding of the disease. Enhancing healthcare personnel participation, improving the quality of visual presentation, supplementing core medical information, and optimizing platform recommendation mechanisms will contribute to increasing the effectiveness of short videos in public health education.
    Keywords:  Bilibili; IgA nephropathy; Information quality; Social media; TikTok
    DOI:  https://doi.org/10.1038/s41598-026-52584-7
  28. Am J Health Promot. 2026 May 04. 8901171261447644
      PurposeThe rapid growth of semaglutide use for weight loss has been accompanied by a proliferation of patient-shared experiences and non-evidence-based claims on video platforms. This unchecked information environment poses significant risks to public health, including potential self-medication and misunderstanding of treatment risks, underscoring the urgent need to evaluate the quality of semaglutide-related video content to safeguard digital health literacy. This study assesses the quality, reliability, and user engagement of semaglutide-related short videos on TikTok and Bilibili.ApproachThis cross-sectional study analyzed the top 100 semaglutide-related videos from TikTok and Bilibili, using keyword searches. Videos were evaluated using JAMA benchmark criteria, Global Quality Scale (GQS), and DISCERN tools.SettingRetrieving top 100 videos from TikTok (Mar 4, 2025) and Bilibili (Mar 8, 2025) using "" (Semaglutide) as the search keyword.Participants200 videos and their characteristics.ResultsAmong 200 videos, no statistically significant inter-platform differences in JAMA, GQS or DISCERN scores were observed. Non-professional organizations achieved higher JAMA scores than individual creators (P < .01). Medical information videos scored higher than personal experience content (P < .0001). Engagement metrics (likes) correlated weakly with quality (r = 0.151, P < .05), while longer videos were associated with higher DISCERN scores (r = 0.273, P < .001) but not increased engagement.ConclusionsSemaglutide-related videos on TikTok and Bilibili show moderate quality, with medical professionals and institutions producing more reliable content. However, user engagement remains a poor indicator of quality. These findings call for platform governance to algorithmically promote evidence-based content and verify credible creators, while public health efforts should steer user attention from popularity to credibility, thereby protecting informed decision-making.
    Keywords:  Bilibili; TikTok; health policy; public health; semaglutide; short videos
    DOI:  https://doi.org/10.1177/08901171261447644
  29. Medicine (Baltimore). 2026 Apr 24. 105(17): e48400
      Social media platforms have become important channels for public access to health information. Recent studies have evaluated stroke-related videos on TikTok, Bilibili, and other platforms; however, evidence focusing specifically on Mandarin-language ischemic stroke-related videos remains limited. This study aimed to evaluate the content, quality, and reliability of Mandarin-language ischemic stroke-related videos on TikTok and Bilibili. In this cross-sectional study, the quality and reliability of Mandarin-language ischemic stroke-related videos on TikTok and Bilibili were evaluated on October 2, 2025. Video duration, engagement metrics, and uploader identity were collected. The Global Quality Scale (GQS) and modified DISCERN (mDISCERN) tools were used to assess video quality and reliability. Mann-Whitney U and Kruskal-Wallis tests were used for group comparisons, and Spearman rank correlation was used for correlation analysis. A total of 186 videos were included. The videos primarily focused on clinical manifestations (21.33%) and treatment (18.86%), with limited content on prognosis (9.35%). The median GQS score was 3.0 (interquartile range: 2.0-3.0), and the median mDISCERN score was 2.0 (interquartile range: 2.0-3.0). There were no significant differences in GQS and mDISCERN scores between platforms (P > .05). Videos uploaded by specialized healthcare professionals had the highest GQS and mDISCERN scores (P < .05). There was no correlation between engagement metrics (likes, comments, shares) and video quality (P > .05). The overall quality of ischemic stroke-related short videos on social media platforms is suboptimal, and professional background significantly influences video quality and reliability. Future efforts should strengthen content supervision on platforms and optimize health information dissemination strategies to enhance the accuracy of ischemic stroke-related video content.
    Keywords:  Bilibili; TikTok; health information quality; ischemic stroke; social media
    DOI:  https://doi.org/10.1097/MD.0000000000048400
  30. Sci Rep. 2026 May 08.
      Epilepsy is one of the most prevalent chronic neurological disorders worldwide, affecting approximately 70 million people globally and imposing substantial burdens on patients, families, and healthcare systems. Its multifaceted treatment landscape spanning antiepileptic drug (AED) therapy, epilepsy surgery, ketogenic dietary therapy, and neuromodulation makes accurate health information critical for patient decision-making and treatment adherence. Short-video platforms such as TikTok (Douyin) and Bilibili have emerged as primary channels through which the public accesses health-related content, yet the quality and reliability of epilepsy-related content on these platforms remain largely unexamined. A cross-sectional content analysis was conducted. We systematically retrieved videos via keyword search on TikTok (Douyin) and Bilibili, using the terms "dianxian" (epilepsy) and "jingfeng" (seizure/convulsion). For each platform, we collected the top 100 unique videos ranked by the platform's default relevance algorithm, with duplicate results from the two search terms removed. After applying pre-specified inclusion and exclusion criteria, 182 videos were included in the final analysis. Two physicians independently assessed the videos using a multi-instrument framework with clear applicable boundaries: Global Quality Score (GQS, for overall educational quality across all content types), modified DISCERN (mDISCERN, exclusively for treatment information reliability), JAMA benchmark criteria (for source transparency, not direct clinical accuracy), and a novel Treatment Misinformation Risk Scale (TMRS, specifically for epilepsy treatment-related content). Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Engagement metrics and uploader characteristics were also recorded, with sensitivity analyses performed to control for confounding from uneven content theme distribution between platforms. A total of 182 videos were analyzed (96 from TikTok, 86 from Bilibili). The overall educational quality was suboptimal (mean GQS: 2.65 ± 0.93; mDISCERN: 2.12 ± 0.89 for treatment-containing videos). Bilibili videos demonstrated significantly higher performance across all instruments: overall educational quality (GQS: 3.11 ± 0.87 vs. 2.24 ± 0.84, P < 0.001), treatment information reliability (mDISCERN: 2.56 ± 0.81 vs. 1.74 ± 0.76, P < 0.001), and source transparency (JAMA: 2.18 ± 0.72 vs. 1.42 ± 0.68, P < 0.001). The mean normalized TMRS score was 1.15 ± 0.62, with TikTok showing significantly higher treatment misinformation risk (1.41 ± 0.54) than Bilibili (0.86 ± 0.53, P < 0.001). TMRS scores were positively correlated with likes (rho = 0.46, P < 0.001), shares (rho = 0.43, P < 0.001), and comments (rho = 0.39, P < 0.001), while quality scores showed no significant correlation with engagement. Sensitivity analyses confirmed that the observed platform differences were not confounded by differences in content theme distribution. Epilepsy-related content on China's major short-video platforms is of concerningly poor quality, with treatment misinformation receiving disproportionately higher user engagement. These findings highlight the urgent need for collaborative efforts among neurologists, platform operators, and health authorities to improve the quality of epilepsy health information in the digital environment.
    Keywords:  Bilibili; Content analysis; Epilepsy; Health information quality; Misinformation; Short video; Social media; TikTok
    DOI:  https://doi.org/10.1038/s41598-026-52532-5
  31. Digit Health. 2026 Jan-Dec;12:12 20552076261443721
       Background: Intracerebral hemorrhage (ICH) is the most devastating stroke subtype, characterized by high mortality and disability rates. With the rapid growth of short-video platforms, TikTok and BiliBili have become important channels for the public to obtain health information. However, the quality and reliability of ICH-related videos on these platforms have not been systematically evaluated.
    Methods: Up to October 17-18, 2025, the research collected the top 100 comprehensively ranked videos from TikTok and BiliBili separately, using the Chinese term " (ICH)" as the search keyword. After screening, 146 videos were included. Two independent reviewers assessed video quality and reliability using three standardized tools: the Global Quality Scale (GQS), the modified DISCERN (mDISCERN) instrument, and the JAMA benchmark. Correlation analysis was used to evaluate the relationship between video quality and engagement metrics.
    Results: Videos on BiliBili scored significantly higher than those on TikTok on both the GQS and mDISCERN (P < 0.001 and P = 0.039, respectively). Furthermore, BiliBili outperformed TikTok in terms of content completeness. However, TikTok demonstrated significantly higher engagement metrics (likes, favorites, comments, and shares) than BiliBili (P < 0.01 for all). Videos uploaded by healthcare professionals achieved the highest quality, with a median GQS score of 3 (IQR 2-4). Correlation analysis revealed a positive correlation between video length and quality scores, while the number of comments was negatively correlated with both video quality and length.
    Conclusion: The quality and reliability of ICH-related videos were superior on BiliBili compared to TikTok, whereas TikTok exhibited overwhelming advantages in user engagement. Longer video duration was associated with better quality and reliability. Although videos from healthcare professionals scored higher in quality and reliability than those from individual users, they still did not meet the standard for high-quality information. This indicates the ongoing challenge of effectively translating specialized medical knowledge into reliable, practical, and easily understandable information for the public.
    Keywords:  BiliBili; ICH; TikTok; information quality; online videos; social media
    DOI:  https://doi.org/10.1177/20552076261443721
  32. Front Public Health. 2026 ;14 1796766
       Background: Electroconvulsive therapy (ECT) is a well-established and effective treatment for several psychiatric disorders, however, stigma and misinformation surrounding ECT remain widespread. Social media has become a major source of health information for patients and may influence treatment perceptions and decision-making, yet the quality and reliability of ECT-related content vary substantially across platforms. This study aimed to evaluate the quality, reliability, and dissemination characteristics of ECT-related videos on TikTok, BiliBili, and YouTube, and to identify factors associated with higher informational quality.
    Methods: On December 8, 2025, the top 100 videos retrieved using the Chinese keyword "" on TikTok and BiliBili, and the English term "ECT" on YouTube were screened. Videos were independently assessed for attitude toward ECT, content completeness, and overall quality using the Global Quality Scale (GQS), modified DISCERN (mDISCERN), and the Medical Video Evaluation Tool (MQ-VET). Inter-rater reliability was calculated, and non-parametric statistical tests and Spearman correlation analyses were performed.
    Results: A total of 71 TikTok videos, 75 BiliBili videos, and 86 YouTube videos were included. YouTube videos demonstrated significantly greater content completeness than those on BiliBili. Overall quality scores were higher on YouTube than on BiliBili, and YouTube also outperformed TikTok in both mDISCERN and total MQ-VET scores. Video uploader identity, presentation format, and content category were differentially associated with video quality across platforms. Engagement metrics were not correlated with video quality on TikTok or BiliBili, whereas positive correlations were observed on YouTube.
    Conclusion: Substantial platform-specific differences exist in the dissemination and quality of ECT-related health information. TikTok demonstrates strong user engagement, whereas YouTube provides more comprehensive and reliable content. These findings underscore the importance of platform-tailored, evidence-based strategies to improve the quality and public communication of ECT-related information.
    Keywords:  electroconvulsive therapy; online health information; short-video platforms; social media; video quality assessment
    DOI:  https://doi.org/10.3389/fpubh.2026.1796766
  33. J Gynecol Obstet Hum Reprod. 2026 May 02. pii: S2468-7847(26)00105-4. [Epub ahead of print] 103205
       BACKGROUND: TikTok pregnancy-related information content has not yet been investigated.
    OBJECTIVE: To assess the quality, reliability, and misinformation on TikTok videos regarding induction of labor (IOL).
    STUDY DESIGN: A cross-sectional analysis of TikTok videos, employing the "Induction of Labor" keyword, was conducted on the 13th of January 2025. All videos retrieved under this search term were evaluated. The TikTok materials were compared between patients and healthcare with the following tools: Patient Education Materials Assessment Tool (PEMAT A/V), the modified Development of a Quality Index for Health Information (mDISCERN), global quality scale (GQS), and video information and quality index (VIQI).
    RESULTS: One hundred fifty TikTok videos were examined. The contents were created mainly from patients 52% (78/150), 39% from healthcare (59/150), and 9% (13/150) from other sources. Healthcare content showed a higher PEMAT A/V for actionability and understandability median score, 81.8% and 66.7%, respectively, compared to the patient-generated content median score of 75.0%, and 33.3% (P = 0.01 and P < 0.001). On VIQI, healthcare videos outperformed patients' content, in information accuracy (4.0 vs 2.5), precision (4.0 vs 2.5), and total VIQI score (14.0 vs. 10.0; all P < 0.001). Healthcare and other sources had a median of 2.0 for mDISCERN reliability (P < 0.001). GQS showed a median of 4.0 for healthcare content versus 2.5 median for patients' content (P < 0.001).
    CONCLUSION: Patients' TikTok content reporting low scores on all validated assessment tools. Healthcare videos reported a higher score of understandability, actionability, and accuracy. These findings suggest that obstetric healthcare content on social media are probably necessary to offer IOL evidence-based information.
    Keywords:  TikTok; healthcare professionals; induction; induction of labor; internet; misinformation; patients; quality; reliability; social media; video
    DOI:  https://doi.org/10.1016/j.jogoh.2026.103205
  34. BMC Public Health. 2026 May 08.
       BACKGROUND: Tobacco use remains one of the greatest public health challenges worldwide. Social short-video platforms have become the primary channel through which the public obtains smoking-cessation information. Grounded in the Health Belief Model and the Theory of Planned Behavior, this study evaluates the information quality of smoking-cessation short videos on Chinese short-video platforms.
    METHODS: We analyzed 262 video samples from four major platforms-TikTok, Kwai, Bilibili, and BuzzVideo. Two researchers who received standardized training independently evaluated each video using the Medical Quality Video Evaluation Tool (MQ-VET), the Global Quality Scale (GQS), and the mDISCERN score. Finally, we performed multiple linear regression to identify factors influencing video quality and user interaction.
    RESULTS: Of the 262 videos included, only 17.6% were produced by medical experts. Overall information quality was low: median MQ-VET was 44 (41-47), median GQS was 2 (2-3), and median mDISCERN was 2 (1-2). In multivariable regression, videos produced by Medical experts (β = 0.586, p < 0.001) and by Public welfare organizations (β = 0.130, p = 0.001) had significantly higher quality than those produced by Individual users. For user engagement, measured by number of likes, information quality (MQ-VET) (β = 0.215, p = 0.009), TikTok as the platform (β = 0.358, p < 0.001), and Bilibili as the platform (β = 0.485, p < 0.001) were significant positive predictors. Quality scores correlated positively with user interaction (ρ = 0.14-0.35, p < 0.005), whereas video duration correlated negatively with interaction (ρ = -0.14 to -0.29, p < 0.01).
    CONCLUSION: Content about smoking cessation on mainstream Chinese short-video platforms is predominantly user-generated, and it is often fragmented, scientifically weak, and lacking elements of behavior-change psychology. Despite these shortcomings, high-quality videos still attract substantial user engagement. To harness the broad reach of these platforms, we propose constructing a four-party cooperation framework among government, platforms, experts, and users grounded in the "Healthy China 2030" initiative, establishing a quality-certification system, and incentivizing medical experts to produce rigorous, high-quality content.
    Keywords:  Bilibili; BuzzVideo; Cross-sectional study; Information quality; Kwai; Short videos; Smoking cessation; TikTok
    DOI:  https://doi.org/10.1186/s12889-026-27667-9
  35. Sci Rep. 2026 May 06.
      With the rising prevalence of diabetic retinopathy (DR), short videos have become a key source of health information for the public. This study aims to evaluate the quality and interaction performance of DR educational short videos, and explore the impacts of content quality, creator attributes, and platform factors on user interaction. We analyzed 146 DR-themed videos from Douyin and Bilibili, assessed content quality using four international benchmarks (JAMA, mDISCERN, GQS, and PEMAT), and measured interaction via the entropy weight method (EWM). Through non-parametric tests, Spearman correlation analysis, quantile regression, mediation analysis, and moderation analysis, the study found that: overall content quality was moderate, with user interaction indicators showing severe right-skewness; all quality dimensions were significantly positively correlated with each other, while JAMA scores were significantly negatively correlated with the weighted sum of interactions; significant differences existed in multiple indicators across different creator backgrounds, professional titles, and platforms; the impact of quality indicators on interaction was heterogeneous and varied by platform. Conclusions suggest that content creation should be optimized based on platform characteristics, and information seekers should develop habits of rational screening and cross-validation.
    Keywords:  Diabetic Retinopathy; Information Quality; Patient education; Public health; Social media
    DOI:  https://doi.org/10.1038/s41598-026-51058-0
  36. Soc Sci Med. 2026 May 04. pii: S0277-9536(26)00444-2. [Epub ahead of print]401 119368
      Patients' experiences of obtaining information for stigmatized health concerns remain understudied. We examine abortion patients' accounts of abortion information-seeking and the emotions they brought to or developed from that process. Drawing on 41 interviews with Ohio abortion patients, we find that information work, the labor of assembling and evaluating health information, involved emotion work. Abortion stigma intensified this emotion work. Consequently, participants engaged in different strategies to minimize this labor, including avoiding information when faced with stigma and misinformation. We argue that the current abortion information environment often burdens abortion seekers and that information-seeking may not always be beneficial in stigmatized health contexts. Our findings have implications for understanding patient engagement and avoidance in abortion care, with implications for how people seek and manage information in other emotional or politicized health contexts.
    DOI:  https://doi.org/10.1016/j.socscimed.2026.119368