bims-librar Biomed News
on Biomedical librarianship
Issue of 2025–11–02
twenty-one papers selected by
Thomas Krichel, Open Library Society



  1. Cureus. 2025 Sep;17(9): e93394
      Open educational resources (OERs) such as blog posts, podcasts, infographics, and videos focusing on medical topics are frequently published online. Their objectives are variable and include the critical appraisal of individual research articles, the knowledge translation of new or under-discussed publications or guidelines, and the review and integration of knowledge on a particular topic. However, due to the ease of publishing in these new media, the quality of these resources is heterogeneous and inconsistent. It is important for medical learners, educators, and practicing physicians to critically appraise these new and easily accessible formats of medical literature and resources. This paper provides an approach to appraise and use OERs.
    Keywords:  critical appraisal; free open access medical education; online medical education; open educational resources; users guide
    DOI:  https://doi.org/10.7759/cureus.93394
  2. Front Public Health. 2025 ;13 1629777
      Under the background of the digital intelligence era, users easily access diverse health information with varying perspectives through multiple social media channels, often falling into a dilemma of informational cognitive conflict. However, there is still a lack of systematic research on the internal mechanisms and boundary conditions that drive users to adopt different information behavior strategies under cognitive conflict. Based on cognitive dissonance theory, this study explores the influence of users' cognitive conflict on different types of health information behavior and the underlying mechanisms. It further analyzes how health information with different characteristics can trigger information avoidance and information verification behaviors. In the first stage, a questionnaire survey was conducted and the hypotheses were tested using PLS-SEM. The results show that cognitive conflict positively influences users' health information avoidance behavior through perceived information fatigue, but its effect on information verification behavior is not significant. In the second stage, experimental studies were conducted using different scenarios to further reveal the interaction effects of information relevance and information credibility on users' health information behaviors. The results indicate that when both information relevance and credibility are high, users are more likely to engage in active information verification. In contrast, low relevance or low credibility tends to lead to information avoidance. Perceived information curiosity and perceived information fatigue play significant mediating roles in this process. This study expands the scope of research on users' health information behaviors, deepens the understanding of cognitive dissonance theory in health information contexts, and provides theoretical support and practical guidance for the effective dissemination and utilization of health information. The research context may have certain limitations. Future studies could broaden sample sources and conduct empirical tests across different cultural contexts.
    Keywords:  cognitive conflict; health information; information avoidance; information verification; social media
    DOI:  https://doi.org/10.3389/fpubh.2025.1629777
  3. J Oral Maxillofac Surg. 2025 Oct 07. pii: S0278-2391(25)00804-3. [Epub ahead of print]
       BACKGROUND: Patients with maxillofacial fractures increasingly seek information from large language models (LLMs), yet the accuracy and readability of these responses remain uncertain.
    PURPOSE: This study evaluated the performance of 5 publicly accessible LLMs in answering frequently asked questions (FAQs) about maxillomandibular fixation (MMF).
    STUDY DESIGN, SETTING, AND SAMPLE: This in-silico cross-sectional study, conducted in January 2025, evaluated 47 FAQs and yielded 235 responses from 5 open-access LLMs, excluding subscription-based models.
    PREDICTOR VARIABLE: The predictor variable was LLM architecture: decoder-only transformer models (DOT-1, DOT-2), a multimodal transformer model (MTM), a productivity-focused model (PM), and a constitutional artificial intelligence (AI)-based model (CAM).
    OUTCOME VARIABLES: The primary outcome was LLM performance, measured with the QUEST (Quality of information, Understanding and reasoning, Expression style and persona, Safety and harm, and Trust and confidence) framework. Domains assessed were accuracy (Likert ≥4), hallucination (presence/absence of fabricated content), usefulness, clarity, trust, and satisfaction (Likert 1 to 5), and readability (Flesch-Kincaid Reading Ease [FKRE] and Grade Level [FKGL]). Responses were rated independently by 7 evaluators (5 oral and maxillofacial surgeons and 2 residents) in a blinded manner.
    COVARIATES: None.
    ANALYSES: Ordinal outcomes were analyzed with the Friedman test and pairwise Wilcoxon signed-rank tests. Readability was compared with one-way ANOVA. Inter-rater reliability was measured with Fleiss' kappa. Statistical significance was set at P < .05.
    RESULTS: The sample included 235 LLM-generated responses. DOT-1 showed the highest accuracy (88.5 ± 6.2%), which was statistically significantly greater than DOT-2 (79.6 ± 10.1%) and PM (81.2 ± 9.3%) (P = .004). It also had a statistically significantly lower hallucination rate (5.2%) compared with DOT-2 (10.1%) and PM (9.4%) (P = .013). CAM performed comparably in accuracy (86.3 ± 7.1%); however, its readability was statistically significantly poorer (Flesch-Kincaid Grade Level = 22.7 ± 12.9; P < .001). Multimodal transformer model showed intermediate performance. Inter-rater agreement was almost perfect for accuracy (κ = 0.79 to 1.00) and hallucination (κ = 0.91 to 1.00) and moderate to substantial for ordinal variables.
    CONCLUSION AND RELEVANCE: LLMs can provide accurate responses to maxillomandibular fixation queries, but readability remains limited and model-dependent. These findings underscore the need for developing more patient-friendly artificial intelligence (AI) outputs and highlight the importance of clinician oversight in guiding patients' use of LLMs.
    DOI:  https://doi.org/10.1016/j.joms.2025.09.016
  4. J Med Internet Res. 2025 Oct 29. 27 e79379
       BACKGROUND: Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence, and as major search providers fuse LLMs into their interfaces, this hybrid approach might become the new "gateway" to the internet. However, open-web retrieval exposes models to nonprofessional sources, risking hallucinations and factual errors that might jeopardize evidence-based care.
    OBJECTIVE: We aimed to quantify the impact of guideline-domain whitelisting on the answer quality of 3 publicly available Perplexity web-based retrieval-augmented generation (RAG) models and compare their performance using a purpose-built, biomedical literature RAG system (OpenEvidence).
    METHODS: We applied a validated 130-item question set derived from the American Academy of Neurology (AAN) guidelines (65 factual and 65 case based). Perplexity Sonar, Sonar-Pro, and Sonar-Reasoning-Pro were each queried 4 times per question with open-web retrieval and again with retrieval restricted to aan.com and neurology.org ("whitelisted"). OpenEvidence was queried 4 times. Two neurologists, blinded to condition, scored each response (0=wrong, 1=inaccurate, and 2=correct); any disagreements that arose were resolved by a third neurologist. Ordinal logistic models were used to assess the influence of question type and source category (AAN or neurology vs nonprofessional) on accuracy.
    RESULTS: From the 3640 LLM answers that were rated (interrater agreement: κ=0.86), correct-answer rates were as follows (open vs whitelisted, respectively): Sonar, 60% vs 78%, Sonar-Pro, 80% vs 88%, and Sonar-Reasoning-Pro, 81% vs 89%; for OpenEvidence, the correct-answer rate was 82%. A Friedman test on modal scores across the 7 configurations was significant (χ26=73.7; P<.001). Whitelisting improved mean accuracy on the 0 to 2 scale by 0.23 for Sonar (95% CI 0.12-0.34), 0.08 for Sonar-Pro (95% CI 0.01-0.16), and 0.08 for Sonar-Reasoning-Pro (95% CI 0.02-0.13). Including ≥1 nonprofessional source halved the odds of a higher rating in Sonar (odds ratio [OR] 0.50, 95% CI 0.37-0.66; P<.001), whereas citing an AAN or neurology document doubled it (OR 2.18, 95% CI 1.64-2.89; P<.001). Furthermore, factual questions outperformed case vignettes across Perplexity models (ORs ranged from 1.95, 95% CI 1.28-2.98 [Sonar + whitelisting] to 4.28, 95% CI 2.59-7.09 [Sonar-Reasoning-Pro]; all P<.01) but not for OpenEvidence (OR 1.44, 95% CI 0.92-2.27; P=.11).
    CONCLUSIONS: Restricting retrieval to authoritative neurology domains yielded a clinically meaningful 8 to 18 percentage-point gain in correctness and halved output variability, upgrading a consumer search assistant to a decision-support-level tool that at least performed on par with a specialized literature engine. Lightweight source control is therefore a pragmatic safety lever for maintaining continuously updated, web-based RAG-augmented LLMs fit for evidence-based neurology.
    Keywords:  artificial intelligence; evidence-based medicine; information retrieval; large language models; medical guidelines; neurology
    DOI:  https://doi.org/10.2196/79379
  5. J Pediatr Soc North Am. 2025 Nov;13 100273
       Background: While the American Medical Association and National Institute of Health recommend patient educational materials (PEMs) be written at a 6th-grade reading level, studies consistently show that PEMs in orthopaedics are written at the 10th-grade level or higher. This mismatch disproportionately affects patients with limited health literacy, who are at increased risk for poor clinical outcomes. This study investigates the potential of artificial intelligence (AI) platforms, including ChatGPT and OpenEvidence, to generate PEMs in pediatric orthopaedics that reach readability standards without sacrificing clinical accuracy.
    Methods: Fifty-one of the most common pediatric orthopaedic conditions were selected using the American Academy of Orthopaedic Surgeons OrthoInfo PEM database. For each condition, PEMs were generated using two AI platforms: ChatGPT 4 and Open Evidence utilizing a standardized prompt requesting a sixth-grade level explanation that included relevant anatomy, symptoms, physical exam findings, and treatment options. Readability was assessed using eight validated readability metrics via the Python Textstat library. PEMs were scored for accuracy and completeness by four blinded, pediatric orthopaedic surgeons. Interrater reliability was assessed using intraclass correlation coefficients (ICCs), and statistical comparisons were performed using paired t-tests.
    Results: ChatGPT-generated PEMs had the lowest average reading grade level (8.7) compared to OrthoInfo (10.8) and Open Evidence (10.1). OrthoInfo PEMs were rated highest for accuracy and completeness (total accuracy: 6.95; total completeness: 6.98), compared to Chat GPT (total accuracy: 6.15; total completeness: 5.90) and Open Evidence (total accuracy: 3.25; total completeness: 3.05), but ChatGPT approached OrthoInfo in several subdomains, including treatment descriptions, timeline, and follow-up recommendations.
    Conclusions: This study demonstrates the promise of AI platforms in generating readable, patient-friendly educational materials in pediatric orthopaedics. While OrthoInfo remains the gold standard in content accuracy and completeness, it falls short of national readability guidelines. AI tools like ChatGPT and OpenEvidence produced significantly more readable PEMs and, in some categories, approached the quality of expert-validated materials. These findings suggest a potential role for AI-assisted content creation in bridging the health literacy gap. However, concerns surrounding accuracy, hallucinations, and source transparency must be addressed before AI-generated PEMs can be safely integrated into clinical practice.
    Key Concepts: (1)Artificial intelligence (AI) refers to computer systems capable of performing tasks that typically require human intelligence, such as language processing; in this study, AI was used to revise and assess the readability of patient education materials.(2)Patient education materials (PEMs) are written or visual tools designed to inform patients and families about medical conditions, treatments, and procedures; they play a critical role in supporting shared decision-making in pediatric orthopaedics.(3)Readbility refers to how easily a written text can be understood by a target audience; improving the readability of PEMs ensures that patients and caregivers can comprehend essential health information.(4)Health literacy is the ability of individuals to obtain, process, and understand basic health information needed to make informed decisions; enhancing the readability of PEMs supports improved health literacy in pediatric populations.(5)Natural language processing (NLP) is a branch of AI that enables computers to understand and generate human language; in this study, NLP was used to revise PEMs and improve their readability and accessibility.
    Level of Evidence: IV.
    Keywords:  Artificial intelligence; Health literacy; Patient education materials; Pediatric orthopaedics; Readability
    DOI:  https://doi.org/10.1016/j.jposna.2025.100273
  6. Curr Oncol. 2025 Oct 19. pii: 582. [Epub ahead of print]32(10):
      Large language models (LLMs) are increasingly explored as chatbots for patient education, including applications in urooncology. Since only 12% of adults have proficient health literacy and most patient information materials exceed recommended reading levels, improving readability is crucial. Although LLMs could potentially increase the readability of medical information, evidence is mixed, underscoring the need to assess chatbot outputs in clinical settings. Therefore, this study evaluates the measured and perceived readability of chatbot responses in speech-based interactions with urological patients. Urological patients engaged in unscripted conversations with a GPT-4-based chatbot. Transcripts were analyzed using three readability indices: Flesch-Reading-Ease (FRE), Lesbarkeitsindex (LIX) and Wiener-Sachtextformel (WSF). Perceived readability was assessed using a survey covering technical language, clarity and explainability. Associations between measured and perceived readability were analyzed. Knowledge retention was not assessed in this study. A total of 231 conversations were evaluated. The most frequently addressed topics were prostate cancer (22.5%), robotic-assisted prostatectomy (19.9%) and follow-up (18.6%). Objectively, responses were classified as difficult to read (FRE 43.1 ± 9.1; LIX 52.8 ± 6.2; WSF 11.2 ± 1.6). In contrast, perceived readability was rated highly for technical language, clarity and explainability (83-90%). Correlation analyses revealed no association between objective and perceived readability. Chatbot responses were objectively written at a difficult reading level, exceeding recommendations for optimized health literacy. Nevertheless, most patients perceived the information as clear and understandable. This discrepancy suggests that perceived comprehensibility is influenced by factors beyond measurable linguistic complexity.
    Keywords:  GPT-4; artificial intelligence; chatbots; health literacy; large language models; patient education; readability; urology; urooncology
    DOI:  https://doi.org/10.3390/curroncol32100582
  7. Healthcare (Basel). 2025 Oct 17. pii: 2615. [Epub ahead of print]13(20):
      Background: The aim of this study was to compare four recently introduced LLMs (ChatGPT-5, Grok 4, Gemini 2.5 Flash, and Claude Sonnet-4). Experienced endodontists evaluated the accuracy, completeness, and readability of the responses given to open-ended questions about iatrogenic events in endodontics. Methods: Twenty-five open-ended questions related to iatrogenic events in endodontics were prepared. The responses of the four LLMs were evaluated by two specialist endodontists using a Likert scale for accuracy and completeness, and the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simplified Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI) for readability. Results: The accuracy score of ChatGPT-5's responses to open-ended questions (4.56 ± 0.65) was found to be significantly higher than those of Gemini 2.5 Flash (3.64 ± 0.95) and Claude Sonnet-4 (3.44 ± 1.19) (p = 0.009, and p = 0.002, respectively). Similarly, the completeness score of ChatGPT-5 (2.88 ± 0.33) was higher than those of Claude Sonnet-4, Gemini 2.5 Flash, and Grok 4 (p < 0.001, p = 0.002, and p = 0.007, respectively). In terms of readability measures, ChatGPT-5 and Gemini 2.5 Flash achieved better FRESs than Claude Sonnet-4 (p = 0.003, and p < 0.001, respectively). Conversely, FKGL scores were higher for Claude Sonnet-4 and Grok 4 compared to ChatGPT-5 (p < 0.001, and p = 0.008, respectively). Correlation analyses revealed a strong positive association (rs = 0.77; p < 0.001) between accuracy and completeness, a weak negative correlation (rs = -0.19; p = 0.047) between completeness and FKGL, and a strong negative correlation between (rs = -0.88; p < 0.001) FKGL and FRES. Additionally, ChatGPT-5 demonstrated lower GFI and CLI scores than the other models, while its SMOG scores were lower than those of Gemini 2.5 Flash and Grok 4 (p = 0.001, and p < 0.001, respectively). Conclusions: Although differences were observed between the LLMs in terms of the accuracy and completeness of the responses, ChatGPT-5 showed the best performance. Even with high scores of accuracy (excellent) and completeness (comprehensive), it must not be forgotten that incorrect information can lead to serious outcomes in healthcare services. Therefore, the readability of responses is of critical importance, and when selecting a model, readability should be evaluated together with content quality.
    Keywords:  ChatGPT; Claude; Gemini; Grok; artificial intelligence; iatrogenic events in endodontics; large language models
    DOI:  https://doi.org/10.3390/healthcare13202615
  8. Int J Med Inform. 2025 Oct 23. pii: S1386-5056(25)00381-8. [Epub ahead of print]206 106164
       BACKGROUND: Brain-Computer Interfaces (BCI) are a type of life-altering neurotechnology, but their inherent complexity poses significant challenges to patient education. Large Language Models (LLMs), such as ChatGPT and Gemini, offer new possibilities to address this challenge. This study aims to conduct a multi-dimensional, rigorous comparative analysis of the performance of these two mainstream AI models in responding to common patient questions related to BCI.
    METHODS: Through a structured process combining clinical expert consensus, literature review, and online patient community analysis, we identified 13 key patient questions covering the entire BCI treatment cycle. We then obtained responses to these questions from ChatGPT and Gemini on September 1, 2025. An evaluation panel, composed of clinical experts and non-medical professionals, conducted a blinded assessment of the response quality using standardized Likert scales across three dimensions: reliability, accuracy, and comprehensibility. Concurrently, we performed an objective, quantitative analysis of the response texts using the Flesch-Kincaid readability tests.
    RESULTS: On core quality metrics such as reliability, accuracy, and comprehensibility, the performance of the two models was generally comparable, both demonstrating a high level of proficiency with only sporadic statistical differences on a few technical questions. However, a clear significant disparity emerged in the dimension of readability: for 12 of the 13 questions, the text generated by Gemini required a significantly lower reading grade level than that of ChatGPT (p < 0.05) and had significantly higher reading ease scores. This difference stemmed from Gemini's tendency to use shorter sentences and simpler vocabulary.
    CONCLUSION: AI chatbots possess immense potential in the field of BCI patient education. Although both ChatGPT and Gemini can provide high-quality information, Gemini demonstrates a clear advantage in the accessibility and approachability of information, making it a potentially more suitable tool for initial application across diverse patient populations. Nevertheless, the limitations of AI in handling highly specialized and dynamically changing knowledge underscore the indispensable role of human expert supervision and validation in any clinical application.
    Keywords:  Artificial intelligence; Brain-computer interface; Digital health; Health communication; Large language models; Patient education
    DOI:  https://doi.org/10.1016/j.ijmedinf.2025.106164
  9. Int J Gynaecol Obstet. 2025 Oct 28.
       INTRODUCTION: To evaluate the accuracy and completeness of responses across common obstetrical and gynecologic topics generated by the large language models (LLMs) ChatGPT and Google Gemini, which have become increasingly popular for patients seeking medical information before physician consultations.
    METHODS: Ten topics were identified, five obstetrical (prenatal labs, extended carrier screen, treatments for nausea and vomiting in pregnancy, gestational diabetes, and trial of labor after cesarean section) and five gynecologic (polycystic ovary syndrome, pelvic inflammatory disease, cervical smears, mammograms, and birth control). For each condition, ChatGPT generated five of the most frequently asked patient questions, which were then presented separately to ChatGPT and Google Gemini. Board-certified Obstetrics and Gynecology physicians evaluated the responses using Likert scales for accuracy (1-6) and completeness (1-3).
    RESULTS: Acceptable response criteria were defined as an accuracy score of 5 or greater ("nearly all correct") and a completeness score of 2 or greater ("adequately complete"). Most responses from both models met these thresholds. Wilcoxon signed-rank tests demonstrated statistically significant differences in accuracy and completeness between models (P < 0.05). Inter-rater agreement was measured using intraclass correlation coefficients. For obstetrical topics, ChatGPT scored -0.047 (completeness) and 0.112 (accuracy), whereas Google Gemini scored 0.367 and 0.205, respectively. For gynecologic topics, ChatGPT scored 0.328 and 0.20, compared with Google Gemini at 0.151 and -0.08.
    CONCLUSION: Both LLMs provided largely accurate and complete responses to patient questions. ChatGPT demonstrated stronger outcomes overall, suggesting potential utility in patient education; however, patients should confirm online information with physicians given the limitations of LLMs.
    Keywords:  ChatGPT; Google Gemini; artificial intelligence; gynecology; obstetrics; patient education
    DOI:  https://doi.org/10.1002/ijgo.70622
  10. JMIR AI. 2025 Oct 29. 4 e78436
       Background: The widespread adoption of artificial intelligence (AI)-powered search engines has transformed how people access health information. Microsoft Copilot, formerly Bing Chat, offers real-time web-sourced responses to user queries, raising concerns about the reliability of its health content. This is particularly critical in the domain of dietary supplements, where scientific consensus is limited and online misinformation is prevalent. Despite the popularity of supplements in Japan, little is known about the accuracy of AI-generated advice on their effectiveness for common diseases.
    Objective: We aimed to evaluate the reliability and accuracy of Microsoft Copilot, an AI search engine, in responding to health-related queries about dietary supplements. Our findings can help consumers use large language models more safely and effectively when seeking information on dietary supplements and support developers in improving large language models' performance in this field.
    Methods: We simulated typical consumer behavior by posing 180 questions (6 per supplement × 30 supplements) to Copilot's 3 response modes (creative, balanced, and precise) in Japanese. These questions addressed the effectiveness of supplements in treating 6 common conditions (cancer, diabetes, obesity, constipation, joint pain, and hypertension). We classified the AI search engine's answers as "effective," "uncertain," or "ineffective" and evaluated for accuracy against evidence-based assessments conducted by licensed physicians. We conducted a qualitative content analysis of the response texts and systematically examined the types of sources cited in all responses.
    Results: The proportion of Copilot responses claiming supplement effectiveness was 29.4% (53/180), 47.8% (86/180), and 45% (81/180) for the creative, balanced, and precise modes, respectively, whereas overall accuracy of the responses was low across all modes: 36.1% (65/180), 31.7% (57/180), and 31.7% (57/180) for creative, balanced, and precise, respectively. No significant difference was observed among the 3 modes (P=.59). Notably, 72.7% (2240/3081) of the citations came from unverified sources such as blogs, sales websites, and social media. Of the 540 responses analyzed, 54 (10%) contained at least 1 citation in which the cited source did not include or support the claim made by Copilot, indicating hallucinated content. Only 48.5% (262/540) of the responses included a recommendation to consult health care professionals. Among disease categories, the highest accuracy was found for cancer-related questions, likely due to lower misinformation prevalence.
    Conclusions: This is the first study to assess Copilot's performance on dietary supplement information. Despite its authoritative appearance, Copilot frequently cited noncredible sources and provided ambiguous or inaccurate information. Its tendency to avoid definitive stances and align with perceived user expectations poses potential risks for health misinformation. These findings highlight the need for integrating health communication principles-such as transparency, audience empowerment, and informed choice-into the development and regulation of AI search engines to ensure safe public use.
    Keywords:  AI; AI search engine; Copilot; artificial intelligence; artificial intelligence search engine; dietary supplements; health communication; health education; large language model
    DOI:  https://doi.org/10.2196/78436
  11. J Med Syst. 2025 Oct 30. 49(1): 148
      
    Keywords:  Health communication; Large language models; Patient education materials; Prompt engineering; Readability; Zero-shot prompting
    DOI:  https://doi.org/10.1007/s10916-025-02290-0
  12. Clin Cosmet Investig Dermatol. 2025 ;18 2757-2767
       Background: Vitiligo causes significant psychological stress, creating a strong demand for accessible educational resources beyond clinical settings. This demand remains largely unmet. Large language models (LLMs) have the potential to bridge this gap by enhancing patient education. However, uncertainties exist regarding their ability to accurately address individualized patient inquiries and whether comprehension capabilities vary between LLMs.
    Purpose: This study aims to evaluate the applicability, accuracy, and potential limitations of OpenAI o1, DeepSeek-R1, and Grok 3 for vitiligo patient education.
    Methods: Three dermatology experts first developed sixteen vitiligo-related questions based on common patient concerns, which were categorized as descriptive or recommendatory with basic and advanced levels. The responses from the three LLMs were then evaluated by three vitiligo-specialized dermatologists for accuracy, comprehensibility, and relevance using a Likert scale. Additionally, three patients rated the comprehensibility of the responses, and a readability analysis was performed.
    Results: All three LLMs demonstrated satisfactory accuracy, comprehensibility, and completeness, although their performance varied. They achieved 100% accuracy in responding to basic descriptive questions but exhibited inconsistency when addressing complex recommendatory queries, particularly regarding treatment recommendations for specific populations. Pairwise comparisons indicated that DeepSeek-R1 outperformed OpenAI o1 in accuracy scores (p = 0.042), while no significant difference was observed compared to Grok 3 (p = 0.157). Readability assessments revealed elevated reading difficulty across all models, with DeepSeek-R1 exhibiting the lowest readability (mean Flesch Reading Ease score of 19.7; pairwise comparisons showed DeepSeek-R1 scores were significantly lower than those of OpenAI o1 and Grok 3, both p < 0.01), potentially reducing accessibility for diverse patient populations.
    Conclusion: Reasoning-LLMs demonstrate high accuracy in responding to simple vitiligo-related questions, but the quality of treatment recommendations declines as question complexity increases. Current models exhibit errors in providing vitiligo treatment advice, necessitating enhanced filtering mechanisms by developers and mandatory human oversight for medical decision-making.
    Keywords:  ChatGPT; DeepSeek; Grok; large language models; patient education; vitiligo
    DOI:  https://doi.org/10.2147/CCID.S552979
  13. BMJ Open. 2025 Oct 23. 15(10): e106870
       INTRODUCTION: Temporomandibular disorders (TMDs) are a prevalent group of musculoskeletal conditions affecting the temporomandibular joint, associated structures and masticatory muscles. The internet has become a primary source of health information for many patients; however, the readability, reliability, content and quality of online information on TMDs vary widely. A comprehensive synthesis of the characteristics and evaluation methods of such content is currently lacking.
    METHODS AND ANALYSIS: This scoping review will follow the Joanna Briggs Institute methodology and be reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews checklist. Peer-reviewed primary and secondary studies assessing online information on TMDs will be included if they report on readability, reliability, content or quality. Eligible information sources include publicly available websites, videos and social media; discussion forums and printed materials will be excluded. No language, date or geographical restrictions will be applied. A three-step search strategy will be implemented across PubMed, Web of Science, Embase, PsycINFO and CINAHL, followed by citation tracking. Screening will be conducted independently by two reviewers using Rayyan. Data will be extracted with a pilot-tested charting tool and synthesised narratively and descriptively in tabular and graphical formats.
    ETHICS AND DISSEMINATION: As this study will only use data from publicly available sources, ethical approval is not required. Findings will be disseminated through publication in a peer-reviewed journal, conference presentations and professional networks, with the aim of guiding the development of accessible and reliable digital resources for individuals seeking information on TMDs.
    REGISTRATION: This protocol has been prospectively registered on the Open Science Framework (OSF): https://doi.org/10.17605/OSF.IO/TAH7K.
    Keywords:  Chronic Pain; Dentistry; HEALTH SERVICES ADMINISTRATION & MANAGEMENT
    DOI:  https://doi.org/10.1136/bmjopen-2025-106870
  14. Nicotine Tob Res. 2025 Oct 28. pii: ntaf218. [Epub ahead of print]
       INTRODUCTION: Information seeking is among the most common uses of YouTube, the most popular social media site among youth. YouTube's recommendation algorithm drives approximately 70% of views but content quality varies greatly. Thus, it is important to understand how YouTube's algorithm impacts the content viewed by those seeking information on tobacco.
    METHODS: The most common YouTube queries for e-cigarettes, oral nicotine products, cigarillos and nicotine (from Google trends) were used to create a dataset comprising of unique starting ("seed") videos paired with their recommended videos on the website sidebar, creating N=5182 potential 'journeys' from seed to recommended video. Video descriptions were coded for tobacco relevancy, pro or anti-tobacco stance (k=.94), and source type (e.g. organic creators, media outlets, public health institutions, or self-defined medical experts) (mean k=.88). The viewpoint and source type of journeys were later analyzed.
    RESULTS: While most journeys led to recommended videos agreeing with the "seed's" stance (56.1%), 10.4% of journeys starting with pro-tobacco videos led to anti-tobacco videos, while 13.3% of journeys starting with anti-tobacco videos led to pro-tobacco videos. Pro- to anti-tobacco journeys most frequently led to content created by news (29.7%) or Self Described Medical Expert sources (28.1%). However, among anti-to pro journeys, 81.6% led to SDME videos.
    CONCLUSIONS: SDMEs play a key role in driving discordant journeys from anti- to pro-tobacco videos, potentially exposing health information seekers to content that may undermine their health goals. Future research is needed to understand the ways recommendation feeds influence content selection and exposure to videos that promote tobacco use.
    DOI:  https://doi.org/10.1093/ntr/ntaf218
  15. Breast Cancer. 2025 Oct 29.
       BACKGROUND: Breast cancer patients require a wide range of medical information, and an increasing number now use social media as a primary source. However, the quality of such information varies considerably, and its reliability often depends on the source. To address this, Denniss et al. developed PRHISM, a tool designed to evaluate the quality of information on social media platforms. In our previous study, we demonstrated validity of PRHISM. This study evaluated Japanese YouTube videos on breast cancer treatment and examined differences in quality by source type.
    METHODS: The top 60 videos displayed in order of relevance were selected on YouTube using the search terms "breast cancer," "treatment," and "chemotherapy." Six breast cancer specialists evaluated the informational quality using PRHISM. Based on the National Academy of Medicine, video sources were classified as credible or other sources. The quality of the videos was then compared using PRHISM scores.
    RESULTS: The overall mean PRHISM score was 60.6 (SD 11.5); 8 videos (13.3%) were rated excellent and 37 (61.7%) good. Videos from credible sources (n = 19) had a significantly higher mean score (70.9, SD 7.67) than those from other sources (n = 41; 55.8, SD 9.79; p = 0.001). The proportion of videos rated as excellent or good was also significantly higher among credible sources (p = 0.023).
    CONCLUSION: Japanese YouTube videos on breast cancer treatment were generally of relatively high quality. Videos from credible sources received significantly higher evaluations. When patients search for medical information on YouTube, it is advisable that they refer to content provided by recommended and credible sources.
    Keywords:  Breast cancer; Information quality; PRHISM; YouTube
    DOI:  https://doi.org/10.1007/s12282-025-01796-2
  16. BMC Public Health. 2025 Oct 31. 25(1): 3682
       BACKGROUND: Cervical cancer continues to pose a significant global health burden for women, especially in low-resource settings. Although HPV vaccination and screening programs are available, public awareness remains limited. Social media platforms have become major sources of health information; however, the quality of content varies considerably. This study assesses the quality, reliability, and dissemination patterns of cervical cancer-related videos on YouTube, Bilibili, and TikTok, with a focus on how uploader characteristics influence information accuracy.
    METHODS: On February 21, 2025, we retrieved the top 100 videos using the keyword "cervical cancer" on YouTube, and its Chinese equivalent "" on Bilibili and TikTok. Two independent reviewers evaluated video quality using the Global Quality Score (GQS), Video Information and Quality Index (VIQI), modified DISCERN (mDISCERN), and Patient Education Materials Assessment Tool (PEMAT). Inter-rater agreement was assessed, and statistical analysis was performed using non-parametric tests and Spearman correlation.
    RESULTS: A total of 84 YouTube videos, 82 Bilibili videos, and 91 TikTok videos were included. TikTok videos showed significantly higher user engagement than those on other platforms (p<0.001), but scored significantly lower in GQS, VIQI, mDISCERN, and PEMAT evaluations (p<0.001). Videos uploaded by professionals consistently received higher quality scores than those from non-professionals. (p<0.001) Although TikTok had higher uploader activity, its content was largely based on personal experiences, lacking scientific rigor and practical guidance. The video quality on TikTok and Bilibili is negatively correlated with their interactivity.
    CONCLUSIONS: The three platforms show distinct differences in how cervical cancer-related health information is disseminated. TikTok demonstrates superior dissemination and engagement performance, whereas YouTube provides higher content quality and credibility. These findings underscore the importance of leveraging each platform's strengths to promote evidence-based health communication.
    Keywords:  Cervical cancer; Health communication; Public education; Public health; Social media
    DOI:  https://doi.org/10.1186/s12889-025-24840-4
  17. Front Public Health. 2025 ;13 1611087
       Introduction: Hashimoto's thyroiditis (HT), a common autoimmune thyroid disorder, is widely discussed on video-sharing platforms. However, user-generated content about HT lacks systematic scientific validation. This study evaluates the reliability and quality of HT-related videos on three major social media platforms: YouTube, Bilibili, and TikTok.
    Methods: Between December 1, 10, 2024, the top 200 videos meeting the criteria retrieved under default search settings using a newly registered user account were included for each platform. These videos were from 107 YouTube accounts, 56 Bilibili accounts and 90 TikTok accounts. Metrics including video parameters and creator profiles were recorded. Content quality was evaluated using five validated assessment tools: PEMAT (Patient Education Materials Assessment Tool), VIQI (Video Information and Quality Index), GQS (Global Quality Score), mDISCERN (modified DISCERN), and JAMA (Journal of the American Medical Association) standards.
    Results: TikTok videos showed the highest audience engagement. YouTube had more team-based accounts (43.9%), while TikTok and Bilibili predominantly featured individual accounts, with TikTok featuring a notably higher proportion of verified individual accounts (86.7%). Solo narration was the most common video style across YouTube (62.5%) and TikTok (70.0%), while in Bilibili, it was the medical scenario. In contrast, YouTube and Bilibili offered a broader range of content, including TV programs, documentaries, and educational courses. The varying emphases of different assessment tools rendered it difficult to determine which platform boasts the highest content quality, but the video quality scores across all platforms are not satisfying. Additionally, we found that content produced by verified creators was of higher quality compared to that of unverified creators, with this trend being particularly evident among individual accounts.
    Conclusion: Social media platforms provide partial support for the dissemination of health information about HT, but the overall video quality remains suboptimal. We recommend that professional creators pursue platform certification to enhance the dissemination of high-quality HT-related videos.
    Keywords:  Hashimoto’s thyroiditis; health education; hypothyroidism; public health; social media
    DOI:  https://doi.org/10.3389/fpubh.2025.1611087
  18. Sci Rep. 2025 Oct 28. 15(1): 37705
      Stroke remains a significant global health concern. Despite numerous stroke-related videos on social media, research evaluating their information quality across platforms remains limited. This study compares information quality and content of stroke-related videos on BiliBili, Douyin and Xiaohongshu. This study analyzed 227 stroke-related videos across three platforms (58 from BiliBili, 88 from Douyin and 81 from Xiaohongshu). Information quality was assessed using adapted HONCode and PEMAT-A/V standards. Content analysis examined stroke aspects (definition, symptoms, etiology, assessment, treatment, outcome, complications, risk factors and prevention). Statistical analyses included Mann-Whitney U, Kruskal-Wallis and Spearman correlation analysis. Overall, videos showed moderate information quality (72.7% achieving medium HONCode levels). Compliance rates were 4.0% for Principle 4 (source reference) and 2.2% for Principle 5 (evidence for claims). Videos showed higher understandability (median 0.73; IQR 0.2) but suboptimal actionability (median 0.67; IQR 1.0). Content completeness was low (median 2.00, IQR 3.0), with treatment (63.0%) and symptoms (55.1%) mentioned most frequently and assessment (15.4%) and complications (15.0%) less frequently. Spearman's correlation analysis indicated that there were mostly no correlations between video information quality and user engagement (Likes, Collections, Comments, Shares) on the three platforms. Specifically, among the three platforms, Douyin had significantly higher information quality (P < 0.001), while Xiaohongshu showed lower understandability (P = 0.032) and content completeness (P < 0.001). Stroke videos' information quality on BiliBili, Douyin and Xiaohongshu was generally moderate, but commonly lacked evidence support, actionability and content completeness. Among these platforms, Douyin demonstrated relatively better performance, while Xiaohongshu showed poorer understandability and completeness. This study recommends that video publishers focus on enhancing evidence support and actionability, particularly regarding stroke assessment and complications, to help the public access more accurate and complete stroke information. For BiliBili and Xiaohongshu, increasing medical professional participation is recommended to improve information quality. Xiaohongshu needs to improve content understandability by using simpler and clearer language to explain stroke-related knowledge.
    Keywords:  BiliBili; Content completeness; Douyin; Information quality; Stroke; Xiaohongshu
    DOI:  https://doi.org/10.1038/s41598-025-21535-z
  19. Immunol Res. 2025 Oct 31. 73(1): 153
      Arthritis is a common condition that causes articular cartilage damage, joint tissue destruction, and ligament involvement, representing one of the leading causes of disability worldwide. This study aimed to analyze long-term trends and characterize seasonal patterns in global online information-seeking behavior related to arthritis and its associated terms using Google Trends data. We retrieved monthly relative search volume (RSV) data for the search terms "arthritis", "ankylosing spondylitis (AS)", "gout", "juvenile idiopathic arthritis (JIA)", "osteoarthritis (OA)", "psoriatic arthritis (PsA)", "rheumatoid arthritis (RA)", and "systemic lupus erythematosus (SLE)" from Jan 2004 to Dec 2022. Long-term trends were visualized using time-series plots, and seasonal patterns were assessed using cosinor analysis. Analysis of global RSV from 2004 to 2022 revealed distinct and divergent long-term trends across arthritis-related terms. The general term "arthritis," along with "OA" and "RA," displayed an initial decline until around 2010-2011, followed by a sustained recovery and gradual increase. In contrast, "AS," "gout," and "PsA" exhibited consistent upward trends throughout the period, while "JIA" progressively declined and "SLE" remained stable. More importantly, cosinor analysis confirmed statistically significant seasonal patterns (all P < 0.05) for "arthritis", "JIA", "OA", "PsA", and "RA", with amplitudes ranging from 2.28 to 4.22. These rhythms were characterized by a reproducible peak in late winter to early spring (acrophase: Feb 5-Apr 5) and a trough in late summer to early autumn. Thematic analysis of rising queries highlighted public focus on disease classification, clinical manifestations, and treatment-related information. Global online interest in arthritis, as measured by RSV, demonstrated significant long-term and seasonal patterns. The changes in the public's interest in arthritis-related terms can reflect the public awareness and potential medical needs. Our findings underscore the importance of infodemiology in public health monitoring.
    Keywords:  Arthritis; Infodemiology; Information seeking behavior; Public health; Seasonal variation
    DOI:  https://doi.org/10.1007/s12026-025-09716-4
  20. Health Sci Rep. 2025 Nov;8(11): e71418
       Background and Aims: Information avoidance refers to deliberately evading or delaying access to freely accessible but undesired information, and eHealth literacy refers to people's capability to comprehend, assess, find, and use health knowledge derived from electronic sources. The present study aims to investigate the health information avoidance and eHealth literacy levels in patients with multiple sclerosis (MS) in Fars Province, Iran, and their correlation with each other.
    Methods: The present research method was a cross-sectional correlational study. The studied population included all patients with MS in Fars Province, Iran, in 2023. The sampling method was convenience sampling with a sample size of 127 people. The data collection tools consist of the Health Information Avoidance Questionnaire and the eHEALS Questionnaire. Data analysis was performed using descriptive statistics, Pearson's correlation coefficient, t-test, ANOVA, and Bonferroni test with the help of the SPSS software version 21.
    Results: The average score for general health information avoidance was 9.06 out of 20 points, while the average score for MS health information avoidance was 20.52 out of 50, and the average patient eHealth literacy score was 26.28 out of 40 points. However, this analysis found no significant correlation between general health information or MS health information avoidance and eHealth literacy avoidance (p ≥ 0.05).
    Conclusion: Most patients show an above-average understanding of health information literacy and minimal health information avoidance. Nevertheless, micro-and macro-level strategies and policies are necessary to decrease this avoidance and enhance eHealth literacy optimally. However, because of the convenience sampling, the generalization to the broader population should be approached with caution.
    Keywords:  electronic health literacy; health information avoidance; multiple sclerosis
    DOI:  https://doi.org/10.1002/hsr2.71418