bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–04–19
forty-six papers selected by
Thomas Krichel, Open Library Society



  1. Nature. 2026 Apr;652(8110): 828
      
    Keywords:  Information technology; Research management; Scientific community
    DOI:  https://doi.org/10.1038/d41586-026-01217-0
  2. J Med Libr Assoc. 2026 Apr 01. 114(2): 90-92
      In this profile, Brenda M. Linares, AHIP, Medical Library Association (MLA) president 2024-25 is introduced through a discussion of her contributions and commitment to growing the next generation of Latina library leaders. As the first Latina immigrant MLA president, she partnered with colleagues to build organizational structures to strengthen diversity, equity, and inclusion in multiple regional chapters of MLA. In addition to her focus on integrating her family into her professional engagements, Linares brought a strong business orientation from her bachelor's degree in Finance and her Master of Business Administration to her MLA leadership and to her professional role as Associate Dean of Library Services, University of Missouri-Kansas City (UKMC) Libraries.
    Keywords:  Biography; Latina; Latinx; MLA President
    DOI:  https://doi.org/10.5195/jmla.2026.2449
  3. J Med Libr Assoc. 2026 Apr 01. 114(2): 164-168
       Background: Our health system library fields thousands of requests for literature searches each year in support of research, policy, evidence-based practice projects, and care for individual patients. With fewer library staff than comparable institutions and an engaged, multidisciplinary clinical workforce, we face ongoing pressures to do more with less and to demonstrate our value.
    Case Presentation: A 2021 article in the Journal of Hospital Librarianship offered an existing survey and basic project design that we used to assess our impacts. We adapted, with permission, the survey and methods of "Analysis of a Hospital Librarian Mediated Literature Search Service at a Regional Health Service in Australia," a quality improvement project authored by Siemensma et al. (2021) [1]. Throughout 2023 we sent the adapted survey to all employees and affiliated clinicians who requested literature searches. The survey included five multiple choice questions as well as a free text box for comments. Respondents were asked to provide simple demographic information and consider the impact and quality of results they received from the librarian.
    Conclusions: Our survey-based evaluation of our literature search service underscores the importance of librarian-mediated literature searches for clinical practice, policy development, and patient care. Demonstrating hospital library impacts is increasingly important and increasingly challenging for understaffed teams. Assessments using previously published surveys are feasible for non-academic libraries and serve as compelling cases for the continued and expanded integration of library resources into clinical practice and decision-making.
    Keywords:  Clinical Support; Expert Searching; Hospital libraries; Information Services; Program Evaluation; Surveys & Questionnaires
    DOI:  https://doi.org/10.5195/jmla.2026.2246
  4. J Med Libr Assoc. 2026 Apr 01. 114(2): 176-177
      In response to the 2023 NIH Data Management and Sharing (DMS) Policy, Washington University School of Medicine in St. Louis launched Digital Commons Data@Becker, a generalist institutional data repository supporting both open and restricted access to research data. Managed by Bernard Becker Medical Library's DMS Team, the repository offers a fully mediated curation workflow that guides researchers through consultation, metadata capture, documentation, and quality control. Draft Digital Object Identifiers (DOIs) can be issued once access type is determined, with final DOI publication following curation and QC. Restricted datasets require Human Research Protection Office (HRPO) review and Data Transfer and Use Agreements (DTUAs), while open access datasets are freely downloadable. The repository leverages persistent identifiers such as Open Researcher and Contributor ID (ORCID iDs), Research Organization Registry (ROR) IDs, and DOIs, along with the DataCite metadata schema and custom metadata fields. Since its launch in 2023, Digital Commons Data@Becker has published 30 datasets spanning biomedical imaging, sequencing, quantitative assays, flow cytometry, and qualitative survey data. Across all datasets, there have been 4,409 views and 4,120 files downloaded, with restricted datasets generating 13 access requests, three of which were granted through DTUAs. Researchers emphasize the value of free institutional curation, flexible access models, and rapid DOI assignment. Digital Commons Data@Becker demonstrates how a generalist institutional data repository can balance accessibility and security to support NIH compliance, while advancing FAIR (Findable, Accessible, Interoperable, Reusable) data sharing and long-term stewardship.
    Keywords:  Data Curation; Data Repository; Data Sharing; FAIR data principles; Institutional Repositories; NIH Data Management and Sharing Policy; Open Access; Restricted Access
    DOI:  https://doi.org/10.5195/jmla.2026.2339
  5. J Hand Ther. 2026 Apr 13. pii: S0894-1130(26)00015-3. [Epub ahead of print]
      
    DOI:  https://doi.org/10.1016/j.jht.2026.02.003
  6. J Med Libr Assoc. 2026 Apr 01. 114(2): 125-135
       Objective: Limited empirical research is available to guide hospital librarians through a healthcare system merger or acquisition. To address this knowledge gap, an e-Delphi research study was used to develop recommended tasks that librarians should consider when consolidating the delivery of library services to a newly merged, geographically distributed healthcare system.
    Methods: This e-Delphi study was conducted and reported according to the Guidance on Conducting and REporting DElphi Studies (CREDES). The expert panel, composed of 29 hospital librarians, responded to four rounds of questionnaires during April to December 2022. In Round 1, the panelists' qualitative responses were collected and analyzed via thematic analysis to identify potential recommended tasks. In Rounds 2 through 4, tasks were eliminated or prioritized based upon the panelists' rating of each task using a seven-point Likert scale. Those tasks rated as 5, 6, or 7 by ≥75% of the panelists were included in the final consensus statement.
    Results: The consensus statement identifies 330 recommended tasks. Highly prioritized tasks involve cultivating beneficial relationships with others throughout the merger, particularly newly blended library teams, finance and administrative leadership, information technology/services, and vendors. Marketing and outreach activities and physical library space management tasks were not prioritized. The panelists emphasized understanding organizational context and culture throughout any merger.
    Conclusions: The recommended tasks can be used by hospital librarians to create an action plan for consolidating and delivering library services in the event of a healthcare system merger or acquisition. Future research on the utility of the recommendations is anticipated.
    Keywords:  Consensus; Delphi method; M&A; Mergers; Organizational Change; acquisitions; administration; change management; hospital librarians; hospital librarianship; hospital libraries; leadership; management; mergers and acquisitions
    DOI:  https://doi.org/10.5195/jmla.2026.2031
  7. J Med Libr Assoc. 2026 Apr 01. 114(2): 178-183
      Librarians' contributions to systematic review projects receive inconsistent recognition within promotion or tenure processes. A review of thirty-six academic libraries' norms and procedures revealed only two that mentioned systematic reviews. Recognition and inclusion of systematic reviews and other evidence synthesis is further complicated by variance in recognition of interdisciplinary work. This commentary provides recommendations for academic library leadership to establish standards for documenting and evaluating systematic review work in annual reviews and promotion or tenure, explicitly recognizing the value of participation in interdisciplinary scholarship, inclusion of search strategies as a scholarly output, and providing guidance for the external review process. We close with a call to action for professional organizations to establish centralized guidelines to ensure the full recognition of librarianship and scholarly participation in systematic reviews.
    Keywords:  Annual Review; Evidence Synthesis; Interdisciplinary; Promotion; Systematic Reviews; Tenure
    DOI:  https://doi.org/10.5195/jmla.2026.2202
  8. J Med Libr Assoc. 2026 Apr 01. 114(2): 159-163
       Background: Open Educational Resources (OER) are free learning materials that benefit students in higher education, including in the health sciences. As more health sciences OER materials are created, there is a need for openly licensed health sciences images. Traditional OER repositories lack specialized health sciences imagery while PubMed is a biomedical database that has potential to fill this gap.
    Case Presentation: A nursing faculty partnered with a health sciences librarian to search PubMed for openly licensed images for a pathophysiology OER textbook. The librarian used existing filters in PubMed to identify articles that have Creative Commons licenses as well as images. The nursing faculty assessed these images and added relevant ones to the pathophysiology textbook.
    Conclusions: PubMed is a free resource that health sciences librarians use on a regular basis. Utilizing the database to find openly licensed materials allows librarians to use a familiar tool for a new and exciting purpose.
    Keywords:  OER; PubMed; images; open educational resources; repositories
    DOI:  https://doi.org/10.5195/jmla.2026.2210
  9. J Med Libr Assoc. 2026 Apr 01. 114(2): 173-175
      As part of an effort to seek sustainable support models for Open Access (OA) publishing, the University of Maryland, Baltimore (UMB), Health Sciences and Human Services Library's (HSHSL's) Scholarly Communications Committee developed an interactive dashboard to visualize university-wide OA publishing trends. Using publication data exported from Scopus and visualized in Microsoft Power BI, the dashboard displays five years of publishing trends by OA model, publisher, journal, school, and citation count. The dashboard is fully interactive, allowing users to filter results based on school, OA model, and year. The design of the dashboard was iterative, with planning discussions taking place in Summer 2024, data model development and initial data collection in Fall 2024, refining of the visualization and data model in early Spring 2025, and the publication of the final dashboard to our website in April 2025. The dashboard continues to be refined and improved based on feedback from stakeholders, and the project team plans to incorporate data on publishing costs in Spring 2026. The project was designed for sustainability and adaptability, with a documented workflow that will be easy for future committees to implement. This innovative, replicable approach supports informed decision-making around OA publishing and provides a model that can be adopted by other academic health sciences libraries.
    Keywords:  Data Visualization; Interactive Dashboards; Open Access Publishing; Scholarly Communications
    DOI:  https://doi.org/10.5195/jmla.2026.2340
  10. J Med Libr Assoc. 2026 Apr 01. 114(2): 136-142
       Background: Individuals seeking health information often turn to the Internet for answers. Wikipedia is a dynamic, crowdsourced encyclopedia and one of the most accessed online sources for this content. However, the Spanish Wikipedia is not nearly as in-depth as the English version, creating a large disparity. Medical students with English and Spanish proficiency possess a distinct skill set that positions them to contribute timely, trusted, evidence-based content to the platform and reduce this inequity.
    Case Presentation: This case study presents the implementation of a credit-bearing Spanish Wikipedia translation elective by the library for fourth-year medical students at Western Michigan University Homer Stryker M.D. School of Medicine, currently the only Spanish Wikipedia elective in a medical school in the United States. The purpose of the course is to increase the quality and readability of medical articles in the English and Spanish versions of the online encyclopedia using evidence-based medicine (EBM) principles.
    Conclusions: The output from this elective demonstrates that medical students can use their medical knowledge and skills to create and improve articles in English and Spanish on Wikipedia and disseminate evidence-based information to millions of consumers worldwide seeking reputable health information. Learners can leverage their specialized training to minimize the gap between these versions and become active participants in global health. By using technology to their advantage, they provide enduring health information that impacts and reaches many more people in a virtual setting than in a traditional one-on-one clinical encounter.
    Keywords:  Spanish; Wikipedia; consumer health; crowdsourcing; curriculum; global health; health information seeking; medical education; medical school; medical students; open educational resources; public health
    DOI:  https://doi.org/10.5195/jmla.2026.2237
  11. J Med Libr Assoc. 2026 Apr 01. 114(2): 105-115
       Objectives: To retrospectively evaluate workload implications and recall performance of narrower or broader database search strategies when using active learning screening tools.
    Method: A convenience sample of 10 completed reviews was used to assess search strategy performance in ASReview LAB, an open-source systematic review software tool. For each review, a single database search strategy was selected and then revised to either broaden (n = 9) or narrow (n = 1) the scope. Results from both the more sensitive (broader) and more precise (narrower) search strategies were labeled as relevant or irrelevant based on inclusion in the completed review. The labeled result sets were uploaded into the ASReview LAB simulation module, which mimics the process of human screening. Metrics such as number of records screened to reach recall of 95% or more were recorded. The effects of three different stopping rules on workload and recall were also explored.
    Results: For quantitative systematic reviews, the difference in absolute screening time required to reach 95% recall between broader or narrower search strategies was minimal (≤35 minutes). In contrast, for qualitative systematic reviews and other review types, broader search strategies led to increased workload. With respect to stopping rules, the time-based stopping heuristic resulted in substantial workload increases when broader search strategies were employed.
    Conclusion: Time savings achieved through the use of semi-automated screening tools may not always offset additional screening time required by broader, more sensitive search strategies. Librarians and information specialists should consider a variety of factors when determining the appropriate balance between search sensitivity and specificity in the context of semi-automated screening tools.
    Keywords:  AI; Screening Tools; evidence synthesis as topic; machine learning; search strategy development; systematic review as topic
    DOI:  https://doi.org/10.5195/jmla.2026.2286
  12. Cutan Ocul Toxicol. 2026 Apr 16. 1-6
       BACKGROUND: Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by "hallucinations" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.
    METHODS: A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: "True and Relevant," "True but Irrelevant/General," or "False/Hallucination." Distributions were compared using Pearson's chi-square test.
    RESULTS: A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.
    CONCLUSION: LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.
    Keywords:  ChatGPT; Deepseek-V3.2; Gemini; Large language models; artificial intelligence; dermatology; hallucinations; latent tuberculosis; literature review; psoriasis
    DOI:  https://doi.org/10.1080/15569527.2026.2656177
  13. Front Digit Health. 2025 ;7 1700018
      Artificial intelligence (AI)-powered large language models, such as ChatGPT, are increasingly used by the public for health information. The reliability of such novel AI-tools in providing credible polycystic ovary syndrome (PCOS) information/advice requires investigation. Healthcare professionals involved in PCOS care (n = 43 from 14 countries) used a 5-point Likert scale to evaluate ChatGPT-generated responses to frequently asked questions about PCOS against the corresponding patient-orientated, evidence-based recommendations/responses available online. ChatGPT responses were rated significantly higher than the evidence-based responses for 11 of the 12 study questions, with moderate to large effect sizes (r rb  = -0.46 to -1.00; all p-values <0.05), with ChatGPT answers being rated on average 0.824 units higher. Scoring agreement varied (poor to fair), with seven questions showing statistically fair agreement (κ = 0.24-0.37, p < 0.05). Readability analyses found no statistically significant differences between ChatGPT and evidence-based responses. However, using ChatGPT for simplifying the responses resulted in significant improvement. ChatGPT holds potential as a complementary patient self-education tool in PCOS, capable of interactive engagement and simplifying medical language. Further research is needed to identity optimal integration of AI tools and validate their clinical applicability for PCOS self-education/management.
    Keywords:  AI; ChatGPT; PCOS; artificial intelligence; large language models; polycystic ovary syndrome
    DOI:  https://doi.org/10.3389/fdgth.2025.1700018
  14. Chron Respir Dis. 2026 Jan-Dec;23:23 14799731261443321
      BackgroundAI-based chatbots are increasingly used as sources of health information. However, their reliability in delivering accurate and scientifically sound responses to patient questions remains uncertain, especially in chronic diseases such as chronic obstructive pulmonary disease (COPD). This study aims to compare the reliability of ChatGPT-4o and Gemini 2.5 Flash in providing patient-centered medical information on COPD.MethodsA total of 34 common public questions about COPD were submitted to ChatGPT-4o and Gemini 2.5 Flash. Responses were evaluated blindly by four pulmonologists across three domains: accuracy, clarity, and scientific adequacy. The mean scores and word counts were analyzed and compared via nonparametric tests.ResultsGemini 2.5 Flash outperforms ChatGPT-4o in terms of scientific adequacy (mean score: 4.69 ± 0.31 vs. 4.34 ± 0.45, p<0.001). No significant difference was found in accuracy or clarity. The Gemini 2.5 Flash also generated significantly longer responses, particularly in the treatment and prognosis domains (p<0.001). Both models provided generally acceptable answers, but ChatGPT-4o's responses were shorter and occasionally less complete.ConclusionsWhile both models delivered largely accurate and understandable content, Gemini 2.5 Flash tended to produce more detailed responses and received higher scientific adequacy ratings; however, this difference should be interpreted in light of the substantial imbalance in response length. These tools may support patient education however, the findings reflect a comparison between AI systems only and should be interpreted within this scope.
    Keywords:  chronic obstructive pulmonary disease; generative artificial intelligence; hallucinations; health education; telemedicine
    DOI:  https://doi.org/10.1177/14799731261443321
  15. Urogynecology (Phila). 2026 Apr 17.
       IMPORTANCE: Decision aid tools are well-utilized resources in shared decision making for the treatment of pelvic floor disorders. With the improvement of large language models like ChatGPT, prior studies have started evaluating whether artificial intelligence can be used to create patient education materials. As treatment of pelvic floor disorders can be complicated and specific to individuals, we assessed the use of ChatGPT for patient decision making.
    OBJECTIVE: Our objective was to evaluate the understandability, reliability, accuracy, and readability of ChatGPT-generated decision aid tools for urogynecology surgical procedures.
    STUDY DESIGN: In this cross-sectional study, 6 questions comparing 2 treatment options for common urogynecological conditions were developed by 6 urogynecologists and entered into ChatGPT to create decision aid tools. Patients and physicians evaluated understandability using a Patient Education Materials Assessment Tool. Physicians also evaluated reliability using a modified DISCERN (mDISCERN) instrument and accuracy using a 5-point Likert scale. Readability was evaluated using the Flesch-Kincaid Reading Ease score. Analysis of all scores was descriptive.
    RESULTS: All decision aid tools received high patient and physician understandability. The tools had fair reliability with an average mDISCERN score of 26. Accuracy was <4 (unfavorable) on 2 of the decision aid tools, and the reading level required to read them was high overall.
    CONCLUSIONS: ChatGPT-generated decision aid tools for urogynecology surgical counseling have high patient understandability, emphasizing their potential for being well-received by patients deciding on treatment. However, approximately 30% of the topics require improvements to accuracy, while all tools could benefit from improved reliability and readability.
    DOI:  https://doi.org/10.1097/SPV.0000000000001856
  16. Angle Orthod. 2026 Apr 15. pii: 082825-727.1. [Epub ahead of print]
       Objectives: To compare the performance of ChatGPT-4 Turbo and Gemini 1.5 Pro in the domain of Clear Aligner Therapy (CAT).
    Materials and Methods: A total of 36 standardized questions on CAT were created based on consent forms from aligner companies and frequently asked patient questions. Responses were generated by ChatGPT-4 Turbo and Gemini 1.5 Pro. A reference answer key was developed from current literature. Two orthodontic professors independently evaluated the responses using a six-point accuracy scale and a three-point completeness scale. Questions were also categorized by topic. Readability was assessed using Flesch-Kincaid Grade Level (FKGL) and Simplified Measure of Gobbledygook (SMOG) scores. Independent samples t-tests were used to compare readability, whereas the Mann-Whitney U test was used for accuracy and completeness. Interrater reliability was assessed with intraclass correlation coefficient (ICC). Significance was set at P < .05.
    Results: Interrater reliability was high for both models. ChatGPT showed excellent agreement for accuracy (ICC = 0.91) and good for completeness (ICC = 0.89). Gemini also showed excellent accuracy (ICC = 0.93) and moderate-to-good completeness (ICC = 0.78). No significant differences were found between models in accuracy, completeness, or readability. Both produced content with FKGL scores indicating university-level reading and SMOG scores suggesting high school-level comprehension.
    Conclusions: The readability of the content may present challenges for general audiences due to its complexity. The models used in this study may assist in patient education; however, the results implied the importance of professional consultation and careful interpretation of artificial intelligence-generated information.
    Keywords:  Clear aligner therapy; Large language models; Patient education; Readability; Treatment accuracy
    DOI:  https://doi.org/10.2319/082825-727.1
  17. J Med Libr Assoc. 2026 Apr 01. 114(2): 94-104
       Objective: To compare answers to clinical questions between five publicly available large language model (LLM) chatbots and information scientists.
    Methods: LLMs were prompted to provide 45 PICO (patient, intervention, comparison, outcome) questions addressing treatment, prognosis, and etiology. Each question was answered by a medical information scientist and submitted to five LLM tools: ChatGPT, Gemini, Copilot, DeepSeek, and Grok-3. Key elements from the answers provided were used by pairs of information scientists to label each LLM answer as in Total Alignment, Partial Alignment, or No Alignment with the information scientist. The Partial Alignment answers were also analyzed for the inclusion of additional information.
    Results: The entire LLM set of answers, 225 in total, were assessed as being in Total Alignment 20.9% of the time (n=47), in Partial Alignment 78.7% of the time (n=177), and in No Alignment 0.4% of the time (n=1). Kruskal-Wallis testing found no significant performance difference in alignment ratings between the five chatbots (p=0.46). An analysis of the partially aligned answers found a significant difference in the number of additional elements provided by the information scientists versus the chatbots per Wilcoxon-Rank Sum testing (p=0.02).
    Discussion: Five chatbots did not differ significantly in their alignment with information scientists' evidence summaries. The analysis of partially aligned answers found both chatbots and information scientists included additional information, with information scientists doing so significantly more often. An important next step will be to assess the additional information, both from the chatbots and the information scientists for validity and relevance.
    Keywords:  LLMs; Large Language Models; artificial intelligence; biomedical informatics; chatbots; evidence synthesis; generative AI; information science; library science
    DOI:  https://doi.org/10.5195/jmla.2026.2333
  18. Work. 2026 Apr 17. 10519815261442519
      BackgroundPatients with fibromyalgia require clear and reliable medical information to manage a complex chronic condition. AI-based tools may offer valuable support for patient education.ObjectiveTo evaluate and compare the performance of three AI models-ChatGPT, Gemini, DeepSeek-in providing patient-centered, accurate information about fibromyalgia syndrome (FMS). Specifically, the study focuses on medical accuracy, readability, and the use of patient-oriented language.MethodsResponses were collected from each AI model using a set of frequently asked questions about FMS. These questions were selected based on global search trends and expert input. A total of 10 questions were asked, and the responses were evaluated by a team of physiotherapists using a 4-point Likert scale.ResultsStatistical analysis revealed significant differences in response quality for certain questions, with the models performing similarly for others. The evaluation indicated that ChatGPT generally provided more accessible, accurate, and patient-friendly answers compared to Gemini and DeepSeek. Readability analysis using the Flesch-Kincaid Grade Level revealed that ChatGPT's responses generally required lower reading grade levels, making them more accessible. In contrast, Gemini produced more complex responses that required higher reading levels. DeepSeek's responses were found to fall in the mid-range of readability.ConclusionsThe findings suggest that AI tools can be a valuable resource for patient education but caution is advised, particularly in areas such as diagnosis, treatment, or medication use. AI-generated responses should always be verified by healthcare professionals to ensure their accuracy and relevance. When used properly and under appropriate supervision, AI can enhance patient understanding and improve access to reliable information about FMS.
    Keywords:  artificial intelligence; chatbot; fibromyalgia syndrome; health communication; medical accuracy; patient education
    DOI:  https://doi.org/10.1177/10519815261442519
  19. Inform Health Soc Care. 2026 Apr 16. 1-11
       OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-perceived usability across thematically stratified queries.
    MATERIALS AND METHODS: In a blinded, within-subject design, 16 frequently asked questions about biologic therapies were categorized into three domains: treatment/drug selection, safety/adverse effects, and special conditions/daily life. Responses were standardized, generated without external retrieval, anonymized as A/B pairs. Thirty physicians assessed clinical appropriateness, scientific accuracy, comprehensiveness, while 60 patients rated readability, understandability, actionability, perceived adequacy, decision support, and trust on 5-point Likert scales. Analyses included paired t-tests, Holm/FDR corrections and two one-sided tests (TOST) to distinguish statistical non-difference from practical equivalence.
    RESULTS: Physicians rated GPT higher across all domains (p < .002), with largest gaps in safety/side effects and treatment/drug selection. Patients favored GPT for understandability, actionability, and decision support (p < .001), while readability, adequacy, trust, and reading time were statistically and clinically equivalent.
    CONCLUSION: Findings highlight the need for topic-aware governance: guideline-dense queries suited to retrieval-augmented generation and checklist compliance, and context-sensitive queries requiring uncertainty signaling and human oversight. This layered approach advances health informatics by defining where LLMs may substitute versus where they require verification, supporting safe and auditable integration into patient education.
    Keywords:  ChatGPT; DeepSeek; digital health; medical informatics; patient education
    DOI:  https://doi.org/10.1080/17538157.2026.2654150
  20. J Med Libr Assoc. 2026 Apr 01. 114(2): 171-172
      To advance information retrieval science for producing evidence syntheses at Canada's Drug Agency, the Research Information Services team developed a replicable process to evaluate automated or artificial intelligence (AI) search tools. The team inventoried 51 tools in the fall of 2023 and built a flexible evaluation instrument to inform adoption decisions and enable comparison between tools. Building on this foundational evaluation work, the team further conducted a comparative analysis on three top-ranked tools in the fall of 2024. The investigation confirmed that these automated or AI tools have inconsistent and variable performance for the range of information retrieval tasks performed by Information Specialists at Canada's Drug Agency. Implementation recommendations from this study informed a "fit for purpose" approach where Information Specialists leverage automated or AI search tools for specific tasks or project types.
    Keywords:  Artificial Intelligence; Automation; Generative Artificial Intelligence; Information Sciences; Information Storage and Retrieval; Large Language Models; Review Literature as Topic
    DOI:  https://doi.org/10.5195/jmla.2026.2341
  21. Ann Afr Med. 2026 Apr 17.
       INTRODUCTION: Patient education is essential in the management of cardiomyopathies, including dilated, restrictive, and hypertrophic subtypes, which often involve complex diagnostic and treatment pathways. Traditional educational resources may not address varying health literacy levels. Recent advancements in artificial intelligence (AI), particularly large language models such as ChatGPT and DeepSeek AI, offer new avenues for generating accessible and scalable patient education materials. This study evaluates the performance of these tools in that context.
    METHODOLOGY: This cross-sectional study involved generating educational brochures for three cardiomyopathies using ChatGPT and DeepSeek AI. The responses were evaluated using the Flesch-Kincaid Grade Level and reading ease (for readability), QuillBot similarity checker (for originality), and the modified DISCERN instrument (for reliability). Statistical analysis included the Shapiro-Wilk test for normality, Levene's test for equality of variances, and independent samples t-tests. Pearson correlation was also performed between readability and reliability scores.
    RESULTS: There were no statistically significant differences between the two AI tools across all assessed parameters. ChatGPT generally produced content with higher grade levels and longer text, while DeepSeek AI demonstrated higher originality in restrictive cardiomyopathy and superior readability in hypertrophic cardiomyopathy. Both tools achieved moderate DISCERN scores, with DeepSeek AI slightly outperforming ChatGPT in dilated cardiomyopathy.
    CONCLUSION: This study found no significant difference in ease score, reliability score, and grade scores between patient education brochures on cardiomyopathies generated by ChatGPT and DeepSeek AI.
    Keywords:  Artificial intelligence; ChatGPT; DeepSeek artificial intelligence; DeepSeek intelligence artificielle; Intelligence artificielle; brochure d’éducation des patients; cardiomyopathie dilatée; cardiomyopathie hypertrophique; cardiomyopathie restrictive; dilated cardiomyopathy; educational tool; hypertrophic cardiomyopathy; outil éducatif; patient education brochure; restrictive cardiomyopathy
    DOI:  https://doi.org/10.4103/aam.aam_740_25
  22. Cleft Palate Craniofac J. 2026 Apr 13. 10556656261438912
      ObjectiveTo assess the accuracy, readability, and comparative quality of five large language models (LLMs) in answering frequently asked questions related to nasoalveolar molding (NAM) in cleft care.DesignRepeated measures study.SettingThis study evaluated the responses of five LLMs, Google Gemini, Microsoft (Copilot), ChatGPT, Meta, and Claude artificial intelligence (AI), through a standardized set of 28 questionnaires related to NAM in cleft care.ParticipantsNone.InterventionThe accuracy of LLMs was assessed using a five-point modified Likert scale. Readability was evaluated using two validated metrics: the Flesch-Kincaid Reading Ease and Flesch-Kincaid Grade Level.Main Outcome MeasureThe primary outcome variable was the response generated by the five LLMs. Two investigators independently assessed the quality of responses from the five LLMs using a five-point modified Likert scale, with the highest score (5) indicating the highest quality.ResultsClaude AI achieved the highest mean Likert score (3.71 ± 0.53), whereas Gemini had the lowest score (3.29 ± 0.60). The highest mean readability score was observed in Meta AI (79.61 ± 37.09), while Claude AI showed significantly lower scores (47.04 ± 46.29).ConclusionAmong the five LLMs, Claude AI achieved the highest accuracy, followed by Microsoft Copilot, ChatGPT, Meta AI, and Google Gemini in responding to NAM-related queries. The responses from Claude AI were complex and harder to read, followed by ChatGPT, Copilot, Gemini, and Meta AI, with Meta AI being the most straightforward to comprehend.
    Keywords:  artificial intelligence; cleft lip and palate; large language models; nasoalveolar molding
    DOI:  https://doi.org/10.1177/10556656261438912
  23. J Surg Res. 2026 Apr 10. pii: S0022-4804(26)00134-4. [Epub ahead of print]322 87-96
       INTRODUCTION: Patient education materials are essential for informed decision-making. However, their readability often exceeds the sixth-grade reading level recommended by the American Medical Association, creating barriers for patients with limited literacy. Artificial intelligence (AI) tools such as ChatGPT-4o may offer a scalable approach to improving accessibility.
    MATERIALS AND METHODS: We analyzed 34 publicly available online patient education materials across general surgery, neurosurgery, orthopedic surgery, and plastic surgery. ChatGPT-4o was prompted to simplify the text to the sixth-grade reading level. Two independent surgical experts reviewed text for accuracy and comprehensiveness. The primary outcome was the Simple Measure of Gobbledygook index. Secondary outcomes included Coleman-Liau index, Flesch-Kincaid grade level, word count, and reading time. Accuracy was measured by the number of inaccurate sentences, and comprehensiveness by coverage of six content categories.
    RESULTS: ChatGPT-4o simplification significantly decreased mean Simple Measure of Gobbledygook score from 12.1 to 10.5 (P < 10-9), with similar reductions in Coleman-Liau index (-0.8, P < 10-5) and Flesch-Kincaid (-1.6, P < 10-8). Word count reduced by 55% on average, most notably in plastic surgery (66%, P = 0.008). Accuracy was maintained in all specialties; however, key content such as risks and benefits was often omitted.
    CONCLUSIONS: AI can improve the readability of surgical patient education materials while maintaining accuracy. Achieving the recommended sixth grade reading level may be challenging due to the inherent complexity of medical information. A hybrid approach that combines AI-assisted simplification with clinician review may represent the most effective strategy for efficiently developing patient-facing educational resources that are accessible, accurate, and comprehensive enough to support informed decision-making.
    Keywords:  Artificial intelligence; ChatGPT; Health literacy; Informed decision-making; Patient education; Readability
    DOI:  https://doi.org/10.1016/j.jss.2026.03.030
  24. Front Public Health. 2026 ;14 1804524
       Objective: To compare, across large language model (LLM) platforms, the quality, readability, and completeness of action-oriented instructions in diabetes self-management education texts, and to quantify the associations among these domains to inform model selection and risk mitigation.
    Methods: Ten LLM platforms were used to generate diabetes education texts (total n = 200), stratified by topic. Outcomes included the Global Quality Score (GQS), the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), and EQIP-36 (Ensuring Quality Information for Patients, 36-item version). Text characteristics, including word count, sentence count, and syllable count, were recorded. Readability was assessed using the Automated Readability Index (ARI), Coleman-Liau Index (CLI), Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease Score (FRES), Gunning Fog Index (GFOG), Linsear Write (LW), and the Simple Measure of Gobbledygook (SMOG). Between-platform differences were evaluated using one-way ANOVA or the Kruskal-Wallis test, as appropriate. Associations between readability indices and GQS, PEMAT-P, and EQIP-36 were examined using correlation heat maps and exploratory stepwise multiple linear regression. Because the readability indices were highly intercorrelated, these regression analyses were considered exploratory and were used to identify candidate readability-related correlates rather than definitive independent predictors.
    Results: GQS and PEMAT-P differed significantly across platforms (both p < 0.001), whereas EQIP-36 did not (p = 0.062). Text length and readability also varied by platform (most p < 0.001). After stratification by topic, PEMAT-P understandability, PEMAT-P total score, and GQS no longer differed significantly across topics (p = 0.356, p = 0.247, and p = 0.182, respectively), whereas PEMAT-P actionability (p < 0.001), EQIP-36 (p < 0.001), and several readability metrics remained significantly different. Difficulty indices were strongly intercorrelated, and FRES was inversely associated with multiple difficulty indices. Exploratory regression analyses suggested that greater reading burden tended to co-occur with lower GQS, PEMAT-P, and EQIP-36 scores.
    Conclusion: LLM-generated diabetes education texts exhibit marked cross-platform heterogeneity, and exploratory analyses suggest a potential trade-off between readability and both information quality and the completeness of action-oriented instructions. Clinical implementation should therefore combine careful platform selection, structured prompting with templates, human-AI review, and continuous quality monitoring to support safe, readable, and actionable patient education.
    Keywords:  cross-platform evaluation; diabetes mellitus; diabetes self-management education and support (DSMES); large language model (LLM); readability; text quality
    DOI:  https://doi.org/10.3389/fpubh.2026.1804524
  25. J Pain Symptom Manage. 2026 Apr 11. pii: S0885-3924(26)00151-X. [Epub ahead of print]
       CONTEXT: Chatbots are increasingly used by the public, but their performance in answering questions about complex health topics such as cannabis is unknown.
    OBJECTIVES: To evaluate responses of three popular chatbots regarding cannabis and its use for cancer-related symptoms.
    METHODS: We asked ChatGPT, Google Gemini, Microsoft Co-Pilot to answer questions about cannabis derived from the Centers for Disease Control website and American Society of Clinical Oncology guidelines regarding cannabis. Responses were collected on February 6, 2025. Six physicians with expertise in this field scored responses for accuracy and comprehensiveness (0-10 scale). Reliability of references was scored separately (0-10 scale). Readability was assessed using Flesch Kincaid Grade Level, Flesch Reading Ease scores.
    RESULTS: Mean accuracy scores (SD) for ChatGPT, Gemini, Co-Pilot were 9.0 (1.8), 8.8 (2.3), 8.3 (2.3), respectively. Co-Pilot significantly underperformed in accuracy compared to ChatGPT (mean difference -0.62, 95% CI: -1.11, 0.14; p=0.008). Mean comprehensiveness scores (SD) for ChatGPT, Gemini, Co-Pilot were 8.1 (2.2), 8.5 (2.2), 7.2 (2.4), respectively. ChatGPT and Gemini performed better than Co-Pilot in comprehensiveness (mean difference Co-Pilot vs. ChatGPT: -0.88 [95% CI: 1.34, -0.42; p<0.001]; mean difference Co-Pilot vs. Gemini: -1.28 [95% CI: -1.74, -0.82; p<0.001]). Inaccurate or misleading statements regarding cannabis formulations and symptom benefits were identified, with missing information on adverse effects and drug interactions. Gemini had the lowest reliability (4.1). Readability among all chatbots was poor.
    CONCLUSION: Despite overall high accuracy and comprehensiveness scores, chatbots made some misleading, inaccurate statements or missed information. For now, their answers should be interpreted with caution.
    Keywords:  Artificial Intelligence; Cancer symptoms; Cannabis; Chatbots
    DOI:  https://doi.org/10.1016/j.jpainsymman.2026.04.002
  26. Int Urogynecol J. 2026 Apr 17.
       INTRODUCTION AND HYPOTHESIS: High-quality patient education materials are essential in urogynecology. We hypothesized that patient handouts generated by different large language models (LLMs) would vary in quality and readability and would differ from an established society-produced leaflet.
    METHODS: Twelve leaflets on bladder training and pelvic floor muscle therapy from six origins: GPT-4, Gemini-2.5 Pro, Sonnet-4, Llama-4, Perplexity, and The International Urogynecological Association (IUGA), were produced or obtained and standardized into plain text. Three blinded reviewers assessed completeness, information quality (DISCERN), and the Patient Education Materials Assessment Tool (PEMAT-A: actionability; PEMAT-U: understandability). The statistical plan included ordinary least squares fixed-effects per metric with type II analysis of variance for source effects; estimated marginal means with Holm-adjusted pairwise comparisons; a crossed mixed-effects model for topic groups; and inter-rater reliability was measured. Readability and text analyses used standard indices.
    RESULTS: Origins varied in completeness (p < 0.001), DISCERN (p < 0.001), and PEMAT-A (p = 0.0018); PEMAT-U showed a trend (p = 0.063). Llama-4 scored significantly lower on completeness and DISCERN, and lower than GPT-4, IUGA, and Perplexity on PEMAT-A; Sonnet4 outperformed Llama-4 on PEMAT-U. No single origin dominated all metrics. Readability varied greatly: GPT-4 had an average Flesch-Kincaid grade level ≈ 6.6, Gemini ≈ 7.4; Sonnet4 ≈ 15; Llama-4 ≈ 17. IUGA leaflets were the longest, with grade levels around 9-10. Bladder-training materials were modestly more complete than pelvic muscle materials (p = 0.045). Inter-rater reliability was high (ICC ≥ 0.87).
    CONCLUSIONS: Patient education quality varies substantially across AI tools and compared with society materials. AI-generated content can meet readability targets but requires expert review to ensure completeness and reliability before clinical use.
    Keywords:  Actionability; Artificial intelligence; Patient education; Readability; Urogynecology
    DOI:  https://doi.org/10.1007/s00192-026-06660-1
  27. J Voice. 2026 Apr 11. pii: S0892-1997(26)00117-7. [Epub ahead of print]
       PURPOSE: Online patient education materials (PEMs) for spasmodic dysphonia (SD) often exceed recommended readability levels, creating barriers to patient comprehension. This cross-sectional study examined (a) the current readability of SD-related PEMs, (b) whether the artificial intelligence (AI) tool ChatGPT-4o mini improves readability when instructed to write at a sixth-grade level, and (c) the extent to which essential clinical information is preserved after AI-based revision.
    METHOD: Fourteen SD-related websites were identified through a systematic Google search, yielding 38 PEMs addressing three core domains: What is SD?, What causes SD?, and How is SD treated? For each PEM, corresponding sixth-grade level responses were generated using ChatGPT-4o mini. Original and AI-revised texts were analyzed using four established readability indices (Flesch-Kincaid Grade Level, Flesch Reading Ease, Gunning Fog Index, and SMOG). Paired-samples t-tests assessed differences in readability. A structured content fidelity analysis was then performed at the statement level to evaluate specificity reduction (SR), dropped content (DC), added content (AC), and overall fidelity using F/P/X classification thresholds.
    RESULTS: Original PEMs required reading levels ranging from approximately 10th grade to early college. ChatGPT-4o mini significantly improved readability across all indices (P <0.001; large effect sizes). Despite readability gains, fidelity analysis of 407 statement pairs revealed notable information loss: 57% of changes were DC, 29% were SR, and 5% were AC. Overall, 21% of PEMs were fully preserved (F), 50% were partially preserved (P), and 29% were not preserved (X), primarily due to omission of clinical detail or reduced specificity.
    CONCLUSION: ChatGPT-4o mini substantially improves the readability of SD-related PEMs but often at the cost of reduced clinical specificity or loss of essential information. AI tools may assist in producing accessible drafts, but expert oversight remains necessary to ensure accuracy and completeness. Future research should examine how AI-revised materials affect patient comprehension and decision-making.
    Keywords:  Artificial intelligence; ChatGPT; Health literacy; Patient education; Readability; Spasmodic dysphonia
    DOI:  https://doi.org/10.1016/j.jvoice.2026.02.052
  28. Knee. 2026 Apr 11. pii: S0968-0160(26)00124-9. [Epub ahead of print]61 104445
       BACKGROUND: Artificial intelligence (AI) tools are increasingly used to support healthcare communication. Total knee arthroplasty (TKA) is a common orthopedic procedure, and many patients seek perioperative information online; however, the accuracy, clinical applicability, and readability of AI-generated responses remain unclear. This study compared responses generated by two large language models (ChatGPT-5, OpenAI; Gemini Advanced v2.5, Google) to frequently asked patient questions related to TKA.
    METHODS: A question pool was developed using Google Trends and major patient information portals. The 10 most frequently searched TKA-related questions were selected through expert review and submitted once to each model under standardized conditions. Responses were anonymized and independently evaluated by 10 board-certified orthopedic surgeons using a five-point Likert scale for medical accuracy and clinical applicability. Readability was assessed using six established indices. Paired comparisons were performed using the Wilcoxon signed-rank test, and inter-rater reliability was assessed using the intraclass correlation coefficient.
    RESULTS: Both models received moderate-to-high expert ratings, with no significant differences in accuracy or clinical applicability (all P > 0.05). Expert agreement varied across topics. Gemini Advanced generated slightly less complex text on several readability indices, whereas other measures were comparable. All responses fell within a secondary-school readability range.
    CONCLUSION: ChatGPT-5 and Gemini Advanced v2.5 demonstrated comparable performance in accuracy and clinical applicability for TKA-related patient questions. Although Gemini Advanced produced marginally simpler text, the differences were small and unlikely to be clinically meaningful. These tools should be used as supervised adjuncts rather than standalone sources of patient guidance.
    Keywords:  Artificial intelligence; ChatGPT-5; Gemini advanced; Patient education; Total knee arthroplasty
    DOI:  https://doi.org/10.1016/j.knee.2026.104445
  29. J Surg Res. 2026 Apr 15. pii: S0022-4804(26)00191-5. [Epub ahead of print]322 317-325
       INTRODUCTION: Informed consent requires patients to fully comprehend the risks, benefits, and alternatives of an intervention. The American Medical Association and the National Institutes of Health recommend patient-facing materials be written at a sixth-grade level or lesser. We evaluated baseline readability of informed consents used within the endocrine surgery division of a tertiary care center and determined whether rewriting them with a large language model (LLM)-based chatbot can bring the text to the recommended level while preserving fidelity.
    METHODS: Eight consent forms (two institutional procedural forms and six prospective trial documents) underwent readability assessment. Each form was processed by the LLM in two separate, independent sessions. Pre- and postedit readability scores were compared. Three independent reviewers assessed content fidelity by calculating precision, recall, and F1 scores (harmonic mean balancing precision and recall). Inter-rater reliability was evaluated using the intraclass correlation coefficient.
    RESULTS: Original forms averaged 14.1 ± 1.3 grade levels. First LLM revision significantly improved readability to an 8.8 ± 1.2 grade level (P < 0.01), a five-grade reduction. Second LLM revision showed no further improvement (9.9 ± 1.2; P = 0.87). The mean F1 score was 0.71 ± 0.26, with high precision (0.95 ± 0.06) but lower recall (0.62 ± 0.16), indicating few hallucinations but frequent content omissions. Greater reductions in reading level were significantly associated with decreased content fidelity (r = 0.73, P < 0.01). Inter-rater agreement was excellent (K = 0.99, P < 0.01).
    CONCLUSIONS: LLM-based editing significantly improved consent form readability but resulted in substantial content omissions. These findings demonstrate LLM's potential for advancing health literacy while highlighting the critical need for human review to ensure completeness and fidelity.
    Keywords:  Health literacy; Informed consent; Large language models; Readability
    DOI:  https://doi.org/10.1016/j.jss.2026.03.079
  30. Res Synth Methods. 2026 May;17(3): 538-556
      Systematic reviews are often characterized as being inherently replicable, but several studies have challenged this claim. The objective of the study was to investigate the variation in results following independent replication of literature searches and meta-analyses of systematic reviews. We included 10 systematic reviews of the effects of health interventions published in November 2020. Two information specialists repeated the original database search strategies. Two experienced review authors screened full-text articles, extracted data, and calculated the results for the first reported meta-analysis. All replicators were initially blinded to the results of the original review. A meta-analysis was considered not 'fully replicable' if the original and replicated summary estimate or confidence interval width differed by more than 10%, and meaningfully different if there was a difference in the direction or statistical significance. The difference between the number of records retrieved by the original reviewers and the information specialists exceeded 10% in 25/43 (58%) searches for the first replicator and 21/43 (49%) searches for the second. Eight meta-analyses (80%, 95% CI: 49-96) were initially classified as not fully replicable. After screening and data discrepancies were addressed, the number of meta-analyses classified as not fully replicable decreased to five (50%, 95% CI: 24-76). Differences were classified as meaningful in one blinded replication (10%, 95% CI: 1-40) and none of the unblinded replications (0%, 95% CI: 0-28). The results of systematic review processes were not always consistent when their reported methods were repeated. However, these inconsistencies seldom affected summary estimates from meta-analyses in a meaningful way.
    Keywords:  literature search; meta-analysis; meta-research; replication; reproducibility; systematic review
    DOI:  https://doi.org/10.1017/rsm.2025.10064
  31. Spine Surg Relat Res. 2026 Mar 27. 10(2): 204-210
       Introduction: Patients frequently use internet-based resources to seek information. Endoscopic spine surgery is extensively marketed on the internet, with purported benefits over traditional open techniques. Previous literature has recommended that the readability of patient education materials (PEM) should not exceed the 6th-grade reading level to optimize health literacy. This study aims to evaluate the readability of online PEMs concerning endoscopic spine surgery.
    Methods: A Google search query was performed using the term "Endoscopic spine surgery patient information." The first 25 websites meeting study inclusion criteria were analyzed for readability using Flesch-Kincaid, average reading level consensus, Gunning Fog, Coleman-Liau, SMOG, and Linsear Write indices. Descriptive statistics were reported.
    Results: The mean average reading level was 12.8 (1.68). The mean Flesch-Kincaid Reading Ease score was 37.6 (11.1). The mean Gunning Fog Score was 14.7 (1.92), Flesch-Kincaid grade level was 12.2 (2.56), Coleman-Liau was 14.0 (1.82), SMOG 11.6 (2.07), Automated Readability Index was 12.9 (3.11), and Linsear Write was 11.8 (2.21). Zero of the 25 included PEMs was evaluated to be below the recommended sixth-grade reading level. Four of the PEMs were considered General Health Information, and 21 were considered Clinical Practice. No differences were found between Clinical Practice and General Health Information websites (p>0.05).
    Conclusions: Creating appropriate PEMs is integral to achieving optimal health literacy. The current readability of the most accessible PEMs related to endoscopic spine surgery is inadequate. As it stands, many patients may not appropriately comprehend the description of their anticipated surgery.
    Keywords:  endoscopic spine surgery; online health information; patient education material; readability
    DOI:  https://doi.org/10.22603/ssrr.2025-0273
  32. Front Digit Health. 2026 ;8 1699285
       Background: Gastric adenocarcinoma, or gastric cancer, typically has a poor prognosis. The objective of this study was to assess the quality, understandability, actionability, and comprehensiveness of online resources for patients diagnosed with gastric adenocarcinoma, or gastric cancer as patients increasingly rely on online health information.
    Methods: A systematic search using the term "stomach cancer" was conducted across three search engines (Google, Yahoo, and Bing) on three different browsers (Safari, Google Chrome, and Microsoft Edge) on 12/13/2024, with the top fifty websites recorded for each combination. Duplicates were removed and inclusion/exclusion criteria were applied. Quality was evaluated using the DISCERN instrument. The PEMAT-P was used to evaluate understandability and actionability. Readability was evaluated with the Flesch-Kincaid Reading Ease algorithm. Comprehensiveness was evaluated with author generated criteria based on national guidelines. Scores for each assessed metric were determined by two independent reviewers for each website and recorded, with any inter-reviewer discrepancies resolved by consensus. Statistical analysis was performed to compare results by website affiliation (academic, foundation or government) and search rank.
    Results: Thirty-seven websites evaluated (N = 17 academic, N = 13foundation and N = 7 government). The mean quality score (DISCERN) was 3.62 (SD 1.21), with no significant differences across affiliations or search positions. Thirty-five out of the 37 evaluated websites achieved an understandability (PEMAT-P) score above the recommended threshold of 70% (Mean 78.38%, SD 11.86%) and 14 websites exceeded the threshold for actionability (Mean 57.66%, SD 37.69%) with no significant differences across affiliations or search positions. Readability (Flesch-Kincaid) averaged a 10th-12th grade level, with a mean score of 51.88 (SD 8.93). Mean comprehensiveness was scored at 62.98% (SD 23.23%) across all websites without significant differences across affiliations or search positions, with over 85% of websites addressing epidemiology, risk factors, and symptomatology, but under 30% of websites including content on post-treatment complications or surveillance.
    Conclusions: While most online resources for gastric cancer provided understandable information, they lacked actionability, were written above recommended reading levels, and offered limited content on long-term management. These shortcomings reflect broader trends seen across other patient resources and highlight the need for more actionable, readable, and comprehensive online patient education materials.
    Keywords:  cancer; gastric adenocarcinoma; gastroenterology; oncology; online patient education materials; public health; screening; stomach cancer
    DOI:  https://doi.org/10.3389/fdgth.2026.1699285
  33. J Surg Res. 2026 Apr 13. pii: S0022-4804(26)00203-9. [Epub ahead of print]322 196-204
       INTRODUCTION: Thyroid cancer treatment is complex. Patients may turn to the internet to gain insights into their condition. The quality of this information, especially for non-English speakers, is unclear. This study examines the quality and accuracy of online resources for both English and non-English speaking patients with thyroid cancer.
    METHODS: Three search engines were queried in October 2023 using thyroid cancer care terms in English, Spanish, and Chinese. Ninety six websites per language were identified. Duplicate and non-accessible websites were excluded. Websites were categorized based by origin (U.S.-based or foreign). Quality was assessed using the JAMA criteria (0-4) and DISCERN tool (1-5) by two independent language-fluent reviewers. Scores were averaged to determine the final score. Accuracy was assessed using six management recommendations from the 2015 American Thyroid Association differentiated thyroid cancer guidelines. Categorical variables and mean scores were compared using X2 analysis and one-way analysis of variance, respectively.
    RESULTS: 62 English, 37 Spanish, and 52 Chinese websites were evaluated. English and Spanish websites more commonly originated from a US source (80.6% and 53.8%, respectively) compared to Chinese websites (21.2%, P < 0.001). Mean JAMA scores for English-, Spanish-, and Chinese-language websites were 2.49 ± 1.30, 2.45 ± 1.36, and 1.33 ± 0.74, respectively (P < 0.001); mean DISCERN scores were 3.58 ± 0.57, 3.39 ± 0.58, and 2.57 ± 0.41, respectively (P < 0.001). Only six English websites and one Spanish website reported all examined ATA treatment recommendations.
    CONCLUSIONS: The quality of online thyroid cancer treatment information is generally poor. Chinese-language websites had lower quality scores and fewer up-to-date recommendations compared to English and Spanish websites. Enhancing online information, especially in non-English languages, presents a significant opportunity.
    Keywords:  Low-risk thyroid cancer; Multilingual health resources; Online information quality; Patient resources; Thyroid cancer treatment
    DOI:  https://doi.org/10.1016/j.jss.2026.03.092
  34. Spinal Cord. 2026 Apr 16.
       STUDY DESIGN: Cross-sectional, observational, and descriptive study.
    OBJECTIVES: To analyze globally accessible videos and user comments retrieved via the YouTube Data API using the keyword "spinal cord injury."
    SETTING: Publicly available YouTube videos and comments.
    METHODS: A total of 588 videos uploaded to YouTube over the past 15 years were screened. After exclusion of non-English content, disabled comments, duplicates, and irrelevant material, the 100 most-viewed videos were included. Video-level metadata and 15,619 user comments were extracted. Using deductive qualitative content analysis, two independent reviewers categorized videos into seven domains: general information, personal experience, daily living activities, treatment trials, exercise demonstrations, rehabilitation center presentations, and others. For text preprocessing and analysis, Python libraries (NLTK, TextBlob, WordCloud) were applied. Sentiment analysis was conducted using the Valence Aware Dictionary and Sentiment Reasoner (VADER). Descriptive statistics and sentiment trends were evaluated using IBM SPSS Statistics, Version 29.0.
    RESULTS: Most YouTube videos on spinal cord injury focused on general information and personal experiences. A significant rise in video numbers and engagement occurred between 2010 and 2020, followed by a decline after 2021. Positive sentiments predominated but declined over time, while neutral comments increased. Negative sentiments remained consistently low throughout all periods.
    CONCLUSION: User engagement with YouTube content on spinal cord injury is influenced by social and global factors, with content largely centered on narratives and general information. Future research should broaden to multiple platforms and incorporate demographic and geographic factors to guide effective digital health communication strategies.
    DOI:  https://doi.org/10.1038/s41393-026-01200-6
  35. J Parkinsons Dis. 2026 Apr 15. 1877718X261438620
      BackgroundParkinson's disease (PD) is the most common movement disorder, and patients increasingly use YouTube to obtain health-related information.ObjectiveThis study aimed to assess the content quality and informational reliability of YouTube videos on PD exercises.MethodsA total of 150 English-language YouTube videos were screened using the search terms "Parkinson exercises", "Parkinson physiotherapy exercises", and "Parkinson home exercise program". For each video, the source, upload date, number of views, likes, dislikes, and comments were recorded. The Video Power Index (VPI) was assessed using the view ratio (views/day) and like ratio (likes × 100 / [likes + dislikes]). The clinical quality, reliability, and educational value of PD-specific exercise videos were assessed using the Global Quality Scale (GQS), modified DISCERN (mDISCERN), and guideline-based criteria derived from the European Physiotherapy Guideline for Parkinson's Disease (PD-GEC).ResultsA total of 29 videos met the inclusion criteria and were analyzed. Videos explaining how and why exercises were performed demonstrated higher mDiscern and GQS scores, while providing repetition, duration, and intensity information was associated with higher GQS scores but not mDiscern (p = 0.080); no differences were observed for disease specificity, functional linkage, or safety warnings (all p > 0.05). PD-GEC scores were not significantly related to video engagement metrics.ConclusionHigher-quality videos tended to provide clear explanations of exercise rationale and dosage, while guideline-based clinical features, including PD-GEC criteria, were not associated with viewer engagement.
    Keywords:  Parkinson; exercise; health information; social media; video database
    DOI:  https://doi.org/10.1177/1877718X261438620
  36. World J Urol. 2026 Apr 11. pii: 297. [Epub ahead of print]44(1):
       PURPOSE: To assess the quality and engagement of bladder exstrophy content on YouTube using validated assessment tools, with the goal of identifying gaps in accessible, high-quality online education. These findings also have broader implications for the interplay of online information and rare conditions.
    METHODS: We reviewed the first 300 YouTube videos relevant to the term "bladder exstrophy" published prior to March 2025. Video quality was assessed using the Patient Education Materials Assessment Tool (PEMAT) and JAMA benchmark criteria. Engagement (views, likes, comments), presenter credentials, and targeted audience were extracted. Spearman correlation coefficients were used to examine the relationship between viewer engagement and quality scores.
    RESULTS: 133 videos met inclusion criteria. Academic institutions produced 60% of videos. The median video length was 5 min with a median view count of 204. Patient-created videos had the highest Video Power Index (VPI), however this was not statistically significant (p = 0.11). Median understandability (PEMAT-V) was high (89), while actionability (PEMAT-A) was low (0). Academic videos had significantly higher PEMAT-V (p = 0.002) and JAMA benchmark scores (p < 0.001) compared to patient, layperson content. No significant correlations were found between video quality scores and user engagement metrics.
    CONCLUSION: Content on bladder exstrophy garners limited viewership and engagement but remains an important educational resource. Videos produced by academic institutions demonstrated the highest quality scores, however, lacked actionable guidance. Higher quality was not associated with increased engagement, highlighting a critical gap in the availability of educational, user-friendly videos that support both understanding and anticipatory guidance for the management of bladder exstrophy.
    Keywords:  Bladder exstrophy; Pediatric urology; Social media
    DOI:  https://doi.org/10.1007/s00345-026-06229-z
  37. Digit Health. 2026 Jan-Dec;12:12 20552076261443224
       Background: The frailty syndrome is among the most prevalent geriatric syndromes, while social media has become a pivotal place for retrieving health information.
    Objective: The objectives of this study are to investigate the quality of frailty-related videos on major Chinese social media platforms and examine the correlation of the quality with user engagement.
    Methods: Collect the videos about frailty from TikTok, Bilibili, and Xiaohongshu. Document the general characteristics, uploader information, and content features of each video. Evaluate the quality of each video with the Journal of the American Medical Association (JAMA) benchmark criteria, the Global Quality Score (GQS), modified DISCERN (mDISCERN), and the Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V).
    Results: We examined 126 videos in current study. Overall, quality was not promising with a mean JAMA score of 1.1 (SD=0.8), GQS of 2.8 (SD=1.0), mDISCERN of 3.0 (SD=0.8), PEMAT-understandability of 76.5% (SD=15.6%), and PEMAT-actionability of 49.7% (SD=40.2%). Among the platforms, Bilibili had the highest quality videos, and Xiaohongshu had the lowest videos quality. Videos produced by organizations, non-profit groups, medical-related personnel, certified authors, expert monologue, and question & answer is better. The correlation between video quality and user engagement metrics was negligible.
    Conclusions: The video quality on social media platforms remains inadequate, offering limited utility to users. Frequently, the viewers cannot precisely determine if content from videos is valid or not. First, uploaders need to optimize video quality and second the oversight of platforms should be strengthened to improve public health literacy and raise awareness.
    Keywords:  Bilibili; TikTok; Xiaohongshu; frailty; social media; video quality
    DOI:  https://doi.org/10.1177/20552076261443224
  38. Thorac Cancer. 2026 Apr;17(8): e70273
       BACKGROUND: Short-form videos have become the common source of cancer information for Chinese patients and caregivers. We evaluated the content, quality, and reliability of lung cancer treatment videos on TikTok, Bilibili, and Kwai and generated evidence-based recommendations for cancer health education.
    METHODS: We conducted a cross-sectional study of lung cancer treatment short videos on TikTok, Bilibili, and Kwai. The top 200 most-liked videos per platform posted between January 1, 2020 and October 30, 2025, were retrieved on November 1, 2025. After screening, 300 videos (100 per platform) were analyzed. Two oncologists rated quality using GQS (1-5) and DISCERN (1-5); creator identity was classified. Comment sentiment (SnowNLP) and engagement metrics were analyzed.
    RESULTS: TikTok had the highest engagement and quality (GQS 3.0, DISCERN 3.0) versus Bilibili/Kwai (2.0) (p < 0.001). Professionals achieved the highest quality (GQS 3.0) versus institutions (1.0) (p < 0.001). However, absolute quality was low across all platforms: only 6% of videos met high-quality criteria (GQS ≥ 4), and 5% met DISCERN ≥ 4. Engagement showed a weak negative correlation with quality (ρ = -0.13 to -0.21).
    CONCLUSIONS: Overall quality is low; professional content is more reliable but less viral. Embedding quality indicators in algorithms and promoting certified creators could improve patient cancer education.
    Keywords:  TikTok; health education; lung cancer; social media; video quality
    DOI:  https://doi.org/10.1111/1759-7714.70273
  39. Transplant Proc. 2026 Apr 11. pii: S0041-1345(26)00172-7. [Epub ahead of print]
       BACKGROUND: The rise of digital media has increased public use of social platforms for health information. While short-form videos hold potential for disseminating knowledge on kidney transplantation, their variable quality is a major concern. This study aimed to analyze the content and quality of kidney transplantation-related videos on major short video sharing platforms.
    METHODS: Between April 28, 2025 and May 02, 2025, using Chinese search terms for "kidney transplantation," we collected 263 relevant videos from WeChat, TikTok, and Bilibili. Two independent researchers evaluated video content and quality using the Journal of the American Medical Association (JAMA) benchmark criteria, Global Quality Scale (GQS), modified DISCERN (mDISCERN), and Patient Education Materials Assessment Tool (PEMAT).
    RESULTS: Healthcare professionals were the primary uploaders (112/263, 42.6%), and disease knowledge constituted the predominant content focus (161/263, 61.2%). Videos sourced from WeChat exhibited higher overall quality than those from TikTok or Bilibili. Videos uploaded by patients garnered significantly more likes and comments (all P < .001). Content featuring personal patient experiences also attracted significantly more likes and comments (all P < .001), whereas disease knowledge content was shared more frequently (P < .001). Patient vlog received significantly more likes and comments, while dialogue formats were shared more often (P < .001). Positive correlations were observed between engagement variables (likes, comments, favorites, shares) and followers (all P < .001), while a negative correlation existed with video duration (all P < .005). Shares and followers positively correlated with video quality (P < .001 and P < .01, respectively).
    CONCLUSION: Although numerous kidney transplantation-related videos are available on short video platforms, their quality and reliability vary considerably and require significant improvement.
    DOI:  https://doi.org/10.1016/j.transproceed.2026.03.008
  40. Inquiry. 2026 Jan-Dec;63:63 469580261441434
      Hepatitis B is a significant global health concern and poses a substantial burden on public health systems. Short video platforms such as TikTok and Bilibili have become important channels for health information dissemination. However, the quality and reliability of Hepatitis B-related content on these platforms remain unclear. The objective of our research is to evaluate the quality of information regarding Hepatitis B disseminated on the TikTok and Bilibili short video platforms. On April 1, 2025, we systematically collected the top 100 Hepatitis B-related short videos from TikTok and Bilibili, totaling 200 videos. Basic video information was extracted, and video quality and reliability were assessed using the Global Quality Scale (GQS), modified DISCERN (mDISCERN), and JAMA benchmarks. Spearman correlation analysis was performed to examine the relationship between engagement metrics and quality scores. TikTok videos demonstrated greater user engagement, as evidenced by higher metrics for likes, comments, and shares, and also achieved superior reliability scores compared to Bilibili. Specifically, the median reliability scores for TikTok videos were mDISCERN: 4 (3-4) and JAMA: 3 (3-3), whereas for Bilibili videos, these scores were mDISCERN: 3 (3-4) and JAMA: 2 (2-3). In terms of content quality, as assessed by the GQS, both platforms exhibited similar levels (TikTok: 4 [3-4], Bilibili: 4 [3-4]). Additionally, videos uploaded by hepatologists consistently showed higher quality and reliability. Spearman correlation analysis indicated significant but weak positive correlations between engagement metrics (likes, comments, shares, saves) and both GQS and JAMA scores; however, no significant correlation was observed with mDISCERN scores. The overall quality and reliability of Hepatitis B-related short videos were moderate, with TikTok videos outperforming Bilibili videos in reliability. Videos created by hepatologists demonstrated higher quality and reliability. We recommend that the public exercise caution when consuming health information from short videos to avoid potential misinformation.
    Keywords:  Bilibili; Hepatitis B; TikTok; health information; short videos
    DOI:  https://doi.org/10.1177/00469580261441434
  41. Br J Anaesth. 2026 Apr 15. pii: S0007-0912(26)00142-X. [Epub ahead of print]
      
    Keywords:  health information quality; information-seeking behaviour; online health information; patient education; regional anaesthesia
    DOI:  https://doi.org/10.1016/j.bja.2026.03.002
  42. Eur Arch Paediatr Dent. 2026 Apr 17.
       PURPOSE: As families increasingly rely on digital platforms to understand their child's dental diagnosis, concerns have emerged about whether online resources for molar incisor hypomineralisation (MIH) are fit for purpose. This study aimed to evaluate the readability, quality and actionability of online MIH information across different platforms using search terms generated by families.
    METHODS: A cross-sectional study was conducted using three independent search terms across multiple search engines, audiovisual platforms and social media. The first 100 results per platform were screened and analysed using validated quality and readability tools. Descriptive content analysis was used to quantify responses and identify themes, and inter-rater reliability was calculated. Descriptive and inferential statistics summarised platform differences.
    RESULTS: Of 2100 screened results, only 45 (2%) met the inclusion criteria. Written content was often professional, but exceeded recommended reading levels, with few meeting accepted quality benchmarks. Search engine results were dominated by academic articles and paywalled journals. YouTube videos showed modest quality but limited clinical depth, whilst TikTok and social media posts showed poor transparency and limited actionability.
    CONCLUSION: Despite high search volumes, online MIH resources remain fragmented, inaccessible and poorly tailored to family needs. Findings highlight an urgent need for discoverable, family-centred digital content and improved support for digital health literacy.
    Keywords:  Content analysis; Digital health literacy; MIH; Molar incisor hypomineralisation; Online health information; Patient education materials
    DOI:  https://doi.org/10.1007/s40368-026-01208-9
  43. J Med Libr Assoc. 2026 Apr 01. 114(2): 151-158
       Background: This case report details an exploratory instructional session for dental students led by librarian-instructors at the University at Buffalo. Using historical source materials from the Robert L. Brown History of Medicine Collection, an hour-long session was developed to introduce year-one dental students to the history of their profession and its ongoing collaboration with important clinical populations.
    Case Presentation: At the request of course faculty, the University at Buffalo's Dental Liaison Librarian, History of Medicine Curator, and History of Medicine Archivist were invited to develop and lead a session on the history of dentistry for a first-year course, Profession, Practice, and Community Dentistry (PDO 801). A core feature of this course is the introduction of students to eight underserved dental patient populations-referred to as "communities of focus." To supplement student learning, library staff utilized the holdings of the Robert L. Brown History of Medicine Collection to bring together stories, artifacts, and printed materials that spoke not only to the history of the profession, but also to the history of the communities of focus. Thought prompts were developed to guide students during a textual analysis activity that analyzed representative materials.
    Conclusions: Overall, this interdisciplinary collaboration provided the opportunity to develop and implement a syllabus-informed historical instructional session that offered targeted insights into dentistry's past. Through guided discussions, hands-on exploration, and textual analysis of historic materials, instructors worked to inspire and educate participating dental students as they progress further along their path as providers of patient-forward care.
    Keywords:  Dental history; Dental students; Interdisciplinary collaboration; Medical humanities
    DOI:  https://doi.org/10.5195/jmla.2026.2287
  44. IEEE Trans Image Process. 2026 Apr 14. PP
      User-generated visual content (UGC) now occupies a significant fraction of internet traffic, and billions of UGC videos and pictures are uploaded daily. Among these, short-form video content now accounts for most of the videos consumed by online users. Given the popularity of short-form UGC content, being able to control the perceptual quality of UGC videos has emerged as an important problem. Visual UGC is subject to myriad types, severity, and combinations of distortions. While UGC video quality has been closely studied, the quality and legibility of text that is overlaid or embedded in short-form UGC videos has received relatively low attention. However, being able to accurately predict text quality in images is important, since it both impacts the overall perception of the content it is embedded in, as well as the messages being conveyed. It is also beneficial for applications involving image or video text recognition which can affect visual search and content identification. Analyzing the quality of text embedded in pictures or videos is a hard problem, since perception of it is commingled with the surrounding visual content. Our work, which greatly extends our early report on text legibility prediction, contributes to both the psychophysics of embedded text quality as well as to computational models of its perception. We have created two subjective datasets - designated as the LIVE-COCO Text Legibility (LIVE-COCO-TL) Database (a modification of COCO-Text), and the LIVE-YouTube Text-in-Video Quality (LIVE-YT-TVQ) Database. LIVE-COCO-TL contains 74,440 text patches with legibility annotations, while LIVE-YT-TVQ contains ∼ 19K subjective quality ratings on 405 videos and 641 text patches extracted from them.We build models that predict embedded or overlaid text legibility and text quality, as well as a multi-task model that simultaneously predicts the overall quality of videos with embedded or overlaid and local text quality. We are making the databases and all models freely available at https://live.ece.utexas.edu/research/LIVE YouTube_Text Quality Assessment/index.html.
    DOI:  https://doi.org/10.1109/TIP.2026.3682131
  45. J Med Libr Assoc. 2026 Apr 01. 114(2): 169-170
      Circulating Now, the history of medicine blog for the National Library of Medicine (NLM), highlights blog posts written by community contributors. To evaluate the community represented within the blog, the project team explored how XQuery, a language for querying XML data, could be utilized in developing a dataset on institutions represented in the blog. The team used ChatGPT to develop the XQuery script and processed the queries through BaseX. The resulting data was transferred to Excel where additional data elements, such as geographic location and institutional type, were manually added. From this dataset, the team created visualizations in Tableau to show the over 400 unique institutions across the world represented. These visualizations supplemented an internal report for the Circulating Now Editorial Board, illustrating the current engagement reach of the blog and areas for future possible collaboration.
    Keywords:  Data Analysis; Data Visualization; History of Medicine; National Library of Medicine
    DOI:  https://doi.org/10.5195/jmla.2026.2338