bims-librar Biomed News
on Biomedical librarianship
Issue of 2025–10–26
27 papers selected by
Thomas Krichel, Open Library Society



  1. Palliat Med. 2025 Oct 23. 2692163251381487
       BACKGROUND: Research evidence is fundamental to informing clinical decision-making and advancing palliative care practice. Although academic, peer-reviewed journals underpin evidence-based healthcare, they represent only part of the knowledge landscape. Incorporating grey literature from sources outside traditional academic publishing can: provide context, balance and diverse perspectives; address knowledge gaps; and mitigate publication bias. However, its decentralised and dispersed nature can pose challenges for researchers unfamiliar with its scope and diversity.
    AIM: To present a flexible framework comprising 12 elements to support researchers in systematically identifying and locating grey literature relevant to palliative care across a broad range of sources. The framework accommodates variation in research focus, available resources, and context. Practical guidance is also provided for reporting grey literature searches with the transparency required in systematic reviews.
    METHODS: The framework was developed through expert consensus, informed by the authors' collective experience in systematic review methodology, grey literature searching, and information retrieval. It has been iteratively refined through teaching and real-world review projects. Each included source was assessed for its depth and breadth of palliative care content.
    RESULTS: The 12-element framework supports palliative care researchers in planning and executing searches across a wide range of fit-for-purpose sources. Practical examples are provided alongside a classification of grey literature source types.
    DISCUSSION AND CONCLUSION: This framework offers structured yet adaptable guidance to support more consistent grey literature engagement. Persistent challenges include defining search boundaries, managing duplication, record-keeping, and assessing quality. Future research should explore the framework's utility across diverse review types and palliative care research priorities.
    Keywords:  grey literature; information storage and retrieval; palliative care; systematic reviews as topic
    DOI:  https://doi.org/10.1177/02692163251381487
  2. J Clin Epidemiol. 2025 Oct 17. pii: S0895-4356(25)00351-8. [Epub ahead of print] 112018
       OBJECTIVES: To assess 1)the frequency of overlapping systematic reviews (SRs) on the same topic including overlap in outcomes, 2)whether SRs meet key methodological characteristics and 3)to describe discrepancies in results.
    STUDY DESIGN AND SETTING: For this research-on-research study, we gathered a random sample of SRs with meta-analysis (MA) published in 2022, identified the questions they addressed and, for each question, searched all SRs with MA published from 2018 to 2023 to assess the frequency of overlap. We assessed whether SRs met a minimum set of 6 key methodological characteristics: protocol registration, search of major electronic databases, search of trial registries, double selection and extraction, use of the Cochrane Risk-of-Bias tool and GRADE assessment.
    RESULTS: From a sample of 107 SRs with MA published in 2022, we extracted 105 different questions and identified 123 other SRs with MA published from 2018 to 2023. There were overlapping SRs for 33 questions (31.4%, 95% CI: 22.9-41.3), with a median of three overlapping SRs per question (interquartile range 2-6; range 2-19). Of the 230 SRs, 15 (6.5%) met the minimum set of 6 key methodological characteristics, and 12 (11.4%) questions had at least one SR meeting this criterion. Among the 33 questions with overlapping SRs, for 7 (21.2%), the SRs had discrepant results.
    CONCLUSIONS: One-third of the SRs published in 2022 had at least one overlapping SR published from 2018 to 2023, and most did not meet a minimum set of methodological standards. For one-fifth of the questions, overlapping SRs provided discrepant results.
    Keywords:  meta-analyses; methodological quality; redundancy; systematic reviews; waste of research
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.112018
  3. J Clin Transl Sci. 2025 ;9(1): e219
       Introduction: Portfolio-level publication tracking collects research output from related programs. Tracking publications is imperative to evaluate the scholarly impact of a program, synthesize program findings, and document impact to funders. A valid tracking protocol increases data quality for accurate impact assessment, but there is little literature on publication tracking methods appropriate for assessing impact across multiple programs.
    Methods: We tracked, managed, and evaluated publications from the National Institutes of Health-funded Rapid Acceleration of Diagnostics - Underserved Populations, which included over 137 projects and a Coordination and Data Collection Center. During the four-year project, we deployed a quarterly self-report survey to project leads and conducted twice-monthly searches for grant-related publications. Search strategies comprised a simple search of project grant numbers and an enhanced search. We evaluated the sensitivity and positive predictive value of search strategies compared to the surveys.
    Results: Compared to the survey, the simple search was 21.5% to 27.4% sensitive with a positive predictive value between 81.1% and 95.8%. The enhanced search was 62.6% to 68.0% sensitive with a positive predictive value between 76.2% and 96.9%. Response rates declined over time from a maximum of 61.3% to a minimum of 32.8%.
    Conclusions: The enhanced search increased specificity in identifying publications, but the survey was necessary to refine strategies and identify missed products. However, the enhanced search may have relieved participant burden in entering citations. These findings may be valuable for coordinating centers, academic departments, working groups, and other academic entities that must quantify the impact of their publications.
    Keywords:  Publication search strategy; coordinating center evaluation; publication evaluation; publication portfolio tracking; publication tracking
    DOI:  https://doi.org/10.1017/cts.2025.10138
  4. Med Sci (Basel). 2025 Oct 01. pii: 211. [Epub ahead of print]13(4):
      Background: Artificial intelligence tools are increasingly being used to assist literature reviews, but their effectiveness compared to traditional methods is not well established. This study compares Scopus AI with PubMed keyword searches on the topic of primary prepectoral breast reconstruction after radical mastectomy. Methods: On 28 May 2025, two literature searches were conducted on the topic of primary prepectoral breast reconstruction after radical mastectomy-one using Scopus AI and the other using manual keyword searches in PubMed. Both searches were limited to peer-reviewed clinical studies in English, excluding case reports and studies with fewer than 10 patients. Data extracted included study design, sample size, outcomes, and key findings. Results: The Scopus AI search retrieved 25 articles, while the traditional method identified 4. After removing duplicates, non-English texts, and non-relevant sources, 17 articles were included in the final analysis. Scopus AI provided automatic summaries, while manual review was required for the traditional method. No overlap was found between the two methods. Conclusions: AI tools like Scopus AI can enhance the speed and breadth of literature reviews, but human oversight remains essential to ensure relevance and quality. Combining AI with traditional methods may offer a more balanced and effective approach for clinical research.
    Keywords:  breast implantation; breast reconstruction; generative artificial intelligence; literature review; mammaplasty
    DOI:  https://doi.org/10.3390/medsci13040211
  5. Cureus. 2025 Sep;17(9): e92590
      Objective While Large Language Models (LLMs) show great promise for various medical applications, their black-box nature and the difficulty of reproducing results have been noted as significant challenges. In contrast, conventional text mining is a well-established methodology, yet its mastery remains time-consuming. This study aimed to determine if an LLM could achieve literature analysis outcomes comparable to those from traditional text mining, thereby clarifying both its utility and inherent limitations. Methods We analyzed the abstracts of 5,112 medical papers retrieved from PubMed using the single keyword "text mining." We used Google Gemini 2.5 (Google Inc., Mountain View, CA, USA) and instructed it to extract distinctive words, concepts, trends, and co-occurrence network concepts. These results were then qualitatively compared with those obtained from conventional text mining tools, VOSviewer and KH Coder. Results Google Gemini appeared to conceptually aggregate individual words and identify research trends. The concepts for co-occurrence networks also showed visual similarity to the networks generated by the traditional tools. However, the LLM's analytical output was based on its own unique interpretation and could not be directly compared with the statistically derived co-occurrence patterns. Furthermore, since this study relied on a visual comparison of network diagrams rather than rigorous quantitative analysis, the conclusions remain qualitative. Conclusion Google Gemini indicated an ability to extract keywords, concepts, and trends. A co-occurrence network visually similar to those generated by conventional text mining tools was created. While it showed particular strengths in conceptual summarization and trend detection, its limitations - including its black-box nature, reproducibility challenges, and subjective interpretations - became apparent. With a proper understanding of these constraints, LLMs may serve as a valuable complementary tool, with the potential to accelerate literature analysis in medical research.
    Keywords:  co-occurrence network; large language model; medical literature analysis; pubmed database; text mining
    DOI:  https://doi.org/10.7759/cureus.92590
  6. Int J Dent. 2025 ;2025 2677641
       Introduction: Dental implantology has seen rapid technological advancements, with artificial intelligence (AI) increasingly integrated into diagnostic, planning, and surgical processes. The release of chat-generative pretrained transformer (ChatGPT) and its subsequent updates, including the deep research function, presents opportunities for AI-assisted systematic reviews. However, its efficacy compared to traditional manual research has not been researched.
    Materials and Methods: A systematic review was conducted on May 6, 2025, to evaluate recent innovations in dental implantology and AI. Two parallel searches were performed: one using ChatGPT 4.1's deep research tool in the PubMed database and another manual PubMed search by two independent reviewers. Both searches used identical keywords and Boolean operators targeting studies from 2020 to 2025. Inclusion criteria were peer-reviewed studies related to implant design, osseointegration, guided placement, and other predefined outcomes.
    Results: The manual search identified 124 articles, of which 23 met the inclusion criteria. ChatGPT retrieved 114 articles, selected 13 for inclusion, yet only included 11 in its synthesis. Two cited articles by the AI software were nonexistent, and numerous relevant studies were not retrieved, whereas the remaining articles were correct and found by manual search as well. ChatGPT had high specificity (98%) and low sensitivity (47.8%), with a statistically significant difference compared to manual search and selection.
    Discussion: AI tools like ChatGPT show promise in literature search, synthesis, and assistance, especially in improving readability and identifying trending topics in science. Nevertheless, the current state of deep research function lacks the reliability required for conducting systematic reviews due to issues such as made-up references and missed articles. The results highlight the need for human supervision and improved safeguards.
    Conclusions: ChatGPT's deep research function can support, but not replace manual systematic search and selection. It offers substantial benefits in writing support and preliminary synthesis due to acceptable accuracy, but limitations in reliability and low sensitivity (47.8%) require cautious use and transparent reporting of any AI involvement in scientific research.
    Keywords:  ChatGPT; artificial intelligence; deep research; implantology
    DOI:  https://doi.org/10.1155/ijod/2677641
  7. Eur Arch Otorhinolaryngol. 2025 Oct 24.
       OBJECTIVE: This study evaluated the performance of ChatGPT-4o in responding to patient-centered questions concerning auditory brainstem implantation (ABI), with a focus on content quality and readability.
    METHODS: A total of 51 real-world patient questions related to ABI were reviewed and grouped into five thematic categories: diagnosis and candidacy, surgical procedures and complications, device function and mapping, rehabilitation and expected outcomes, and daily life and long-term concerns. Responses were independently assessed by two audiologists and one otologist across four domains-accuracy, comprehensiveness, clarity, and credibility-using a 5-point Likert scale. Readability was evaluated using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) formulas. Kruskal-Wallis and Friedman tests were used to examine statistical differences across question categories and evaluation dimensions.
    RESULTS: ChatGPT-4o achieved consistently high scores across all evaluative domains, with mean values exceeding 4.5. Clarity received the highest average score (4.72). No significant differences were found between thematic categories or across dimensions. However, readability analysis revealed that most responses required college-level reading proficiency (FKGL = 13.3 ± 2), particularly in the domains of diagnosis and surgical content, and were rated as "difficult" according to Flesch Reading Ease scores (FRE < 50).
    CONCLUSION: ChatGPT-4o shows potential as a supportive communication tool in the context of ABI patient education. However, its application in clinical practice remains limited by issues of readability and clinical specificity. Ongoing refinement and medical oversight will be essential to ensure safe and effective integration into healthcare settings.
    Keywords:  Artificial intelligence; Auditory brainstem implant; ChatGPT; Large language models
    DOI:  https://doi.org/10.1007/s00405-025-09789-9
  8. Br J Clin Pharmacol. 2025 Oct 25. e70321
       AIMS: To assess the utility of the artificial intelligence (AI) chatbot ChatGPT (openly available version 3.5) in responding to real-world pharmacotherapeutic queries from healthcare professionals.
    METHODS: Three independent and blinded evaluators with different levels of medical expertise and professional experience (beginner, advanced, and expert) compared AI chatbot- and physician-generated responses to 70 real-world pharmacotherapeutic queries submitted to the clinical-pharmacological drug information centre of Hannover Medical School between June and October 2023 with regard to quality of information, answer preference, answer correctness and quality of language. Inter-rater reliability was assessed with Krippendorff's alpha. Two separate investigators not otherwise involved in the conduct or analysis of the study selected the top three clinically relevant errors in chatbot- and physician-generated responses.
    RESULTS: All three evaluators rated the quality of information of physician-generated responses higher than the quality of information of AI chatbot-generated responses and, accordingly, thought that the physician-generated responses were better than the chatbot-generated responses (answer preference). All evaluators detected factually wrong information more frequently in chatbot-generated responses than in physician-generated responses. Although the beginner and expert evaluators rated the quality of language of physician-generated responses higher than the quality of language of chatbot-generated responses, there was no significant difference according to the advanced evaluator.
    CONCLUSIONS: ChatGPT's responses to real-world pharmacotherapeutic queries were substantially inferior compared to conventional physician-generated responses with regard to quality of information and factual correctness. Our study suggests that to date it must be strongly cautioned against the use of ChatGPT in pharmacotherapy counselling.
    Keywords:  ChatGPT; artificial intelligence; chatbot; clinical pharmacology; drug information centre; patient safety
    DOI:  https://doi.org/10.1002/bcp.70321
  9. Surg Endosc. 2025 Oct 24.
       INTRODUCTION: The rapid uptake of large language models (LLMs) in surgery demands evidence of their reliability when guiding laparoscopic cholecystectomy (LC).
    METHODS: An analytical cross-sectional study (April-June 2025) compared five current LLMs (ChatGPT-o3, Claude-Sonnet-4, DeepSeek-V3.5, Gemini-2.5 Flash, and Grok-3) on 24 guideline-derived questions covering the pre-, intra-, and postoperative phases of LC. Four blinded hepatobiliary surgeons rated 120 answers with the eight-item modified DISCERN (mDISCERN, 8-40) and Global Quality Score (GQS, 1-5). Readability was quantified with FRES, FKGL, SMOG, Fog, CLI, and lexical density indices, and inter-rater agreement assessed by two-way ICC.
    RESULTS: Grok delivered the highest mean mDISCERN (36.3 ± 2.3) and GQS (4.76 ± 0.41), whereas Gemini scored lowest (29.0 ± 2.1; 3.58 ± 0.36). DeepSeek produced the most readable output (FRES ≈ 30.6; FKGL ≈ 12.1), while Claude generated the densest, least readable text (negative FRES; FKGL ≈ 18.3). Quality correlated positively with word count and lexical density (ρ ≈ 0.7) but not with syntactic complexity. Surgeon ratings showed good reliability (ICC(2,k) = 0.775; ICC(3,k) = 0.819).
    CONCLUSIONS: LLM performance for LC varies markedly; even the best-performing model stops short of full reliability, reinforcing the need for procedure-specific validation before clinical deployment. This multidimensional audit provides a reproducible benchmark for selecting and fine-tuning surgical decision-support LLMs and highlights that terminological richness, rather than sentence complexity, underpins high-quality guidance.
    Keywords:  Artificial ıntelligence in surgery; DISCERN ınstrument; Global quality score; Laparoscopic cholecystectomy; Large language models (LLMs); Readability metrics; Surgical decision support
    DOI:  https://doi.org/10.1007/s00464-025-12315-x
  10. Am J Otolaryngol. 2025 Oct 20. pii: S0196-0709(25)00147-4. [Epub ahead of print]46(6): 104744
      To evaluate whether advanced large language models (LLMs) ChatGPT-4o, ChatGPT o3, Microsoft Copilot, Claude Sonnet 4, Gemini 1.5 Flash, and DeepSeek V3-R1 can improve the readability of head and neck cancer patient education materials while maintaining accuracy. Eleven publicly available articles were assessed for baseline readability using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Each article was rewritten by six LLMs using a standardized prompt to simplify text to a sixth-grade level. Readability was reassessed post-intervention, and paired t-tests with repeated-measures ANOVA and Bonferroni correction compared performance across models. Baseline readability was poor, with a mean FKGL of 10.1 and FRES of 51.5, exceeding national recommendations. Gemini, Copilot, ChatGPT o3, and DeepSeek significantly reduced FKGL compared to originals, while ChatGPT-4o showed minimal change and Claude increased difficulty. Gemini performed best, achieving FKGL ≤7 in 81.8 % of cases, followed by Copilot and DeepSeek at 72.7 %. ChatGPT-4o met this threshold in 18.2 % of articles, and Claude met it in none. LLMs can improve the readability of head and neck cancer patient education materials, but effectiveness varies substantially among models. Gemini, Copilot, and DeepSeek were most successful in meeting clinical readability thresholds, whereas ChatGPT-4o and Claude underperformed. Careful model selection and clinical oversight are essential when applying AI to patient education, as rewritten materials must still be reviewed for accuracy before public use.
    Keywords:  Artificial Intelligence; Chatbots; Head and neck cancer; Health literacy; Large language models; Patient education materials
    DOI:  https://doi.org/10.1016/j.amjoto.2025.104744
  11. Foot Ankle Surg. 2025 Oct 14. pii: S1268-7731(25)00228-0. [Epub ahead of print]
       BACKGROUND: This study investigates the quality, accuracy, and readability of ChatGPT's responses to common patient inquiries regarding hallux rigidus.
    METHODS: Twenty-five patient questions were directed to ChatGPT and analyzed. The DISCERN criteria assessed information quality, while the method by Mika et al. evaluated response accuracy. Questions were classified per Rothwell classification, and readability was evaluated using Flesch-Kincaid, Gunning Fog, Coleman-Liau, and SMOG indices.
    RESULTS: The mean DISCERN score was 50.26 (fair), and the Mika et al. score was 2.04 (satisfactory requiring minimal clarification). According to the Rothwell classification, 72 % of the questions were in the Fact group. The mean readability corresponded to 11.3 years of education.
    CONCLUSIONS: ChatGPT provides partially satisfactory information about hallux rigidus in general at a high reading level. More detailed content should include surgical classifications, biomechanical details, and level of evidence. With these aspects, ChatGPT might be considered a supportive tool in patient education.
    LEVELS OF EVIDENCE: None.
    Keywords:  Artificial intelligence; ChatGPT; Frequently asked questions; Hallux limitus; Hallux rigidus
    DOI:  https://doi.org/10.1016/j.fas.2025.10.006
  12. Acta Cardiol. 2025 Oct 24. 1-6
       BACKGROUND: This study aimed to evaluate the accuracy and readability of ChatGPT-4 responses related to cardiac rehabilitation (CR) for patients with heart failure (HF), with the objective of assessing its potential as a patient education tool.
    METHODS: The study involved 16 open-ended questions related to CR, developed by two specialists (one cardiologist and one physical medicine and rehabilitation specialist). These questions were submitted to ChatGPT-4, and its responses were evaluated for accuracy and readability. Accuracy was assessed using a 6-point Likert scale, while readability was analysed using the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Coleman-Liau Index (CLI), and Gunning Fog Index (GFI). Inter-evaluator reliability was assessed by the intraclass correlation coefficient (ICC).
    RESULTS: The mean accuracy score of ChatGPT-4 responses was high (5.25 ± 0.77 and 5.38 ± 0.62 for two raters), with 81.25% of responses rated 5 or above. The readability analysis revealed a median FRE of 59.5, indicating moderate readability, with FKGL at 7.1 and CLI at 11.2. The ICC between the two evaluators was 0.854, indicating good agreement.
    CONCLUSION: ChatGPT-4 provided accurate and reliable information on CR for HF patients. Although the readability was slightly above the ideal level, its overall performance suggests potential as a supportive tool in patient education. Further improvements in language simplicity are needed to optimise its usability.
    Keywords:  Cardiac rehabilitation; ChatGPT; heart failure
    DOI:  https://doi.org/10.1080/00015385.2025.2576451
  13. Cureus. 2025 Sep;17(9): e92434
      Introduction Parental education on pediatric respiratory illnesses is essential to ensure timely care for children. With the increasing use of artificial intelligence (AI) in health communication, tools such as ChatGPT and DeepSeekAI offer potential for generating accessible and reliable educational material. Methodology A cross-sectional analysis was conducted using AI-generated educational content. Each response was assessed for word and sentence count, average words per sentence, syllables per word, readability (Flesch Reading Ease Score and Grade Level), similarity percentage (QuillBot), and reliability (modified Discern Score). Statistical analysis was performed using independent sample t-tests, with p<0.05 considered significant. Results DeepSeekAI responses were longer (mean word count: 422 vs. 333.75) and included more sentences. However, no statistically significant differences were found in any variable, including readability (Grade Level: 9.22 vs. 9.30; Ease Score: 41.17 vs. 40.97) and reliability (2.25 vs. 2.00). Similarity scores were also comparable between the two tools (36.02 vs 32.1). Conclusion Both ChatGPT and DeepSeekAI generated parent education materials of similar quality, readability, and reliability. The findings of this study suggest that either AI tools can be utilized for developing parental educational content for common pediatric respiratory conditions.
    Keywords:  artificial intelligence; chatgpt; deepseek; deepseekai; education; paediatric respiratory diseases; pediatric respiratory diseases
    DOI:  https://doi.org/10.7759/cureus.92434
  14. Mediterr J Rheumatol. 2025 Sep;36(3): 410-416
       Background: The aim of this study was to evaluate the quality, completeness, accuracy, and readability of Large Language Models (LLM) responses to 25 popular questions about Familial Mediterranean Fever (FMF).
    Methods: The readability of the responses of LLMs (ChatGPT-4, Copilot, Gemini) was assessed by Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade (FKG). The Ensuring Quality Information for Patients (EQIP) tool was used to assess the quality. To assess the completeness and accuracy of responses, 3-point and 5-point Likert scales were used, respectively.
    Results: The mean FRES scores of LLMs ranged between 29.80 and 35.66. The FKG scores ranged between 12.36 and 13.72. The mean accuracy scores of LLMs ranged between 4.88 and 4.96. No significant difference was found between the LLM groups regarding accuracy and readability scores (p>0.05). The mean completeness scores of LLMs ranged between 2.36 and 2.84. ChatGPT-4 was the leading LLM in completeness scores according to the Likert scale, and the difference between LLM groups was statistically significant (p=0.006). Gemini performed better in the quality analysis with the EQIP tool, and there was a statistically significant difference between the LLM groups (p<0.001).
    Conclusion: In this study, LLMs performed acceptably in accuracy and completeness. However, there are serious concerns about their readability and quality. To improve health information, LLM developers should include more diverse data sources in the training sets of the models. Moreover, the ability of LLMs to provide readability features that are adaptable to the level of education could be an important innovation in this field.
    Keywords:  artificial intelligence; familial mediterranean fever; health literacy; large language model; patient education
    DOI:  https://doi.org/10.31138/mjr.261224.hfm
  15. Urolithiasis. 2025 Oct 23. 53(1): 202
      This study aims to evaluate the reliability, quality and readability of ChatGPT-4o responses regarding pediatric urolithiasis. Forty frequently asked questions about pediatric urinary stones were posed to ChatGPT-4o twice, one week apart. The reliability of ChatGPT-4o's responses was assessed using the five-point DISCERN tool (mDISCERN). The overall quality of the responses was evaluated using the Global Quality Scale (GQS). To assess the readability of ChatGPT-4o's responses, multiple metrics were employed, including the Flesch Reading Ease (FRE) score, the Flesch-Kincaid Grade Level (FKGL), the Gunning Fog Index (GFI), the Coleman-Liau Index (CLI), and the Simple Measure of Gobbledygook (SMOG). The median mDISCERN score was 5 (range: 4-5), and the median GQS score was 5 (range: 3-5), indicating high reliability and quality. However, readability metrics suggested a high level of difficulty: FRE (27.98 ± 13.65), FKGL (11.46 ± 1.88), SMOG (14.96 ± 1.64), GFI (17.27 ± 2.37), and CLI (15.60 ± 1.95). Only 2.5% of responses were understandable to individuals with reading skills at a 10-12-year-old level, 37.5% were suitable for college-level readers, and 60% required professional-level comprehension. A moderate correlation was observed between mDISCERN and GQS scores (r = 0.42, p = 0.007), but neither correlated significantly with readability metrics. ChatGPT-4o may provide reliable and high-quality information about pediatric urinary stones; however, the advanced reading level of its responses presents a significant barrier to accessibility for patients and caregivers. Therefore, despite its potential utility, the readability challenge must be addressed to ensure equitable patient education.
    Keywords:  ChatGPT-4o; GQS; Pediatric; Urolithiasis; mDISCERN
    DOI:  https://doi.org/10.1007/s00240-025-01880-4
  16. World J Gastrointest Oncol. 2025 Oct 15. 17(10): 109792
       BACKGROUND: With the rising use of endoscopic submucosal dissection (ESD) and endoscopic mucosal resection (EMR), patients are increasingly questioning various aspects of these endoscopic procedures. At the same time, conversational artificial intelligence (AI) tools like chat generative pretrained transformer (ChatGPT) are rapidly emerging as sources of medical information.
    AIM: To evaluate ChatGPT's reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.
    METHODS: In this study, 30 specific questions related to ESD and EMR were identified. Then, these questions were repeatedly entered into ChatGPT, with two independent answers generated for each question. A Likert scale was used to rate the accuracy, completeness, and comprehensibility of the responses. Meanwhile, a binary category (high/Low) was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.
    RESULTS: By analyzing the average scores of the three raters, our findings indicated that the responses generated by ChatGPT received high ratings for accuracy (mean score of 5.14 out of 6), completeness (mean score of 2.34 out of 3), and comprehensibility (mean score of 2.96 out of 3). Kendall's coefficients of concordance indicated good agreement among raters (all P < 0.05). For the responses generated by Google, more than half were classified by experts as having low accuracy and low completeness.
    CONCLUSION: ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR. Future studies should address ChatGPT's current limitations by incorporating more detailed and up-to-date medical information. This could establish AI chatbots as significant resource for both patients and health care professionals.
    Keywords:  Artificial intelligence; Chat generative pretrained transformer; Endoscopic mucosal dissection; Endoscopic submucosal dissection; Google; Patient education
    DOI:  https://doi.org/10.4251/wjgo.v17.i10.109792
  17. Medicine (Baltimore). 2025 Oct 24. 104(43): e45135
      This study aims to evaluate and compare the quality and comprehensibility of responses generated by 5 artificial intelligence chatbots - ChatGPT-4, Claude, Mistral, Grok, and Google PaLM - to the most frequently asked questions about uveitis. Google Trends was employed to identify significant phrases associated with uveitis. Each artificial intelligence chatbot was provided with a unique sequence of 25 frequently searched terms as input. The responses were evaluated using 3 distinct tools: The Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), the Simple Measure of Gobbledygook (SMOG) index, and the Automated Readability Index (ARI). The 3 most frequently searched terms were "uveitis eye," "anterior uveitis," and "uveitis symptoms." Among the chatbots evaluated, GPT-4 demonstrated the lowest ARI and SMOG scores (P = .001). Regarding the PEMAT-P, Mistral scored the lowest in understandability, while Grok achieved the highest score for actionability (P < .001). All chatbots, except Mistral, exhibited high intelligibility scores. GPT-4 had the lowest SMOG and ARI score among the chatbots evaluated, making it the easiest to read. Chatbot technology holds significant potential to enhance healthcare information dissemination and facilitate better patient understanding. While chatbots can effectively provide information on health topics such as uveitis, further improvement is needed to maximize their efficacy and accessibility.
    Keywords:  ChatGPT-4; artificial intelligence; readability; understandability; uveitis
    DOI:  https://doi.org/10.1097/MD.0000000000045135
  18. J Patient Exp. 2025 ;12 23743735251385918
      This study evaluates the readability and quality of online resources on steroid knee injections. Online materials were identified using Google, Bing, and Yahoo with the search terms steroid knee injection, corticosteroid knee injection, and knee injection treatment. Of 150 screened web pages, 57 met inclusion criteria. Quality was assessed using the DISCERN instrument and Journal of the American Medical Association (JAMA) benchmark, while readability was measured using the Flesch-Kincaid Grade Level (FKGL) and Simple Measure of Gobbledygook (SMOG). Health On the Net Foundation Code of Conduct certification status was recorded. The mean DISCERN score was 42.47 ± 17.06, and the Journal of the American Medical Association score was 1.58 ± 1.52, indicating low quality. Readability analysis showed an FKGL score of 9.19 ± 2.08 and an SMOG score of 8.20 ± 5.23, suggesting most materials require advanced literacy. For-profit web pages had lower quality but were easier to read, whereas nonprofit and academic sites provided higher quality but more complex content. Most web pages offer low-quality, difficult-to-understand information. Patients should seek reliable sources, and oversight is needed to improve quality and accessibility.
    Keywords:  health communication; health literacy; knee osteoarthritis; readability analysis
    DOI:  https://doi.org/10.1177/23743735251385918
  19. Int J Audiol. 2025 Oct 23. 1-11
       OBJECTIVE: To assess the validity, reliability and readability of four AI chatbots for hearing-health information.
    DESIGN AND STUDY SAMPLE: Three audiologists created 100 questions covering adult hearing loss, paediatric hearing, hearing aids, tinnitus and cochlear implants (20 each). Questions were submitted twice to ChatGPT-3.5, Bing AI, Gemini and Perplexity. Answers were scored for factual accuracy and completeness on a five-point Global Quality Score. Validity was defined using low (score = 5) and high (score ≥ 4) thresholds. Internal consistency was estimated with Cronbach's α; readability with the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). All scoring was completed independently by two blinded reviewers; discrepancies were resolved by consensus.
    RESULTS: Under the low threshold ChatGPT-3.5 and Perplexity were most valid (84% and 79%); high-threshold validity fell to 37% and 34%. Perplexity had the highest overall reliability (α = 0.83) yet α dropped below 0.70 for cochlear-implant, tinnitus and hearing-aid questions. 84% percent of outputs were "Difficult"/"Very Difficult" and 68% read at college level.
    CONCLUSIONS: AI chatbots deliver generally accurate hearing-health content, but high-threshold accuracy, domain-specific reliability and readability remain suboptimal. They should supplement, not replace the professional counselling. Continued optimisation and external validation are needed before routine clinical recommendation.
    Keywords:  Artificial Intelligence; Hearing loss; chatbots; readability; reliability; validity
    DOI:  https://doi.org/10.1080/14992027.2025.2569927
  20. JMIR Form Res. 2025 Oct 22. 9 e68000
       Background: As internet usage continues to rise, an increasing number of individuals rely on online resources for health-related information. However, prior research has shown that much of this information is written at a reading level exceeding national recommendation, which may hinder patient comprehension and decision-making. The American Medical Association (AMA) recommends that patient-directed health materials be written at or below a 6th-grade reading level to ensure accessibility and promote health literacy. Despite these guidelines, studies indicate that many online health resources fail to meet this standard. The exercise stress test is a widely used diagnostic tool in cardiovascular medicine, yet no prior studies have assessed the readability and quality of online health information specific to this topic.
    Objective: This study aimed to evaluate the readability and quality of online resources on exercise stress testing and compare these metrics between academic and non-academic sources.
    Methods: A cross-sectional readability and quality analysis was conducted using Google and Bing to identify web-based patient resources related to exercise stress testing. Eighteen relevant websites were categorized as academic (n=7) or nonacademic (n=11). Readability was assessed using four established readability formulas: Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), Simple Measure of Gobbledygook (SMOG), and Gunning Fog (GF). Website quality and reliability were evaluated using the modified DISCERN (mDISCERN) tool. Statistical comparisons between academic and nonacademic sources were performed using independent samples t tests.
    Results: The average FKGL, SMOG, and GF scores for all websites were 8.36, 8.28, and 10.14, respectively, exceeding the AMA-recommended 6th-grade reading level. Academic sources had significantly higher FKGL (9.1 vs. 7.9, P=.03), SMOG (8.9 vs. 7.9, P=.04), and lower FRE scores (57.6 vs. 65.3, P=.006) than nonacademic sources, indicating greater reading difficulty. The average GF scores for academic and nonacademic sources were 10.68 and 9.81, respectively, but this difference was not statistically significant. The quality of web resources, as assessed by mDISCERN, was classified as fair overall, with an average score of 29.44 out of 40 (74%). While academic and nonacademic websites had similar mDISCERN scores, areas such as source citation, publication dates, and acknowledgment of uncertainty were consistently lacking across all resources.
    Conclusions: Online resources on exercise stress testing are, on average, written at a reading level that exceeds the AMA's 6th-grade reading guideline, potentially limiting patient comprehension. Academic sources are significantly more difficult to read than nonacademic sources, though neither category meets the recommended readability standards. The quality of web-based resources was found to be fair but could be improved by ensuring transparency in sourcing and providing clearer, more comprehensive information. These findings underscore the need for improved accessibility and readability in online health information to support patient education and informed decision-making.
    Keywords:  DISCERN; FKGL; FRE; Flesch reading ease; Flesch-Kincaid grade level; GF; Gunning Fog index; SMOG; Simple Measure of Gobbledygook; cross-sectional study; exercise; exercise stress test; health information; health literacy; mDISCERN; medical information; patient outcomes; physical activity; quality; quality analysis; readability; search engines; websites
    DOI:  https://doi.org/10.2196/68000
  21. Infect Dis Clin Microbiol. 2025 Sep;7(3): 310-319
       Objective: Human papillomavirus (HPV) infection is a major public health concern, contributing to HPV-related cancers. Although effective vaccines are available, misinformation on social media complicates public health efforts. This study aimed to evaluate the quality, educational value, understandability, actionability, transparency, reliability, and popularity of Turkish-language YouTube videos on HPV vaccination.
    Materials and Methods: A YouTube search was conducted using the Turkish keywords HPV aşısı (HPV vaccine), Gardasil aşısı (Gardasil vaccine), and serviks kanseri aşısı (cervical cancer vaccine). The first 50 videos for each keyword were screened and included. Videos were assessed using validated tools: the Patient Education Materials Assessment Tool (PEMAT) for understandability and actionability, the JAMA score for transparency and reliability, the Video Power Index (VPI) for popularity, the Global Quality Score (GQS), and the Video Information & Quality Index (VIQI) for quality. Higher VIQI and VPI scores reflect greater quality and popularity, respectively.
    Results: The median video duration was 95 seconds (interquartile range [IQR], 105 seconds). The median JAMA score was 2 (IQR, 1), indicating low transparency and reliability. The median GQS score was 3 (IQR, 2), indicating moderate quality. PEMAT scores had a median of 66% (IQR, 25). The median VIQI and VPI were 15 (IQR, 4) and 144 (IQR, 1274), respectively. No significant differences were found in quality metrics between more and less popular videos. Most videos (98.75%) were produced by health-care providers (HCPs), predominantly gynecologists (86.4%), with no representation from family physicians.
    Conclusion: Although predominantly produced by HCPs, Turkish-language YouTube videos on HPV vaccination demonstrated only moderate quality and limited capacity to promote vaccination. Greater involvement of family physicians, key providers of preventive healthcare, may enhance the public health impact of online HPV vaccination content.
    Keywords:  Content analysis; YouTube; human papillomavirusHPV vaccinationonline health informationquality assessment
    DOI:  https://doi.org/10.36519/idcm.2025.699
  22. JMIR Infodemiology. 2025 Oct 24. 5 e70756
       Background: The quality of health information on social media is a major concern, especially during the early stages of public health crises. While the quality of the results of the popular search engines related to particular diseases has been analyzed in the literature, the quality of health-related information on social media, such as X (formerly Twitter), during the early stages of a public health crisis has not been addressed.
    Objective: This study aims to evaluate the quality of health-related information on social media during the early stages of a public health crisis.
    Methods: A cross-sectional analysis was conducted on health-related tweets in the early stages of the most recent public health crisis (the COVID-19 pandemic). The study analyzed the top 100 websites that were most frequently retweeted in the early stages of the crisis, categorizing them by content type, website affiliation, and exclusivity. Quality and reliability were assessed using the DISCERN and JAMA (Journal of the American Medical Association) benchmarks.
    Results: Our analyses showed that 95% (95/100) of the websites met only 2 of the 4 JAMA quality criteria. DISCERN scores revealed that 81% (81/100) of the websites were evaluated as low scores, and only 11% (11/100) of the websites were evaluated as high scores. The analysis revealed significant disparities in the quality and reliability of health information across different website affiliations, content types, and exclusivity.
    Conclusions: This study highlights a significant issue with the quality, reliability, and transparency of online health-related information during a public health challenge. The extensive shortcomings observed across frequently shared websites on Twitter highlight the critical need for continuous evaluation and improvement of online health content during the early stages of future health crises. Without consistent oversight and improvement, we risk repeating the same shortcomings in future, potentially more challenging situations.
    Keywords:  DISCERN; JAMA benchmarks; Journal of the American Medical Association; health crisis; health information; infodemic; public health; quality assessment
    DOI:  https://doi.org/10.2196/70756
  23. Urol Int. 2025 Oct 21. 1-18
       AIM: Ureteropelvic junction obstruction (UPJO) is a common disease of the urinary system. Laparoscopic pyeloplasty, especially robot-assisted laparoscopic pyeloplasty, has become the primary surgical method for the treatment of UPJO. As YouTube gradually becomes a platform for young doctors to learn surgical techniques, we intend to assess these surgical videos of traditional laparoscopic pyeloplasty and robot-assisted pyeloplasty on YouTube in terms of their educational quality.
    MAIN METHODS: Two authors searched for "laparoscopic pyeloplasty", "robot-assisted pyeloplasty", and "robotic pyeloplasty" on YouTube (https://www.youtube.com/) individually on March 16, 2023. We developed the LAP-VEGaS Video Assessment Tool /LAP-VEGaS-LP scale based on the LAP-VEGaS guidelines to quantify the quality of the videos. And we used JAMA (Journal of the American Medical Association) Benchmark Criteria to assess the reliability of the videos. SPSS 26.0 software was applied to the description of the statistics and correlation analysis.
    KEY FINDINGS: Finally, 55 videos were included. The average length of the videos was 9.30 minutes (IQR, 22.56 minutes). The mean number of subscribers was 4745 (range, 3-28,700, IQR 16684). The mean number of VPI (views per like) was defined as the percentage of the like ratio, and the ratio of the views was 2.48% (IQR, 6.79). The median JAMA score of the videos was 6 (IQR, 2). The mean LAP-VEGaS-LP score was 17.72 (SD 0.76). The video definition had a positive correlation with the number of subscribers (r=0.410, P=0.003), views ratio (r=0.431, P=0.002), and VPI (r=0.443, P=0.001). The LAP-VEGaS-LP score had a positive correlation with the number of subscribers (r=0.398, P=0.004), definition (r=0.314, P=0.026), views ratio (r=0.459, P=0.001), VPI (r=0.496, P<0.001), and the score of reliability (r=6.53, P<0.001).
    SIGNIFICANCE: The educational quality of laparoscopic pyeloplasty surgical videos is concerning. A more authoritative standard is needed to guide the uploaders and improve the educational value of the videos.
    DOI:  https://doi.org/10.1159/000549012
  24. Eur Arch Otorhinolaryngol. 2025 Oct 18.
      
    Keywords:  Endoscopic; IVORY; LAP-VEGaS; Myringoplasty; Tympanoplasty; YouTube
    DOI:  https://doi.org/10.1007/s00405-025-09777-z
  25. Am J Health Promot. 2025 Oct 21. 8901171251383874
      PurposeAmerican Indian and Alaska Native (AI/AN) peoples face disproportionate health risks. Understanding how AI/ANs seek out information can inform effective campaigns design that can help address these risks. We investigate preferred communication sources, health information seeking behavior (HISB), self-efficacy, perceived importance of health information, and prevention orientation of American Indians and Alaska Natives (AI/ANs).DesignWe administered a survey at 3 cultural events.SettingThe National Tribal Health Conference in Bellevue, the University of Washington Winter and Spring Powwows in Seattle.SubjectsParticipants (N = 344) of the survey included people from tribes throughout the US, particularly from northwestern tribes.AnalysisIndependent samples t-tests and ANOVAs examined differences in HISB. Frequency analyses identified preferred health information. PROCESS tested the relationship between perceived importance and HISB, and moderation from prevention orientation and self-efficacy.ResultsPreferred health information source were doctor (M = 3.5), the internet (M = 3.32) and friends/relatives (M = 3.11). Females demonstrated more HISB than males (P < .01). Individuals with a college degree or higher showed greater HISB (P < .001). AI/ANs living on reservations (M = 2.34, SD = 1.53) preferred newspapers for health information more than those in metropolitan (M = 1.64, SD = .13) or rural areas (M = 1.45, SD = .16, P < .05). Perceived importance is a robust positive factor that predicts HISB (b = .48, t(315) = 9.67, P < .001).ConclusionThis study offers advice for scholars and practitioners to design messages to increase accessibility of health information.
    Keywords:  health communication; health disparities; indigenous; risk communication
    DOI:  https://doi.org/10.1177/08901171251383874
  26. BMJ Open. 2025 Oct 23. 15(10): e097949
       OBJECTIVES: To characterise the information needs and experiences of receiving COVID-19 vaccine information by youth with mental health concerns.
    DESIGN: Thematic analysis of semistructured interview transcripts.
    SETTING: Semistructured interviews via WebEx video conferencing or by telephone.
    PARTICIPANTS: 46 youth aged 16-29 with one or more self-reported mental health concerns and six family members of youth.
    RESULTS: Our analysis generated four main themes: (1) information content and characteristics; (2) critical appraisal; (3) modulators of information-seeking behaviour; and (4) unmet information needs.
    CONCLUSIONS: Our findings suggest that youth with mental health concerns have unique information needs and processing patterns influenced by their environments and experiences with mental health concerns. Participants identified barriers to receiving reliable health information and suggested ways to improve this process.
    Keywords:  Adolescent; COVID-19; Health Education; MENTAL HEALTH; PUBLIC HEALTH
    DOI:  https://doi.org/10.1136/bmjopen-2024-097949
  27. Front Public Health. 2025 ;13 1672145
       Objective: This study aims to elucidate the correlations among health information search, health anxiety, and geriatric hypochondriasis, and to examine the mediating role of health information search behavior between health anxiety and hypochondriasis among the older adults, thereby providing a theoretical basis for interventions.
    Methods: A cross-sectional survey was conducted among 251 older adults participants recruited via cluster sampling from six streets in Changshu City, Suzhou, from January to March 2024. Data were collected using validated scales, including the Short-Form Health Anxiety Scale and the Short-Form Cyberchondria Severity Scale. SPSS 26.0 was used for statistical analysis, incorporating descriptive statistics, correlation analysis, binary logistic regression, and bootstrap mediation analysis (5,000 samples). Statistical significance was set at p < 0.05.
    Results: (1) More than 60% of the participants were female; 44.22% were aged 60-65; 46.22% self-rated as healthy; 41.43% frequently searched for health information. (2) Health information search and health anxiety were positively correlated with geriatric hypochondriasis (both p < 0.01). (3) Health information search fully mediated the relationship between health anxiety and hypochondriasis (mediating effect = 0.659, 95% CI [0.41, 0.92]).
    Conclusion: This study confirms the mediating role of health information search in the pathway from health anxiety to hypochondriasis among the older adults. It suggests that interventions should focus on improving digital health literacy and reducing unnecessary health information searches to mitigate hypochondriacal tendencies.
    Keywords:  digital health literacy; geriatric hypochondriasis; health anxiety; health information search; mediation analysis
    DOI:  https://doi.org/10.3389/fpubh.2025.1672145