BMC Nephrol. 2026 May 25.
BACKGROUND: Large language models are increasingly becoming a key resource for hemodialysis patients to access information on disease management. However, the information reliability, readability, and guideline concordance of LLM-generated hemodialysis-related educational texts remain insufficiently evaluated.
METHODS: This study identified 42 dialysis-related questions from an initial pool of 200 candidate questions extracted from Google Trends, relevant clinical guidelines, and online forums. Using a standardized single-turn, zero-shot prompting strategy with default web-interface settings, these questions were independently input into five models (ChatGPT-4o, DeepSeek-V2.5, Gemini 2.5 Pro, Perplexity Pro, and Copilot). Two trained raters independently evaluated the outputs using the DISCERN, EQIP, JAMA, and GQS scales in a blinded review, with disagreements adjudicated by a third senior nephrologist. Readability was quantified using the FKGL, FRES, GFI, CLI, and SMOG metrics. Additionally, using internationally authoritative guidelines such as KDIGO as a benchmark, guideline concordance and potential text-level safety concerns in the generated outputs were reviewed against authoritative hemodialysis-related guidelines, and qualitative methods were employed to describe the issue of hallucinations in the model outputs.
RESULTS: Significant differences were observed across the five LLMs for all four information-quality metrics (P < 0.001 for DISCERN and EQIP; P = 0.002 for GQS and JAMA). RAG-based models, particularly Perplexity and Copilot, showed relatively higher information reliability. None of the outputs met the recommended sixth-grade readability benchmark, and greater guideline concordance was often accompanied by higher linguistic complexity. RAG-based models also showed relatively better alignment with reference guideline statements, whereas non-retrieval-based models more often omitted guideline-recommended elements or provided less specific responses. Qualitative review identified several examples of model-generated "medical hallucinations," including contraindicated self-management suggestions, potentially inappropriate dietary advice, and out-of-scope clinical instructions presented as self-care, indicating potential text-level safety concerns if used without professional review.
CONCLUSION: RAG-based models showed relatively better evidence support, information reliability, and guideline concordance in hemodialysis-related educational text generation. However, all evaluated LLMs produced outputs with readability barriers and occasional potentially unsafe or out-of-scope recommendations at the text level. These findings do not establish the actual clinical safety or effectiveness of LLM use among hemodialysis patients, but they indicate that unsupervised patient-facing use should be approached cautiously and that expert review and plain-language adaptation are necessary before such outputs are used as educational materials.
Keywords: Artificial intelligence; Complications; Guideline concordance; Hemodialysis; Information quality; Large language models; Readability