J Am Med Inform Assoc. 2026 Apr 16. pii: ocag045. [Epub ahead of print]
OBJECTIVES: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and diagnosis from clinical text.
MATERIALS AND METHODS: A systematic search was conducted in PubMed, ACM Digital Library, and IEEE Xplore. Two reviewers independently screened papers and extracted data. Methodological rigor and quality of the studies were evaluated using the MI-CLAIM framework.
RESULTS: The search resulted in 135 studies; 27 of them met the inclusion criteria. Methods used spanned rule-based systems, classical ML/DL models, transformer architectures, and LLMs. Transformer- and LLM-based approaches outperformed earlier methods in entity recognition, phenotype extraction, and diagnostic ranking. Several studies demonstrated clinical impact, such as increased genetic testing and identification of undiagnosed cases. However, most studies relied on retrospective and single-center datasets. Reporting of preprocessing, evaluation, and reproducibility was largely inconsistent, and interpretability, fairness, and privacy were rarely addressed.
DISCUSSION: Natural language processing and LLMs show strong potential to accelerate rare disease diagnosis. However, heterogeneity in methods and metrics hinders cross-study comparability. Data scarcity, lack of generalization, and limited transparency remain significant challenges.
CONCLUSIONS: Natural language processing/LLM methods can support timely diagnosis of rare diseases using unstructured clinical text. Future research should prioritize multicenter studies, standardized evaluation frameworks, transparency, and fairness safeguards to enable reliable, equitable deployment.
Keywords: electronic health records; language models; natural language processing; phenotype; rare diseases/diagnosis; statistical