Front Endocrinol (Lausanne). 2025 ;16 1626203
Background: Non-alcoholic fatty liver disease (NAFLD) is the most common chronic liver disease, seriously threatening the public health. Although the proportion of patients with lean NAFLD is lower than that of patients with obese NALFD, it should not be overlooked. This study aimed to construct interpretable machine learning models for predicting lean NAFLD risk in type 2 diabetes mellitus (T2DM) patients.
Methods: This study enrolled 1,553 T2DM individuals who received health care at the First Affiliated Hospital of Ningbo University, Ningbo, China, from November 2019 to November 2024. Feature screening was performed using the Boruta algorithm and the Least Absolute Shrinkage and Selection Operator (LASSO). Linear discriminant analysis (LDA), logistic regression (LR), Naive Bayes (NB), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGboost) were used in constructing risk prediction models for lean NAFLD in T2DM patients. The area under the receiver operating characteristic curve (AUC) was used to assess the predictive capacity of the model. Additionally, we employed SHapley Additive exPlanations (SHAP) analysis to unveil the specific contributions of individual features in the machine learning model to the prediction results.
Results: The prevalence of lean NAFLD in the study population was 20.3%. Eight variables, including age, body mass index (BMI), and alanine aminotransferase (ALT), were identified as independent risk factors for lean NAFLD. Ten predictive factors, including BMI, ALT, and aspartate aminotransferase (AST), were screened for the construction of risk prediction models. The random forest model demonstrated superior performance compared to alternative machine learning (ML) algorithms, achieving an AUC of 0.739 (95% confidence interval [CI]: 0.676-0.802) in the training set, and it also exhibited the best predictive value in the internal validation set with an AUC of 0.789 (95% CI: 0.722-0.856). In addition, the SHAP method identified TG, ALT, GGT, BMI, and UA as the top five variables influencing the predictions of the RF model.
Conclusion: The construction of lean NAFLD risk models based on the Chinese T2DM population, particularly the RF model, facilitates its early prevention and intervention, thereby reducing the risks of intrahepatic and extrahepatic adverse outcomes.
Keywords: interpretable machine learning; lean non-alcoholic fatty liver disease; predict risk; prediction model; type 2 diabetes mellitus