Clin Orthop Relat Res. 2026 May 22.
BACKGROUND: Prognostic support tools are increasingly used to guide treatment decisions in patients with metastatic long-bone disease. PathFX is a widely distributed survival prediction model that has been validated worldwide in various settings. Despite this, to our knowledge, there has been no systematic evaluation of PathFX's algorithmic fairness across clinically relevant subgroups within external evaluation studies.
QUESTIONS/PURPOSES: (1) How accurately does PathFX predict survival at 1, 3, 6, 12, 18, and 24 months in an external cohort of patients undergoing surgery for long-bone metastases? (2) Is the performance and error distribution of PathFX fair across key sociodemographic, clinical, and temporal subpopulations within an external cohort of patients undergoing surgery for long-bone metastases?
METHODS: All patients 18 years or older from a tertiary orthopaedic oncology service who underwent surgery from January 2010 to December 2022 for impending or completed metastatic long-bone fracture were retrospectively studied. Of the 1018 patients, 45% (460 of 1018) were male. Race and ethnicity were self-identified through a standardized institution-wide demographic survey and recorded in the electronic health record. Among patients with available data (n = 991), 88% (874 of 991) identified as White, 5% (51 of 991) as Black, 3% (28 of 991) as Asian, and 4% (38 of 991) as Other. Race and ethnicity data were missing or not reported for 4% (36 of 1018) of patients. The primary outcome was overall survival at prespecified time points (1, 3, 6, 12, 18, and 24 months). Data on the nine predictors required by PathFX (age, sex, primary tumor group, Eastern Cooperative Oncology Group performance status, pathologic fracture status at the index site, presence of multiple skeletal metastases, presence of organ metastases, hemoglobin level, and absolute lymphocyte count) were collected for each patient. We assessed discrimination (time-specific area under the curve [AUC]/C-index with 95% confidence intervals [CIs]), calibration (slope and intercept with CIs and graphical calibration), overall accuracy (Brier score), and decision curve analysis. Discrimination (time-specific AUC/C-index) reflects how well the model distinguishes between patients who experience the event and those who do not; it ranges from 0.5 (no better than chance) to 1.0 (perfect discrimination), with values around 0.7 generally considered acceptable and ≥ 0.8 strong. Calibration assesses whether predicted probabilities agree with observed outcomes: the calibration intercept indicates systematic overestimation or underestimation (ideal = 0), while the calibration slope reflects whether risk predictions are too extreme or too moderate (ideal = 1). Overall accuracy was quantified using the Brier score, which measures the average squared difference between predicted probabilities and actual outcomes; lower values indicate better accuracy, with 0 representing perfect prediction. Finally, decision curve analysis evaluates clinical usefulness by estimating the net benefit of using the model across a range of decision thresholds compared with default strategies (treat all or treat none). We evaluated model performance and error distribution within prespecified sociodemographic, clinical, and temporal subgroups and compared subgroup estimates using Δmetrics with 95% CIs.
RESULTS: In general, the accuracy and other performance parameters we observed for PathFX were inadequate for clinical use. Overall, the best-performing model was the 18-month survival model: AUC 0.63 (95% CI 0.60 to 0.67), Brier 0.22 (95% CI 0.21 to 0.23), calibration slope 0.58 (95% CI 0.33 to 0.83), and intercept 0.21 (95% CI 0.10 to 0.32). The AUC for the other models did not exceed 0.68, with worse calibration metrics. Intercepts were positive for all time points, which means that the model systematically underestimated survival in this patient population. Calibration slopes were < 1 throughout, indicating overconfident (too extreme) probabilities. Brier scores ranged from 0.07 to 0.24, which is consistent with moderate probabilistic accuracy. Because the Brier score is dependent on the baseline event incidence, variation across prediction time points partly reflects changes in outcome frequency rather than pure differences in discriminative or calibration performance. The subgroup analyses suggested heterogeneity; that is, the model exhibited a better discrimination in females and poorer performance in patients who were not White with flatter calibration slopes. There were no clear differences in subgroups based on treatment period.
CONCLUSION: Based on the findings of this study, PathFX in its current form is insufficient for clinical use in patients with long-bone metastases undergoing surgery, as it consistently underestimates survival. Recalibration of the model through development of an updated cohort with stepwise model updating and subgroup stability checks is warranted; however, even after recalibration, complete model redevelopment may ultimately be required before PathFX can be reliably used to guide surgical decision-making.
LEVEL OF EVIDENCE: Level III, prognostic study.