Gigascience. 2025 Jan 06. pii: giaf104. [Epub ahead of print]14
BACKGROUND: Most cancers exhibit somatic copy number alterations (SCNAs)-gains and losses of variable regions of DNA. SCNAs play a key role in cancer adaptation through modulation of gene expression, deletion of tumor suppressor genes, or amplification of oncogenes. Systematic analysis of SCNAs is now a routine task in both the clinic and research and can help identify novel cancer genes, improve our understanding of cancer gene regulation, and enable us to accurately reconstruct cancer phylogenies. However, to conduct such analyses, SCNA profiles have to be integrated between samples, patients, and cohorts-often a nontrivial task, for which dedicated toolkits are lacking.
RESULTS: To fill this gap, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the following publicly available cohorts: The Cancer Genome Atlas, Pan-Cancer Analysis of Whole Genomes, and TRAcking Cancer Evolution through therapy (Rx). We compare the effect of sample preprocessing and different segmentation and aggregation strategies on cancer type and subtype classification tasks using various classification models. We also evaluate how well a classifier trained on one cohort generalizes to another. Lastly, we introduce 2 segment-based peak and outlier scores to investigate relationships between segments, between samples, and between cancer types. Using these scores, we investigate non-small cell lung cancer samples, highlighting that SOX2 amplification is the dominant copy number alteration in lung squamous cell carcinoma and the main distinction to lung adenocarcinoma.
CONCLUSIONS: CNSistent is a general-purpose toolkit for integrated processing of SCNA profiles across many patients and cohorts. It is available at https://bitbucket.org/schwarzlab/cnsistent. The Research Resource Identifier for CNSistent is SCR_027025.
Keywords: SCNA; cancer; cancer classification; data processing; deep learning