Proc Natl Acad Sci U S A. 2025 May 06. 122(18): e2416516122
Constructing single-cell atlases requires preserving differences attributable to biological variables, such as cell types, tissue origins, and disease states, while eliminating batch effects. However, existing methods are inadequate in explicitly modeling these biological variables. Here, we introduce SIGNAL, a general framework that leverages biological variables to disentangle biological and technical effects, thereby linking these metadata to data integration. SIGNAL employs a variant of principal component analysis to align multiple batches, enabling the integration of 1 million cells in approximately 2 min. SIGNAL, despite its computational simplicity, surpasses state-of-the-art methods across multiple integration scenarios: 1) heterogeneous datasets, 2) cross-species datasets, 3) simulated datasets, 4) integration on low-quality cell annotations, and 5) reference-based integration. Furthermore, we demonstrate that SIGNAL accurately transfers knowledge from reference to query datasets. Notably, we propose a self-adjustment strategy to restore annotated cell labels potentially distorted during integration. Finally, we apply SIGNAL to multiple large-scale atlases, including a human heart cell atlas containing 2.7 million cells, identifying tissue- and developmental stage-specific subtypes, as well as condition-specific cell states. This underscores SIGNAL's exceptional capability in multiscale analysis.
Keywords: data integration; knowledge transfer; principal component analysis; single-cell data; technical variation