bioRxiv. 2026 Jan 02. pii: 2025.12.31.697247. [Epub ahead of print]
Quantitative, time-resolved 3D fluorescence microscopy can reveal complex cellular dynamics in living cells and tissues. Broader use remains limited by the difficulty of identifying, segmenting, and tracking objects of different size and shape in crowded intracellular environments in low-contrast, anisotropic, monochromatic image volumes. Objects overlap, deform, appear and disappear, and span wide ranges of size and intensity. Classical segmentation pipelines typically require high signal-to-noise data and rely on intensity heuristics with hand-tuned postprocessing that generalize poorly. Supervised deep learning methods require extensive voxel-level annotations that are costly, inconsistent across phenotypes, and rapidly become obsolete as imaging conditions change. We introduce SpatialDINO, a fully automated self-supervised method that trains a native 3D vision transformer, based on a modified version of DINOv2(1). SpatialDINO yields robust semantic feature maps from single channels of multi-channel microscopy that, irrespective of object shape, support object detection and segmentation directly from naïve 3D images across z-spacings and numbers of planes and different imaging modalities, without retraining or voxel annotations. We trained SpatialDINO on a small set of confocal volumes acquired by live-cell fluorescent 3D lattice light-sheet microscopy, spanning targets of different size and shape located in crowded cellular environments, from diffraction-limited clathrin coated pits and clathrin coated vesicles to bigger structures including endosomes and lysosomes, and endosomes and lysosomes pharmacologically enlarged to highlight endosomal membrane profiles. Post-processing of the features generated by SpatialDINO allows detection and unique object identification of these objects in naïve 3D images. It also enables detection of significantly different previously unseen object classes, such as cellular plasma membranes and nuclei and even tumors in MRI scans. Finally, we illustrate its value by tracking endosomes in 3D time series, combining SpatialDINO-derived feature similarity with spatial proximity to improve association through occlusion, abrupt appearance changes, and dense packing - all conditions that have been challenging for existing methods. SpatialDINO therefore lowers a major barrier to quantitative analysis of heterogeneous, monochromatic objects in crowded 3D cellular environments.
SUMMARY: Lavaee et al. have developed SpatialDINO to surmount the difficulties presented by attempting to identify, segment, and track objects in crowded volumes. SpatialDINO is a self-supervised, native 3D vision transformer trained directly on unlabeled fluorescence volumes acquired by live-cell 3D lattice light-sheet microscopy. By learning dense volumetric representations without voxel-level supervision, SpatialDINO generates features that enable fully automated detection, segmentation, and tracking of subcellular structures across a wide range of sizes and morphologies in crowded, anisotropic 3D/4D datasets acquired with different microscopy modalities. The approach generalizes across targets and imaging conditions without further training, reducing dependence on manual annotation while maintaining performance in complex cellular environments.
SIGNIFICANCE: SpatialDINO brings a self-supervised foundation model for analyzing 3D fluorescence microscopy images by adapting DINOv2-style joint-embedding training to learn dense volumetric features directly from unlabeled 3D datasets. By exploiting true 3D context rather than slice-wise "2.5D" aggregation, it enables automated detection and segmentation in crowded, anisotropic, low-contrast volume and enables tracking in 4D time-lapse data. SpatialDINO generalizes across targets and imaging conditions without voxel-level annotation or retraining.