J Am Soc Mass Spectrom. 2026 Jun 15.
Reference MS/MS libraries remain incomplete due to the vast chemical diversity of metabolites, leaving many spectra from untargeted metabolomics experiments unannotated─the "dark matter" of metabolomics. Machine learning can extend metabolite annotation beyond direct library matches, but its success depends critically on how MS/MS spectra are converted into numerical representations that capture chemically meaningful features while reducing sparsity. Although numerous spectral representations exist, they have not been systematically compared. Using over 71,000 unique compounds with merged-energy MS/MS spectra, we benchmarked a broad set of spectral featurization methods, including fixed and adaptive binning, global-quantile variable-width bins, frequent-peaks representations, spectrum hashing, and learned embeddings such as Spec2Vec, MS2DeepScore, DreaMS, and SpecEmbedding. We further evaluated how vector dimensionality affects performance. A total of 105 neural network models were trained under 5-fold cross-validation to predict Mol2Vec molecular embeddings and retrieve correct structures from a 0.6-million-compound database. Retrieval was assessed at 0.1, 3, and 10 ppm mass tolerances, and a null ranking model was generated to determine expected Top-N accuracy under random candidate ordering. Adaptive binning, frequent-peaks, and DreaMS produced the most accurate embedding predictions. On the test data set, Top-1 retrieval reached 46%, 44%, and 38% for 0.1, 3, and 10 ppm, respectively, with Top-5 accuracies up to 77%. In the CASMI2022 data set, Top-1 performance remained similar at 0.1 ppm but dropped markedly at wider tolerances, reaching only 26% at 3 ppm and 23% at 10 ppm. To ensure reproducibility and broad community applicability, results were further validated on two fully open benchmark data sets, MassSpecGym and Spectraverse, with findings consistent across all three resources. These results underscore clear performance differences among featurization strategies, the strong dependence of retrieval accuracy on mass precision, and the need for evaluation metrics aligned with structure-level annotation tasks.
Keywords: machine learning; metabolite annotation; spectral featurization; structure retrieval; tandem mass spectrometry; untargeted metabolomics