bims-metlip Biomed News
on Methods and protocols in metabolomics and lipidomics
Issue of 2021–06–27
seventeen papers selected by
Sofia Costa, Cold Spring Harbor Laboratory



  1. Anal Chem. 2021 Jun 22.
      Exploratory mass spectrometry-based metabolomics generates a plethora of features in a single analysis. However, >85% of detected features are typically false positives due to inefficient elimination of chimeric signals and chemical noise not relevant for biological and clinical data interpretation. The data processing is considered a bottleneck to unravel the translational potential in metabolomics. Here, we describe a systematic workflow to refine exploratory metabolomics data and reduce reported false positives. We applied the feature filtering workflow in a case/control study exploring common variable immunodeficiency (CVID). In the first stage, features were detected from raw liquid chromatography-mass spectrometry data by XCMS Online processing, blank subtraction, and reproducibility assessment. Detected features were annotated in metabolomics databases to produce a list of tentative identifications. We scrutinized tentative identifications' physicochemical properties, comparing predicted and experimental reversed-phase liquid chromatography (LC) retention time. A prediction model used a linear regression of 42 retention indices with the cLogP ranging from -6 to 11. The LC retention time probes the physicochemical properties and effectively reduces the number of tentatively identified metabolites, which are further submitted to statistical analysis. We applied the retention time-based analytical feature filtering workflow to datasets from the Metabolomics Workbench (www.metabolomicsworkbench.org), demonstrating the broad applicability. A subset of tentatively identified metabolites significantly different in CVID patients was validated by MS/MS acquisition to confirm potential CVID biomarkers' structures and virtually eliminate false positives. Our exploratory metabolomics data processing workflow effectively removes false positives caused by the chemical background and chimeric signals inherent to the analytical technique. It reduced the number of tentatively identified metabolites by 88%, from initially detected 6940 features in XCMS to 839 tentative identifications and streamlined consequent statistical analysis and data interpretation.
    DOI:  https://doi.org/10.1021/acs.analchem.1c00816
  2. J Pharm Biomed Anal. 2021 Jun 15. pii: S0731-7085(21)00319-8. [Epub ahead of print]203 114208
      With continuously increased scan rate and sensitivity, high resolution mass spectrometry (HRMS) allows for both reliable targeted analysis (e.g., parallel reaction monitoring, PRM) and a global overview for discovery-based untargeted profiling (e.g., data dependent acquisition, DDA) to be performed. Based on previous study on PRM for large scale targeted metabolomics quantification, we developed an innovative method merged targeted and untargeted approaches in a single run. In our workflow, the scheduled PRM for targeted analysis of amino acids and derivatives combined with the full scan was acquired in every sample injection by hydrophilic interaction liquid chromatography tandem quadrupole-Orbitrap high resolution mass spectrometry (HILIC-Q-Orbitrap HRMS). The identification of metabolic features from full scan was further performed with DDA methodology on grouped quality control (QC) samples and matched with available database. Specifically, 20 amino acids and 40 derivatives were selected for targeted analysis with optimal chromatographic separation and PRM parameters. All isomers within the selected metabolites were totally separated in the robust HILIC condition. 36 of selected metabolites were well-detected and showed a good linearity and reproducibility in NIST SRM 1950 plasma. Moreover, the absolute quantification performance of targeted PRM method was systematically validated using 10 amino acids with the corresponding stable isotope-labeled internal standards (SIL-IS). Finally, the newly developed method was successfully applied to analysis of the plasma samples from patients of pancreatic benign tumor and pancreatic cancer. The significant reduction of circulating amino acids in patients with pancreatic malignancy was confirmed by targeted PRM method and other amino acids modifications as well as polar metabolites were identified with untargeted profiling. Therefore, we have established a workflow that combines specifically and reliably targeted PRM method as well as broad-coverage untargeted profiling, which provides an innovative strategy for basic and clinical metabolomics study.
    Keywords:  Amino acids and derivatives; HILIC; PRM; Q-Orbitrap HRMS; Untargeted profiling
    DOI:  https://doi.org/10.1016/j.jpba.2021.114208
  3. J Pharm Biomed Anal. 2021 Jun 16. pii: S0731-7085(21)00321-6. [Epub ahead of print]203 114210
      An on-line supercritical fluid extraction coupled with supercritical fluid chromatography-quadrupole tandem mass spectrometry method was developed to determine lipids related to inflammation in brain tissues of depressed rats. The analysis of 23 lipids from extraction to separation and detection only took 15 min and required 1 mg of brain tissue powder. The matrix effect of the on-line method for endogenous lipids was systematically investigated, and targeted lipids were quantified by matrix effect corrected calibration curves in this study. The on-line method was comprehensively optimized and evaluated. All calibration curves for lipids showed good linearity (correlation coefficient >0.99). The limits of detection and the limits of quantification were in the range of 0.0261-0.396 pg and 0.0791-1.20 pg. The recoveries and the matrix effect were in the range of 85.3-117.5% and 51.9-176.6%, respectively. The relative standard deviations of precision ranged from 2.7 to 14.2%, with accuracies higher than 87.2%. Compared with liquid-liquid extraction coupled with liquid chromatography-tandem mass spectrometry method, the on-line method obtained higher recovery and sensitivity with significantly reduced analytical time, manual operations, and sample amounts. Finally, this on-line method was applied to analyses of brain tissues of depressed rats. Six pro-inflammatory lipids increased in depressed rats, while six anti-inflammatory lipids decreased. Liquiritin and fluoxetine were presumed to promote a similar synthesis of anti-inflammatory lipids. Based on the results, this on-line method showed great promise in analyzing lipids in complex biological samples.
    Keywords:  Lipids; Matrix effect; On-line technique; Supercritical fluid chromatography; Supercritical fluid extraction
    DOI:  https://doi.org/10.1016/j.jpba.2021.114210
  4. Anal Chim Acta. 2021 Aug 15. pii: S0003-2670(21)00500-6. [Epub ahead of print]1173 338674
      Liquid chromatography-mass spectrometry (LC-MS)-based lipidomics generates large datasets that need to be interpreted using high-performance data pre-processing tools such as XCMS, mzMine, and Progenesis. These pre-processing tools rely heavily on accurate peak detection, which depends on proper setting of the peak detection mass tolerance (PDMT). The PDMT is usually set with a fixed value in either ppm or Da units. However, this fixed value may result in duplicates or missed peak detection and inaccurate peak quantification. To improve the accuracy of peak detection, we developed the dynamic binning method, which considers peak broadening described by the physics of ion separation and sets the PDMT dynamically in function of m/z. In our method, the PDMT is proportional to (mz)2 for Fourier-transform ion cyclotron resonance (FTICR), to (mz)1.5 for Orbitrap and to m/z for Quadrupole time-of-flight (Q-TOF), and is a constant for Quadrupole mass analyzer. The dynamic binning method was implemented in XCMS [1,2], and the adopted source code is available in GitHub at https://github.com/xiaodfeng/DynamicXCMS. We have compared the performance of the XCMS implemented dynamic binning with different popular lipidomics pre-processing tools to find differential compounds. We generated set samples with 43 lipid internal standards that were differentially spiked to aliquots of one human plasma lipid sample using Orbitrap LC-MS/MS. The performance of various pipelines using matched parameter sets was quantified by a quality score system that reflects the ability of a pre-processing pipeline to detect differential peaks spiked at various concentrations. The quality score indicated that our dynamic binning method improves the quantification performance of XCMS (maximum p-value 9.8·10-3 of two-sample Wilcoxon test) over its original implementation. We also showed that the XCMS with dynamic binning found differential spiked-in lipids better or with similar performance as mzMine and Progenesis do.
    Keywords:  Dynamic binning; EIC construction; LC-MS pre-Processing; Lipidomics; Peak detection
    DOI:  https://doi.org/10.1016/j.aca.2021.338674
  5. Nat Commun. 2021 06 22. 12(1): 3832
      Molecular networking connects mass spectra of molecules based on the similarity of their fragmentation patterns. However, during ionization, molecules commonly form multiple ion species with different fragmentation behavior. As a result, the fragmentation spectra of these ion species often remain unconnected in tandem mass spectrometry-based molecular networks, leading to redundant and disconnected sub-networks of the same compound classes. To overcome this bottleneck, we develop Ion Identity Molecular Networking (IIMN) that integrates chromatographic peak shape correlation analysis into molecular networks to connect and collapse different ion species of the same molecule. The new feature relationships improve network connectivity for structurally related molecules, can be used to reveal unknown ion-ligand complexes, enhance annotation within molecular networks, and facilitate the expansion of spectral reference libraries. IIMN is integrated into various open source feature finding tools and the GNPS environment. Moreover, IIMN-based spectral libraries with a broad coverage of ion species are publicly available.
    DOI:  https://doi.org/10.1038/s41467-021-23953-9
  6. J Proteome Res. 2021 Jun 23.
      Large-scale untargeted LC-MS-based metabolomic profiling is a valuable source for systems biology and biomarker discovery. Data analysis and processing are major tasks due to the high complexity of generated signals and the presence of unwanted variations. In the present study, we introduce an R-based open-source collection of scripts called OUKS (Omics Untargeted Key Script), which provides comprehensive data processing. OUKS is developed by integrating various R packages and metabolomics software tools and can be easily set up and prepared to create a custom pipeline. Novel computational features are related to quality control samples-based signal processing and are implemented by gradient boosting, tree-based, and other nonlinear regression algorithms. Bladder cancer biomarkers discovery study which is based on untargeted LC-MS profiling of urine samples is performed to demonstrate exhaustive functionality of the developed software tool. Unique examination among dozens of metabolomics-specific data curation methods was carried out at each processing step. As a result, potential biomarkers were identified, statistically validated, and described by metabolism disorders. Our study demonstrates that OUKS helps to make untargeted LC-MS metabolomic profiling with the latest computational features readily accessible in a ready-to-use unified manner to a research community.
    Keywords:  R programming; bladder cancer; data analysis; metabolomics; untargeted profiling
    DOI:  https://doi.org/10.1021/acs.jproteome.1c00392
  7. Curr Protoc. 2021 Jun;1(6): e177
      Short-chain fatty acids (SCFAs) are produced mainly by intestinal microbiota and play an important role in many host biological processes such as immune system development, glucose and energy homeostasis, and regulation of immune response and inflammation. In addition, they participate in the regulation of anorectic hormones, which have a role in appetite control, tumor suppression, and regulating the central and peripheral nervous systems. As such, there is great interest in monitoring levels of SCFAs in various biological samples. Due to the highly hydrophilic and volatile characteristics of SCFAs, optimizing extraction and sample preparation procedures is often a central component to further improve SCFA quantification. Here, we describe a rapid and highly sensitive analytical method for measuring SCFAs in human serum and feces. Briefly, SCFAs are protected by adding sodium hydroxide, followed by a one-step extraction (pH > 7). Then, SCFAs are quantified by gas chromatography coupled to mass spectrometry (GC-MS) after derivatization with N-tert-butyldimethylsilyl-N-methyltrifluoroacetamide (MTBSTFA). This method demonstrates excellent sensitivity, linearity, and derivatization efficiency for simultaneous determination of 14 different SCFAs. Further, this validated method can be successfully applied to quantify SCFAs in micro-scale biological samples. In summary, we describe efficient and advanced sample preparation and detection procedures that are critically needed for monitoring SCFA concentrations in human biological samples. © 2021 Wiley Periodicals LLC. Basic Protocol: SCFA extraction and detection from fecal and serum samples with gas chromatography-mass spectrometry.
    Keywords:  gas chromatography; mass spectrometry; metabolomics; microbiome; short-chain fatty acids
    DOI:  https://doi.org/10.1002/cpz1.177
  8. World J Gastrointest Oncol. 2021 Jun 15. 13(6): 536-549
      Metabolites are versatile bioactive molecules. They are not only the substrates and/or the products of enzymatic reactions but also act as the regulators in the systemic metabolism. Metabolomics is a high-throughput analytical strategy to qualify or quantify as many metabolites as possible in the metabolomes. It is an indispensable part of systems biology. The leading techniques in this field are mainly based on mass spectrometry and nuclear magnetic resonance spectroscopy. The metabolomic analysis has gained wide use in bioscience fields. In the tumor research arena, metabolomics can be employed to identify biomarkers for prediction, diagnosis, and prognosis. Chemotherapeutic effect evaluation and personalized medicine decision-making can also benefit from metabolomic analysis of patient biofluid or biopsy samples. Many cell-level studies can help in disease exploration. In this review, the basic features and principles of varied metabolomic analysis are introduced. The value of metabolomics in clinical and laboratory gastrointestinal cancer studies is discussed, especially for mass spectrometry applications. Besides, combined use of metabolomics and other tools to solve problems in cancer practice is briefly illustrated. In summary, metabolomics paves a new way to explore cancerous diseases in the light of small molecules.
    Keywords:  Biomarker; Diagnosis; Gastrointestinal cancer; Mass spectrometry; Metabolite; Metabolomics
    DOI:  https://doi.org/10.4251/wjgo.v13.i6.536
  9. Anal Bioanal Chem. 2021 Jun 22.
      The stability of lipids and other metabolites in human body fluids ranges from very stable over several days to very unstable within minutes after sample collection. Since the high-resolution analytics of metabolomics and lipidomics approaches comprise all these compounds, the handling of body fluid samples, and thus the pre-analytical phase, is of utmost importance to obtain valid profiling data. This phase consists of two parts, sample collection in the hospital ("bedside") and sample processing in the laboratory ("bench"). For sample quality, the apparently simple steps in the hospital are much more critical than the "bench" side handling, where (bio)analytical chemists focus on highly standardized processing for high-resolution analysis under well-controlled conditions. This review discusses the most critical pre-analytical steps for sample quality from patient preparation; collection of body fluids (blood, urine, cerebrospinal fluid) to sample handling, transport, and storage in freezers; and subsequent thawing using current literature, as well as own investigations and practical experiences in the hospital. Furthermore, it provides guidance for (bio)analytical chemists to detect and prevent potential pre-analytical pitfalls at the "bedside," and how to assess the quality of already collected body fluid samples. A knowledge base is provided allowing one to decide whether or not the sample quality is acceptable for its intended use in distinct profiling approaches and to select the most suitable samples for high-resolution metabolomics and lipidomics investigations.
    Keywords:  Blood; Cerebrospinal fluid; Lipidomics; Metabolomics; Plasma; Pre-analytic; Serum; Urine
    DOI:  https://doi.org/10.1007/s00216-021-03450-0
  10. Anal Chem. 2021 Jun 23.
      There is a current need to monitor human exposure to a large number of pesticides and other chemicals of emerging concern (CECs). This requires screening analysis with high confidence for these compounds and their metabolites in complex matrices, which is hampered by the fact that no reference standards are available for most metabolites. We address this challenge by a high-throughput workflow based on incubation of pesticides (or other CECs) with human liver S9, followed by solid-phase extraction, liquid chromatography-high-resolution mass spectrometry (LC-HRMS) analysis, and automated data processing to generate a database (retention time, precursor m/z, and MS2 spectral library) for the annotation in human samples. The metabolite prioritization consists of statistical comparisons and mass defect and m/z range filtering to obtain a subset of probable phase I metabolites, for which molecular formulas and likely metabolic transformation are retrieved. We tested the workflow on 22 pesticides, for which we could determine 91 metabolite molecular formulas which are only partly covered by the literature and/or predicted by in silico metabolization. Our workflow allows for an efficient generation of metabolite reference information, which can be used directly for annotating LC-HRMS data from human samples. A full structure elucidation of individual metabolites can be limited to those being actually present in human samples.
    DOI:  https://doi.org/10.1021/acs.analchem.1c00972
  11. Biomed Chromatogr. 2021 Jun 24. e5204
      To investigate clinical pharmacokinetics of CA4P, a high-throughput high performance liquid chromatography-tandem mass spectrometry assay with identical positive electrospray ionization mode was developed, for the simultaneous determination of CA4P, its active metabolite CA4, and CA4 glucuronide in human plasma. CA4P and CA4 were easier to protonate in positive electrospray ionization mode, while CA4G was reported to produce deprotonated ion in negative ESI mode. Since baseline separation of CA4P and CA4G could not be achieved, utilizing MS positive/negative ion switching is not feasible. In this study, an abundant ammonium adduct ion of CA4G in ESI+ was observed as an ideal precursor ion. The final precursor/product transition pairs chosen for CA4P, CA4, and CA4G were at m/z 397/350, 317/286, and 510/317, respectively. To our knowledge, it is the first report on the simultaneous quantification of CA4P, CA4 and CA4G in biological samples. The proposed method was validated which showed a wide linear dynamic range, high selectivity and sensitivity, good repeatability, and a short run time. Compared with the literatures, the lower limits of quantification were 5 and 2 fold more sensitive for CA4G and CA4, respectively. It was successfully applied to the pharmacokinetic study of CA4P in phase I clinical trial.
    Keywords:  Ammonium adduct ion; Combretastatin; Human plasma; LC-MS/MS; Metabolite
    DOI:  https://doi.org/10.1002/bmc.5204
  12. Anal Chem. 2021 Jun 22.
      Incorporating safety data early in the drug discovery pipeline is key to reducing costly lead candidate failures. For a single drug development project, we estimate that several thousand samples per day require screening (<10 s per acquisition). While chromatography-based metabolomics has proven value at predicting toxicity from metabolic biomarker profiles, it lacks sufficiently high sample throughput. Acoustic mist ionization mass spectrometry (AMI-MS) is an atmospheric pressure ionization approach that can measure metabolites directly from 384-well plates with unparalleled speed. We sought to implement a signal processing and data analysis workflow to produce high-quality AMI-MS metabolomics data and to demonstrate its application to drug safety screening. An existing direct infusion mass spectrometry workflow was adapted, extended, optimized, and tested, utilizing three AMI-MS data sets acquired from technical and biological replicates of metabolite standards and HepG2 cell lysates and a toxicity study. Driven by criteria to minimize variance and maximize feature counts, an algorithm to extract the pulsed scan data was designed; parameters for signal-to-noise-ratio, replicate filter, sample filter, missing value filter, and RSD filter were all optimized; normalization and batch correction strategies were adapted; and cell phenotype filtering was implemented to exclude high cytotoxicity samples. The workflow was demonstrated using a highly replicated HepG2 toxicity data set, comprising 2772 samples from exposures to 16 drugs across 9 concentrations and generated in under 5 h, revealing metabolic phenotypes and individual metabolite changes that characterize specific modes of action. This AMI-MS workflow opens the door to ultrahigh-throughput metabolomics screening, increasing the rate of sample analysis by approximately 2 orders of magnitude.
    DOI:  https://doi.org/10.1021/acs.analchem.1c01616
  13. Rapid Commun Mass Spectrom. 2021 Jun 25. e9155
       RATIONALE: Biobanks of patient tissues have emerged as essential resources in biomedical research. Optimal cutting temperature (OCT) blends have shown to provide stability to the embedded tissue and is compatible with spectroscopic methods, such as infrared (IR) and Raman spectroscopy. Data derived from omics-methods are only useful if tissue damage caused by storage in OCT is minimal and well understood. In this context, we investigated the suitability of OCT storage for heart tissue destined for LC-MS/MS lipidomic studies.
    METHODS: To determine the compatibility of OCT storage with LC-MS/MS lipidomics studies. The lipid profiles of Macaque heart tissue snap-frozen in liquid nitrogen or stored in OCT were evaluated.
    RESULTS: We have evaluated a lipid extraction protocol suitable for OCT-embedded tissue that is compatible with LC-MS/MS. We annotated and evaluated the profiles of 306 lipid species from tissues stored in OCT or liquid nitrogen. For most of the lipid species (95.4%), the profiles were independent of the storage conditions. However, 4.6% of the lipid species; mainly plasmalogens, were affected by the storage method.
    CONCLUSION: This study shows that OCT storage is compatible with LC-MS/MS lipidomics of heart tissue, facilitating the use of biobanked tissue samples for future studies.
    DOI:  https://doi.org/10.1002/rcm.9155
  14. J Chromatogr A. 2021 Jun 08. pii: S0021-9673(21)00458-1. [Epub ahead of print]1651 462334
      An on-surface multi-purpose autosampler was built for liquid chromatography-mass spectrometry (LC-MS) based on the autoTLC-MS interface, taking advantage of open-source hard- and software developments as well as 3D printing. Termed autoTLC-LC-MS system, it is introduced for orthogonal hyphenation of normal phase high-performance thin-layer chromatography with reversed phase high-performance LC (HPLC) and high-resolution MS (HRMS). For verification of its functionality, a multi-class antibiotic mixture was applied as a calibration band pattern on an adsorbent layer and detected by the Bacillus subtilis bioassay. This effect-image was uploaded as a template in the updated TLC-MS_manager software. The clicked-on antibiotic zones were sequentially eluted without intervention from the planar counterpart (without bioassay) via a monolithic HPLC column into the HRMS system. For elution of antibiotics of 7 structural classes at 5 different calibration levels, the new on-surface autosampler achieved intra-day precisions of 2.1-14.1%, while inter-day precisions ranged 2.5-16.1% (all n = 3). The new hyphenation offers potential for planar sample clean-up prior to HPLC, concentration of liquid samples, increase of peak capacity and proof of peak purity or isomers. The integrated autoTLC-LC-MS system enabled high sample throughput, efficiency and reproducibility for the first time through fully automated TLC-LC-MS sequence operation. Its contact-closure signal functionality, versatile 3D printed planar sample holder and open-source software made it readily adjustable for new analytical tasks. Undoubtedly, any planar material can be investigated for leachables, such as textiles, foils, papers and other packagings, as well as planar biological samples for ingredients.
    Keywords:  Orbitrap high-resolution mass spectrometry; Orthogonal hyphenation; Planar chromatography; autoTLC–LC–MS
    DOI:  https://doi.org/10.1016/j.chroma.2021.462334
  15. Anal Chem. 2021 Jun 22.
      The use of quality control samples in metabolomics ensures data quality, reproducibility, and comparability between studies, analytical platforms, and laboratories. Long-term, stable, and sustainable reference materials (RMs) are a critical component of the quality assurance/quality control (QA/QC) system; however, the limited selection of currently available matrix-matched RMs reduces their applicability for widespread use. To produce an RM in any context, for any matrix that is robust to changes over the course of time, we developed iterative batch averaging method (IBAT). To illustrate this method, we generated 11 independently grown Escherichia coli batches and made an RM over the course of 10 IBAT iterations. We measured the variance of these materials by nuclear magnetic resonance (NMR) and showed that IBAT produces a stable and sustainable RM over time. This E. coli RM was then used as a food source to produce a Caenorhabditis elegans RM for a metabolomics experiment. The metabolite extraction of this material, alongside 41 independently grown individual C. elegans samples of the same genotype, allowed us to estimate the proportion of sample variation in preanalytical steps. From the NMR data, we found that 40% of the metabolite variance is due to the metabolite extraction process and analysis and 60% is due to sample-to-sample variance. The availability of RMs in untargeted metabolomics is one of the predominant needs of the metabolomics community that reach beyond quality control practices. IBAT addresses this need by facilitating the production of biologically relevant RMs and increasing their widespread use.
    DOI:  https://doi.org/10.1021/acs.analchem.1c01294
  16. J Chromatogr A. 2021 Jun 19. pii: S0021-9673(21)00439-8. [Epub ahead of print]1651 462315
      In this work two different acquisition approaches were used for the quantification and/or tentative identification of phenolic compounds (PCs) in plant matrices by HPLC-MS/MS. A targeted approach, based on MRM acquisition mode, was used for the identification and quantification of a list of target analytes by comparison with standards; a semi-targeted approach was also developed by the precursor ion scan and neutral loss for the tentative identification of compounds not included in the target list. Analysis of phenolic content in three different plant matrices (curry leaves, hemp and blueberry) was carried out. The extraction and clean-up steps were set up according to the characteristics of the sample allowing to minimize the interfering compounds present in such complex matrices, as proved by the low matrix effect obtained (<16%) and recovery values ranging from 45% to 98% for all the analytes. This approach provided a sensitive and robust quantitative analysis of the target compounds with LOQs between 0.0002 and 0.05 ng mg-1, which allowed the identification and quantification of several hydroxycinnamic and hydroxybenzoic acids, in addition to numerous flavonoids in all three matrices. Furthermore, different moieties were considered as neutral losses or as precursor ions in semi-targeted MS/MS approach, providing the putative identification of different glycosylated forms of flavonoids, such as luteolin-galactoside and diosmin in all three matrices, while apigenin-glucuronide was detected in hemp and quercetin-glucuronide in blueberry. A further study was carried out by MS3, allowing the discrimination of compounds with similar aglycones, such as luteolin and kaempferol.
    Keywords:  HPLC-MS/MS; MS(3); Neutral loss scan; Polyphenols; Precursor ion scan
    DOI:  https://doi.org/10.1016/j.chroma.2021.462315
  17. Rapid Commun Mass Spectrom. 2021 Jun 25. e9153
       RATIONALE: Advanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.
    METHODS: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to lowdimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.
    RESULTS: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome dataset consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.
    CONCLUSIONS: falcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.
    DOI:  https://doi.org/10.1002/rcm.9153