BMC Methods. 2025 ;2(1): 16
Background: The human genome contains over 3 million small open reading frames (smORFs, ≤ 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.
Methods: To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.
Results: ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.
Discussion: ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.
Supplementary Information: The online version contains supplementary material available at 10.1186/s44330-025-00037-4.
Keywords: Cancer; De Novo genes; Machine learning; Microprotein; Peptides; Proteogenomics; Ribosome profiling; Small open reading frame; Steroidogenic acute regulatory protein