bims-micpro Biomed News
on Discovery and characterization of microproteins
Issue of 2025–02–09
eight papers selected by
Thomas Farid Martínez, University of California, Irvine



  1. Genomics Proteomics Bioinformatics. 2025 Feb 07. pii: qzaf004. [Epub ahead of print]
      One of the main goals of human genome project was to identify all the protein-coding genes. There are ∼ 20,500 protein-coding genes annotated in human reference databases. However, in the last few years, proteogenomics studies have predicted thousands of novel protein-coding regions including low molecular weight proteins encoded by small open reading frames (ORFs) in untranslated regions of messenger RNAs and non-coding RNAs. Most of these predictions are based on bioinformatics analysis and ribosome footprints. The validity of some of these small ORF (sORF) encoded proteins (SEPs) has been established following functional characterization. With the growing number of predicted novel proteins, a strategy to identify reliable candidates that warrant further studies is needed. We developed an integrated proteogenomics workflow to identify reliable set of novel protein-coding regions in the human genome based on their recurrent observations across multiple samples. Publicly available ribosome profiling and global proteomics datasets were used to establish protein-coding evidence. We predicted protein translation from 4008 ORFs based on recurrent ribosome occupancy signals across samples. In addition, we identified 825 SEPs based on proteomics data. Some of the novel protein-coding regions identified were in genome-wide association studies (GWAS) loci associated with various traits and disease phenotypes. Peptides from SEPs are also presented by major histocompatibility complex class I (MHC-I) complex similar to canonical proteins. Novel protein-coding regions reported in this study expand the current catalog of protein-coding genes and warrant experimental studies to elucidate cellular functions regulated by these proteins and their role in human diseases.
    Keywords:  Non-coding RNAs; Novel proteins; Protein-coding potential; SEPs; sORF
    DOI:  https://doi.org/10.1093/gpbjnl/qzaf004
  2. Mol Cell Proteomics. 2025 Feb 04. pii: S1535-9476(25)00012-X. [Epub ahead of print] 100914
      Noncanonical micropeptides or called novel microproteins, i.e., polypeptides mostly under 10 kDa, are encoded by genomic sequences that have been previously annotated as noncoding but now known as small open reading frames (sORFs). The recent identification of microproteins encoded by sORFs has provided evidence that many sORFs encode functional microproteins that play crucial roles in various biological processes. T cell activation is a critical biological process for adaptive immune response. Understanding key players in this process will allow us to decipher the complex mechanisms as well as develop immunotherapy for treating a wide range of diseases. Although there have been extensive studies on canonical proteins in T cell activation, the novel microproteins in T cells and their roles have been uncharted water to date. Nascent proteins are defined as newly synthesized polypeptides emerged during the translation of mRNA. In this study, we combined nascent proteomics and quantitative proteomics to identify 411 novel microproteins in primary human T cells, including 83 nascent microproteins. We activated the T cell function with either PMA/Ionomycin (distal activation) or CD3/CD28 activating antibodies (proximal activation), and obtained a comprehensive canonical protein and microprotein profiles to pinpoint common and distinct differentially expressed proteins under these two activation conditions. After experimental testing, three microproteins numbered T1, T2 and T3 were found to be functional in regulating T cell activation. Bioinformatic and proteomic analyses suggested that T1 was functional related to immune as negative feedback to T cell activation. Our study not only established an integrated approach to uncover and elucidate novel microproteins but also highlight the significant role of microproteins in regulating T cell activation.
    DOI:  https://doi.org/10.1016/j.mcpro.2025.100914
  3. J Proteome Res. 2025 Feb 07. 24(2): 777-785
      Long noncoding RNAs (lncRNAs) are closely associated with tumor development, and increasing evidence suggests that small open reading frame (smORF) within lncRNAs also have the capability to encode smORF-encoded peptides (SEPs). Here, we thoroughly uncovered the SEP expression profile of hepatocellular carcinoma (HCC) from tumor and adjacent nontumor tissues of 154 HCC patients using high-throughput mass spectrometry (MS). A total of 208 SEPs were identified, with no significant difference in abundance and stability compared with coding region proteins. Notably, the peptide encoded by LINC01007 (LINC01007-33AA) was significantly upregulated in HCC tissues (p < 0.05) and could serve as an independent risk factor affecting prognosis (HR [95% CI]: 1.31[1.01-1.7]). This endogenous peptide was further confirmed at both the mRNA and protein levels, and its overexpression significantly enhances the invasion and migration of HCC cells. These findings highlight the potential of MS-based methods to identify novel noncoding sequence encoded functional peptides associated with tumor progression.
    Keywords:  hepatocellular carcinoma; invasion and migration; lncRNA-encoded peptides; proteomics; small open reading frame
    DOI:  https://doi.org/10.1021/acs.jproteome.4c00862
  4. Nat Commun. 2025 Feb 02. 16(1): 1275
      The biological process of RNA translation is fundamental to cellular life and has wide-ranging implications for human disease. Accurate delineation of RNA translation variation represents a significant challenge due to the complexity of the process and technical limitations. Here, we introduce RiboTIE, a transformer model-based approach designed to enhance the analysis of ribosome profiling data. Unlike existing methods, RiboTIE leverages raw ribosome profiling counts directly to robustly detect translated open reading frames (ORFs) with high precision and sensitivity, evaluated on a diverse set of datasets. We demonstrate that RiboTIE successfully recapitulates known findings and provides novel insights into the regulation of RNA translation in both normal brain and medulloblastoma cancer samples. Our results suggest that RiboTIE is a versatile tool that can significantly improve the accuracy and depth of Ribo-Seq data analysis, thereby advancing our understanding of protein synthesis and its implications in disease.
    DOI:  https://doi.org/10.1038/s41467-025-56543-0
  5. BMC Genomics. 2025 Feb 05. 26(1): 110
      Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria's often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called 'sORFdb', was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio .
    Keywords:  Bacteria; Database; Protein families; SORF; Short open reading frames; Small proteins
    DOI:  https://doi.org/10.1186/s12864-025-11301-w
  6. Front Cell Dev Biol. 2025 ;13 1525345
      Oncogenes are typically overexpressed in tumor tissues and often linked to poor prognosis. However, recent advancements in bioinformatics have revealed that many highly expressed genes in tumors are associated with better patient outcomes. These genes, which act as tumor suppressors, are referred to as "paradoxical genes." Analyzing The Cancer Genome Atlas (TCGA) confirmed the widespread presence of paradoxical genes, and KEGG analysis revealed their role in regulating tumor metabolism. Mechanistically, discrepancies between gene and protein expression-affected by pre- and post-transcriptional modifications-may drive this phenomenon. Mechanisms like upstream open reading frames and alternative splicing contribute to these inconsistencies. Many paradoxical genes modulate the tumor immune microenvironment, exerting tumor-suppressive effects. Further analysis shows that the stage- and tumor-specific expression of these genes, along with their environmental sensitivity, influence their dual roles in various signaling pathways. These findings highlight the importance of paradoxical genes in resisting tumor progression and maintaining cellular homeostasis, offering new avenues for targeted cancer therapy.
    Keywords:  bioinformatics; discordant gene-protein abundance; paradoxical genes; signaling pathway; tumor immune microenvironment; tumor metabolism
    DOI:  https://doi.org/10.3389/fcell.2025.1525345
  7. Immunogenetics. 2025 Feb 05. 77(1): 14
      Isoform sequencing (Iso-Seq) uses long-read technology to produce highly accurate full-length reads of mRNA transcripts. Visualization of individual mRNA molecules can reveal new details of transcript variation within understudied portions of mRNA, such as the 5' untranslated region (UTR). Differential 5' UTRs may contain motifs, upstream open reading frames (uORFs), and secondary structures that can serve to regulate translation or further indicate changes in promoter usage, where transcriptional control may impact protein expression levels. To begin to explore isoform variation during T-cell activation, we generated the first Iso-Seq reference transcriptome of activated human CD4 T cells. Within this dataset, we discovered many novel splice- and end-variant transcripts. Remarkably, one in every eight genes expressed in our dataset was found to have a notable proportion of transcripts with 5' UTR lengthened by over 100 bp compared to the longest corresponding UTR within the Gencode dataset. Among these end-variant transcripts, two novel isoforms were identified for CXCR5, a chemokine receptor associated with T follicular helper cell (Tfh) function and differentiation. When investigated in a model cell system, these lengthened UTR conferred reduced transcript stability and, for one of these isoforms, short uORFs introduced by the added length altered protein expression kinetics. This study highlights instances in which current reference databases are incomplete relative to the information obtained by long-read sequencing of intact mRNA. Iso-Seq is thus a promising approach to better understanding the plasticity of promoter usage, alternative splicing, and UTR sequences that influence RNA stability and translation efficiency.
    Keywords:  5′ untranslated region (UTR); Activated CD4 T cell; CXCR5; Isoform sequencing (Iso-Seq); PacBio
    DOI:  https://doi.org/10.1007/s00251-025-01371-1
  8. J Anim Sci Biotechnol. 2025 Feb 05. 16(1): 19
       BACKGROUND: Intramuscular fat is an important factor in evaluating pork quality and varies widely among different pig breeds. However, the regulatory mechanism of circular RNAs (circRNAs) in lipid metabolism remains largely unexplored.
    RESULTS: We combined circRNA-seq and Ribo-seq data to screen a total of 18 circRNA candidates with coding potential, and circANKRD17 was found to be significantly elevated in the longissimus dorsi muscle of Lantang piglets, with a length of 1,844 nucleotides. Using single-cell sequencing, we identified 477 differentially expressed genes in IMF cells between Lantang and Landrace piglets, with enrichment in the PPAR signaling pathway. These genes included FABP4, FABP5, CPT1A, and UBC, consistent with the high levels of acylcarnitines observed in the longissimus dorsi muscles of the Lantang breed, as determined by lipidomic analysis. Further in vitro and in vivo experiments indicated that circANKRD17 can regulate lipid metabolism through various mechanisms involving the PPAR pathway, including promoting adipocyte differentiation, fatty acid transport and metabolism, triglyceride synthesis, and lipid droplet formation and maturation. In addition, we discovered that circANKRD17 has an open reading frame and can be translated into a novel 571-amino-acid protein that promotes lipid metabolism.
    CONCLUSIONS: Our research provides new insights into the role of protein-coding circANKRD17, especially concerning the metabolic characteristics of pig breeds with higher intramuscular fat content.
    Keywords:  CircRNAs; Intramuscular fat; Meat quality; PPAR pathway
    DOI:  https://doi.org/10.1186/s40104-025-01153-5