bims-micpro Biomed News
on Discovery and characterization of microproteins
Issue of 2024–07–14
seven papers selected by
Thomas Farid Martínez, University of California, Irvine



  1. Methods Mol Biol. 2024 ;2836 19-34
      Genome annotation has historically ignored small open reading frames (smORFs), which encode a class of proteins shorter than 100 amino acids, collectively referred to as microproteins. This cutoff was established to avoid thousands of false positives due to limitations of pure genomics pipelines. Proteogenomics, a computational approach that combines genomics, transcriptomics, and proteomics, makes it possible to accurately identify these short sequences by overlaying different levels of omics evidence. In this chapter, we showcase the use of μProteInS, a bioinformatics pipeline developed for the identification of unannotated microproteins encoded by smORFs in bacteria. The workflow covers all the steps from quality control and transcriptome assembly to the scoring and post-processing of mass spectrometry data. Additionally, we provide an example on how to apply the pipeline's machine learning method to identify high-confidence spectra and pinpoint the most reliable identifications from large datasets.
    Keywords:  Genome annotation; Mass spectrometry; Proteomics; RNA-seq; smORFs; μProteInS
    DOI:  https://doi.org/10.1007/978-1-0716-4007-4_2
  2. Methods Mol Biol. 2024 ;2836 3-17
      Proteogenomics has revealed the translation of unannotated open reading frames (ORFs) present in mRNAs and in noncoding RNAs (ncRNAs). OpenProt annotates all ORFs with a minimum of 30 codons in the transcriptome of several species and displays many functional features associated with the corresponding proteins. Two types of proteins are annotated: reference or canonical proteins which are proteins already annotated in UniProt, RefSeq, or Ensembl and noncanonical proteins. Noncanonical proteins form two groups: predicted novel isoforms that display a significant level of homology with a reference protein and alternative proteins that are new proteins with no significant homology to known proteins. This chapter describes how to check whether a gene and/or transcript contains multiple open reading frames and how to use OpenProt databases for the detection of alternative proteins and novel isoforms by mass spectrometry-based proteomics.
    Keywords:  Alternative proteins; Database; Mass spectrometry; Multicoding; Proteogenomics
    DOI:  https://doi.org/10.1007/978-1-0716-4007-4_1
  3. bioRxiv. 2024 Jun 29. pii: 2024.06.29.601336. [Epub ahead of print]
      Over the past 15 years, hundreds of previously undiscovered bacterial small open reading frame (sORF)-encoded polypeptides (SEPs) of fewer than fifty amino acids have been identified, and biological functions have been ascribed to an increasing number of SEPs from intergenic regions and small RNAs. However, despite numbering in the dozens in Escherichia coli , and hundreds to thousands in humans, same-strand nested sORFs that overlap protein coding genes in alternative reading frames remain understudied. In order to provide insight into this enigmatic class of unannotated genes, we characterized GndA, a 36-amino acid, heat shock-regulated SEP encoded within the +2 reading frame of the gnd gene in E. coli K-12 MG1655. We show that GndA pulls down components of respiratory complex I (RCI) and is required for proper localization of a RCI subunit during heat shock. At high temperature GndA deletion (ΔGndA) cells exhibit perturbations in cell growth, NADH + /NAD ratio, and expression of a number of genes including several associated with oxidative stress. These findings suggest that GndA may function in maintenance of homeostasis during heat shock. Characterization of GndA therefore supports the nascent but growing consensus that functional, overlapping genes occur in genomes from viruses to humans.
    Significance Statement: Same-strand overlapping, or nested, protein coding sequences optimize the information content of size-constrained viral genomes, but were previously omitted from prokaryotic and eukaryotic genome annotations. It was therefore surprising when dozens of nested sORFs were recently discovered in bacteria. Our case study of E. coli GndA supports the hypothesis that overlapping genes may exist because they encode proteins with related functions. More broadly, characterization of nested sORFs may revise our understanding of the architecture of bacterial and eukaryotic genes.
    DOI:  https://doi.org/10.1101/2024.06.29.601336
  4. Sci Adv. 2024 Jul 12. 10(28): eadn3628
      The expression of tumor-specific antigens during cancer progression can trigger an immune response against the tumor. Here, we investigate if microproteins encoded by noncanonical open reading frames (ncORFs) are a relevant source of tumor-specific antigens. We analyze RNA sequencing data from 117 hepatocellular carcinoma (HCC) tumors and matched healthy tissue together with ribosome profiling and immunopeptidomics data. Combining human leukocyte antigen-epitope binding predictions and experimental validation experiments, we conclude that around 40% of the tumor-specific antigens in HCC are likely to be derived from ncORFs, including two peptides that can trigger an immune response in humanized mice. We identify a subset of 33 tumor-specific long noncoding RNAs expressing novel cancer antigens shared by more than 10% of the HCC samples analyzed, which, when combined, cover a large proportion of the patients. The results of the study open avenues for extending the range of anticancer vaccines.
    DOI:  https://doi.org/10.1126/sciadv.adn3628
  5. Genome Biol. 2024 Jul 08. 25(1): 183
       BACKGROUND: Recent studies uncovered pervasive transcription and translation of thousands of noncanonical open reading frames (nORFs) outside of annotated genes. The contribution of nORFs to cellular phenotypes is difficult to infer using conventional approaches because nORFs tend to be short, of recent de novo origins, and lowly expressed. Here we develop a dedicated coexpression analysis framework that accounts for low expression to investigate the transcriptional regulation, evolution, and potential cellular roles of nORFs in Saccharomyces cerevisiae.
    RESULTS: Our results reveal that nORFs tend to be preferentially coexpressed with genes involved in cellular transport or homeostasis but rarely with genes involved in RNA processing. Mechanistically, we discover that young de novo nORFs located downstream of conserved genes tend to leverage their neighbors' promoters through transcription readthrough, resulting in high coexpression and high expression levels. Transcriptional piggybacking also influences the coexpression profiles of young de novo nORFs located upstream of genes, but to a lesser extent and without detectable impact on expression levels. Transcriptional piggybacking influences, but does not determine, the transcription profiles of de novo nORFs emerging nearby genes. About 40% of nORFs are not strongly coexpressed with any gene but are transcriptionally regulated nonetheless and tend to form entirely new transcription modules. We offer a web browser interface ( https://carvunislab.csb.pitt.edu/shiny/coexpression/ ) to efficiently query, visualize, and download our coexpression inferences.
    CONCLUSIONS: Our results suggest that nORF transcription is highly regulated. Our coexpression dataset serves as an unprecedented resource for unraveling how nORFs integrate into cellular networks, contribute to cellular phenotypes, and evolve.
    Keywords:  Coexpression networks; De novo gene birth; Noncanonical ORFs; Transcriptional regulation; Translatome; smORFs
    DOI:  https://doi.org/10.1186/s13059-024-03287-7
  6. Insect Biochem Mol Biol. 2024 Jul 05. pii: S0965-1748(24)00085-7. [Epub ahead of print] 104154
      Chagas disease affects around 8 million people globally, with Latin America bearing approximately 10,000 deaths each year. Combatting the disease relies heavily on vector control methods, necessitating the identification of new targets. Within insect genomes, genes harboring small open reading frames (smORFs - < 100 amino acids) present numerous potential candidates. In our investigation, we elucidate the pivotal role of the archetypal smORF-containing gene, mille-pattes/polished-rice/tarsalless (mlpt/pri/tal), in the post-embryonic development of the kissing bug Rhodnius prolixus. Injection of double-stranded RNA targeting mlpt (dsmlpt) during nymphal stages yields a spectrum of phenotypes hindering post-embryonic growth. Notably, fourth or fifth stage nymphs subjected to dsmlpt do not undergo molting. These dsmlpt nymphs display heightened mRNA levels of JHAMT-like and EPOX-like, enzymes putatively involved in the juvenile hormone (JH) pathway, alongside increased expression of the transcription factor Kr-h1, indicating changes in the hormonal control. Histological examination reveals structural alterations in the hindgut and external cuticle of dsmlpt nymphs compared to control (dsGFP) counterparts. Furthermore, significant changes in the vector's digestive physiology were observed, with elevated hemozoin and glucose levels in the posterior midgut of dsmlpt nymphs. Importantly, dsmlpt nymphs exhibit impaired metacyclogenesis of Trypanosoma cruzi, the causative agent of Chagas disease, underscoring the crucial role of proper gut organization in parasite differentiation. Thus, our findings constitute the first evidence of a smORF-containing gene's regulatory influence on vector physiology, parasitic cycle, and disease transmission.
    Keywords:  American trypanosomiasis; Chagas disease; ecdysone; hemozoin; insect development; polished-rice/tarsalless
    DOI:  https://doi.org/10.1016/j.ibmb.2024.104154
  7. Int J Mol Sci. 2024 Jun 28. pii: 7166. [Epub ahead of print]25(13):
      In recent years, interest in very small proteins (µ-proteins) has increased significantly, and they were found to fulfill important functions in all prokaryotic and eukaryotic species. The halophilic archaeon Haloferax volcanii encodes about 400 µ-proteins of less than 70 amino acids, 49 of which contain at least two C(P)XCG motifs and are, thus, predicted zinc finger proteins. The determination of the NMR solution structure of HVO_2753 revealed that only one of two predicted zinc fingers actually bound zinc, while a second one was metal-free. Therefore, the aim of the current study was the homologous production of additional C(P)XCG proteins and the quantification of their zinc content. Attempts to produce 31 proteins failed, underscoring the particular difficulties of working with µ-proteins. In total, 14 proteins could be produced and purified, and the zinc content was determined. Only nine proteins complexed zinc, while five proteins were zinc-free. Three of the latter could be analyzed using ESI-MS and were found to contain another metal, most likely cobalt or nickel. Therefore, at least in haloarchaea, the variability of predicted C(P)XCG zinc finger motifs is higher than anticipated, and they can be metal-free, bind zinc, or bind another metal. Notably, AlphaFold2 cannot correctly predict whether or not the four cysteines have the tetrahedral configuration that is a prerequisite for metal binding.
    Keywords:  AlphaFold2; C(P)XCG motif; ESI; Haloferax volcanii; archaea; mass spectrometry; metal-binding proteins; microproteins; small proteins; zinc finger
    DOI:  https://doi.org/10.3390/ijms25137166