bims-micpro Biomed News
on Discovery and characterization of microproteins
Issue of 2024–06–30
four papers selected by
Thomas Farid Martínez, University of California, Irvine



  1. Biophys Rep (N Y). 2024 Jun 21. pii: S2667-0747(24)00026-0. [Epub ahead of print] 100167
      Significant efforts have been made to characterize the biophysical properties of proteins. Small proteins have received less attention because their annotation has historically been less reliable. However, recent improvements in sequencing, proteomics, and bioinformatics techniques have led to the high-confidence annotation of small open reading frames (smORFs) that encode for functional proteins, producing smORF-encoded proteins (SEPs). SEPs have been found to perform critical functions in several species, including humans. While significant efforts have been made to annotate SEPs, less attention has been given to the biophysical properties of these proteins. We characterized the distributions of predicted and curated biophysical properties, including sequence composition, structure, localization, function, and disease association of a conservative list of previously identified human SEPs. We found significant differences between SEPs and both larger proteins and control sets. Additionally, we provide an example of how our characterization of biophysical properties can contribute to distinguishing protein-coding smORFs from non-coding ones in otherwise ambiguous cases.
    Keywords:  Amino Acids; Computational Biology; Genome; Human; Peptides; Proteins
    DOI:  https://doi.org/10.1016/j.bpr.2024.100167
  2. Genome Biol Evol. 2024 Jun 27. pii: evae126. [Epub ahead of print]
      During evolution, new ORFs with the potential to give rise to novel proteins continuously emerge. A recent compilation of non-canonical ORFs with translation signatures in humans has identified thousands of cases with a putative de novo origin. However, it is not known which is their distribution in the population. Are they universally translated? Here we use ribosome profiling data from 65 lymphoblastoid cell lines from individuals of Yoruba origin to investigate this question. We identify 2,587 de novo ORFs translated in at least one of the cell lines. In line with their de novo origin, the encoded proteins tend to be smaller than 100 amino acids and encode positively charged proteins. We observe that the de novo ORFs are more polymorphic in the population than the set of canonical proteins, with a substantial fraction of them being translated in only some of the cell lines. Remarkably, this difference remains significant after controlling for differences in the translation levels. These results suggest that variations in the level translation of de novo ORFs could be a relevant source of intra-species phenotypic diversity in humans.
    Keywords:  LCL; de novo ORF; human; polymorphism; translation
    DOI:  https://doi.org/10.1093/gbe/evae126
  3. Genes (Basel). 2024 Jun 13. pii: 775. [Epub ahead of print]15(6):
      The high-throughput proteomics data generated by increasingly more sensible mass spectrometers greatly contribute to our better understanding of molecular and cellular mechanisms operating in live beings. Nevertheless, proteomics analyses are based on accurate genomic and protein annotations, and some information may be lost if these resources are incomplete. Here, we show that most proteomics data may be recovered by interconnecting genomics and proteomics approaches (i.e., following a proteogenomic strategy), resulting, in turn, in an improvement of gene/protein models. In this study, we generated proteomics data from Leishmania donovani (HU3 strain) promastigotes that allowed us to detect 1908 proteins in this developmental stage on the basis of the currently annotated proteins available in public databases. However, when the proteomics data were searched against all possible open reading frames existing in the L. donovani genome, twenty new protein-coding genes could be annotated. Additionally, 43 previously annotated proteins were extended at their N-terminal ends to accommodate peptides detected in the proteomics data. Also, different post-translational modifications (phosphorylation, acetylation, methylation, among others) were found to occur in a large number of Leishmania proteins. Finally, a detailed comparative analysis of the L. donovani and Leishmania major experimental proteomes served to illustrate how inaccurate conclusions can be raised if proteomes are compared solely on the basis of the listed proteins identified in each proteome. Finally, we have created data entries (based on freely available repositories) to provide and maintain updated gene/protein models. Raw data are available via ProteomeXchange with the identifier PXD051920.
    Keywords:  Leishmania donovani; experimental proteome; mass spectrometry; post-translational modifications (PTMs); proteogenomics
    DOI:  https://doi.org/10.3390/genes15060775
  4. J Mol Evol. 2024 Jun 25.
      By looking for a lack of homologs in a reference database of 27 well-annotated proteomes of primates and 52 well-annotated proteomes of other mammals, 170 putative human-specific proteins were identified. While most of them are deemed uncertain, 2 are known at the protein level and 23 at the transcript level, according to UniProt. Interestingly, 23 of these 25 proteins are found to be encoded or to have close homologs in an open reading frame of a long noncoding human RNA. However, half of them are predicted to be at least 80% globular, with a single structural domain, according to IUPred, and with at least 80% of ordered residues, according to flDPnn. Strikingly, there is a near-complete lack of structural knowledge about these proteins, with no tertiary structure presently available in the Protein Data Bank and a fair prediction for one of them in the AlphaFold Protein Structure Database. Moreover, knowledge about the function of these possibly key proteins remains scarce.
    Keywords:  AlphaFold; FlDPnn; Globularity; IUPred; RNAcentral; Tertiary structure; UniProt
    DOI:  https://doi.org/10.1007/s00239-024-10174-z