bims-micpro Biomed News
on Discovery and characterization of microproteins
Issue of 2022‒09‒18
two papers selected by
Thomas Farid Martínez
University of California, Irvine

  1. Brief Bioinform. 2022 Sep 12. pii: bbac392. [Epub ahead of print]
      Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303 nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.
    Keywords:  coding sORFs; deep learning; ensemble learning; feature encoding; in-frame sequence features
  2. Nucleic Acids Res. 2022 Sep 13. pii: gkac776. [Epub ahead of print]
      Cancer-related epitopes can engage the immune system against tumor cells, thus exploring epitopes derived from non-coding regions is emerging as a fascinating field in cancer immunotherapies. Here, we described a database, IEAtlas (, which aims to provide and visualize the comprehensive atlas of human leukocyte antigen (HLA)-presented immunogenic epitopes derived from non-coding regions. IEAtlas reanalyzed publicly available mass spectrometry-based HLA immunopeptidome datasets against our integrated benchmarked non-canonical open reading frame information. The current IEAtlas identified 245 870 non-canonical epitopes binding to HLA-I/II allotypes across 15 cancer types and 30 non-cancerous tissues, greatly expanding the cancer immunopeptidome. IEAtlas further evaluates the immunogenicity via several commonly used immunogenic features, including HLA binding affinity, stability and T-cell receptor recognition. In addition, IEAtlas provides the biochemical properties of epitopes as well as the clinical relevance of corresponding genes across major cancer types and normal tissues. Several flexible tools were also developed to aid retrieval and to analyze the epitopes derived from non-coding regions. Overall, IEAtlas will serve as a valuable resource for investigating the immunogenic capacity of non-canonical epitopes and the potential as therapeutic cancer vaccines.