Interdiscip Sci. 2021 Jul 25.
Long non-coding RNA (lncRNA), which is a type of non-coding RNA, was reported to contain short open reading frames (sORFs). SORFs-encoded short peptides (SEPs) have been demonstrated to play a crucial role in regulating the biological processes such as growth, development, and resistance response. The identification of SEPs is vital to further understanding their function. However, there is still a lack of methods for identifying SEPs effectively and rapidly. In this study, a novel method for lncRNA-encoded short peptides identification based on feature subset recombination and ensemble learning, lncPepid, is developed. lncPepid transforms the data of Zea mays and Arabidopsis thaliana into hybrid features from two aspects including sequence composition and physicochemical properties separately. It optimizes hybrid features by proposing a novel weighted iteration-based feature selection method to recombine a stable subset that characterizes SEPs effectively. Different classification models with different optimized features are constructed and tested separately. The outputs of the optimal models are integrated for ensemble classification to improve efficiency. Experimental results manifest that the geometric mean of sensitivity and specificity of lncPepid is about 70% on the identification of functional SEPs derived from multiple species. It is an effective and rapid method for the identification of lncRNA-encoded short peptides. This study can be extended to the research on SEPs from other species and have crucial implications for further findings and studies of functional genomics.
Keywords: Ensemble learning; Feature subset recombination; Long non-coding RNA; Short open reading frames; Short peptides