Genome Biol Evol. 2024 Nov 21. pii: evae252. [Epub ahead of print]
Studying fundamental aspects of eukaryotic biology through genetic information can face numerous challenges, including contamination and intricate biotic interactions, which are particularly pronounced when working with uncultured eukaryotes. However, existing tools for predicting open reading frames (ORFs) from transcriptomes are limited in these scenarios. Here we introduce Transcript Identification and Selection (TIdeS), a framework designed to address these non-trivial challenges associated with current 'omics approaches. Using transcriptomes from 32 taxa, representing the breadth of eukaryotic diversity, TIdeS outperforms most conventional ORF-prediction methods (i.e., TransDecoder), identifying a greater proportion of complete and in-frame ORFs. Additionally, TIdeS accurately classifies ORFs using minimal input data, even in the presence of 'heavy contamination'. This built-in flexibility extends to previously unexplored biological interactions, offering a robust single-stop solution for precise ORF predictions and subsequent decontamination. Beyond applications in phylogenomic-based studies, TIdeS provides a robust means to explore biotic interactions in eukaryotes (e.g., host-symbiont, prey-predator) and for reproducible dataset curation from transcriptomes and genomes.
Keywords: ORF prediction; biotic interactions; contamination; machine learning; phylogenomics