DiS-TSS: An Annotation Agnostic Algorithm for TSS Identification

The spread, distribution and utilization of transcription start sites (TSS) experimental evidence within promoters are poorly understood. Cap Analysis of Gene Expression (CAGE) has emerged as a popular gene expression profiling protocol, able to quantitate TSS usage by recognizing the 5′ end of capped RNA molecules. However, there is an increasing volume of studies in the literature suggesting that CAGE can also detect 5′ capping events which are transcription byproducts. These findings highlight the need for computational methods that can effectively remove the excessive amount of noise from CAGE samples, leading to accurate TSS annotation and promoter usage quantification. In this study, we present an annotation agnostic computational framework, DIANA Signal-TSS (DiS-TSS), that for the first time utilizes digital signal processing inspired features customized on the peculiarities of CAGE data. Features from the spatial and frequency domains are combined with a robustly trained Support Vector Machines (SVM) model to accurately distinguish between peaks related to real transcription initiation events and biological or protocol-induced noise. When benchmarked on experimentally derived data on active transcription marks as well as annotated TSSs, DiS-TSS was found to outperform existing implementations, by providing on average ~11k positive predictions and an increase in performance by ~5% based on in the experimental and annotation-based evaluations. © Springer Nature Switzerland AG 2020.

URI

http://hdl.handle.net/11615/73703

Collections

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19735]