A streamlined platform for cell type-specific prediction of TF binding

A streamlined platform for cell type-specific prediction of TF binding


Author(s): Emanuel Sonder,Mark Robinson,Pierre-Luc Germain

Affiliation(s): Institute for Neuroscience, ETH Zurich



Transcription factors (TFs) mediate transcription by binding specific sites in the genome, i.e., transcription factor binding sites (TFBS). These vary across cell types and conditions, determining cell fate and response to stimuli. The binding of a TF is influenced by its affinity for certain DNA sequences, but also by local chromatin accessibility and the presence and activity of other TFs acting as cofactors. TFBS can be determined experimentally by techniques such as ChIP-seq; however, given the large number of combinations of cell types, conditions and TFs, these costly experimental techniques can only capture a small number of these combinations. This motivates the use of computational TFBS predictions. Currently existing algorithms for predicting TFBS achieve fair performance on selected combinations of TFs and cell types, however often require rich and/or carefully curated data and are difficult to extend in a generic manner to new combinations. We are developing a scalable approach to efficiently and accurately predict binding sites for a TF of interest in a cell type-specific manner, which makes use of the wealth of available data, while only requiring an ATAC-seq profile from the target cell type. Focus is put on broad applicability of our approach across differing cellular contexts. We first defined a search space by constructing a compendium of 3.8M putative regulatory elements, then train bagged learners of models mimicking varying binding activities of factors in different cell types. A broad range of features quantifying various aspects of TF binding, such as footprints/insertion signals, cooperativity and conservedness of binding patterns, is constructed based on the ATAC-seq profile, motif matches and available ChIP-seq data. On a implementational level, our pipeline makes extensive use of data containers provided via the Bioconductor platform, such as MultiAssayExperiment and GenomicRanges objects, allowing for explorability and usability of predictions in synchronization with existing Bioconductor packages. Our approach harnesses similarity across cell types and interactions between transcription factors to achieving state-of-the-art predictions at a low computational cost. Given the broad applicability of our method, we intend to build a comprehensive compendium of cell type-specific TFBS predictions. Accompanying the compendium we plan to distribute our method and the pretrained models via the Bioconductor project, potentially as a separate method and data package, to enable users to obtain predictions on their cell types of interest, and use provided predictions for downstream tasks such as TF activity analysis.