Computational prediction of functional similarity of CRMs
Transcriptional regulation of genes is fundamental to all living organisms. The spatial, temporal and condition-specific expression levels of genes are in part determined by inherited regulatory codes in non-coding regions of the DNA. A large set of methods have been proposed to detect conserved regions of regulatory DNA by means of sequence alignments. However, it has become clear that some regulatory regions do not show statistically significant alignments even in the presence of functional conservation. Therefore, detecting and characterising elusive regulatory codes remains a challenging problem. \ud In this thesis we develop and validate a novel computational alignment free model for detection of functional similarity of regulatory sequences. We show that our model can detect functional links between pairs of sequences that do not align with a significant score. We apply the model to a) detect enhancers within the same genome that are likely to have similar functions and b) to detect functionally conserved enhancer regions in orthologous genomes. Our method finds regulatory codes that are common to groups of similar enhancers and consistent with previous biological knowledge. \ud The inputs for our model are two sequences that we wish to compare in terms of their functional similarity as well as a set of transcription factor motifs. The mathematical framework of our model is built on two main components: In the first model component, each sequence is mapped to a vector of estimated occupancy levels for all motifs. These vectors are representing which motifs at what multiplicity and specificity are present in each sequence.\ud In the second model component, a statistical approach is established where we first estimate a probability distribution of motif occupancy levels for sequences that function similar to the template sequence. We then compute a statistical similarity score to evaluate if the sequences are more similar to each other than to random background sequences.\ud Two applications of this model are presented: First it is applied to a set of experimentally validated non-alignable enhancers from\ud D. melanogaster. We show that:\ud • Our model can detect statistical links between these enhancers,\ud • Weak binding sites can make a strong contribution to sequence similarity,\ud • Our model treats statistically significant presence and absence of motifs symmetrically. Similarity of sequences, therefore, can be based on a combination of the two. We show examples of motifs making contributions to sequence similarity through their absence.\ud • Using our model, we can create a network of similarities among the fly enhancers. Groups of enhancers in this network show common\ud regulatory codes. One of these regulatory codes is strongly supported by existing experimental data.\ud In the second application of our model we predict functional subregions of a known D. melanogaster enhancer. To achieve this, we first show that the model can detect the orthology of this enhancer between 10 Drosophila species. We then demonstrate how this statistical link can be used to predict functional subregions within this enhancer.