Applying machine learning (deep neural networks) to genome-wide maps of molecular activity (e.g. gene expression or DNA accessibility) has emerged as a powerful tool to predict genomic activity directly from DNA sequence. While high prediction accuracy can generally be achieved when testing against non-coding elements in our genome, there are several shortcomings that limit the widespread use of “sequence-to-activity” models in clinical settings. A key challenge is deriving human-interpretable, molecular mechanisms from ‘black-box’ models. As of today, we still lack a full understanding of how our genetic code leads to epigenetic and gene expression changes, making it difficult to develop new treatments that target malignancies caused by transcriptional dysregulation.
Our lab is pioneering technologies in near-native settings that characterize the functional properties of individual transcription factors (TFs) – the master regulators of cell fate – and how they relay genetic information to the epigenome . Instead of relying on already existing regulatory sequences, we deploy large libraries of synthetic ones, to generate detailed maps of how TFs interact with each other and with the nuclear environment.
The TF Combinatorial Code
The human genome codes for >1500 TFs, with many studies pointing to a complex network of TF-TF interactions controlling the expression of individual genes. Despite a decades-long effort to characterize how TF pairs and multi-TF complexes regulate gene expression, our understanding of the ‘TF combinatorial code’ remains limited. This is in large part due to the vast space of possible combinations:
1500^2 > 200 Mio possible combinations of pairwise interactions, not accounting for how TFs are positioned in 3D space (orientation and spacing), different binding strengths, and multi-way interactions. This complex regulatory space stands in contrast to ~100,000 observations of genomic activity we can measure in any given experiment, thus creating a challenge for mechanistic inference: How can we learn the contribution of individual TF combinations, when each genomic context assembles a unique configuration and we only have one genome to learn from?
To tackle this challenge, we are designing synthetic, genome-integrated molecular amplifiers, which allow recording the transcriptional output and cooperativity of individual TF-pairs across a wide variety of cell types. By combining high-throughput experiments with machine learning, we strive to create a resource that maps the regulatory output for thousands of TF-TF interactions in a quantitative manner.
Previous work on the topic: https://www.nature.com/articles/s41588-024-01892-7
Communication between Regulatory Elements
We previously developed a genome-integrated reporter assay called EXTRA-seq that can be targeted to native gene loci with minimal scarring. EXTRA-seq bridges the gap between in vitro screening of regulatory sequences using artificial setups (such as massively parallel reporter assays; MPRAs) and ‘genotype-to-phenotye’ studies (GWAS, eQTLs), which require simultaneous data on naturally occurring genetic variation and molecular activity across dozens of individuals, a currently limited resource. With EXTRA-seq we can already generate hundreds of mutations at will and test their function in native settings within a cell. Importantly, we can incorporate several kilobases of sequence, a distance far enough to separate canonical ‘enhancer’ function from that of promoters. Our goal is to expand the functional repertoire of EXTRA-seq and answer questions that are difficult to tackle with existing methods. The data will serve as a much-needed benchmark for DNA foundation models, whose goal it is to accurately predict the effects of both beneficial and deleterious mutations to the non-coding genome.
Previous work on the topic: https://www.biorxiv.org/content/10.1101/2024.12.08.627402v1 & https://www.nature.com/articles/s41588-025-02441-6