Applying machine learning (deep neural networks) to genome-wide maps of epigenetic modifications (e.g. DNA accessibility, or gene expression) has emerged as a powerful tool to predict genomic activity directly from DNA seqeunce. While high prediction accuracy can generally be achieved, there are several shortcomings that limit the widespread use of “sequence-to-activity” models in clinical settings. A key challenge is deriving human-interpretable, molecular mechanisms from ‘black-box’ models. As of today, we still lack a full understanding of how our genetic code leads to epigenetic changes, making it difficult to develop new treatments for epigenome-related malignancies.
Our lab is pioneering technologies in near-native settings that characterize the functional properties of individual transcription factors (TFs) – the master regulators of cell fate – and how they relay genetic information to the epigenome . Instead of relying on already existing regulatory sequences, we deploy large libraries of synthetic ones, to generate detailed maps of how TFs interact with each other and with the nuclear environment.
The TF Combinatorial Code
The human genome codes for >1500 TFs, with many studies pointing to a complex network of TF-TF interactions controlling the expression of individual genes. Despite a decades-long effort to characterize how TF pairs and multi-TF complexes regulate gene expression, our understanding of the ‘TF combinatorial code’ remains limited. This is in large part due to the vast space of possible combinations:
2^1500 possible combinations of pairwise interactions (a number with more zeros than the total amount of money circulating worlwide) stands in contrast to ~100,000 observations of genomic activity measured in a given experiment. This creates a challenge for mechanistic inference, as each genomic context assembles a unique configuration of multiple TFs.
To tackle this challenge, we are designing genome-integrated molecular amplifiers, which allow recording the transcriptional output and cooperativity between individual TF-pairs across a wide variety of cell types. By combining high-throughput experiments with machine learning, we strive to create a resource that maps the regulatory output for thousands of TF-TF interactions in a quantitative manner.
Previous work on the topic: https://www.nature.com/articles/s41588-024-01892-7

Communication between Regulatory Elements
We previously developed a genome-integrated reporter assay called EXTRA-seq that can be targeted to native gene loci with minimal scarring. EXTRA-seq bridges the gap between in vitro screening of regulatory sequences using artificial setups (such as massively parallel reporter assays; MPRAs) and ‘genotype-to-phenotye’ studies (GWAS, eQTLs), which require simultaneous data on naturally occurring genetic variation and molecular activity across dozens of individuals, a currently limited resource. With EXTRA-seq we can generate hundreds of mutations at will and test their function in native settings within a cell. Importantly, we can incorporate several kilobases of sequence, a distance far enough to separate canonical ‘enhancer’ function from that of promoters. Our goal is to expand the functional repertoire of EXTRA-seq and answer questions that are difficult to tackle with existing methods.
Previous work on the topic: https://www.biorxiv.org/content/10.1101/2024.12.08.627402v1
