/GraphReg

Chromatin interaction aware gene regulatory modeling with graph attention networks

Primary LanguagePython

GraphReg

Alt text

GraphReg (Chromatin interaction aware gene regulatory modeling with graph attention networks) is a graph neural network based gene regulation model which integrates DNA sequence, 1D epigenomic data (such as chromatin accessibility and histone modifications), and 3D chromatin conformation data (such as Hi-C, HiChIP, Micro-C, HiCAR) to predict gene expression in an informative way. GraphReg is a versatile model which can be used to answer interesting questions in regulatory genomics such as:

  • How well we can predict expression of a gene by using the epigenomic features of its promoter and candidate enhancers and enhancer-promoter interactions? Can this model be used in unseen cell types to predict gene expression?

  • What are the cis regulatory elements of the genes in each cell type? Which candidate enhancers are functional and play a role in gene regulation?

  • Which transcription factor (TF) motifs are important for gene regulation? How do distal TF motifs regulate their target genes?

This repository contains all the codes for training GraphReg models and all the downstream analyses for gene expression prediction, enhancer validation, and discovering regulating TF motifs.

Data preparation

1D data (epigenomic and CAGE)

We need a coverage file bigwig for each epigenomic track. We have used some useful functions from Basenji for reading and writing the bigwig files, which can be found in utils.

We can use two different approaches to generate bigwig files from alignment BAM files:

  • bam_cov.py from Basenji. This works best when we want to work with each cell type individually. The coverage tracks from different cell types are not normalized by this method. In Epi-GraphReg if we are interested in cross-cell-type generalization, the coverage tracks should be normalized by other techniques such as DESeq, otherwise there would be batch effect between cell types due to sequencing depths, which would hurt the generalization performance.

  • bamCoverage from deepTools. This is more suitable for cross-cell-type analyses, as they offer some normalization methods for bigwig files. In particular, we use 1x normalization or reads per genome coverage (RPGC), which normalizes the coverage in each bin by sequencing depth. We run bamCoverage with bin size 100 for epigenomic tracks and 5000 for CAGE-seq.

After generating the bigwig files, we use data_read.py to read the bigwig files and save the coverage signals in hdf5 format. We use pool_width = 100 (to get the coverage in 100bp bins) for epigenomic tracks and pool_width = 5000 (to get the coverage in 5Kb bins) for CAGE. The reason of using 5Kb bins for CAGE is that we use 5Kb resolution of 3D assays and want to have corresponding bins. If we use bam_cov.py to generate bigwig files, we set sum_stat = 'sum' to sum all the base-pair coverage in each bin; otherwise, if we use bamCoverage to generate bigwig files, we set sum_stat = 'max' as the coverage per bin has already been computed per bin.

3D data (chromatin conformation: Hi-C/HiChIP/Micro-C/HiCAR)

The chromatin conformation fastq data from various 3D assays such as Hi-C, HiChIP, Micro-C, HiCAR could be aligned to any genome (using packages like Juicer or HiC-Pro) to get .hic files. GraphReg needs connecivity graphs for each chromosome. As these 3D data are very noisy, we need some statistical tools to get the significant interactions for the graphs, otherwise it would be very noisy. To this end, we use HiCDCPlus which gets the .hic files and returns the significance level (FDR) for each genomic interaction (of resolution 5Kb) based on a Negative Binomial model. We filter the interactions and keep the ones with FDR <= alpha to form the graphs and adjacency matrices. We have worked with three different values of alpha = 0.1, 0.01, 0.001 and noticed that its ideal value depends on the 3D data. But, we recommend alpha = 0.1 as a default and less stringent cutoff.

The outputs of HiCDCPlus is given to hic_to_graph.py to generate the adjacency matrices for each chromosome, which are saved as sparce matrices.

TSS bins and positions

We need to have a BED file for TSS annotations. This file could be extracted from any gene annotation GTF files for any genome build. We have used GENCODE annotations which can be found here. The TSS annotation BED file is given to find_tss.py to compute the number of TSS's in each 5Kb bin. find_tss.py saves four outputs as numpy files: start position of each bin, number of TSS's in each bin, and the gene names (if existent) and their TSS positions in each bin. With 5Kb bins, the majority of them would have one TSS. However, there is a chance that a bin has 2 or 3 TSS's, in which case we save the first TSS position and all the genes (in the format gene_name_1+gene_name_2), because we want to keep track of all the genes appearing in each bin.

Writing all data (1D, 3D, TSS) to TFRecords

GraphReg has been implemented in TensorFlow. TFRecord is an efficient format to store and read data in TensorFlow. We use data_write.py to read (1) epigenomic coverage files (saved in h5 format), (2) sparse adjacency matrices (saved in npz format), and (3) TSS files (saved in np format) and save them sequentially in TFRecords in each chromosome. We start from beginning of each chromosome and write the epigenomic and CAGE coverages, adjacency matrices, and TSS annotations for the regions of 6Mb. Then we sweep the entire chromosome by steps of 2Mb. This way, there is no overlap for the middle 2Mb regions where we predict gene expression values. For each batch of 6Mb, the dimensions of data would be: 60,000 for each epigenomic track, 1200 for CAGE, and 1200 x 1200 for adjacency matrices. The predicted CAGE values in the middle 400 bins would appear in the loss function so that all the genes could see their distal enhancers up to 2Mb up- and downstream of their TSS.

The TFRecord files are slightly different for Epi-GraphReg and Seq-GraphReg models: (1) TFRecords for Seq-GraphReg also contain one-hot-coded DNA sequences of the size 6,000,000 x 4, as the DNA sequence is an input for these models, (2) The epigenomic signals for Epi-GraphReg undergo an extra log-normalization, via function log2(x+1), to reduce their dynamic ranges, as they are inputs in Epi-GraphReg models.

Now that we have generated TFRecord files, we are ready to train the models.

Training GraphReg models

Epi-GraphReg

Use Epi-GraphReg.py to train the Epi-GraphReg models. You should specify the validation and test chromosomes. The remaining autosomal chromosomes are used for training. For example:

python Epi-GraphReg.py -c K562 -p $data_path -a HiChIP -g 1 -q 0.1 -v 3,13 -t 4,14

trains Epi-GraphReg on cell line K562, using graphs extracted from HiChIP with FDR (q-value) cutoff 0.1, in generalizable mode -g 1, with Chrs 3 and 13 as the validation and Chrs 4 and 14 as the test chromosomes. $data_path is the directory where TFRecords have been stored. Training on generalizable mode means that the model uses the normalized epigenomic coverage tracks (for example the ones obtained from RPGC normalization) so that the trained model can be used in other cell types as well.

Seq-GraphReg

The Seq-GraphReg models can be trained in two ways: (1) end-to-end, (2) separate. End-to-end training means that both epigenomic and CAGE data are predicted in a multi-task learning fashion. However, separate training means that we train two tasks (epigenomic and GACE) separately: we first use CNN layers to predict the epigenomic tracks from DNA sequence (similar to Basenji) and then feed the bottleneck representations to the graph attention layers to predict the CAGE values. So, which one should you use? It depends on the amount of GPU memory the users have access to. End-to-end training requires high GPU memory as it needs to load the entire 6Mb DNA sequence to the GPU memory. One advantage of an end-to-end model is the ability to do gradient-based feature attribution from output of any gene back to the base pair level. However, if a high GPU memory is not available, we can employ separate training, where we can use smaller genomic regions of length 100Kb (instead of 6Mb) to predict the epigenomic data from DNA sequences as these are local features and no graph is used for this task. Then after predicting the entire 6Mb (60 mini batches of 100Kb), we concatenate their corresponding bottleneck representations (with the size 60,000 x 64, where 64 is the dimension of bottleneck representations) and feed that to the graph attention layers along with the corresponding graph of 6Mb region.

End-to-end training

Use Seq-GraphReg_e2e.py to train end-to-end Seq-GraphReg models. You should specify the validation and test chromosomes. The remaining autosomal chromosomes are used for training. For example:

python Seq-GraphReg_e2e.py -c K562 -p $data_path -a HiChIP -q 0.1 -v 3,13 -t 4,14

trains end-to-end Seq-GraphReg on cell line K562, using graphs extracted from HiChIP with FDR (q-value) cutoff 0.1 with Chrs 3 and 13 as the validation and Chrs 4 and 14 as the test chromosomes.

Separate training

1- Use Seq-CNN_base.py to first train a CNN model to predict the epigenomic data from DNA sequence. For example:

python Seq-CNN_base.py -c K562 -p $data_path -v 3,13 -t 4,14

trains a CNN model on cell line K562 with Chrs 3 and 13 as the validation and Chrs 4 and 14 as the test chromosomes. We have implemented four flavors of such epigenomic CNN models: with/without dilation layers, and with/without FFT (Fast Fourier Transform) loss (so overall four models), which can be found here. The idea of adding FFT attribution prior is borrowed from here to improve the interpretability of the deep learning models using DNA sequences.

2- Use Seq-GraphReg.py to train the graph attention networks of Seq-GraphReg to predict the CAGE values. For example:

python Seq-GraphReg.py -c K562 -p $data_path -a HiChIP -q 0.1 -f 1 -d 0 -v 3,13 -t 4,14

first loads the epigenomic CNNs trained in step (1) on cell line K562 without dilation layers -d 0 and with FFT loss -f 1. Then it uses the bottleneck representation as the input for the second task: predicting CAGE using the graphs extracted from HiChIP with FDR cutoff 0.1. Note that the validation (Chrs 3 and 13) and test (Chr 4 and 14) chromosomes are the same as step (1).

Testing GraphReg models

After training the GraphReg models, it is time to use them to predict gene expression (CAGE) in held-out test chromosomes or cell types. All the scripts for saving the predictions and plotting the results are in test. We first run the prediction scripts (explained below) to save the CAGE predictions and all meta data (such as number of enhancer-promoter interactions, name of the genes, TSS positions, etc.) for the TSS bins in the test chromosomes in a csv file. Then we run the plotting scripts (which call the saved csv files) to plot the results.

Feature attributions of GraphReg models

Enhancer validation

We can use feature attributions of GraphReg models to find out which regions are important for gene expression, meaning that they could be considered as the enhancers of any target gene. To validate enhancer predictions, we need experimental enhancer perturbation data. We have used two such datasets in the K562 cell line: (1) CRISPRi FlowFISH from this paper and (2) Targeted Perturb-seq (TAP-seq) from this paper. We can use the feature attributions of both Epi-GraphReg and Seq-GraphReg models. We have tried DeepSHAP and gradient-by-input for feature attribution. Note that in Epi-GraphReg we do the gradients of output with respect to the input epigenomic bins (of size 100bp), while in Seq-GraphReg we do the gradients of outputs with respect to the bottleneck representation bins (of size 100bp). This approach has been suggested by Basenji. All the scripts for these kind of analyses are in here.

In-silico TF motif knockout and ISM

We can use Seq-GraphRe models to get insights about how TF motifs regulate their target genes. To have experimental validation data, we downloaded CRISPRi TF knockout experiments in K562 cells from ENCODE. These data show which genes are down- (up) regulated after knockout of a TF. To see if we can predict such effects, we delete the motifs of that TF in the genome and then observe the difference in the predictions of each gene. If a TF motif is important for the expression of a gene, the predicted value of gene expression would go down. By doing so, we can find the direct effects of each TF on their target genes. However, this prediction is not perfect as there is no way for the model to consider the indirect effects of the TFs on their target genes. Seq-GraphReg_TF_KO.py does these kinds of analyses.

We can also perform an in-silico saturation mutagenesis (ISM) in some distal enhancers of the genes to get an idea about how single mutations get affect the expression predictions of the genes. Seq-GraphReg_ISM.py can do ISM using Seq-GraphReg models.