Process explanation:
Part 1 - Get enhancer sequences and target gene names:
- Read elite GeneHancer bed file, line by line
- For each line, extract coordinates of the enhancer and the name of the target gene (in Ensembl)
- Look up the Hugo name of the target gene and its biotype/description in genes.ENSG.tbl
- Extract the enhancer sequence from referenceGenome matching the coordinates
- Output this intermediary file
Part 2 - Get TF binding site motifs and apply them to enhancers
- Read in Hocomoco PFM matrices one at a time
- For each matrix, generate a Biopython Motif
- For each enhancer region from 1) find which BS Motifs match there (on + and - strands)
- Calculate how many binding sites on average match within 1 enhancer region - to be informed on the supernode
Part 3 - Use co-expression data to find TFBS clusters
- Use GTEX co-expression data to figure out how TFs that match within the same enhancer region regulate the transcription of the target gene - to be informed on the supernode
Files expected to be present:
- GRCh38.primary_assembly.genome.fa - reference genome ch38
- genes.ENSG.tbl - gene names in Ensembl and Hugo forms, along with biotype description
- elite_ensg_enhan_fused_ensembl_prom_500b.hg38.bed - elite enhancer coordinates from GeneHancer
- elite_enhancer_sequences.fa - elite enhancers sequences (generated)
(You can generate it by running the following command:
bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed elite_ensg_enhan_fused_ensembl_prom_500b.hg38.bed -fo elite_enhancer_sequences.fa