/TF-Binding-Matrix

A sparse 3D matrix of 2,503,732 bound and open regions across 175 transcription factors and 70 cells/tissues

Primary LanguagePythonMIT LicenseMIT

TF binding matrix

DOI

A sparse 3D matrix of 1,817,918 2,503,732 bound and open regions across 163 175 transcription factors and 52 70 cell and tissue types

alt text

News

01/09/2020 We have expanded the matrix using recent data from ENCODE

Content

  • The data folder contains scripts to download all the data necessary to build the matrices
  • The lib folder contains global functions to be used by all Python scripts
  • The matrix folder contains the Python scripts to build the matrices
  • The file environment.yml contains the conda environment used to build the matrices (see dependencies)

Dependencies

All dependencies can be easily installed through the conda package manager:

conda create -n TfBindingMatrix -c bioconda -c conda-forge python=3.7 biopython \
    coreutils numpy pandas pybedtools sparse wget

Steps

The following steps were followed to generate the TF binding matrices for the transfer learning manuscript.

1. Data

1.1 DNase I hypersensitive sites

Download clustered DHS data in 95 cell and tissue types from the ENCODE DHS peak clusters at the UCSC Genome Browser. Then, extract the center of each cluster and expand it 100 bp in each direction using bedtools slop for a final length of 200 bp. In addition, download information about the names of the clustered cells and tissues and their correspondance with the different cluster IDs.

cd ./DHS/UCSC/
./get_dhs.sh

1.2 ENCODE accessibility and TF binding

Download all human DNase-seq and TF ChIP-seq data from ENCODE. Then, resize all the DNase-seq data using bedtools slop for a final length of 150 bp and store it in a single file. Finally, store all the ChIP-seq data for each TF in a separate file.

cd ./ENCODE/
./get_encode.sh

1.3 Hg38 genome sequence

Download the FASTA sequence of the build 38 of the Genome Reference Consortium human genome (i.e. hg38). Discard any non-standard chromosomes.

cd ./Genomes/hg38/
./get_hg38.sh

1.4 ReMap TF binding

Download all human TF ChIP-seq peaks from ReMap 2018. Then, extract the peak summits and the sample names given to the different ENCODE experiments (i.e. files whose name starts with ENCSR).

cd ./ReMap/
./get_remap.sh

1.5 UniBind TFBSs

Download all human PWM-based TFBS predictions from UniBind. Then, collapse all TFBSs into a single file, and extract the names of the different TFs as well as the sample names given to the different ENCODE experiments (i.e. files whose name starts with ENCSR).

cd ./UniBind/
./get_unibind.sh 

2. Matrices

Build two TF binding matrices (i.e. data structures containing information about TF binding events, not motif models), one more sparse and the other less sparse. The matrices aggregate binding data, both from ChIP-seq experiments and TFBS predictions, of 163 TFs to 1,817,918 accessible genomic regions (i.e. DHSs) in 52 cell and tissue types. The matrices are saved as 2D numpy arrays, with rows and columns being individual TFs and DHS regions, respectively.

cd ./matrix/UCSC/
./matrix.py --dhs-file ../../data/DHS/UCSC/DHS.200bp.bed \
            --encode-dir ../../data/ENCODE/hg38/ \
            --fasta-file ../../data/Genomes/hg38/hg38.fa \
            --remap-dir ../../data/ReMap/ \
            --unibind-dir ../../data/UniBind/

The final matrices can be found here under the names matrix2d.ReMap+UniBind.sparse.npz and matrix2d.ReMap+UniBind.less-sparse.npz.