Course Project for Columbia Course in Computational Genomics CBMF W4761
Taught by Itshack Pe'er
Brian Trippe and Jeffrey Zhou
This github repository describes a pipeline to process ChIP-seq and RNA-seq data into a format that can be provided to a Random Forest Classifer. In our project, we evaluate the accuracy of the learned models with 5-fold bootstrapping and use this as a metric for how well correlated certain epigenetic marks are to the RNA-seq data.
Please see the individual TSS and splicing sub-directories for instructions to proceed with each pipeline.
Hardware:
- A unix-based operating system
- 8GB RAM, ~300GB hard drive space (optional)
Software:
- Python 2.7
- C++
- rapidjson
- numpy, scipy, scikit-learn
See these instructions.
See [these] instructions.
This file contains all the gene/transcript/exon annotations originally used by the roadmap project used to perform RNA quantification.
Read mappings of ChIP seq data used in the pipeline
RPKM matrix by exon of the RNA-seq data (protein coding genes) used in the pipeline
RPKM matrix by exon of the RNA-seq data (non-coding genes) used in the pipeline