/ChIPSeQ

ChIPSeQ Analysis Protocol

Primary LanguagePythonCeCILL Free Software License Agreement v2.1CECILL-2.1

ChIPSeQ


Update here


  1. Set up directories structure and config file
  2. Launch Main pipeline
  3. Multiqc (marking duplicates and plot number of mapped reads) - You don't care !
  4. ChipQC
  5. DiffBinding and Peak annotation
  6. Set up Environment - You don't care !

You will the last version of this code here :

/home/jean-philippe.villemin/disciplussimplex/TEST_CHIPSEQ/

Each time you use this pipeline, you need to copy/paste it in a new directory with a config file.

Set up directories structure and config file


Snakemake pipeline provided by Raoul Raffel @t IGH,Montpellier.

Directory aborescence need to be strictly the same.

One directory per condition. Several histone marks are inside these directories. A raw_data directory where you put fastq files.

Inside this directory, you will set up directories for input raw data and for each replicate of histone marks as followed : ChipSeqDir

You also have in this directory the script and its config file.

Raw fastq must be formated as follows file.fq.gz.

NB : Due to a bug in CTCF and K9AC files. fastq_illumina_filter doesn't work. See here To bypass that, rename the file as file.filtered.fq.gz. Or you can remove the spaces by yourself and still do this filtering (sed 's/^$/N/'

newfile.fa) .

NB1 : If you received different files for one replicate, you need to merge them into one, with a command line like :

	cat file1.gz file2.gz > Luco22.fq.gz

Then, you need to configure your yaml configuration file. See provided example.

You need to change filepaths,filenames accordingly to your data & environment.

Launch Main Pipeline

The main pipeline do all alignments for fastq files. It produces also fastqc, wig files for visualisation and finally it will gives us the peaks calling files. Other stuffs are done but you don't care. (the fastqc produces MissingOutputException...but it works don't worry)

Dry Run ( can be useful to see the set of commands that will be done):

  • n : is drymode
  • p : p is for print (...)
  • j : number of cores
	snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k -n --verbose -p

Real Run (nohup make the script runs in background. Note the ampersand at the end to go back in the shell.):

	nohup snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k --verbose -p &

Control with htop -u username and look into nohup.out file to see if job is finished. nohup.out is a file created with all the commands, errors that can occurs, it's a log file.

NB4 : You can relaunch script if you find something went wrong.It will execute only part that failed. Fastqc will be done again (whereas fastqc analysis has already be done with sucess) because of a bug I didn't manage to correct. computeMatrix_ref_point, computeMatrix_ref_point failed but you don't care because we don't use their output.

NB5 : Don't use the normalise bigwig create here. You can use bigwig for replicates normalised by RPKM but not taking account of the input.

Description of output : You should know them since 3 years.


Normalise tracks

This step is mandatory if you want to create next bigwig with mean signal without duplicates.

Mark Duplicates

#MarkDuplicate
find /pathTo/condition -name *.sorted.bam  | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.marked.bam} M=${0/.bam/.marked_dup_metrics.txt} VALIDATION_STRINGENCY=LENIENT' {} \;

Remove Duplicates

find /pathTo/condition -name *.sorted.bam   | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.removed.marked.bam} M=${0/.bam/.sorted.removed.marked_dup_metrics.txt} REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT' {} \;

This has been added to have only one track per mark. Here we are doing one track using mean of replicates removing the input signal signal to normalise.

The script is using a json configuration file. It's also plotting profile around TSS for differentialy expressed gene. So you need to provide bed files for TSS of upregulated genes and TSS of downregulated genes. You can use the old bed I set in the config. This is not the main purpose of the script but I didn't removed this part to it's still trying to plot stuff but your are only instered in the bigwig normalised.

So here you need to create a json file. See --config parameter to retrieve an example file.

Here we are just interested in creating the mean track, you so can provide fake bed files.

Example :

python3 /home/jean-philippe.villemin/code/CHIP-SEQ/src/deeptools_countsignal.py --config=/home/jean-philippe.villemin/code/configs/HEATMAP_HISTONE/tss_by_DGE_NODUP/POL2_T7.json 

NB : Lot of things in json you don't care.
Only things you need to change are the path to bam files.( "bam" , bam-DupRemoved" and 'path_to_output' for Rep1,Rep2,Control)

ChipQC

The next step are based on ChipQC and DiffBind R packages. You need to do chipQC to do differential binding. Infos about parameters of chipqc Infos about parameters diffbind

First you need to create a samplesheet containing all path to your samples. see( ChipQC ,DiffBind,Doc )

You will find an example here :

/home/jean-philippe.villemin/disciplussimplex/TEST_CHIPSEQ/samplestest.csv /home/jean-philippe.villemin/CHIPSEQ_2017_1_ALL/samplesChipMarks.csv

Quality

You will use this file to create an R object saved on disk that will be used by the next step. This step is creating a first report with quality controls.

You can use consensus methodology where ChIPQC is done using consensus length around peak summit. With this approach, you will also have input signal plotted for interval you defined around peak summits with option -s.

Option -c is used to say you are using consenus methodology. (see docs) Option -n is used to say what is the name of the Robject saved.

Rscript chipQC.R --file /pathTO/samplesChipMarks.csv -n ALL_MarkedDup -s 500 -c

You can find it here : /home/jean-philippe.villemin/code/CHIP-SEQ/R/

NB : I loaded the library at the beginning. It call hg38. (if you work on mouse, you will have to change hard coded stuffs)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

NB : Index .bai must be created before. Very important.

DiffBinding and Peak annotation

Here you will test a variation of binding between condition using R object output (name ALL_MarkedDup in this example) from the step before.

-n is used to say the number of conditions.
-m is used to say which marks from your sample you want to analyze. -p is used to say percentage at least used to retrieve peaks in all you samples. minOverlap (peakset parameter in diffbind) only include peaks in at least this many peaksets in the main binding matrix.If minOverlap is between zero and one, (Don't change 0.99, that means peaks must be in all samples.)Peak will be included from at least this proportion of peaksets.

You will do that for each mark you want to analyse in your sampleshit.

Rscript chipDiffBind.R --file=/pathTO/ALL_MarkedDup -n 3 -m K27AC -p 0.99

NB : Don't use the .Rdata extension when you call the file.

Description of the 2 outputs :

No filter is applied on the fold or the p-value. You will find that in the 4th column of the .bed.

POL2.T1_vs_T7.bed:

chr10 73910795 73916597 2623_NearestLocation_inside_ENSG00000122861_proteincoding_PLAU_6.91_20.9030899869919_+_4519_900 0 .

chrom start end IDPeak_fromOverlappingOrNearest_insideFeature_Feature_genebiotype_symbol_Fold_log10(p-value)_strand_distancetoFeature_shortestDistance

Annotation comes from chipeakAnno

fromOverlappingOrNearest: nearest: indicates this feature's start (feature's end for features at minus strand) is closest to the peak start; insideFeature: upstream peak resides upstream of the feature; downstream: peak resides downstream of the feature; inside: peak resides inside the feature; overlapStart: peak overlaps with the start of the feature; overlapEnd: peak overlaps with the end of the feature; includeFeature: peak include the feature entirely distancetoFeature: distance to the nearest feature such as transcription start site. By default, the distance is calculated as the distance between the start of the binding site and the TSS that is the gene start for genes located on the forward strand and the gene end for genes located on the reverse stran shortestDistance: The shortest distance from either end of peak to either end the feature.

POL2.T1_vs_T7.diffPeaks.bed :

chr10 73910795 73916597 2623 0 .

2623 is just the peak ID

All the packages that are used are listed below.


Set up Environment


First you need to check that the following tools are installed on server/computer.

Deeptools : Tool to handle deepsequencing files here

Macs2 : Peak Caller here

wigToBigWig : Include in KentTools suite here

It's advised to install Conda.
It will make things easier then for installing tools. (snakemake for example)

Conda : here

snakemake : Evil framework for pipeline in python here

WiggleTools : Optionnal , can be useful when you want to merge bigwig files and apply some basic numerical operations here

SPP : Need to be there but you don't care here

MaSC : Need to be there but you don't care here

R3.3.1 : A lot of packages are used ... It's adviced to read their docs.

ChipQC : Control Quality for peaks. Object created will be used then with ChIPQC for Differential binding here

DiffBind : Differential binding analysis. Which peaks are moving in my samples ? here

ChIPpeakAnno : Anotation of the peaks in a genomic context here

ChIPseeker : Used for a derived upset peaks graph counting how much peaks are annotated in one or other features. here

rtracklayer : Very usefull to export/import bed to Grange objects. here

ComplexHeatmap : Plot heatmap. here

Other packages are mandatory. biomaRt, GenomicFeatures,stringr, reshape , ggplot2, optparse , knitr ... Look inside R scripts to know what to install.