angela simeone
This repository collects bash and R scripts used for the analysis of the PDTX models by quantitative G4-ChIP-seq (qG4-ChIP-seq). 22 PDTX models have been profiled by qG4-ChIP-seq in 4 technical replicates plus input (5 libraries each).
In our study (PID, DOI:) we process data obteined from qG4-ChIP-seq. We identify regions with differential binding (up-regulated) in each of the 22 breast cancer model and we call these regions as ΔG4.
We then explore and characterize all the 22 ΔG4 regions in respect to expression, to copy number alterations and to transcritpion factor binding profiles.
Listed below are the different steps followed to perform the processing and analysis of qBG4-ChIP-seq
Data produced using the qG4-ChIP-seq have been deposited at GSE152216. The sample file describing the deposited data is here. Additional information about the qG4-ChIP-seq protocol here.
The repository contains files used to perform the high level analysis.
Differential regions for breast cancer can be found as bed files here.
Listed below are the different steps followed to perform the the analysis of qBG4-ChIP-seq.
-
Sequencing (Fastq) processing: alignment, peak calling and consensus regions generation from peaks: adapter removal, alignment, species separation, duplicates removal, peak calling, downsample input files and repeat peak calling, extraction of multi2 regions (regions confirmed in 2 out of 4 replicates), merge old consensus (first round of experiments) to new consensus (second round of experiments).
-
Analysis of PDTX samples qBG4-ChIP-seq
Use human consensus to estimate human signal, use Drosophila ROI consensus regions to estimate drosophila signal, collect all stats (libraries sizes, coverages for human and drosophila respectively) necessary to proceed to drosophila normalization.
Intermediate files are generated and used in the customized R script to generate sample file and to extract normalization factors. -
Characterization of differential regions ΔG4
Pairwise comparison of PDTX G4 regions (Jaccard index), fold-enrichment of PDTXs at 45 cancer drivers genes, G4-motifs analysis, prepare data to analyze associatio of ΔG4 to expression. -
Association of ΔG4 to expression
R script to explore association between presence of ΔG4 in promoter and relative expression levels in each individual PDTX model. This analysis shows that The presence of ΔG4 is generally associated to higly expressed gene. -
Association of ΔG4 to CNA
Explore fold-enrichment of ΔG4 at CNA regions. This analysis is done for each individual PDTX (i.e. indivudual PDTXs CNA have been identified first and then ΔG4 have been compared to them).- R script to identify CNA given input library bam (5M subsampling)
This script uses the R package "QDNAseq".
- R script to identify CNA given input library bam (5M subsampling)
-
Association of ΔG4 to CNA (copy number alterations) and expression
-
Association of ΔG4 to TF binding sites (after downloading data from ChIP-Atlas)
Here is the list of data files used at various stages of the analysis.
Input files to generate the sample file
Input files to perform drosophila normalization
Input files (consesus regions) to extract coverages:
Input files to perform differential analysis
Input files to perform the analysis of the association between DG4R and expression
- Coordinates of gene promoters hg19
- List of promoter with overlapping DG4R
- List of promoter with overlapping CG4R
- Expression values data table
Input files to perform analysis of helicases
Input files for analysis of the association between DG4R and CNA
- genome file
- occurances of overlaps between DG4R and CNA_AMP in the actual and randomized cases
- occurances of overlaps between DG4R and CNA_GAIN in the actual and randomized cases
- occurances of overlaps between DG4R and CNA_NEUT in the actual and randomized cases
- occurances of overlaps between DG4R and CNA_HETD in the actual and randomized cases
- occurances of overlaps between DG4R and CNA_HOMD in the actual and randomized cases