/XR-DS-seq-Tutorial

GitHub repository tutorial for XR-Seq and Damage-Seq methodologies, providing code examples, datasets, and documentation for studying DNA damage and repair processes.

Primary LanguagePythonMIT LicenseMIT

XR-DS-seq-Tutorial

GitHub repository tutorial for XR-Seq and Damage-Seq methodologies, providing code examples, datasets, and documentation for studying DNA damage and repair processes.

Introduction to NGS (UNDER CONSTRUCTION)

This section provides an overview of Next-Generation Sequencing (NGS) technologies, their applications, and their significance in DNA damage and repair studies.

Introduction to XR-seq (UNDER CONSTRUCTION)

Introduction to Damage-seq (UNDER CONSTRUCTION)

NGS File Format

This section explains the file formats that you will encounter troughout the tutorial. You can find more detail about each of these formats together with others from the link.

  • FASTQ:

    • The FASTQ format is used to store both the sequencing reads and their corresponding quality scores.
    • It consists of four lines per sequence read:
    • Line 1: Begins with a '@' symbol followed by a unique identifier for the read.
    • Line 2: Contains the actual nucleotide sequence of the read.
    • Line 3: Starts with a '+' symbol and may optionally contain the same unique identifier as Line 1.
    • Line 4: Contains the quality scores corresponding to the nucleotides in Line 2.
    • FASTQ files are commonly generated by sequencing platforms and are widely used as input for various NGS data analysis tools.
  • FASTA:

    • The FASTA format is a simple and widely used text-based format for representing nucleotide or protein sequences.
    • Each sequence in a FASTA file consists of two parts:
      • A single-line description or identifier that starts with a '>' symbol, followed by a sequence identifier or description.
      • The sequence data itself, which can span one or more lines.
    • The sequence lines contain the actual nucleotide or amino acid symbols representing the sequence.
    • FASTA files can store multiple sequences, each with its own identifier and sequence data.
    • FASTA format is commonly used for storing and exchanging sequences, such as reference genomes, gene sequences, or protein sequences.
    • Many bioinformatics tools and databases accept FASTA files as input for various sequence analysis tasks, including alignment, motif discovery, and similarity searches.
  • SAM/BAM:

    • SAM (Sequence Alignment/Map) is a text-based format used to store the alignment information of reads to a reference genome.
    • BAM (Binary Alignment/Map) is the binary equivalent of SAM, which is more compact and allows for faster data processing.
    • Both SAM and BAM files contain the same information, including read sequences, alignment positions, mapping qualities, and optional tags for additional metadata.
    • SAM/BAM files are crucial for downstream analysis tasks such as variant calling, differential expression analysis, and visualization in genome browsers.
    • SAM/BAM files can be generated using aligners like Bowtie2 or BWA.
  • BED:

    • BED (Browser Extensible Data) is a widely used format for representing genomic intervals or features.
    • It consists of a tab-separated plain-text file with columns representing different attributes of each genomic feature.
    • The standard BED format includes columns for chromosome, start position, end position, feature name, and additional optional columns for annotations or scores.
    • BED files are commonly used for visualizing and analyzing genomic features such as gene coordinates, binding sites, peaks, or regulatory regions.
    • BED files can be generated from BAM files using tools like BEDTools or samtools.
  • BEDGRAPH:

    • BEDGRAPH is a file format used to represent continuous numerical data across the genome, such as signal intensities, coverage, or scores.
    • Similar to BED files, BEDGRAPH files are plain-text files with columns representing genomic intervals and corresponding values.
    • The standard BEDGRAPH format includes columns for chromosome, start position, end position, and a numerical value representing the data for that interval.
    • BEDGRAPH files are typically used to visualize and analyze genome-wide data, such as ChIP-seq signal, DNA methylation levels, or RNA-seq coverage.
    • The values in a BEDGRAPH file can be positive or negative, representing different aspects of the data.
    • BEDGRAPH files can be generated from BAM files using tools like BEDTools or bedGraphToBigWig.
  • BigWig:

    • BigWig is a binary file format designed for efficient storage and retrieval of large-scale numerical data across the genome.
    • BigWig files are created from BEDGRAPH files and provide a compressed and indexed representation of the data, allowing for fast random access and visualization.
    • BigWig files store the data in a binary format and include additional indexing information for quick retrieval of data within specific genomic regions.
    • They are commonly used for visualizing and analyzing genome-wide data in genome browsers or for performing quantitative analyses across different genomic regions.
    • BigWig files can be generated from BEDGRAPH files using tools like bedGraphToBigWig or through conversion from other formats like BAM or WIG.

Create conda environment

Initially, you will create and activate a conda environment to set up the necessary dependencies for running the tutorial. If you haven't downloaded conda yet, you can find the instructions in the link.

conda create --name tutorial

conda activate tutorial

Downloading reference genome and generating related files

This section provides commands to download a reference genome (GRCh38) and generate related files. In this tutorial, you will use Bowtie2 as the sequence aligner tool. Therefore, you must install Bowtie2 together with samtools, which is a versatile software package used for manipulating and analyzing SAM/BAM files.

conda install -c bioconda bowtie2

conda install -c bioconda samtools

Next, you will download the reference genome. Because you will analyze results/HeLa cells that are a type of human cancer cells (derived from cervical cancer cells), GRCh38 genome is chosen as the reference:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/GRCh38.p13.genome.fa.gz -O ref_genome/GRCh38.fa.gz && gunzip ref_genome/GRCh38.fa.gz

After downloading the genome fasta file, you should generate the index files for bowtie2 which is crucial for efficiently reducing computational time and improving alignment quality.

bowtie2-build --threads 8 ref_genome/GRCh38.fa ref_genome/Bowtie2/genome_GRCh38

You will create another index file via samtools to retrieve sequence information based on genomic coordinates or sequence identifiers.

samtools faidx ref_genome/GRCh38.fa

Next, the index file created by samtools will be converted to a ron file (a different format of the index). This file will be useful when we simulate our samples with boquila.

python3 scripts/idx2ron.py -i ref_genome/GRCh38.fa.fai -o ref_genome/GRCh38.ron -l genome_idex2ron.log

Quality control

Here, you'll learn how to perform quality control on the NGS data using FastQC, a tool for assessing the quality of sequencing reads. A nice github page for the evaluation of FastQC results.

To run FastQC, you should download the package from bioconda:

conda install -c bioconda fastqc

Then you can create a directory for the output of FastQC with mkdir qc/ command. Lastly, you can run the command below to execute the tool for both Damage-seq and XR-seq samples:

fastqc -t 8 --outdir qc/ samples/hela_ds_cpd.fq 

fastqc -t 8 --outdir qc/ samples/hela_xr_cpd.fq 

Adaptor handling

This section demonstrates how to handle adaptors in the NGS data using Cutadapt, a tool for trimming adaptor sequences and other contaminants.

conda install -c bioconda cutadapt

At this stage, we will perform different tasks for Damage-seq and XR-seq. In Damage-seq we want to discard all the reads with adaptors since having the adaptor in Damage-seq reads mean that the read does not contain any damage. In the case of XR-seq, we will only trim the adaptors from the reads and keep the trimmed ones.

cutadapt -j 8 -g GACTGGTTCCAATTGAAAGTGCTCTTCCGATCT --discard-trimmed -o results/hela_ds_cpd_cutadapt.fq samples/hela_ds_cpd.fq > hela_ds_cpd_cutadapt.log

cutadapt -j 8 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTNNNNNNACGATCTCGTATGCCGTCTTCTGCTTG -o results/hela_xr_cpd_cutadapt.fq samples/hela_xr_cpd.fq > hela_xr_cpd_cutadapt.log

Mapping, removing duplicates, quality trimming, and converting to bed

This section covers the steps involved in mapping the preprocessed reads to the reference genome using Bowtie2, converting the mapped reads to BAM format, and extracting BED files.

Initially we will align our reads to the reference genome using the prepared index files. After that we will convert the output sam files to bam.

(bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_GRCh38 -U results/hela_ds_cpd_cutadapt.fq -S results/hela_ds_cpd_cutadapt.sam) > hela_ds_cpd_cutadapt_align.log 2>&1

samtools view -Sbh -o results/hela_ds_cpd_cutadapt.bam results/hela_ds_cpd_cutadapt.sam

(bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_GRCh38 -U results/hela_xr_cpd_cutadapt.fq -S results/hela_xr_cpd_cutadapt.sam) > hela_xr_cpd_cutadapt_align.log 2>&1

samtools view -Sbh -o results/hela_xr_cpd_cutadapt.bam results/hela_xr_cpd_cutadapt.sam

In the next part, we will remove the duplicate reads with picard. Let's install it.

conda install -c bioconda picard

To run MarkDplicates command of picard (this command will remove the duplicates for us), you need your files to be ordered by their header. For that purpose, you should sort the files and then use picard.

samtools sort -o results/hela_ds_cpd_cutadapt_sorted.bam results/hela_ds_cpd_cutadapt.bam -@ 8 -T results/

samtools sort -o results/hela_xr_cpd_cutadapt_sorted.bam results/hela_xr_cpd_cutadapt.bam -@ 8 -T results/

(picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/hela_ds_cpd_cutadapt_sorted.bam --TMP_DIR results/ --OUTPUT results/hela_ds_cpd_cutadapt_sorted_dedup.bam --METRICS_FILE results/hela_ds_cpd_cutadapt_sorted_dedub.metrics.txt) > hela_ds_cpd_picard.log 2>&1

(picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/hela_xr_cpd_cutadapt_sorted.bam --TMP_DIR results/ --OUTPUT results/hela_xr_cpd_cutadapt_sorted_dedup.bam --METRICS_FILE results/hela_xr_cpd_cutadapt_sorted_dedub.metrics.txt) > hela_xr_cpd_picard.log 2>&1

Lastly, we will remove low quality reads (MAPQ score < 20) via samtools and use bedtools to convert our bam files into bed format.

conda install -c bioconda bedtools

samtools index results/hela_ds_cpd_cutadapt_sorted_dedup.bam
samtools view -q 20 -b results/hela_ds_cpd_cutadapt_sorted_dedup.bam |& bedtools bamtobed > results/hela_ds_cpd.bed

samtools index results/hela_xr_cpd_cutadapt_sorted_dedup.bam
samtools view -q 20 -b results/hela_xr_cpd_cutadapt_sorted_dedup.bam |& bedtools bamtobed > results/hela_xr_cpd.bed

Sorting BED files

After obtaining the BED files, the reads in the BED files are sorted according to genomic coordinates.

sort -k1,1 -k2,2n -k3,3n results/hela_ds_cpd.bed > results/hela_ds_cpd_sorted.bed

sort -k1,1 -k2,2n -k3,3n results/hela_xr_cpd.bed > results/hela_xr_cpd_sorted.bed

Filtering for chromosomes

Reads that are aligned to regions other than chromosomes 1-22 and X are filtered and the rest is kept for further analysis.

grep "^chr" results/hela_ds_cpd_sorted.bed | grep -v -e "chrY" -e "chrM" > results/hela_ds_cpd_sorted_chr.bed

grep "^chr" results/hela_xr_cpd_sorted.bed | grep -v -e "chrY" -e "chrM" > results/hela_xr_cpd_sorted_chr.bed

Separating strands

The reads in the BED files are separated according to which strand they were mapped on.

    awk '{if($6=="+"){print}}' results/hela_ds_cpd_sorted_chr.bed > results/hela_ds_cpd_sorted_chr_plus.bed
    awk '{if($6=="-"){print}}' results/hela_ds_cpd_sorted_chr.bed > results/hela_ds_cpd_sorted_chr_minus.bed

    awk '{if($6=="+"){print}}' results/hela_xr_cpd_sorted_chr.bed > results/hela_xr_cpd_sorted_chr_plus.bed
    awk '{if($6=="-"){print}}' results/hela_xr_cpd_sorted_chr.bed > results/hela_xr_cpd_sorted_chr_minus.bed

Centering the damage site in for Damage-seq reads

Due to the experimental protocol of Damage-seq, the damaged dipyrimidines are located at the two bases upstream of the reads. We use bedtools flank and bedtools slop to obtain 10-nucleotide long read locations with the damaged nucleotides at 5th and 6th positions.

    cut -f1,2 ref_genome/GRCh38.fa.fai > ref_genome/sizes.chrom

    bedtools flank -i  results/hela_ds_cpd_sorted_chr_plus.bed -g ref_genome/sizes.chrom -l 6 -r 0 > results/hela_ds_cpd_sorted_chr_plus_flank.bed
    bedtools flank -i results/hela_ds_cpd_sorted_chr_minus.bed -g ref_genome/sizes.chrom -l 0 -r 6 > results/hela_ds_cpd_sorted_chr_minus_flank.bed
    bedtools slop -i results/hela_ds_cpd_sorted_chr_plus_flank.bed -g ref_genome/sizes.chrom -l 0 -r 4 > results/hela_ds_cpd_sorted_chr_plus_10.bed
    bedtools slop -i results/hela_ds_cpd_sorted_chr_minus_flank.bed -g ref_genome/sizes.chrom -l 4 -r 0 > results/hela_ds_cpd_sorted_chr_minus_10.bed
    cat results/hela_ds_cpd_sorted_chr_plus_10.bed results/hela_ds_cpd_sorted_chr_minus_10.bed > results/hela_ds_cpd_sorted_chr_10.bed

Obtaining FASTA files from BED files

To convert BED files to FASTA format:

    bedtools getfasta -fi ref_genome/GRCh38.fa -bed results/hela_ds_cpd_sorted_chr_plus_10.bed -fo results/hela_ds_cpd_sorted_plus_10.fa -s
    bedtools getfasta -fi ref_genome/GRCh38.fa -bed results/hela_ds_cpd_sorted_chr_minus_10.bed -fo results/hela_ds_cpd_sorted_minus_10.fa -s

Filtering Damage-seq data by motif

Damage-seq reads are expected to contain C and T nucleotides at certain position due to the nature of the NER and Damage-seq experimental protocol. Therefore, we use this as an additional filtering step to eliminate the reads that don’t fit this criteria.

    python3 scripts/fa2bedByChoosingReadMotifs.py -i results/hela_ds_cpd_sorted_plus_10.fa -o results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bed -r '.{4}(c|t|C|T){2}.{4}'
    python3 scripts/fa2bedByChoosingReadMotifs.py -i results/hela_ds_cpd_sorted_minus_10.fa -o results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bed -r '.{4}(c|t|C|T){2}.{4}'

    cat results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bed results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bed > results/hela_ds_cpd_sorted_ds_dipyrimidines.bed

Obtaining read length distribution and read count

In order see if our data contains the reads with lengths expected from the XR-seq or Damage-seq methods, read length distribution of the data should be extracted.

    awk '{print $3-$2}' results/hela_ds_cpd_sorted_chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/hela_ds_cpd_sorted_chr_ReadLengthDist.txt

    awk '{print $3-$2}' results/hela_xr_cpd_sorted_chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/hela_xr_cpd_sorted_chr_ReadLengthDist.txt

In addition, we count the reads in the BED files and use this count in the following steps.

    grep -c "^" results/hela_ds_cpd_sorted_ds_dipyrimidines.bed > results/hela_ds_cpd_sorted_ds_dipyrimidines_readCount.txt
    grep -c "^" results/hela_xr_cpd_sorted_chr.bed > results/hela_xr_cpd_sorted_chr_readCount.txt

Assessing nucleotide content

We check the reads for the enrichment of nucleotides and dinucleotides at specific positions.

    bedtools getfasta -fi ref_genome/GRCh38.fa -bed results/hela_ds_cpd_sorted_ds_dipyrimidines.bed -fo results/hela_ds_cpd_sorted_ds_dipyrimidines.fa -s

    bedtools getfasta -fi ref_genome/GRCh38.fa -bed results/hela_xr_cpd_sorted_chr.bed -fo results/hela_xr_cpd_sorted_chr.fa -s

    python3 scripts/fa2kmerAbundanceTable.py -i results/hela_ds_cpd_sorted_ds_dipyrimidines.fa -k 1 -o results/hela_ds_cpd_sorted_ds_dipyrimidines_nucleotideTable.txt

    python3 scripts/fa2kmerAbundanceTable.py -i results/hela_xr_cpd_sorted_chr.fa -k 1 -o results/hela_xr_cpd_sorted_chr_nucleotideTable.txt

Generating BigWig files

BigWig file format includes representation of the distribution reads in each genomic window without respect to plus and minus strands. It is a compact file format that is required for many tools in the further analysis steps.

The first step for generation of BigWig files from BED files is generating BedGraph files. Since the strands of the reads are not taken into account in the BigWig and BedGraph files

    conda install -c mvdbeek ucsc_tools

    bedtools genomecov -i results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_ds_cpd_sorted_ds_dipyrimidines_readCount.txt | awk '{print 1000000/$1}') > results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bdg
    
    bedtools genomecov -i results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_ds_cpd_sorted_ds_dipyrimidines_readCount.txt | awk '{print 1000000/$1}') > results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bdg

    bedtools genomecov -i results/hela_xr_cpd_sorted_chr_plus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_xr_cpd_sorted_chr_readCount.txt | awk '{print 1000000/$1}') > results/hela_xr_cpd_sorted_chr_plus.bdg        
    
    bedtools genomecov -i results/hela_xr_cpd_sorted_chr_minus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_xr_cpd_sorted_chr_readCount.txt | awk '{print 1000000/$1}') > results/hela_xr_cpd_sorted_chr_minus.bdg

Then, from these BedGraph files, we generate BigWig files.

    sort -k1,1 -k2,2n results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bdg > results/hela_ds_cpd_sorted_ds_dipyrimidines_plus_sorted.bdg
    sort -k1,1 -k2,2n results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bdg > results/hela_ds_cpd_sorted_ds_dipyrimidines_minus_sorted.bdg
    
    sort -k1,1 -k2,2n results/hela_xr_cpd_sorted_chr_plus.bdg > results/hela_xr_cpd_sorted_chr_plus_sorted.bdg
    sort -k1,1 -k2,2n results/hela_xr_cpd_sorted_chr_minus.bdg > results/hela_xr_cpd_sorted_chr_minus_sorted.bdg
        
    bedGraphToBigWig results/hela_ds_cpd_sorted_ds_dipyrimidines_plus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_ds_cpd_sorted_ds_dipyrimidines_plus.bw
    bedGraphToBigWig results/hela_ds_cpd_sorted_ds_dipyrimidines_minus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_ds_cpd_sorted_ds_dipyrimidines_minus.bw

    bedGraphToBigWig results/hela_xr_cpd_sorted_chr_plus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_xr_cpd_sorted_chr_plus.bw
    bedGraphToBigWig results/hela_xr_cpd_sorted_chr_minus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_xr_cpd_sorted_chr_minus.bw

Simulating the sample reads

Because CPD and (6-4)PP damage types require certain nucleotides in certain positions, the genomic locations rich in adjacent CC, TC, CT or TT dinucleotides may be prone to receiving more UV damage while other regions that are poor in these dinucleotides receive less damage. Therefore, the sequence contents may bias our analysis results while comparing the damage formation or NER efficiency of two genomic regions. In order to eliminate the effect of sequence content, we create synthetic sequencing data from the real Damage-seq and XR-seq data, which give us the expected damage counts and NER efficiencies, respectively, from the sequence content of the genomic areas of interest. We use Boquila to generate simulated data.

    conda install boquila -c bioconda 

    boquila --fasta results/hela_ds_cpd_sorted_ds_dipyrimidines.fa --bed results/hela_ds_cpd_sorted_ds_dipyrimidines_sim.bed --ref ref_genome/GRCh38.fa --regions ref_genome/GRCh38.ron --kmer 2 --seed 1 --sens 2 > results/hela_ds_cpd_sorted_ds_dipyrimidines_sim.fa

    boquila --fasta results/hela_xr_cpd_sorted_chr.fa --bed results/hela_xr_cpd_sorted_chr_sim.bed --ref ref_genome/GRCh38.fa --regions ref_genome/GRCh38.ron --kmer 2 --seed 1 --sens 2 > results/hela_xr_cpd_sorted_chr_sim.fa
    awk '{if($6=="+"){print}}' results/hela_ds_cpd_sorted_ds_dipyrimidines_sim.bed > results/hela_ds_cpd_sorted_ds_dipyrimidines_sim_plus.bed
    awk '{if($6=="-"){print}}' results/hela_ds_cpd_sorted_ds_dipyrimidines_sim.bed > results/hela_ds_cpd_sorted_ds_dipyrimidines_sim_minus.bed

    awk '{if($6=="+"){print}}' results/hela_xr_cpd_sorted_chr_sim.bed > results/hela_xr_cpd_sorted_chr_sim_plus.bed
    awk '{if($6=="-"){print}}' results/hela_xr_cpd_sorted_chr_sim.bed > results/hela_xr_cpd_sorted_chr_sim_minus.bed

The read counts from the simulated Damage-seq and XR-seq data are then used to normalize our real Damage-seq and XR-seq data to eliminate the sequence content bias.

    grep -c '^' results/hela_xr_cpd_sorted_chr_sim.bed > results/hela_xr_cpd_sorted_chr_sim_readCount.txt

    bedtools genomecov -i results/hela_xr_cpd_sorted_chr_sim_plus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_xr_cpd_sorted_chr_sim_readCount.txt | awk '{print 1000000/$1}') > results/hela_xr_cpd_sorted_chr_sim_plus.bdg
    
    bedtools genomecov -i results/hela_xr_cpd_sorted_chr_sim_minus.bed -g ref_genome/GRCh38.fa.fai -bg -scale $(cat results/hela_xr_cpd_sorted_chr_sim_readCount.txt | awk '{print 1000000/$1}') > results/hela_xr_cpd_sorted_chr_sim_minus.bdg

    sort -k1,1 -k2,2n results/hela_xr_cpd_sorted_chr_sim_plus.bdg > results/hela_xr_cpd_sorted_chr_sim_plus_sorted.bdg
    sort -k1,1 -k2,2n results/hela_xr_cpd_sorted_chr_sim_minus.bdg > results/hela_xr_cpd_sorted_chr_sim_minus_sorted.bdg

    bedGraphToBigWig results/hela_xr_cpd_sorted_chr_sim_plus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_xr_cpd_sorted_chr_sim_plus.bw
    bedGraphToBigWig results/hela_xr_cpd_sorted_chr_sim_minus_sorted.bdg ref_genome/GRCh38.fa.fai results/hela_xr_cpd_sorted_chr_sim_minus.bw

Plotting length distribution, nucleotide enrichment, and bam correlations

Plotting read length distribution

We create a histogram of the read length distribution to clearly understand which read length is mostly found in our data. This information is used as a proof about the success of Damage-seq or XR-seq experiment in capturing the right reads.

    conda install r-rbokeh
    conda install -c conda-forge r-ggplot2
    conda install -c conda-forge r-tidyr```

``` R
    library(ggplot2)
    read_len <- read.table("results/hela_xr_cpd_sorted_chr_ReadLengthDist.txt")
    colnames(read_len) <- c("length", "counts")
    xrLenDistPlot <- ggplot(data = read_len, aes(x = length, y = counts)) +
        geom_bar(stat='identity') +
        xlab("Read length") +
        ylab("Read count")
    ggsave("results/xrLenDistPlot.png")

Plotting nucleotide enrichment of the reads

Since the reads from Damage-seq and XR-seq data are expected to contain C and T nucleotides in certain positions, we plot the nucleotide enrichment in each position in reads. This also is used as a proof of data quality.

    R
    library(ggplot2)
    library(dplyr)
    nucl_content <- read.table("results/hela_xr_cpd_sorted_chr_nucleotideTable.txt", header=T)
    nucl_content_gathered <- gather(nucl_content, nucl, count, -kmer)
    nucl_content_gathered$nucl <- factor(nucl_content_gathered$nucl, levels = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", "X13", "X14", "X15", "X16", "X17", "X18", "X19", "X20","X21", "X22", "X23", "X24", "X25", "X26", "X27"), labels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20","21", "22", "23", "24", "25", "26", "27"))

    nucl_content_gathered$kmer <- factor(nucl_content_gathered$kmer, levels = c("C", "T", "G", "A"))
    xrNuclPlot <- ggplot(data = nucl_content_gathered, aes(x = nucl, y = count, fill = kmer)) +
        geom_bar(position = "fill", stat='identity') +
        xlab("Position") +
        ylab("Count") +
        labs(fill = "") +
        scale_fill_manual(values = c("seagreen3","gray60", "steelblue2", "steelblue4"))
    ggsave("results/xrNuclPlot.png")

Bam correlations

Another way to assess the data quality is to compare samples and replicates. Replicates should be alike in terms of genomic distribution of their reads while, for example, CPD samples should be different than (6-4)PP samples.

    conda install -c bioconda deeptools

    multiBamSummary bins --bamfiles results/hela_ds_cpd_cutadapt_sorted_dedup.bam results/hela_xr_cpd_cutadapt_sorted_dedup.bam --minMappingQuality 20 --outFileName results/bamSummary.npz
    
    plotCorrelation -in results/bamSummary.npz --corMethod spearman --skipZeros --plotTitle "Spearman Correlation of Read Counts" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o results/bamCorr.png \

Plotting read distribution on genes

deepTools is an easy way to plot read distribution in certain genomic regions. This tool uses the BigWig files to count the reads in genomic windows and requires the regions of interest in BED format. To plot a line graph of the read distribution on genes:

    bigwigCompare --bigwig1 results/hela_xr_cpd_sorted_chr_plus.bw --bigwig2 results/hela_xr_cpd_sorted_chr_sim_plus.bw --operation ratio --outFileFormat bigwig --outFileName hela_xr_cpd_norm_sim_plus.bw

    computeMatrix scale-regions -S results/hela_xr_cpd_norm_sim_plus.bw -R ref_genome/GRCh38_genes.bed --outFileName results/hela_xr_cpd_norm_sim_plus_on_GRCh_genes_1kb_scaleregions_computeMatrix.out -b 1000 -a 1000 --smartLabels

    plotHeatmap --matrixFile results/hela_xr_cpd_norm_sim_plus_on_GRCh_genes_1kb_scaleregions_computeMatrix.out --outFileName hela_xr_cpd_norm_sim_plus_on_GRCh_genes_1kb_heatmap_k2.png --xAxisLabel "Position with respect to genes" --yAxisLabel "Genes" --kmeans 2 --heatmapWidth 10 --startLabel "Start" --endLabel "End"

Running pipeline with snakemake

This section will guide you through running the XR-Seq and Damage-Seq pipeline using Snakemake, a workflow management system. Detailed instructions and code examples can be found in the provided github link: https://github.com/CompGenomeLab/xr-ds-seq-snakemake