Introduction to DNA-Seq processing

By Louie Letourneau, MSc.

Edited and Modified by Mathieu Bourgey, Ph.D and Robert Eveleigh, MSc.

In this workshop, we will present the main steps that are commonly used to process and to analyze sequencing data. We will focus only on whole genome data and provide command lines that allow detecting Single Nucleotide Variants (SNV). This workshop will show you how to launch individual steps of a complete DNA-Seq pipeline

We will be working on a 1000 genome sample, NA12878. You can find the whole raw data on the 1000 genome website

For practical reasons we subsampled the reads from the sample because running the whole dataset would take way too much time and resources.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.

Cheat file

You can find all the unix command lines of this practical in the file: commands.sh

Environment setup

export PICARD_JAR=/usr/local/bin/picard.jar 
export GATK_JAR=/usr/local/bin/GenomeAnalysisTK.jar
export BVATOOLS_JAR=/usr/local/bin/bvatools-1.6-full.jar 
export TRIMMOMATIC_JAR=/usr/local/bin/trimmomatic-0.35.jar 
export REF=/home/mBourgey/kyoto_workshop_WGS_2015/references/ 

cd $HOME 
rsync -avP /home/reveleigh/cleanCopy/workshop $HOME 
cd $HOME/workshop/ 

ls

Workspace setup

The initial structure of your folders should look like this:

<ROOT>
|-- raw_reads/               # fastqs from the center (down sampled)
    `-- NA12878              # One sample directory
        |-- runERR_1         # Lane directory by run number. Contains the fastqs
        `-- runSRR_1         # Lane directory by run number. Contains the fastqs

Software requirements

These are all already installed, but here are the original links.

Inspect Data

So you've just received an email saying that your data is ready for download from the sequencing center of your choice.

What should you do? solution

Fastq files

Let's first explore the fastq file.

Try these commands

zless -S raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz

Why was it like that ? solution

Now try these commands:

zcat raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz | head -n4
zcat raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair2.fastq.gz | head -n4

What was special about the output ?

Why was it like that? Solution

You could also just count the reads

zgrep -c "^@SRR" raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz

We should obtain 15546 reads

Why shouldn't you just do ?

zgrep -c "^@" raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz

Solution

####Phred Score In the original implementation of Phred scores, information about the size and shape of peaks in sequencing trace files were used to generate an error probability and converted to a log score.

The same principles are used for base quality made by the base calling method of the NGS sequencer i.e the base quality is the probability of the base to have been called incorrectly

Phred score/Base quality  = - 10 * log10(error probability)

e.g. Base Quality of 10

    10 = -10 * log10(error probability)
    -1 = log10(error probability)
    10^-1 = error probability           *this means 10 to the -1 power
    1 / 10 = error probability

Phred Quality Score	Probability of incorrect base call	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10,000	99.99%
50	1 in 100,000	99.999%
60	1 in 1,000,000	99.9999%

The base quality score however is respresented as a ASCII code in fastq and other file formats

Typically the phred33 (sanger format) scale is used by adding 33 to the Phred quality score and looking up the corresponding ASCII character

Example

Phred score of 31 
31+33 = 64

In Dec column 64 corresponds to '@' ASCII character
See below

Quality

We can't look at all the reads. Especially when working with whole genome 30x data. You could easily have billions of reads.

Tools like FastQC and BVATools readsqc can be used to plot many metrics from these data sets.

Let's look at the data:

Try commands:

mkdir originalQC/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
  --read1 raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz \
  --read2 raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair2.fastq.gz \
  --threads 2 --regionName SRR --output originalQC/

java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
  --read1 raw_reads/NA12878/runERR_1/NA12878.ERR.33.pair1.fastq.gz \
  --read2 raw_reads/NA12878/runERR_1/NA12878.ERR.33.pair2.fastq.gz \
  --threads 2 --regionName ERR --output originalQC/

Copy the images from the originalQC folder to your desktop and open the images.

scp -r <USER>@www.genome.med.kyoto-u.ac.jp:~/workshop/originalQC/ ./

Open the images

What stands out in the graphs ? Solution

All the generated graphics have their uses. Two particularly useful to get an overall picture of sequencing quality.

The quality box plots - Box whisker type plots showing the quality distribution of your data
The nucleotide content graphs - Plots the proportion of each base position in a file

General observations:

Some reads have bad 3' ends.
Some reads have adapter sequences in them.

Trimming

Trimming of the reads is required to remove the low quality bases at the 3' end, include reads of a specific minimum length and trim adapters and other technical sequences

To do that we will use Trimmomatic using these options:

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read (2:30:15)
TRAILING: Cut bases off the end of a read, if below a threshold quality (20)
MINLEN: Drop the read if it is below a specified length (32)
TOPHRED33: Convert quality scores to Phred-33

note on trimmomatic command

The adapter file is in your work folder.

cat adapters.fa

Why are there 2 different ones ? Solution

Try command:

mkdir -p reads/NA12878/runSRR_1/
mkdir -p reads/NA12878/runERR_1/

java -Xmx2G -cp $TRIMMOMATIC_JAR org.usadellab.trimmomatic.TrimmomaticPE -threads 2 -phred33 \
  raw_reads/NA12878/runERR_1/NA12878.ERR.33.pair1.fastq.gz \
  raw_reads/NA12878/runERR_1/NA12878.ERR.33.pair2.fastq.gz \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair1.fastq.gz \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.single1.fastq.gz \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair2.fastq.gz \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.single2.fastq.gz \
  ILLUMINACLIP:adapters.fa:2:30:15 TRAILING:20 MINLEN:32 \
  2> reads/NA12878/runERR_1/NA12878.ERR.trim.out

java -Xmx2G -cp $TRIMMOMATIC_JAR org.usadellab.trimmomatic.TrimmomaticPE -threads 2 -phred33 \
  raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair1.fastq.gz \
  raw_reads/NA12878/runSRR_1/NA12878.SRR.33.pair2.fastq.gz \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair1.fastq.gz \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.single1.fastq.gz \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair2.fastq.gz \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.single2.fastq.gz \
  ILLUMINACLIP:adapters.fa:2:30:15 TRAILING:20 MINLEN:32 \
  2> reads/NA12878/runSRR_1/NA12878.SRR.trim.out

cat reads/NA12878/runERR_1/NA12878.ERR.trim.out reads/NA12878/runSRR_1/NA12878.SRR.trim.out

What does Trimmomatic says it did ? Solution

Let's look at the graphs now

Try command:

mkdir postTrimQC/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
  --read1 reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair1.fastq.gz \
  --read2 reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair2.fastq.gz \
  --threads 2 --regionName ERR --output postTrimQC/
java -Xmx1G -jar ${BVATOOLS_JAR} readsqc \
  --read1 reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair1.fastq.gz \
  --read2 reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair2.fastq.gz \
  --threads 2 --regionName SRR --output postTrimQC/

How does it look now ? Solution

Alignment

The raw reads are now cleaned up of artefacts we can align each lane separately.

Why should this be done separatly? Solution

Why is it important to set Read Group information ? Solution

##Alignment with bwa-mem

Try command:

mkdir -p alignment/NA12878/runERR_1
mkdir -p alignment/NA12878/runSRR_1

bwa mem -M -t 2 \
  -R '@RG\tID:ERR_ERR_1\tSM:NA12878\tLB:ERR\tPU:runERR_1\tCN:Broad Institute\tPL:ILLUMINA' \
  ${REF}/b37.fasta \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair1.fastq.gz \
  reads/NA12878/runERR_1/NA12878.ERR.t20l32.pair2.fastq.gz \
  | java -Xmx2G -jar ${PICARD_JAR} SortSam \
  INPUT=/dev/stdin \
  OUTPUT=alignment/NA12878/runERR_1/NA12878.ERR.sorted.bam \
  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate MAX_RECORDS_IN_RAM=500000

bwa mem -M -t 2 \
  -R '@RG\tID:SRR_SRR_1\tSM:NA12878\tLB:SRR\tPU:runSRR_1\tCN:Broad Institute\tPL:ILLUMINA' \
  ${REF}/b37.fasta \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair1.fastq.gz \
  reads/NA12878/runSRR_1/NA12878.SRR.t20l32.pair2.fastq.gz \
  | java -Xmx2G -jar ${PICARD_JAR} SortSam \
  INPUT=/dev/stdin \
  OUTPUT=alignment/NA12878/runSRR_1/NA12878.SRR.sorted.bam \
  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate MAX_RECORDS_IN_RAM=500000

Why did we pipe the output of one to the other ? Solution

Could we have done it differently ? Solution

We will explore the generated BAM format in the next section.

Lane merging

The results lane-specific bam will now to merged into one bam file for further processing.

Since we tagged each read in the BAM by read group, upon merging we can still identify the origin of each read.

Try command:

java -Xmx2G -jar ${PICARD_JAR} MergeSamFiles \
  INPUT=alignment/NA12878/runERR_1/NA12878.ERR.sorted.bam \
  INPUT=alignment/NA12878/runSRR_1/NA12878.SRR.sorted.bam \
  OUTPUT=alignment/NA12878/NA12878.sorted.bam \
  VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

You should now have one BAM containing all your data.

Let's double check

ls -l alignment/NA12878/
samtools view -H alignment/NA12878/NA12878.sorted.bam | grep "^@RG"

You should have your 2 read group entries.

Why did we use the -H switch? Solution

Try without. What happens? Solution

lane merging note

Post-processing BAMs to remove technical errors

We began by running quality control tools on the raw reads. Additional steps are also required to fix biases introduced at various point of the data processing procedures.

##Indel Realignment Due to the heuristics of the alignment algorithm (i.e. each read pair is optimized independent of the full alignment context), mismatches can be introduced near regions of insertions and deletions (indels).

The Genome Analysis ToolKit (GATK) has a tool called IndelRealigner which performs two steps

Realigner Target Creator : Identifies small suspicious regions requiring realignment
Indel Realigner : Uses the full alignment context to determine if a indel exist

Try command:

java -Xmx2G  -jar ${GATK_JAR} \
  -T RealignerTargetCreator \
  -R ${REF}/b37.fasta \
  -o alignment/NA12878/realign.intervals \
  -I alignment/NA12878/NA12878.sorted.bam \
  -L 1

java -Xmx2G -jar ${GATK_JAR} \
  -T IndelRealigner \
  -R ${REF}/b37.fasta \
  -targetIntervals alignment/NA12878/realign.intervals \
  -o alignment/NA12878/NA12878.realigned.sorted.bam \
  -I alignment/NA12878/NA12878.sorted.bam

How could we make this go faster? Solution

How many regions were identified for realignment? Solution

FixMates

This step ensures that all mate-pair information is in rync between each read and its mate pair.

NOTE: This step is deprecated now since indel realignment performs this step. We shall skip this due to time constrains

Mark duplicates

This procedure removes/marks duplicate reads in your bam file. Removing/marking duplicates from an analyses can prevent skewed allele frequencies in SNP discovery, false expression profiles in RNA-seq experiments, etc.

So what are duplicate reads? Solution

What causes these type of reads? Solution

How are duplicated reads detected? Solution

Here we will use Picard approach:

Try these command:

java -Xmx2G -jar ${PICARD_JAR} MarkDuplicates \
  REMOVE_DUPLICATES=false CREATE_MD5_FILE=true VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
  INPUT=alignment/NA12878/NA12878.realigned.sorted.bam \
  OUTPUT=alignment/NA12878/NA12878.sorted.dup.bam \
  METRICS_FILE=alignment/NA12878/NA12878.sorted.dup.metrics

We can look in the metrics output to see what happened.

less alignment/NA12878/NA12878.sorted.dup.metrics

How many duplicates were there? Solution

We can see that it computed separate measures for each library.

Why is this important to do not combine everything? Solution

Note on Duplicate rate

Base Quality recalibration (BQSR)

GATK BaseRecalibrator: Accurately recalibrates base quality scores by correcting for various sources of systematic technical error e.g. sequence context, mismatch position within read, etc.

Why do we need to recalibrate base quality scores ? Solution

java -Xmx2G -jar ${GATK_JAR} \
  -T BaseRecalibrator \
  -nct 2 \
  -R ${REF}/b37.fasta \
  -knownSites ${REF}/dbSnp-137.vcf.gz \
  -L 1:47000000-47171000 \
  -o alignment/NA12878/NA12878.sorted.dup.recalibration_report.grp \
  -I alignment/NA12878/NA12878.sorted.dup.bam

java -Xmx2G -jar ${GATK_JAR} \
  -T PrintReads \
  -nct 2 \
  -R ${REF}/b37.fasta \
  -BQSR alignment/NA12878/NA12878.sorted.dup.recalibration_report.grp \
  -o alignment/NA12878/NA12878.sorted.dup.recal.bam \
  -I alignment/NA12878/NA12878.sorted.dup.bam

Extract BAM metrics

Now all refinement steps of the bam are complete. Now it is best to perform a quality check of the data.

Compute coverage For Exome data the capture kit regions are used to verify how well your targets were captured

Insert Size Used to validate the quality of the library preparaion

Alignment metrics Verifies the concordance between your sample and the reference genome

Compute coverage

Both GATK and BVATools have depth of coverage tools.

GATK DepthOfCoverage

Try command:

java  -Xmx2G -jar ${GATK_JAR} \
  -T DepthOfCoverage \
  --omitDepthOutputAtEachBase \
  --summaryCoverageThreshold 10 \
  --summaryCoverageThreshold 25 \
  --summaryCoverageThreshold 50 \
  --summaryCoverageThreshold 100 \
  --start 1 --stop 500 --nBins 499 -dt NONE \
  -R ${REF}/b37.fasta \
  -o alignment/NA12878/NA12878.sorted.dup.recal.coverage \
  -I alignment/NA12878/NA12878.sorted.dup.recal.bam \
  -L 1:47000000-47171000

note on DepthOfCoverage command

Coverage is the expected ~30x

Look at the coverage:

less -S alignment/NA12878/NA12878.sorted.dup.recal.coverage.sample_interval_summary

Is the coverage fit with the expectation ? solution

Insert Size

It corresponds to the size of DNA fragments sequenced.

Different from the gap size (= distance between reads) !

These metrics are computed using Picard CollectInsertSizeMetrics:

Try command:

java -Xmx2G -jar ${PICARD_JAR} CollectInsertSizeMetrics \
  VALIDATION_STRINGENCY=SILENT \
  REFERENCE_SEQUENCE=${REF}/b37.fasta \
  INPUT=alignment/NA12878/NA12878.sorted.dup.recal.bam \
  OUTPUT=alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.tsv \
  HISTOGRAM_FILE=alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.histo.pdf \
  METRIC_ACCUMULATION_LEVEL=LIBRARY

Look at the output:

less -S alignment/NA12878/NA12878.sorted.dup.recal.metric.insertSize.tsv

There is something interesting going on with our library ERR.

Can you tell what it is? Solution

Alignment metrics

For the alignment metrics, samtools flagstat is very fast but with bwa-mem since some reads get broken into pieces, the numbers are a bit confusing.

We prefer the Picard way of computing metrics:

java -Xmx2G -jar ${PICARD_JAR} CollectAlignmentSummaryMetrics \
  VALIDATION_STRINGENCY=SILENT \
  REFERENCE_SEQUENCE=${REF}/b37.fasta \
  INPUT=alignment/NA12878/NA12878.sorted.dup.recal.bam \
  OUTPUT=alignment/NA12878/NA12878.sorted.dup.recal.metric.alignment.tsv \
  METRIC_ACCUMULATION_LEVEL=LIBRARY

explore the results

less -S alignment/NA12878/NA12878.sorted.dup.recal.metric.alignment.tsv

Do you think the sample and the reference genome fit together ? Solution

Variant calling

Most of SNV caller use either a Baysian, a threshold or a t-test approach to do the calling

I won't go into the details of finding which variant is good or bad since this will be your next workshop.

Here we will just call and view the variants using the samtools mpileup function:

mkdir variants
samtools mpileup -L 1000 -B -q 1 -g \
	-f ${REF}/b37.fasta \
	-r 1:47000000-47171000 \
	alignment/NA12878/NA12878.sorted.dup.recal.bam | bcftools  view -vcg -  \
	> variants/mpileup.vcf

note on samtools mpileup and bcftools command

Now we have variants from all three methods. Let's compress and index the vcfs for future visualisation.

bgzip -c variants/mpileup.vcf > variants/mpileup.vcf.gz

tabix -p vcf variants/mpileup.vcf.gz

Let's look at the compressed vcf.

zless -S variants/mpileup.vcf.gz

Details on the spec can be found here: http://vcftools.sourceforge.net/specs.html

Fields vary from caller to caller.

Some values are are almost always there:

The ref vs alt alleles,
variant quality (QUAL column)
The per-sample genotype (GT) values.

note on the vcf format fields

#Add-on Additional exercice to play with sam/bam files

Aknowledgments

This tutorial is an adaptation of the one created by Louis letourneau here. I would like to thank and acknowledge Louis for this help and for sharing his material. the format of the tutorial has been inspired from Mar Gonzalez Porta of Embl-EBI. I also want to acknowledge Joel Fillon, Louis Letrouneau (again), Francois Lefebvre, Maxime Caron and Guillaume Bourque for the help in building these pipelines and working with all the various datasets.

robrollback/Kyoto_DNAseq_workshop

Introduction to DNA-Seq processing

Cheat file

Environment setup

Workspace setup

Software requirements

Inspect Data

Fastq files

Quality

Trimming

Alignment

Lane merging

Post-processing BAMs to remove technical errors

FixMates

Mark duplicates

Base Quality recalibration (BQSR)

Extract BAM metrics

Compute coverage

Insert Size

Alignment metrics

Variant calling

Aknowledgments