/exome_analysis

A pipeline for analysis of exome seq from raw fastq to annotated variants

Primary LanguageHTML

DOI

1. Data

Sample info

Type value
Sample type DNA from FFPE blocks
No. of samples 40
Capture type Whole exome
Sequencing platform Illumina HiSeq
Library type Paired end (150*2)

=======================================================================
Library preparation
SureselectXT Target Enrichment system kit

Adapters used in this study are:
Illumina Universal Adapters
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’

Index Adapter:
5’-GATCGGAAGAGCACACGTCTGAACTCCAGTCAC [INDEX] ATCTCGTATGCCGTCTTCTGCTTG-3’

=======================================================================

Reference Genome

Reference Genome : hg19
1. Download reference genome data using ftp

    location: ftp.broadinstitute.org/bundle
    username: gsapubftp-anonymous
    password:

2. Access data via google bucket
hg19
https://storage.cloud.google.com/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta

hg38
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

3. Web browser
Broad Institute
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/

UCSC Genome
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

Igenome Illumina
http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz

=======================================================================

2. Softwares


**FASTQC**
Home Page:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Download Location:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip

Manual Page:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/

=======================================================================

Trim Galore
Download Location:
https://github.com/FelixKrueger/TrimGalore

Manual Page:
https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md

Cut Adapt
Download Location:
https://github.com/marcelm/cutadapt/
run in the shell pip install cutadapt

Manual Page:
https://cutadapt.readthedocs.io/en/stable/

=======================================================================

GATK
Download Location:
https://github.com/broadinstitute/gatk/releases/download/4.1.4.1/gatk-4.1.4.1.zip

Manual Page:
https://software.broadinstitute.org/gatk/documentation/

=======================================================================

BWA
Download Location:
https://liquidtelecom.dl.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2

Manual page:
http://bio-bwa.sourceforge.net/
http://bio-bwa.sourceforge.net/bwa.shtml

Bowtie2
Home Page:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Download Page:
https://sourceforge.net/projects/bowtie-bio/files/latest/download
https://liquidtelecom.dl.sourceforge.net/project/bowtie-bio/bowtie/1.2.3/bowtie-src-x86_64.zip

=======================================================================

PICARD
Download Location:
https://github.com/broadinstitute/picard/releases/download/2.21.4/picard.jar

Manual Page:
https://broadinstitute.github.io/picard/command-line-overview.html

Home Page:
https://broadinstitute.github.io/picard/

=======================================================================

SAMTools
Download Location:
https://liquidtelecom.dl.sourceforge.net/project/samtools/samtools/1.10/samtools-1.10.tar.bz2

Manual Page:
http://www.htslib.org/doc/samtools.1.html
http://quinlanlab.org/tutorials/samtools/samtools.html

=======================================================================

SAMBAMBA
Download Location:
https://github.com/biod/sambamba
installation
git clone --recursive https://github.com/lomereiter/sambamba.git
cd sambamba
make sambamba-ldmd2-64

Manual Page:

https://lomereiter.github.io/sambamba/docs/sambamba-view.html

=======================================================================

Bedtool
Download Location:
https://github.com/arq5x/bedtools2/releases
installation
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make

Manual Page:
https://bedtools.readthedocs.io/en/latest/

=======================================================================

SNPeff
Download Location:
https://github.com/pcingola/SnpEff
https://excellmedia.dl.sourceforge.net/project/snpeff/snpEff_latest_core.zip
http://snpeff.sourceforge.net/download.html

Manual Page:
http://snpeff.sourceforge.net/SnpEff_manual.html

Annovar
http://annovar.openbioinformatics.org/en/latest/
VarAFT
https://varaft.eu/PackageForInstall/varaft-2.16.deb

=======================================================================

Pipeline for variant calling

1. Indexing the reference genome


Using bwa
bwa index -a bwtsw hg19.fa


Using samtools
samtools faidx reference.fa


Create sequnce dictionary

java -jar picard.jar CreateSequenceDictionary \
R=hg19.fa \
O=hg19.dict


R : Reference Genome
O : Output dictionary

2. Checking the Quality of sequence

./fastqc sample1_R1.fastq #forward strand
./fastqc sample1_R2.fastq #reverse strand

3. Removing the adaptors using trim_galore

command to run

trim_galore \
-–paired \
-–phred33 \
-–quality 20 \
–a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \
–stringency 5 –e 0.1 \
–t \
–r1 35 \
–r2 35 \
sample1_R1.fastq sample1_R2.fastq


parameters

--paired : For paired end reads


-–phred33 : Instructs Cutadapt to use ASCII+33 quality scores as Phred scores (Sanger/Illumina 1.9+ encoding) for quality trimming.
The other option is --phred64 : Instructs Cutadapt to use ASCII+64 quality scores as Phred scores (Illumina 1.5 encoding) for quality trimming.


-q or --quality : Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal).


-a or --adapter
-a2 or adapter2

-a
Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will try to auto-detect whether the Illumina universal, Nextera transposase or Illumina small RNA adapter sequence was used

-a2
Optional adapter sequence to be trimmed off read 2 of paired-end files. This option requires --paired to be specified as well. If the libraries to be trimmed are smallRNA then a2 will be set to the Illumina small RNA 5' adapter automatically (GATCGTCGGACT). A single base may also be given as e.g. -a2 A{10}, to be expanded to -a2 AAAAAAAAAA


-t or --trim1 : Trims 1 bp off every read from its 3' end. This may be needed for FastQ files that are to be aligned as paired-end data with Bowtie 1. This is because Bowtie (1) regards alignments like this as invalid (whenever a start/end coordinate is contained within the other read):


-r1 or --length_1 : Unpaired single-end read length cutoff needed for read 1 to be written to .unpaired_1.fq output file. These reads may be mapped in single-end mode. Default: 35 bp

-r2 or --length_2 : Unpaired single-end read length cutoff needed for read 2 to be written to .unpaired_2.fq output file. These reads may be mapped in single-end mode. Default: 35 bp


-s or --stringency : Overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1, i.e. even a single base pair of overlapping sequence will be trimmed of the 3' end of any read.

4. Re-checking the Quality of sequence

./fastqc sample1_R1_trimmed.fastq #forward strand
./fastqc sample1_R2_trimmed.fastq #reverse strand

5. Map to Reference

bwa mem \
-M \
-t 8 \
hg19.fa \
sample1_R1_trimmed.fastq sample1_R2_trimmed.fastq > aligned_paired_reads.sam

Converting SAM to BAM
samtools view -Sb aligned_paired_reads.sam > aligned_paired_reads.bam

Both the above steps can be achieved in following single step

bwa mem \
-M \
-t 8 \
hg19.fa \
sample1_R1_trimmed.fastq sample1_R2_trimmed.fastq | samtools view -1 -bS - > aligned_paired_reads.bam


-M : Mark shorter split hits as secondary (for Picard compatibility).
-t : Number of threads
-R : Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM.         The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.
samtools view -1 -bS: to sort and compress your sam file to the bam format

Validate BAM file

java -jar picard.jar ValidateSamFile \
I=aligned_paired_reads.bam \
MODE=SUMMARY

6. MergeBamAlignments

java -jar picard.jar MergeBamAlignment \
ALIGNED=aligned_paired_reads.bam \
UNMAPPED=unmapped.bam \
O=merge_alignments.bam \
R=hg19.fasta


Merge alignment data from a SAM or BAM with data in an unmapped BAM file. This tool produces a new SAM or BAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of alignment). The purpose of this tool is to use information from the unmapped BAM to fix up aligner output. The resulting file will be valid for use by other Picard tools. For simple BAM file merges, use MergeSamFiles. Note that MergeBamAlignment expects to find a sequence dictionary in the same directory as REFERENCE_SEQUENCE and expects it to have the same base name as the reference FASTA except with the extension ".dict". If the output sort order is not coordinate, then reads that are clipped due to adapters or overlapping will not contain the NM, MD, or UQ tags
Link

7. MarkDuplicates

java -jar picard.jar MarkDuplicates \
I=merge_alignments.bam \
O=marked_duplicates.bam \
M=marked_dup_metrics.txt


Link

8. SortSam

java -jar picard.jar SortSam \
I=marked_duplicates.bam \
O=sorted.bam \
SORT_ORDER=coordinate


Link

9. Get stats

Alignment Summary Matrix

java -jar picard.jar CollectAlignmentSummaryMetrics \
REFERENCE=hg19.fa \
INPUT=sorted.bam \
OUTPUT=alignment_summary_matirx.txt

Link


Insert Size Matrix

java -jar picard.jar CollectInsertSizeMetrics \
I=sorted.bam \
O=insert_size_metrics.txt \
H=insert_size_histogram.pdf \
M=0.5

Link


Depth

samtools depth -a sorted_reads.bam > depth_out.txt

Link


10. Base (Quality Score) Recalibration

java -jar gatk.jar BaseRecalibrator \
-I sorted.bam \
-R hg19.fa \
--known-sites sites_of_variation.vcf \
--known-sites another/optional/setOfSitesToMask.vcf \
-O recal_data.table

parameter --known-sites : vcf of ExAc, gnomAD, or dbSNP etc


Link

Apply Recalibration

gatk ApplyBQSR \
-R hg19.fa \
-I sorted.bam \
--bqsr-recal-file recal_data.table \
-O recalibrated.bam


Link

11. AnalyzeCovariates

Link

Generate the first pass recalibration table file

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R hg19.fa \
-I sorted.bam \
-knownSites my-trusted-snps.vcf \

-knownSites my-trusted-indels.vcf \
-o firstpass.table


Generate the second pass recalibration table file

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R hg19.fa \
-I sorted.bam \
-knownSites my-trusted-snps.vcf \
-knownSites my-trusted-indels.vcf \
-BQSR firstpass.table \
-o secondpass.table


Finally generate the plots and also keep a copy of the csv (optional)

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R hg19.fa \
-before firstpass.table \
-after secondpass.table \
-csv BQSR.csv \
-plots BQSR.pdf


Link

12. Build BAM Index

java -jar picard.jar \
BuildBamIndex INPUT=recalibrated.bam


Link

=======================================================================

RealignerTargetCreator

java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R hg19.fa \
-I recalibrated.bam \
--known indels.vcf \
-o forIndelRealigner.intervals


Link
This is removed in latest GATK
Link

IndelRealigner

java -jar GenomeAnalysisTK.jar \
-T IndelRealigner \
-R hg19.fa -I recalibrated.bam \
-targetIntervals realignment_targets.list \
-o realigned_reads.bam


Link

This is removed in latest GATK
Link

=======================================================================

13. Call Variants (HaplotypeCaller)

Single-sample GVCF calling (outputs intermediate GVCF)

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.g.vcf.gz \
-ERC GVCF


Single-sample GVCF calling with allele-specific annotations

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.g.vcf.gz \
-ERC GVCF \
-G Standard \
`-G AS_Standard``


Variant calling with bamout to show realigned reads

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.vcf.gz \
-bamout bamout.bam


Link
Link2
link3

14. Extract SNPs & Indels

SNP

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fa \
-V raw_variants.vcf \
-selectType SNP \
-O raw_variants_snp.vcf


Indels

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fa \
-V raw_variants.vcf \
-selectType INDEL \
`-O raw_variants_indel.vcf``


Link

15. Filter SNPs & Indels

SNPs

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg19.fa \
-V raw_variants_snp.vcf \
--filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' \
--filterName "basic_snp_filter" -o filtered_snps.vcf


Link

QD,FS,MQ Link


Indels

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg19.fa \
-V raw_variants_indel.vcf \
--filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' \
--filterName "basic_indel_filter" -o filtered_indels.vcf


16. Annotate using snpEff

java -jar snpEff.jar \
-v snpeff_db filtered_snps.vcf > filtered_snps_final.ann.vcf

17. Coverage statistics

bedtools genomecov -bga -ibam recal_reads.bam > genomecov.bedgraph