Type | value |
---|---|
Sample type | DNA from FFPE blocks |
No. of samples | 40 |
Capture type | Whole exome |
Sequencing platform | Illumina HiSeq |
Library type | Paired end (150*2) |
=======================================================================
Library preparation
SureselectXT Target Enrichment system kit
Adapters used in this study are:
Illumina Universal Adapters
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’
Index Adapter:
5’-GATCGGAAGAGCACACGTCTGAACTCCAGTCAC [INDEX] ATCTCGTATGCCGTCTTCTGCTTG-3’
=======================================================================
Reference Genome : hg19
1. Download reference genome data using ftp
location: ftp.broadinstitute.org/bundle
username: gsapubftp-anonymous
password:
2. Access data via google bucket
hg19
https://storage.cloud.google.com/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
hg38
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/
3. Web browser
Broad Institute
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/
UCSC Genome
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
Igenome Illumina
http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz
=======================================================================
**FASTQC**
Home Page:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Download Location:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip
Manual Page:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
=======================================================================
Trim Galore
Download Location:
https://github.com/FelixKrueger/TrimGalore
Manual Page:
https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md
Cut Adapt
Download Location:
https://github.com/marcelm/cutadapt/
run in the shell pip install cutadapt
Manual Page:
https://cutadapt.readthedocs.io/en/stable/
=======================================================================
GATK
Download Location:
https://github.com/broadinstitute/gatk/releases/download/4.1.4.1/gatk-4.1.4.1.zip
Manual Page:
https://software.broadinstitute.org/gatk/documentation/
=======================================================================
BWA
Download Location:
https://liquidtelecom.dl.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
Manual page:
http://bio-bwa.sourceforge.net/
http://bio-bwa.sourceforge.net/bwa.shtml
Bowtie2
Home Page:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Download Page:
https://sourceforge.net/projects/bowtie-bio/files/latest/download
https://liquidtelecom.dl.sourceforge.net/project/bowtie-bio/bowtie/1.2.3/bowtie-src-x86_64.zip
=======================================================================
PICARD
Download Location:
https://github.com/broadinstitute/picard/releases/download/2.21.4/picard.jar
Manual Page:
https://broadinstitute.github.io/picard/command-line-overview.html
Home Page:
https://broadinstitute.github.io/picard/
=======================================================================
SAMTools
Download Location:
https://liquidtelecom.dl.sourceforge.net/project/samtools/samtools/1.10/samtools-1.10.tar.bz2
Manual Page:
http://www.htslib.org/doc/samtools.1.html
http://quinlanlab.org/tutorials/samtools/samtools.html
=======================================================================
SAMBAMBA
Download Location:
https://github.com/biod/sambamba
installation
git clone --recursive https://github.com/lomereiter/sambamba.git
cd sambamba
make sambamba-ldmd2-64
Manual Page:
https://lomereiter.github.io/sambamba/docs/sambamba-view.html
=======================================================================
Bedtool
Download Location:
https://github.com/arq5x/bedtools2/releases
installation
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make
Manual Page:
https://bedtools.readthedocs.io/en/latest/
=======================================================================
SNPeff
Download Location:
https://github.com/pcingola/SnpEff
https://excellmedia.dl.sourceforge.net/project/snpeff/snpEff_latest_core.zip
http://snpeff.sourceforge.net/download.html
Manual Page:
http://snpeff.sourceforge.net/SnpEff_manual.html
Annovar
http://annovar.openbioinformatics.org/en/latest/
VarAFT
https://varaft.eu/PackageForInstall/varaft-2.16.deb
=======================================================================
Using bwa
bwa index -a bwtsw hg19.fa
Using samtools
samtools faidx reference.fa
Create sequnce dictionary
java -jar picard.jar CreateSequenceDictionary \
R=hg19.fa \
O=hg19.dict
R
: Reference Genome
O
: Output dictionary
./fastqc sample1_R1.fastq
#forward strand
./fastqc sample1_R2.fastq
#reverse strand
command to run
trim_galore \
-–paired \
-–phred33 \
-–quality 20 \
–a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \
–stringency 5 –e 0.1 \
–t \
–r1 35 \
–r2 35 \
sample1_R1.fastq sample1_R2.fastq
parameters
--paired
: For paired end reads
-–phred33
: Instructs Cutadapt to use ASCII+33 quality scores as Phred scores (Sanger/Illumina 1.9+ encoding) for quality trimming.
The other option is --phred64
: Instructs Cutadapt to use ASCII+64 quality scores as Phred scores (Illumina 1.5 encoding) for quality trimming.
-q
or --quality
: Trim low-quality ends from reads in addition to adapter removal. For RRBS samples, quality trimming will be performed first, and adapter trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal).
-a
or --adapter
-a2
or adapter2
-a
Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will try to auto-detect whether the Illumina universal, Nextera transposase or Illumina small RNA adapter sequence was used
-a2
Optional adapter sequence to be trimmed off read 2 of paired-end files. This option requires --paired to be specified as well. If the libraries to be trimmed are smallRNA then a2 will be set to the Illumina small RNA 5' adapter automatically (GATCGTCGGACT
). A single base may also be given as e.g. -a2
A{10}, to be expanded to -a2
AAAAAAAAAA
-t
or --trim1
: Trims 1 bp off every read from its 3' end.
This may be needed for FastQ files that are to be aligned as paired-end data with Bowtie 1. This is because Bowtie (1) regards alignments like this as invalid (whenever a start/end coordinate is contained within the other read):
-r1
or --length_1
: Unpaired single-end read length cutoff needed for read 1 to be written to .unpaired_1.fq output file. These reads may be mapped in single-end mode.
Default: 35 bp
-r2
or --length_2
: Unpaired single-end read length cutoff needed for read 2 to be written to .unpaired_2.fq output file. These reads may be mapped in single-end mode.
Default: 35 bp
-s
or --stringency
: Overlap with adapter sequence required to trim a sequence.
Defaults to a very stringent setting of 1, i.e. even a single base pair of overlapping sequence will be trimmed of the 3' end of any read.
./fastqc sample1_R1_trimmed.fastq
#forward strand
./fastqc sample1_R2_trimmed.fastq
#reverse strand
bwa mem \
-M \
-t 8 \
hg19.fa \
sample1_R1_trimmed.fastq sample1_R2_trimmed.fastq > aligned_paired_reads.sam
Converting SAM to BAM
samtools view -Sb aligned_paired_reads.sam > aligned_paired_reads.bam
Both the above steps can be achieved in following single step
bwa mem \
-M \
-t 8 \
hg19.fa \
sample1_R1_trimmed.fastq sample1_R2_trimmed.fastq | samtools view -1 -bS - > aligned_paired_reads.bam
-M
: Mark shorter split hits as secondary (for Picard compatibility).
-t
: Number of threads
-R
: Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM.
The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’.
samtools view -1 -bS
: to sort and compress your sam file to the bam format
java -jar picard.jar ValidateSamFile \
I=aligned_paired_reads.bam \
MODE=SUMMARY
java -jar picard.jar MergeBamAlignment \
ALIGNED=aligned_paired_reads.bam \
UNMAPPED=unmapped.bam \
O=merge_alignments.bam \
R=hg19.fasta
Merge alignment data from a SAM
or BAM
with data in an unmapped BAM
file. This tool produces a new SAM
or BAM
file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM
(attributes that are otherwise lost in the process of alignment). The purpose of this tool is to use information from the unmapped BAM
to fix up aligner output. The resulting file will be valid for use by other Picard tools. For simple BAM file merges, use MergeSamFiles
. Note that MergeBamAlignment
expects to find a sequence dictionary in the same directory as REFERENCE_SEQUENCE
and expects it to have the same base name as the reference FASTA except with the extension ".dict". If the output sort order is not coordinate, then reads that are clipped due to adapters or overlapping will not contain the NM
, MD
, or UQ
tags
Link
java -jar picard.jar MarkDuplicates \
I=merge_alignments.bam \
O=marked_duplicates.bam \
M=marked_dup_metrics.txt
java -jar picard.jar SortSam \
I=marked_duplicates.bam \
O=sorted.bam \
SORT_ORDER=coordinate
java -jar picard.jar CollectAlignmentSummaryMetrics \
REFERENCE=hg19.fa \
INPUT=sorted.bam \
OUTPUT=alignment_summary_matirx.txt
java -jar picard.jar CollectInsertSizeMetrics \
I=sorted.bam \
O=insert_size_metrics.txt \
H=insert_size_histogram.pdf \
M=0.5
samtools depth -a sorted_reads.bam > depth_out.txt
java -jar gatk.jar BaseRecalibrator \
-I sorted.bam \
-R hg19.fa \
--known-sites sites_of_variation.vcf \
--known-sites another/optional/setOfSitesToMask.vcf \
-O recal_data.table
parameter
--known-sites
: vcf of ExAc
, gnomAD
, or dbSNP
etc
gatk ApplyBQSR \
-R hg19.fa \
-I sorted.bam \
--bqsr-recal-file recal_data.table \
-O recalibrated.bam
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R hg19.fa \
-I sorted.bam \
-knownSites my-trusted-snps.vcf \
-knownSites my-trusted-indels.vcf \
-o firstpass.table
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R hg19.fa \
-I sorted.bam \
-knownSites my-trusted-snps.vcf \
-knownSites my-trusted-indels.vcf \
-BQSR firstpass.table \
-o secondpass.table
java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R hg19.fa \
-before firstpass.table \
-after secondpass.table \
-csv BQSR.csv \
-plots BQSR.pdf
java -jar picard.jar \
BuildBamIndex INPUT=recalibrated.bam
=======================================================================
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R hg19.fa \
-I recalibrated.bam \
--known indels.vcf \
-o forIndelRealigner.intervals
Link
This is removed in latest GATK
Link
java -jar GenomeAnalysisTK.jar \
-T IndelRealigner \
-R hg19.fa -I recalibrated.bam \
-targetIntervals realignment_targets.list \
-o realigned_reads.bam
This is removed in latest GATK
Link
=======================================================================
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.g.vcf.gz \
-ERC GVCF
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.g.vcf.gz \
-ERC GVCF \
-G Standard \
`-G AS_Standard``
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R hg19.fa \
-I recalibrated.bam \
-O raw_variants.vcf.gz \
-bamout bamout.bam
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fa \
-V raw_variants.vcf \
-selectType SNP \
-O raw_variants_snp.vcf
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R hg19.fa \
-V raw_variants.vcf \
-selectType INDEL \
`-O raw_variants_indel.vcf``
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg19.fa \
-V raw_variants_snp.vcf \
--filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' \
--filterName "basic_snp_filter" -o filtered_snps.vcf
QD
,FS
,MQ
Link
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg19.fa \
-V raw_variants_indel.vcf \
--filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' \
--filterName "basic_indel_filter" -o filtered_indels.vcf
java -jar snpEff.jar \
-v snpeff_db filtered_snps.vcf > filtered_snps_final.ann.vcf
bedtools genomecov -bga -ibam recal_reads.bam > genomecov.bedgraph