Author: Rostam Abdollahi-Arpanahi
Date: May 7th, 2020
Make sure all the following tools are already installed on your machine.
module load fastqc
module load bwa
module load samtools
module load picard
module load gatk
Basically the NGS company or databases provide the data in fasta or SRA format. You must be able to download the NGS to your own computer. The most common command for downloading the data in unix is:
wget "URL address"
Combine multiple *.gz files without un-compressing (Here we had to lanes of reads per each sample)
- Forward strand
zcat read1_1.fastq.gz read1_2.fastq.gz | gzip -c > forward.fastq.gz
zcat forward.fastq.gz | md5sum
- Reverse strand
zcat read2_1.fastq.gz read2_2.fastq.gz | gzip -c > reverse.fastq.gz
zcat reverse.fastq.gz | md5sum
- check the quality control of NGS data
fastqc -t 4 forward.fastq.gz reverse.fastq.gz
- If the data quality is not good enough, use a program such as trimmometric or cutadapt or trim_galore (we skipped this step in our analysis because of great data quality)
java -jar trimmomatic-0.36.jar PE LLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:6:20 MINLEN:140 CROP:140 -o output_files file1.fatq.gz file2.fastq.gz file3.fastq.gz file4.fastq.gz
- After quality control, the next step is to align the reads to a reference sequence
- It is available for download at NCBI through the following link: ARS-UCD1.2 refernce genome
gunzip ARS-UCD.1.2.fa.gz
mv ARS-UCD.1.2.fa bos.fa
bwa index -a bwtsw bos.fa
picard CreateSequenceDictionary -R bos.fa
samtools faidx bos.fa
bwa aln -t 4 bos.fa forward.fastq.gz > forward_read.sai
bwa aln -t 4 bos.fa reverse.fastq.gz > reverse_read.sai
bwa mem bos.fa forward_read.sai reverse_read.sai forward.fastq.gz reverse.fastq.gz > aln_read.sam
samtools view -bS -t bos.fa aln_read.sam -o aln_read.bam
samtools sort aln_read.bam -o sorted_reads.bam
samtools flagstat sorted_reads.bam > coverage_mapping
picard MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=output.Metrics VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=50
picard FixMateInformation I=dedup_reads.bam O=nmsrt_fixmate_sorted_reads.bam SO=queryname VALIDATION_STRINGENCY=LENIENT TMP_DIR=./tmp/temp
picard SortSam I=dedup_reads.bam O=picard_sorted.bam SORT_ORDER=coordinate TMP_DIR=./tmp/temp VALIDATION_STRINGENCY=LENIENT
picard AddOrReplaceReadGroups I=picard_sorted.bam O=sorted_reads_RG_Edited.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=chr8 VALIDATION_STRINGENCY=LENIENT
mv sorted_reads_RG_Edited.bam picard_sorted_RG.bam
picard BuildBamIndex I=picard_sorted_RG.bam VALIDATION_STRINGENCY=LENIENT
java -jar picard.jar ReorderSam I=picard_sorted_RG.bam O=reorder.bam SD=bos.fa CREATE_INDEX=TRUE TMP_DIR=./tmp/temp
mv reorder.bam picard_sorted_RG.bam
4.9 - Download annotation file from database (Ensembl or 1000 Bull Genomes Project)
wget http://www.1000bullgenomes.com/doco/ARS1.2PlusY_BQSR_v3.vcf.gz
mv ../sorted_bos.vcf.idx ../bos.vcf.idx
java -jar GenomeAnalysisTK.jar -R bos.fa -T RealignerTargetCreator -o indels_Realigner.intervals -I picard_sorted_RG.bam -known bos.vcf
java -jar GenomeAnalysisTK.jar -R bos.fa -T IndelRealigner -targetIntervals indels_Realigner.intervals -known bos.vcf -I picard_sorted_RG.bam -o sorted_reads_indel.bam
java -Xmx16g -jar GenomeAnalysisTK.jar -R bos.fa -T BaseRecalibrator -I sorted_reads_indel.bam -knownSites bos.vcf -o sorted_reads_indel_grp
java -Xmx16g -jar GenomeAnalysisTK.jar -R bos.fa -T PrintReads -BQSR sorted_reads_indel_grp -I picard_sorted_RG.bam -o sorted_reads_indel_recal.bam
5.1 - Variant Calling with gvcf format (we assume we have two different *.bam files, you may have several bam files)
java -Xmx16g -jar GenomeAnalysisTK.jar -R bos.fa -T HaplotypeCaller -I sorted_reads_indel_recal.bam -I sorted_reads_ctrl_indel_recal.bam --dbsnp bos.vcf -stand_call_conf 20 -o tothap.raw.snps.indels.g.vcf
java -jar GenomeAnalysisTK.jar -R bos.fa -T SelectVariants -selectType SNP -V tothap.raw.snps.indels.g.vcf -o SNP.vcf
java -jar GenomeAnalysisTK.jar -R bos.fa -T SelectVariants -selectType INDEL -V tothap.raw.snps.indels.g.vcf -o Indel.vcf
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R bos.fa --variant SNP.vcf --filterName "snpsfilter" --filterExpression "QD<2.0||MQ<40.0||FS>60.0" --out snps.tagged.vcf
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R bos.fa --variant Indel.vcf --filterName "my_indel_filter" --filterExpression "QD<2.0||FS>200.0||ReadPosRankSum < -20.0" --out indels.tagged.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R bos.fa --variant snps.tagged.vcf -select 'vc.isNotFiltered()' -o snps.filtered.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R bos.fa --variant indels.tagged.vcf -select 'vc.isNotFiltered()' -o indels.filtered.vcf
Please send your comments and suggestions to rostam7474 at gmail dot com