NGS Workflow

Links to download the data:

1. Download SRA file using SRA-toolkit the command line:

prefetch SRR24927859

2. Split SRA file into R1 and R2 fastq files

fastq-dump --split-files SRR24927859.sra

3. Get the count of reads from fastq files

grep -c "^@" SRR24927859_R1.fastq > SRR24927859_total_reads_R1.txt
grep -c "^@" SRR24927859_R2.fastq > SRR24927859_total_reads_R2.txt

4. Perform basic QC checks on both files

mkdir qc_results
fastqc SRR24927859_R1.fastq SRR24927859_R2.fastq --outdir=qc_results

5. Download human genome fasta file from UCSC

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz

6. Index the reference file

bwa index hg38.fa.gz

7. Align both sets of fastq files onto the genome. Use BWA for alignment.

bwa mem -t 8 hg38.fa SRR24927859_R1.fastq SRR24927859_R2.fastq | samtools view -h -bS - > SRR24927859_aligned_reads.bam

8. Calculate alignment statistics and store them in a file.

samtools sort SRR24927859_aligned_reads.bam -o SRR24927859_aligned_reads_sorted.bam
samtools flagstat SRR24927859_aligned_reads_sorted.bam > SRR24927859_alignment_statistics.txt

9. Perform mark duplicates and store Mark duplicate statistics.

picard MarkDuplicates INPUT=SRR24927859_aligned_reads_sorted.bam OUTPUT=SRR24927859_marked_duplicates.bam METRICS_FILE=mark_duplicate_stats.txt

10. Variant calling.

bcftools mpileup -Ou -f hg38.fa.gz SRR24927859_marked_duplicates.bam | bcftools call -mv -Ob -o SRR24927859.bcf
bcftools view SRR24927859.bcf | vcfutils.pl varFilter - > SRR24927859.vcf

11. Separate out SNPs and Indels into individual vcf files.

#For SNPs
samtools mpileup --skip-indels -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f hg38.fa.gz -o SRR24927859_snps.bcf SRR24927859_marked_duplicates.bam
bcftools index SRR24927859_snps.bcf ${file_name}_indexed_snps.bcf
bcftools call --skip-variants indels --multiallelic-caller --variants-only -O v SRR24927859_snps.bcf -o SRR24927859_snps.vcf

#For INDELs
samtools mpileup -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f hg38.fa.gz -o SRR24927859_indels.bcf SRR24927859_marked_duplicates.bam
bcftools index SRR24927859_indels.bcf SRR24927859_indexed_indels.bcf
bcftools call --skip-variants snps --multiallelic-caller --variants-only  -O v SRR24927859_indels.bcf -o SRR24927859_indels.vcf