DNA-seq:
-
create an environment for DNA
install trimmomatic, bwa, samtools and snakemake
-
use trimming script
$ vim sk_trim # to revise the script
$ snakemake -j 8 -s sk_trim
- Quality assessment
- Alignment against GRCh38
- Duplicate removal
- Variant calling (germline or somatic mode)
- Variant annotation
cd /tempwork173/qiyoyou/
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source /home/qiyoyou/.bashrc
conda create -n py3.7 python=3.7
source activate py3.7
deactivate
conda create -n DNA python=3.7
conda activate DNA
conda install -c bioconda fastqc
conda install -c bioconda multiqc
conda install -c bioconda trimmomatic
conda install -c bioconda bwa
conda install -c bioconda samtools
conda install -c bioconda snakemake
conda install -c bioconda htslib ## for bgzip and tabix
2.1 Make sure you activate the enviromment since we installed 'fastqc' within and change direction to the folder containing raw .fastq data
conda activate DNA
cd /tempwork173/qiyoyou/Liu_WES_201903/RawData/
2.2 Quality check with 'fastqc'. Apply fastqc to all file with '.gz' pattern (It usually takes a lot of time if sequence data being large)
fastqc *.gz
2.3 Quality check with 'multiqc'. Apply multiqc to the output (.html) in this folder (./). This process integrate all the seperate output .html files into one.
multiqc ./
[link to the search page of hg38]
Build a folder for storing the index and cd to this folder.
cd /tempwork173/qiyoyou/db/
mkdir grch38_bwa
cd grch38_bwa/
Download hg38 .fasta from the link and store it to this folder.
wget https://storage.cloud.google.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta?_ga=2.56607659.-1064644800.1560758967
Making bwa index
bwa index Homo_sapiens_assembly38.fasta
Remember to revise the script to adapt to your own filenames and path names via 'vim' command.
cd /tempwork173/qiyoyou/Liu_WES_201903/scripts/
vim sk_align
snakemake -j 8 -s sk_align ## use 8 threads
Here is a really flexible settings for those parameters. (parameters:--min-coverage 20 --min-var-freq 0.2 --min-reads2 4 --min-avg-qual 20 --p-value 0.05)
cd /tempwork173/qiyoyou/Liu_WES_201903/calling/
parallel bgzip {} ::: *.vcf
parallel tabix -p vcf {} ::: *.vcf.gz
# or one by one
# For each VCF file:
bgzip Variants_sample_A.raw.vcf
tabix -p vcf Variants_sample_A.raw.vcf.gz
cd /tempwork173/qiyoyou/
mkdir tools
cd tools/
tar -xvf vcftools_0.1.13.tar.gz
Merge seperate .vcf files into one via VCFtools:vcf-merge. After this procedure, 'Variants_all_samples.vcf' can be uploaded to wANNOVAR to perform functional annotation.
cd /tempwork173/qiyoyou/tools/vcftools_0.1.13/perl/
perl vcf-merge /tempwork173/qiyoyou/Liu_WES_201903/calling/temp/*.vcf.gz >| /tempwork173/qiyoyou/Liu_WES_201903/calling/temp/Variants_all_samples.vcf
Remove the header lines containing '##'.
awk '! /\##/' Variants_all_samples.vcf > Variants_all_samples_no_header.vcf
[How to Download hg38/GRCh38 FASTA Human Reference Genome]
(Analysis workflow)
- Lee CY, Chiu YC, Wang LB, et al: Common applications of next-generation sequencing technologies in genomic research. Translational Cancer Research 2013;2:33-45. http://tcr.amegroups.com/article/view/962/html