/Exome-seq

A pipeline on WES(whole exome sequence) analysis, from data download to analysis

Primary LanguageShell

Exome-seq


Tools preparation

softwares available to conda:

conda install sra-tools fastqc multiqc fastx_toolkit trim-galore bwa samtools snpeff bcftools qualimap

softwares not available to conda such as GATK:

# GATK
wget -c -t 0 https://github.com/broadinstitute/gatk/releases/download/4.2.2.0/gatk-4.2.2.0.zip
unzip gatk-4.2.2.0.zip
cd gatk-4.2.2.0
./gatk
echo 'export PATH="/home/students/y.dong/tool/gatk-4.2.2.0/:$PATH"' >> ~/.bashrc
source ~/.bashrc

STEP 1 - 9 (from NCBI SRA files to VCF files in Mendelian single-gene inherited disease):

$bash WES_pipeline.sh
# If the raw files are fastq.gz files with slurm
$sbatch WES_pipeline_RAWfastq.sh

step 1: Download SRA files, high-throughput raw sequencing data. method one:using the prefetch function of the SRA-tools. E.g:download SRR3589957-SRR3589965(RNAseq of Homo sapiens 293 cell line)

for i in `seq 57 65`;
do    
prefetch SRR35899${i}
done

method two:using the FTP protocol. E.g:download SRR5422017-SRR5422019(RNAseq of Saccharomyces cerevisiae)

for i in `seq 7 9`;
do
wget -c -t 0 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR542/SRR542201${i}/SRR542201${i}.sra	#new
done

method three:using Entrez Direct package. E.g:download all SRA files under the SRP091493 project number and unzip those SRA file.

$ esearch -db sra -query SRP091493 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump --split-files –bzip2

method four:using aspera. E.g:all SRA files in file a can be downloaded. File a can be directly accessed by the NCBI project into the accession list option in all runs.

$ prefetch -t ascp -a "/home/user/.aspera/connect/bin/ascp|/home/user/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file SRR_Acc_List.txt -O ./

step 2: Unzip the SRA file and do quality control

$ bash step2_SRA-QC.sh

step 3: Remove adapter and filter low-quality reads

$ bash step3_clipper.sh

step 4: Download the genome file and annotation file for the corresponding specie

method one: find the corresponding genome in NCBI, copy the download link, and download it directly using the wget command. E.g:download genome of Drosophila melanogaster,and the annotation file is gff, the same download method.

$ wget -c -t 0 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz

method two: download genome, annotation, and indexes from iGenomes(https://support.illumina.com/sequencing/sequencing_software/igenome.html).


step 5: Use the mapping tool to build index of the corresponding species genome

E.g:Use BWA to build an index file for the human genome. If we download genome files from iGenomes, this step can be ommited.

$ bwa step5_index ./Homo_sapiens.fasta -p genome

step 6: Map to genome.

There are many different sequence alignment tools, such as STAR, hisat2, BWA, Tophat, bowtie2...but BWA is widely used in exome-seq mapping.

$ bash step6_bwa.sh

step 7: Use samtools to convert the sam files generated by the alignment to bam files and sort them.

Then we can use samtools to statistic the mapping results

$ bash step7_samtools_stat.sh

Let's check the BAM file:

$ samtools mpileup SRRXXXXXXXX.bam |head -100|tail -3

step 8: Find variants. Firstly, we need download the annotation files of variants from GATK. I use lftp to download. If we aren't root, we can install lftp by conda. It's best to install tools in another virtual environment built by conda, not in the base environment, so as not to cause conda damage. Because the libc.so.6 version of the workstation is often low.

$ lftp ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/
> mirror hg38
> exit
$ tree -h
$ gunzip Homo_sapiens_assembly38.fasta.gz

The easiest way to find variants (not precise):

$ samtools mpileup -ugf /DYY/gatk/hg38/Homo_sapiens_assembly38.fasta *.bam | bcftools call -vmO z -o var_out.vcf.gz

The complete way using GATK. Firstly, we need to mark PCR duplicate sequences. If the library is paired-end, we need fix the mate information of the pair-end. And if we forget add read groups in BAM file by BWA, we can use AddOrReplaceReadGroups to add read group information, run '$bash add_read_group.sh'. Then base quality score recalibration is necessary.

$ bash step8.1_gatk_dup_bqsr.sh

According to different research purposes, the data set is various. Hence, the analysis method is also different. The application of WES is mainly divided into three parts: 1.analyzing mutation genes in complex disease such as cancer, 2. analyzing of pathogenic genes in Mendelian single-gene inherited disease, 3. analyzing the low-frequence mutations in population.

Part 1. analyzing mutation genes in complex disease such as cancer

We use GATK Mutect2 to analyze. For detailed instruction of Mutect2 usage and parameters see https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2. The script like this:

$ bash step8.2_p1.1_gatk_cancer_mut.sh

Then we need filter the false positives:

$ bash step8.2_p1.2_gatk_cancer_filter.sh

part 2. analyzing of pathogenic genes in Mendelian single-gene inherited disease We can use GATK HaplotypeCaller to analyze. For detailed instruction of HaplotypeCaller and parameters see https://gatk.broadinstitute.org/hc/en-us/articles/360036465912-HaplotypeCaller#. For muti-sample experimental group, we need generate the intermediate gVCF files, then joint genotype.

$ bash step8.2_p2.1_gatk_men_hc.sh

Then we need filter, here I use the gatk VQSR. Using hapmap, omini,1000G,dbsnp database to train Gaussian mixture model, getting the log odds of being a true variant versus being false, then set the empirical VQSLOD cutoff to filter variants.

$ bash step8.2_p2.2_gatk_men_filter.sh
$ bash step8.2_p2.3_gatk_men_filter_onlyPASS.sh  #extract only PASS label

part 3. analyzing the low-frequence mutations in population


step 9: Annotate genetic variants

We use ANNOVAR to annotate variants. After downloading ANNOVAR, we can put it in the environment (.bashrc). Then we can use the scripts of ANNOVAR anywhere.

# for part 1
$ bash step9.1_p1_annovar_convert_anno.sh
# Then merge all VCFs for obeserving
$ bash step9.2_p1_merge_anno.sh

# for part 2
$ bash step9.1_p2_annovar_convert_anno.sh
$ bash step9.2_p2_merge_anno.sh

step 10: Visualize variants

This step need run by R script. we use the maftools package.

$ Rscript visu_mut.R