The LUSH pipeline reconstructs analysis tools SOAPnuke, BWA and GATK using C/C++, and employs a new parallel computing architecture. Several redundant intermediate steps of reading and writing the same file are eliminated in the LUSH pipeline, which greatly avoids unnecessary I/O throughput and improves CPU utilization. LUSH pipeline provides far superior computational speed to GATK/GATKSpark while maintaining a high level of accuracy comparable to that of GATK.
If you encounter environmental issues when running the Lush toolkit on your system, try run in docker.
## 1. git clone repository and pull docker image
git clone https://github.com/Bgi-LUSH/LUSH-DNASeq-pipeline.git
docker pull centos:centos7.6.1810
## 2. create container lush_dnaseq
docker run -it -d -h test -u root -v /YOUR_PATH/LUSH-DNASeq-pipeline/:/usr/LUSH-DNASeq-pipeline/ --name lush_dnaseq centos:centos7.6.1810
# Replace "/YOUR_PATH/LUSH-DNASeq-pipeline/" with your directory path
## 3. go into container lush_dnaseq
docker exec -it lush_dnaseq /bin/bash
## 4. run LUSH_pipeline or LUSH_toolkits
cd /usr/LUSH-DNASeq-pipeline/test
sh test_LUSH_pipeline.sh ## pipeline
sh test_LUSH-Aligner.sh
sh test_LUSH-BQSR.sh
sh test_LUSH-HC.sh
sh test_LUSH-GenotypeGVCFs.sh
you can also build the latest version of the LUSH Docker image using the Dockerfile. just run :
git clone https://github.com/Bgi-LUSH/LUSH-DNASeq-pipeline.git
docker build -t lush-toolkit:latest -f LUSH-DNASeq-pipeline/Dockerfile .
LUSH_Aligner integrates four functional modules: preprocessing, alignment to genome, sort alignments and mark duplicates, corresponding to the software: SOAPnuke, Bwa mem, Samtools sort (Danecek et al. 2021) and GATK-MarkDuplicates (Picard), respectively. LUSH_Aligner reimplemented the original software algorithm using C and C++ and optimized the algorithm architecture to improve CPU utilization while reduce IO usage.
LUSH_Aligner first needs to construct the index for the reference genome using the following command:
./bin/LUSH_toolkit-Aligner/lush_aligner index ./example_data/ref/chrM.fa
LUSH_Aligner uses the filter4mem command to perform filtering, alignment, sorting alignments and marking duplicates.
Common parameters for LUSH_Aligner filter4mem:
--adapter1/-2 STR 3' adapter sequence
--adapter2/-3 STR 5' adapter sequence [only for PE reads]
-4 Merge multiple pairs of clean FASTQ into one pair
--cleanfqOut/-6 STR Output directory. Processed fq files and statistical results would be output to here
--mixMatch/-b INT misMatch: the max mismatch number when match the adapter (default: [1])
--lowQual/-l FLOAT lowQual: low quality threshold (default: [5])
--qualRate/-J FLOAT qualRate: low quality rate (default: [0.5])
--nRate/-n FLOAT nRate: N rate threshold (default: [0.05])
--qualSys/-g INT qualSys: quality system 1:illumina, 2:sanger (default: [sanger])
--cutData/-C INT max reads number: limit the clean reads number(Mb)
-t INT number of threads [1]
-r STR Reference sequence file
-I STR config file for fq1 file and fq2 file
-o STR the Bam file path after MarkDuplicates
-Y use soft clipping for supplementary alignments
-M mark shorter split hits as secondary
-Z temporary file directory
config file for fq1 file and fq2 file should like this (tab-split):
./example_data/NA12878_l01_1.fq.gz NA12878_l01_1 @RG\tID:NA12878.1\tLB:LibA\tSM:NA12878\tPL:COMPLETE\tCN:BGI
./example_data/NA12878_l01_2.fq.gz NA12878_l01_2
./example_data/NA12878_l02_1.fq.gz NA12878_l02_1 @RG\tID:NA12878.2\tLB:LibA\tSM:NA12878\tPL:COMPLETE\tCN:BGI
./example_data/NA12878_l02_2.fq.gz NA12878_l02_2
Example:
mkdir -p ./outdir/ ./outdir/tem
./bin/LUSH_toolkit-Aligner/lush_aligner filter4mem \
-6 ./outdir/ \
-n 0.1 -J 0.5 -l 12 -g 2 -b 2 -t 20 -M \
-r hg19.fa \
-o ./outdir/NA12878.sort.dup.bam \
-Z ./outdir/tem \
-i ./example_data/lush.config
We provide a quick run script in the test/ directory. You can directly navigate to the test directory by typing "cd ./test/" and then execute "sh ./test_LUSH-Aligner.sh".
LUSH_BQSR is a C/C++ re-implementation of the GATK BaseRecalibrator and ApplyBQSR.
Common parameters:
--bam_path STR the input BAM file
--fasta STR Reference sequence file
--out_dir STR The output file to create
--plugin_path STR path for libbqsr.so
--bed STR BED file for genomic intervals over which to operate
--producer_number INT number of producer threads
--worker_number INT number of worker threads
--known_site STR One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis (should be decompression file).
--pr_one_bam INT When there are multiple producers, change the default split BAM output to merge BAM output
--writer_thread INT The number of threads used for writing
--ip INT Amount of padding (in bp) to add to each interval you are including
--bam_index If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file
--quantize_qual INT Use static quantized quality scores to a given number of levels (with -bqsr)
Example:
mkdir -p ./outdir
export LD_LIBRARY_PATH=./bin/LUSH_toolkit-BQSR:$LD_LIBRARY_PATH
./bin/LUSH_toolkit-BQSR/lush_bqsr \
--bam_path /INPUT_PATH/NA12878.sort.dup.bam \
--out_dir ./outdir/LUSH_BQSR \
--plugin_path ./bin/LUSH_toolkit-BQSR/libbqsr.so \
--producer_number 2 --worker_number 21 \
--fasta hg19.fa \
--known_site Mills_and_1000G_gold_standard.indels.hg19.vcf \
--writer_thread 5 --pr_one_bam 1
We provide a quick run script in the test/ directory. You can directly navigate to the test directory by typing "cd ./test/" and then execute "sh ./test_LUSH-BQSR.sh".
LUSH_HC is a C/C++ re-implementation of the GATK HaplotypeCaller.
Common parameters:
Usage: lush-hc <tool> [-option]
Required:
-I, --input <file1> <file2> ... input file paths
-O, --output <file> output file path
-R, --reference <file> reference file path
Options:
-H, --help display help message
-V, --version display version message
-L, --interval <file> interval file path
-P, --interval-padding <int> interval padding size, must be a non-negative integer (default: 0)
-Q, --base-quality-score-threshold <int> base qualities threshold, must be in [6, 127](default: 18)
-D, --max-reads-depth <int> maximum reads depth per alignment start position (default: 50)
must be a non-negative integer,set to 0 to disable
-G, --gvcf-gq-bands <int1> <int2>... specify GQ bands for merging non-variant sites in GVCF mode
must be in [1, 100], (default: 1, 2, 3,... 60, 70, 80, 90, 99)
--nthreads <int> number of threads to use, must be in [1, 128] (default: 30)
--pcr-indel-model <str> PCR indel model (default: CONSERVATIVE)
available options: {NONE, HOSTILE, CONSERVATIVE, AGGRESSIVE}
--emit-ref-confidence <str> emit reference confidence score mode (default: NONE)
available options: {NONE, BP_RESOLUTION, GVCF}
--nstreampool <int> iostream pool size, must be in [1, 20] (default: 10)
--inspect-reads strictly inspect input reads (default: false)
--old-pairhmm-engine use intel pairhmm engine (default: lush pairhmm engine)
--compression-level compression level, must be in [0, 9] (default: 6)
Example:
mkdir -p ./outdir
export LD_LIBRARY_PATH=./bin/LUSH_toolkit-HC:$LD_LIBRARY_PATH
./bin/LUSH_toolkit-HC/lush_hc HaplotypeCaller \
--pcr-indel-model NONE \
-I /INPUT_PATH/NA12878.sort.dup.bam \
-R hg19.fa \
-O ./outdir/NA12878.vcf.gz
We provide a quick run script in the test/ directory. You can directly navigate to the test directory by typing "cd ./test/" and then execute "sh ./test_LUSH-HC.sh".
LUSH_GenotypeGVCFs is a C/C++ re-implementation of the GATK GenotypeGVCFs.
UASGE: LUSH_GenotypeGVCF inputGvcfFile outputVcfFile stand-call-conf
inputGvcfFile input VCF file
outputGvcfFile output file name:/file/NA12878_PCR.vcf.gz
stand-call-conf The minimum phred-scaled confidence threshold at which variants should be called:10.0
Example:
mkdir -p ./outdir
export LD_LIBRARY_PATH=./bin/LUSH_toolkit-GenotypeGVCFs:$LD_LIBRARY_PATH
./bin/LUSH_toolkit-GenotypeGVCFs/lush_genotypegvcfs /INPUT_PATH/NA12878.g.vcf.gz ./outdir/NA12878.vcf.gz 10
We provide a quick run script in the test/ directory. You can directly navigate to the test directory by typing "cd ./test/" and then execute "sh ./test_LUSH-GenotypeGVCFs.sh".
Usage:
LUSH_pipeline.sh [-i FQCONF] [-t THREAD] [-o OUTDIR] [-m MODEL] [-s PREFIX]
Description:
FQCONF, the path of INPUT fastq file
THREAD, the number of thread [10]
OUTDIR, the path of outdir [./]
PREFIX, the prefix of outputfile [LUSHtest]
MODE, GVCF or not [Y/N]
Example:
LUSH_pipeline.sh -i fq.config -t 40 -o ./ -m N -s samplename
fq.config should be like this:
/PATH/MGISEQ2000_PCR-free_NA12878_30X_1.fq.gz NA12878_30X_1 @RG\tID:NA12878_30X.1\tLB:NA12878_30X\tSM:NA12878_30X\tPL:COMPLETE\tCN:BGI
/PATH/MGISEQ2000_PCR-free_NA12878_30X_2.fq.gz NA12878_30X_2
Note: Inside LUSH_pipeline.sh, the software path and reference sequence file on lines 47-58 need to be replaced with your current actual path and filename.
We provide a quick run script in the test/ directory. You can directly navigate to the test directory by typing "cd ./test/" and then execute "sh ./test_LUSH_pipeline.sh".
Usage:
GATK_pipeline.sh [-i FQFile] [-t THREAD] [-o OUTDIR] [-m MODEL] [-s PREFIX] [-p SPARK]
Description:
FQFile, the path of INPUT fastq file, should be like '/path/fastq1,/path/fatq2'
THREAD, the number of thread [10]
OUTDIR, the path of outdir [./]
PREFIX, the prefix of outputfile [GATKtest]
MODE, GVCF or not [Y/N]
SPARK, Spark or not [Y/N]
Example:
GATK_pipeline.sh -i /PATH/MGISEQ2000_PCR-free_NA12878_30X_1.fq.gz,/PATH/MGISEQ2000_PCR-free_NA12878_30X_2.fq.gz -t 40 -o ./ -m N -s samplename -p N
Usage:
Haplotype_Comparison.sh [-i VCFFile] [-t THREAD] [-o OUTDIR] [-m MODEL] [-s PREFIX]
Description:
VCFFile, the path of INPUT vcf file, should be like '/path/*1.vcf.gz,/path/*2.vcf.gz'
THREAD, the number of thread [10]
OUTDIR, the path of outdir [./]
PREFIX, the prefix of outputfile [HAPtest]
Example:
Haplotype_Comparison.sh -i LUSHtest.vcf.gz,GATKtest.vcf.gz -t 40 -o ./ -s sample
Note that only sample scripts are provided here. Actual operation requires replacing the paths of the corresponding software and configuration files as required.