This repository contains a test dataset and the instructions to run Wengan (version 0.2).
This test uses the precompiled binaries of Wengan (v0.2). The Linux precompiled binaries can be downloaded using the following command:
wget https://github.com/adigenova/wengan/releases/download/v0.2/wengan-v0.2-bin-Linux.tar.gz
tar zxvf wengan-v0.2-bin-Linux.tar.gz
# set WG to
export WG=$PWD/wengan-v0.2-bin-Linux/wengan.pl
Technology | # of reads | Genome coverage | Description | Files | Source |
---|---|---|---|---|---|
Illumina | 415,598 | 50 | 2 x 300 bp | EC.50X.R1.fastq.gz EC.50X.R2.fastq.gz | Schatz Lab |
Nanopore | 3,116 | 30 | N50: 46kb | EC.ONT.30X.fa.gz | Loman lab |
PacBio | 9,369 | 30 | N50: 17kb | EC.PAC.30X.fa.gz | PacBio website |
The original datasets are available at the listed sources. We subsampled the original files to the listed genome coverage.
This test was run in a node of the cluster leftraru (NLHPC Chile). The node has the following hardware and software:
- CPUs : Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz (44 threads).
- RAM : 188 Gb RAM.
- File System : Lustre (EXAScaler)
- Operating system : Linux 3.10.0-862.14.4.el7.x86_64
#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a D -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wd_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a D -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wd_pr1 -t 10 -g 5
The fasta file *.SPolished.asm.wengan.fasta (ec_Wd_or1.SPolished.asm.wengan.fasta and ec_Wd_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).
The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~9 Gb.
#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a A -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wa_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a A -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wa_pr1 -t 10 -g 5
The fasta file *.SPolished.asm.wengan.fasta (ec_Wa_or1.SPolished.asm.wengan.fasta and ec_Wa_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).
The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~4 Gb.
#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a M -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wm_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a M -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wm_pr1 -t 10 -g 5
The fasta file *.SPolished.asm.wengan.fasta (ec_Wm_or1.SPolished.asm.wengan.fasta and ec_Wm_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).
The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~3 Gb.
The supplementary material of the Wengan BioRxiv preprint describes the datasets and commands used to assemble four human genomes with Wengan.