Wengan demo

This repository contains a test dataset and the instructions to run Wengan (version 0.2).

Download the Wengan code
Running the E.coli demo
Assembling human genomes

Download the Wengan code

This test uses the precompiled binaries of Wengan (v0.2). The Linux precompiled binaries can be downloaded using the following command:

wget https://github.com/adigenova/wengan/releases/download/v0.2/wengan-v0.2-bin-Linux.tar.gz
tar zxvf wengan-v0.2-bin-Linux.tar.gz
# set WG to 
export WG=$PWD/wengan-v0.2-bin-Linux/wengan.pl

Running the E.coli demo

Dataset description

Technology	# of reads	Genome coverage	Description	Files	Source
Illumina	415,598	50	2 x 300 bp	EC.50X.R1.fastq.gz EC.50X.R2.fastq.gz	Schatz Lab
Nanopore	3,116	30	N50: 46kb	EC.ONT.30X.fa.gz	Loman lab
PacBio	9,369	30	N50: 17kb	EC.PAC.30X.fa.gz	PacBio website

The original datasets are available at the listed sources. We subsampled the original files to the listed genome coverage.

Hardware used

This test was run in a node of the cluster leftraru (NLHPC Chile). The node has the following hardware and software:

CPUs : Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz (44 threads).
RAM : 188 Gb RAM.
File System : Lustre (EXAScaler)
Operating system : Linux 3.10.0-862.14.4.el7.x86_64

Wengan commands

Running WenganD

#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a D -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wd_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a D -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wd_pr1 -t 10 -g 5

Expected results

The fasta file *.SPolished.asm.wengan.fasta (ec_Wd_or1.SPolished.asm.wengan.fasta and ec_Wd_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).

Computational resources

The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~9 Gb.

Running WenganA

#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a A -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wa_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a A -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wa_pr1 -t 10 -g 5

Expected results

The fasta file *.SPolished.asm.wengan.fasta (ec_Wa_or1.SPolished.asm.wengan.fasta and ec_Wa_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).

Computational resources

The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~4 Gb.

Running WenganM

#WG should point to wengan.pl script (found in the root installation directory)
WG=$PATH_TO/wengan-v0.2-bin-Linux/wengan.pl
# Assembling Illumina + Nanopore reads
perl ${WG} -x ontraw -a M -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.ONT.30X.fa.gz -p ec_Wm_or1 -t 10 -g 5
# Assembling Illumina + PacBio (CLR) reads
perl ${WG} -x pacraw -a M -s ecoli/reads/EC.50X.R1.fastq.gz,ecoli/reads/EC.50X.R2.fastq.gz -l ecoli/reads/EC.PAC.30X.fa.gz -p ec_Wm_pr1 -t 10 -g 5

Expected results

The fasta file *.SPolished.asm.wengan.fasta (ec_Wm_or1.SPolished.asm.wengan.fasta and ec_Wm_pr1.SPolished.asm.wengan.fasta respectively) contains the final genome assembly reported by Wengan. Both hybrid datasets are assembled to a single contig sequence (Genome Size of ~4.6 Mb).

Computational resources

The expected runtime with a single core is about 10 minutes and with 10 cores about 2 Minutes. The maximum RAM usage is around ~3 Gb.

Assembling human genomes

The supplementary material of the Wengan BioRxiv preprint describes the datasets and commands used to assemble four human genomes with Wengan.

adigenova/wengan_demo

Wengan demo

Table of Contents

Download the Wengan code

Running the E.coli demo

Dataset description

Hardware used

Wengan commands

Running WenganD

Expected results

Computational resources

Running WenganA

Expected results

Computational resources

Running WenganM

Expected results

Computational resources

Assembling human genomes