Finaly Valentine Murigneux published her benchmarking work in Gigascience. | Awesome | Paper
They just used available datasets. Such a nice idea!
The following assemblers were used for the benchmarking, including the long-read only assemblers (Canu [12], Flye [13], Wtdbg2 [14], Miniasm [15], NextDenovo (https://github.com/Nextomics/NextDenovo), NECAT [6], Raven [16], and Shasta [7]) and hybrid assemblers (MaSuRCA [17] and QuickMerge [18]).
Canu was not tested on the M. coruscus genome due to the extremely intensive computing time required for this large genome.
Previous analyses have suggested using corrected ONT reads could improve the genome assembly [19]. To check the effect of this has on the assemblies, the ONT reads that were corrected and / or trimmed by Canu and NECAT were also tested.
To check whether including the shorter ONT reads could affect the assembly. The ONT reads were also sub-sampled with different cut-off lengths (see Table 1 for the lengths used).
CPU hour was calculated in the Slurm workload manager system by recording the program start and end time point. However, since the hardware configuration in each node varied, the CPU hour presented is only an indicator of the relative trend among different assemblers.
The assembled contigs were polished at least three times with Flye, and heterozygous contigs were removed with the purge_dup pipeline [20]. The resultant genomes were polished twice using Pilon version 1.23 [21] with Illumina reads.
The genome completeness of each assembly was thoroughly monitored at each step using BUSCO v4.0.6 with odb10 metazoan dataset [22]. The genome quality of the Scaly-foot Snail assemblies were assessed by QUAST version 5.0.2 [23] comparing against the formerly published version of the genome as a reference [8]. QUAST calculates genome assembly characteristics such as N50 and total size, but also assesses mis-assemblies with minimap2.
The detailed commands and settings used for all analyses can be found in the Supplementary Information.
The commands used in this study.
## Remove Bacteria contaminated Illumina reads by Kraken2
/home/share/kraken2-2.0.8-beta/kraken2 -db /home/share/kraken2-2.0.8-beta/minikraken2_v1_8GB --threads 40 --paired SFG1.fq SFG2.fq --classified-out SFG_Bac#.fq --unclassified-out SFG_Host#.fq
#Remove the symbiont contaminated ONT reads
minimap2 -ax map-ont -t 40 ~/Data/SFG/Symbiont_genome/Bac.fasta SFG_HAC_3Kb.fq | samtools fastq -n -f 4 - > SFG_HAC_3Kb_BacFree.fq
#wtdbg2 assembly
~/App/wtdbg2/wtdbg2 -i ../Flye_5Kb_canu_trimm_ass/SFG_HA.trimmedReads.fasta.gz -t 40 -o SFG_HAC_canuTrim
~/App/wtdbg2/wtpoa-cns -t 40 -i SFG_HAC_canuTrim.ctg.lay.gz -fo SFG_HAC_canuTrim.ctg.lay.fa
# Flye assembly and polishing
flye --nano-raw ../SFG_HAC_3Kb_BacFree.fq -g 0.3566g -o flye_ONT_HAC_3kb -t 40
flye --polish-target ../SFG_10Kb_HAC.fa --nanopore-raw ../SFG_HAC_3Kb_BacFree.fq --iterations 3 --out-dir flye_po_R3 --threads 40
# minimap + miniasm assembly
minimap2 -X -t 20 -x ava-ont ../SFGB1_clean_3Kb.fq ../SFGB1_clean_3Kb.fq > reads.paf
miniasm -f ../SFGB1_clean_3Kb.fq reads.paf > SFG_default.gfa
awk '/^S/{print ">"$2"\n"$3}' SFG_default.gfa | fold > SFG_default.fa
# canu assembly
~/App/canu-2.0/Linux-amd64/bin/canu -fast -p SFG_HA -d SFG_HA genomeSize=0.3566g -nanopore-raw SFG_HAC_3Kb_BacFree.fq corOutCoverage=200 corMhapSensitivity=normal correctedErrorRate=0.105 minReadLength=5000 useGrid=true gridOptions=--partition=oces
#NECAT assembly
necat.pl config SFG_HAC_config.txt
necat.pl correct SFG_HAC_config.txt
necat.pl assemble SFG_HAC_config.txt
necat.pl bridge SFG_HAC_config.txt
# the configure file
PROJECT=SFG_HAC
ONT_READ_LIST=readlist.txt
GENOME_SIZE=356600000
THREADS=40
MIN_READ_LENGTH=5000
PREP_OUTPUT_COVERAGE=70
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=2
CNS_OUTPUT_COVERAGE=50
CLEANUP=1
USE_GRID=false
GRID_NODE=0
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true
#NextDenovo configure file
[General]
job_type = local
job_prefix = nextDenovo
task = all # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 3
parallel_jobs = 4
input_type = raw
input_fofn = ./input.fofn
workdir = ./01_rundir
[correct_option]
read_cutoff = 1k
seed_cutoff = 11653
blocksize = 1g
pa_correction = 2
seed_cutfiles = 2
sort_options = -m 1g -t 2 -k 50
minimap2_options_raw = -x ava-ont -t 10
correction_options = -p 15
[assemble_option]
minimap2_options_cns = -x ava-ont -t 10 -k17 -w17
nextgraph_options = -a 1
#Raven assembly
~/App/raven/build/bin/raven ../SFG_HAC_10Kb_BacFree.fa.gz -t 40 -p 1 >Raven_10Kb.fasta 2> log.txt
#Shasta assembly
sudo ~/App/shata_v0.4.0/shasta-Linux-0.4.0 --input ../Mcor_ONT_10kb.fa --memoryMode filesystem --memoryBacking 2M
#QuickMerge
delta-filter -r -q -l 10000 SFG_hybrid.delta > SFG_hybrid.rq.delta
quickmerge -d SFG_hybrid.rq.delta -q MaSu_HAC35x_FlyeP3_PD_Pilon2.fasta -r SFG_Flye_HAC10kb_P2_PD_pilon2.fasta -hco 5.0 -c 1.5 -l 2164900 -ml 10000 -p SFG_QM
# Purge_dup version 1.2.3
minimap2 -x map-ont SFG_FlyepolishedR3.fasta SFG_ONT_3Kb_BacFree.fq -t 40 > reads.paf
pbcstat *.paf
calcuts PB.stat -l13 -m51 -u144 > cutoffs 2>calcults.log
split_fa SFG_FlyepolishedR3.fasta > polished_3.fasta.split
minimap2 -x asm5 -DP polished_3.fasta.split polished_3.fasta.split -t 40 >split.paf
purge_dups -2 -T cutoffs -c PB.base.cov split.paf >dups.bed 2>purge_dups.log
/get_seqs dups.bed SFG_FlyepolishedR3.fasta
#or with Illumina reads
bowtie2-build -f SFG_canu_flyeP3.fasta SFG_canu --threads 40
bowtie2 -p 40 --maxins 800 -x SFG_canu -1 SFG_trim_1.fq -2 SFG_trim_2.fq 1>PE.sam 2>SE.err
samtools view -bS PE.sam >PE.bam -@ 20
~/App/purge_dups/bin/ngscstat PE.bam
~/App/purge_dups/bin/calcuts TX.stat -l13 -m51 -u144 > cutoffs 2>calcults.log
~/App/purge_dups/bin/split_fa SFG_canu_flyeP3.fasta > polished_3.fasta.split
minimap2 -x asm5 -DP polished_3.fasta.split polished_3.fasta.split -t 40 > polished_3.fasta.split.paf
~/App/purge_dups/bin/purge_dups -2 -T cutoffs -c TX.base.cov polished_3.fasta.split.paf > dups.bed 2> purge_dups.log
#Pilon error correction
bowtie2-build -f canu_FlyeP3_PD2_pilon1.fa SFS --threads 40
bowtie2 -p 40 -D 20 -R 2 -N 1 -L 18 -i S,1,0.50 --maxins 1200 -x SFS -1 SFG_clean_1.fq -2
SFG_clean_2.fq 1>SFSPE500.sam 2> SFSPE500.err
grep -E "@|NM:" SFSPE500.sam | grep -v "XS:" > SFSPE500_uniq.sam
samtools view -bS SFSPE500_uniq.sam > SFSPE500_uniq.bam -@ 40
samtools sort SFSPE500_uniq.bam -m 5G -@ 10 -o SFSPE500_uniq_sorted.bam
java -jar ~/App/picard/picard.jar MarkDuplicates I= SFSPE500_uniq_sorted.bam O= SFSPE500_uniq_sorted_dedupe.bam METRICS_FILE=metrics.txt
samtools index SFSPE500_uniq_sorted_dedupe.bam
java -Xmx180G -jar ~/App/pilon-1.23/pilon-1.23.jar --genome canu_FlyeP3_PD2_pilon1.fa --frags SFSPE500_uniq_sorted_dedupe.bam --diploid --threads 40
#QUAST analysis
quast.py ./canu/Canu.fasta ./flye/Flye.fasta ./masurca/MaSuRCA.fasta ./miniasm/Miniasm.fasta ./necat/NECAT.fasta ./nextdenovo/NextDenovo.fasta
./raven/Raven.fasta ./shasta/Shasta.fasta ./wtdbg2/Wtdbg2.fasta ./quickmerger/QuickMerger.fasta -r ./Csqv1.1/Csq_v1.1.fa -t 40
#repeatmodeler and repeatmasker
BuildDatabase -name Mcor_v2.0 ../Mcor_v2.0.fasta
~/App/RepeatModeler-2.0.1/RepeatModeler -database Mcor_v2.0 -pa 40
RepeatMasker -species all -pa 8 -div 30 Mcor_v2.0.fasta
RepeatMasker -lib /home/sunj/Data/Mcoru/Annotations/RepeatModeler/Mcor_v2.0-families.fa -pa 10 -div 30 Mcor_v2.0.fasta.masked
# Braker for Augustus training
~/App/BRAKER-2.1.5/scripts/braker.pl --genome=SFG_Flye_HAC10kb_P2_PD_pilon2.fasta.masked.masked --species=Chrysomallon --bam=all.sorted.bam --cores 40
# maker configure file
#-----Genome (these are always required)
genome=SFG_Flye_HAC10kb_P2_PD_pilon2.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----EST Evidence (for best results provide a file for at least one)
est=Trinity_all_0.97.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=Mollu_Prot_50AA_0.95.fa #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib=./SFG_flye-families.fa #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=~/App/maker-3.01.03/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=Chrysomallon #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=1 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=20 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=10000 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=50 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=1 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=20000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=300 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/home/sunj/temp #specify a directory other than the system default temporary directory for temporary files
Assembler | Input Reads | settings | N50/NG50* (assuming genome size =1593Mb) | No. of Scaffolds | Max | No. of Scaf. Over 1Mb | Total_length | BUSCO (metazoan, ODB10) | cpu hour |
---|---|---|---|---|---|---|---|---|---|
shasta v0.4.0 | Raw ONT, 15Kb cutoff (65X) | default | 218.9Kb/317.6Kb | 22499 | 4.03Mb | 77 | 2.14Gb | C:69.1%[S:65.1%,D:4.0%],F:10.7%,M:20.2%,n:954 | 98 |
wtdbg2 v2.5 | Raw ONT, 15Kb cutoff (65X) | -x preset3 -L 15000 | 721.7Kb/1.01Mb | 16590 | 5.96Mb | 448 | 2.02Gb | C:56.1%[S:55.6%,D:0.5%],F:11.4%,M:32.5%,n:954 | 820 |
wtdbg2 v2.5 | NECAT corr, 15Kb cutoff (50X) | -x preset4 | 772.9Kb/1.11Mb | 16451 | 9.37Mb | 444 | 2.01Gb | C:82.8%[S:78.9%,D:3.9%],F:4.7%,M:12.5%,n:954 | 9218 |
wtdbg2 v2.5 | NECAT corr & trim, 15Kb cutoff (50X) | -x preset4 | 756.3Kb/1.12Mb | 15960 | 9.34Mb | 455 | 1.99Gb | C:83.4%[S:79.9%,D:3.5%],F:5.1%,M:11.5%,n:954 | 21370 |
flye v2.7.1-b1590 | NECAT corr, 15Kb cutoff (50X) | --nano-corr | 250.7Kb/403.8Kb | 28333 | 3.07Mb | 116 | 2.52Gb | - | 11428 |
flye v2.7.1-b1590 | Raw ONT, 23Kb cutoff (46X) | --nano-raw | 274.8Kb/456.2Kb | 27310 | 2.77Mb | 160 | 2.57Gb | C:87.3%[S:75.1%,D:12.2%],F:5.8%,M:6.9%,n:954 | 1440 |
Raven v1.1.10 | Raw ONT, 15Kb cutoff (65X) | default | 325.2Kb/475.9Kb | 10382 | 1.73Mb | 95 | 2.55Gb | C:87.5%[S:78.0%,D:9.5%],F:3.8%,M:8.7%,n:954 | 1240 |
NECAT 20200119_Linux-amd64 | Raw ONT, 15Kb cutoff (65X) | PREP_OUTPUT_COVERAGE=70, CNS_OUTPUT_COVERAGE=50 | 1.24Mb/1.87Mb | 5153 | 6.56Mb | 887 | 2.62Gb | C:86.2%[S:68.9%,D:17.3%],F:4.7%,M:9.1%,n:954 | 23738 |
Flye Polish three times | 1.25Mb/1.88Mb | 5076 | 6.57Mb | 895 | 2.63Gb | C:90.3%[S:66.0%,D:24.3%],F:3.1%,M:6.6%,n:954 | |||
Flye Polish three times, purge duplicate | 1.46Mb/1.76Mb | 2841 | 6.57Mb | 711 | 1.96Gb | C:89.2%[S:87.6%,D:1.6%],F:3.8%,M:7.0%,n:954 | |||
Flye Polish three times, purge duplicate 2nd round | 1.46Mb/1.74Mb | 2646 | 6.57Mb | 703 | 1.92Gb | C:89.4%[S:87.9%,D:1.5%],F:3.8%,M:6.8%,n:954 | |||
NextDenovo v2.3.0 | Raw ONT | seed_cutoff = 23987, 45X | 2.54Mb/3.40Mb | 1839 | 12.50Mb | 575 | 2.07Gb | C:88.6%[S:86.6%,D:2.0%],F:4.1%,M:7.3%,n:954 | 6700 |
Flye Polish three times | 2.51Mb/3.37Mb | 1839 | 12.40Mb | 573 | 2.06Gb | C:89.8%[S:86.7%,D:3.1%],F:3.9%,M:6.3%,n:954 | |||
Flye Polish three times, purge duplicate | 2.70Mb/3.36Mb | 1591 | 12.40Mb | 548 | 1.93Gb | C:89.9%[S:88.4%,D:1.5%],F:3.6%,M:6.5%,n:954 | |||
Flye Polish three times, purge duplicate, Pilon Round 2 | 2.70Mb/3.37Mb | 1591 | 12.4Mb | 547 | 1.93Gb | C:95.8%[S:94.1%,D:1.7%],F:1.5%,M:2.7%,n:954 | |||
MaSuRCA v3.4.1 | Illumina PE150, Clean ONT | LHE_COVERAGE=25, FLYE_ASSEMBLY=0 | 813.6Kb/1.70Mb | 10541 | 11.02Mb | 561 | 2.67Gb | C:96.0%[S:70.9%,D:25.1%],F:1.5%,M:2.5%,n:954 | 46100 |
purge duplicate | 1.17Mb/1.61Mb | 4570 | 11.02Mb | 515 | 1.98Gb | C:95.6%[S:93.4%,D:2.2%],F:1.8%,M:2.6%,n:954 | |||
purge duplicate, POLCA polishing | 1.17Mb/1.61Mb | 4570 | 11.02Mb | 515 | 1.98Gb | C:95.8%[S:93.6%,D:2.2%],F:1.7%,M:2.5%,n:954 |
-
Nanopore only benchmarking Nextdenovo mentionned Benchmarking Oxford Nanopore read assemblers for high-quality molluscan genomes
-
Good for writing material and methods | Ultracontinuous single haplotype genome assemblies for the domestic cat (Felis catus) and Asian leopard cat (Prionailurus bengalensis)
-
Very nice paper of benchmarking with nice gggplot2 graph on Adineta vaga paper Overcoming uncollapsed haplotypes in long-read 2 assemblies of non-model organisms
Fig1: Statistics of raw assemblies obtained from the full PacBio dataset (raw assemblies), with a preliminary read-filtering step (keeping only reads larger than 15 kb), or a subsequent removal of uncollapsed haplotypes with HaploMerger2, purge_dups, or purge_haplotigs. a) Assembly scores for size, N50, completeness and haploidy. b) Long-read coverage distribution over the contigs.
Fig3: Statistics of the PacBio and Nanopore assemblies depending on sequencing depth, with a) assembly size, b) N50, c) complete single-copy BUSCOs and d) haploidy. The assemblies were ran on five random subsamplings of the long-read datasets.
Fig4: Computational resources (RAM and CPU time) used by the assemblers. a) Maximum RAM usage and b) mean CPU time depending on sequencing depth. Canu and NextDenovo were not included in this comparison as they were run on different machines.
-
Awesome review of long reads sequencing assembly Piercing the dark matter: bioinformatics of long-range sequencing and mapping
Supplementary Material for “Haplotype-resolved de novo assembly with phased assembly graphs” S1 Software commands S1.1 Hifiasm To produce primary assemblies of homozygous samples (M. musculus, Z. mays and CHM13), hifiasm (version 0.7) was run with the following command which does not purge haplotig duplications:
hifiasm -o <outputPrefix> -t <nThreads> -l0 <HiFi-reads.fasta>
For heterozygous samples, hifiasm was run with the following command:
hifiasm -o <outputPrefix> -t <nThreads> <HiFi-reads.fasta>
We added ‘-D10’ for the octoploid F. × ananassa because the default k-mer cutoff seems too low:
hifiasm -o <outputPrefix> -t <nThreads> -D10 <HiFi-reads.fasta>
For trio-binning assembly, we first built the paternal trio index and the maternal trio index by yak (version r55) with the following commands:
yak count -b37 -t <nThreads> -o <pat.yak> <paternal-short-reads.fastq>
yak count -b37 -t <nThreads> -o <mat.yak> <maternal-short-reads.fastq>
and then we produced the paternal assembly and the maternal assembly with the following command:
hifiasm -o <outputPrefix> -t <nThreads> -1 <pat.yak> -2 <mat.yak> <HiFi-reads.fasta>
S1.2 Falcon-Unzip Falcon-kit (version 1.8.1) was run with the following HiFi-specific options:
length cutoff pr = 8000
ovlp daligner option = -k24 -h1024 -e.98 -l1500 -s100
ovlp HPCdaligner option = -v -B128 -M24
ovlp DBsplit option = -s400
overlap filtering setting = --max-diff 200 --max-cov 200 --min-cov 2 --n-core 24 --min-idt
98 --ignore-indels
Falcon-unzip-kit (version 1.3.7) was run with default options.
S1.3 HiCanu For primary assembly, HiCanu (version 2.0) was run with the following command line:
canu -p asm -d <outDir> genomeSize=<GSize> useGrid=false maxThreads=<nThreads> \
-pacbio-hifi <HiFi-reads.fasta>
The contigs labeled by ‘suggestedBubbles=yes’ were removed from the primary assembly. For triobinning assembly, we ran HiCanu in two steps as recommended. We partitioned the HiFi reads by parental short reads with the following command:
canu -haplotype -p asm -d <outDir> genomeSize=<GSize> useGrid=false \
maxThreads=<nThreads> -haplotypePat <pat-reads.fq> -haplotypeMat <mat-reads.fq> \
-pacbio-raw <HiFi-reads.fasta>
Note that ‘-pacbio-raw’ was used to partition HiFi reads followed the document of HiCanu. We then perform HiCanu assemblies on partitioned reads.
S1.4 Peregrine For primary assembly, Peregrine (version 0.1.6.1) was run with the following command, where 48 is the number of threads in use:
docker run -it -v <workDir>:/wd --user $(id -u):$(id -g) cschin/peregrine:0.1.6.1 asm \
/wd/Input.fnfo 48 48 48 48 48 48 48 48 48 --with-consensus --with-alt --shimmer-r 3 \
--best n ovlp 8 --output <outDir>
For trio-binning assembly, we first used HiCanu to partition HiFi reads by parental short reads, and then assembled the each haplotype individually by Peregrine.
S1.5 Purge dups Purge dups (version 1.2.3) was used to postprocess the output primary assemblies of HiCanu for all heterozygous samples. The commands are as follows:
minimap2 -I6G -xmap-pb <contigs.fa> <HiFi-reads.fasta> -t <nThreads> > <read-aln.paf>
bin/pbcstat <read-aln.paf>
bin/calcuts PB.stat > cutoffs
bin/split fa <contigs.fa> > <split.fa>
minimap2 -I6G -xasm5 -DP <split.fa> <split.fa> -t <nThreads> > <ctg-aln.paf>
bin/purge dups -2 -T cutoffs -c PB.base.cov <ctg-aln.paf> > <dups.bed>
bin/get seqs <dups.bed> <contigs.fa>
Since running Purge Dups in default cannot produce primary assembly of HiCanu with right size for HG002, we manually adjusted the cutoffs thresholds of Purge Dups as follows “5 7 11 30 22 42”.
S1.6 Running asmgene We aligned the cDNAs to the reference genome and contigs by minimap2 r974 and evaluated the gene completeness with paftools.js from the minimap2 package:
minimap2 -cxsplice:hq -t <nThreads> <contigs.fa> <cDNAs.fa> > <aln.paf>
paftools.js asmgene -i.97 <ref.paf> <asm.paf>
We set the sequence identity threshold to be 97% with ‘-i.97’ to tolerate low per-base accuracy of ONT assemblies. For trio binning assemblies, we added option ‘-a’ to evaluate genes mapped to the autosomes only. When evaluating multi-copy genes retained in an assembly, we replaced ‘-i.97’ to ‘-i.99’ to increase the resolution.
S1.7 Computing NGA50 We used minigraph (version 0.10-dirty-r361) and paftools (version 2.17-r974-dirty) to calculate the NGA50 of each asssembly:
minigraph -xasm -K1.9g --show-unmap=yes -t <nThreads> <ref.fa> <asm.fa> > <asm.paf>
paftools.js asmstat <ref.fa.fai> <asm.paf>
In comparison to minimap2, minigraph tends to generate longer alignments and is more robust to highly variable regions.
S1.8 BUSCO BUSCO (version 3.0.2) was used with the following command:
python3 run BUSCO.py -i <asm.fa> -m genome -o <outDir> -c <nThreads> -l <lineage dataset>
where ‘lineage dataset’ is set to tetrapoda for R. muscosa and set to embryophyta for F. × ananassa and S. sempervirens.
S1.9 Determining resolved BACs The resolution of BAC for different assemblies was evaluated using the pipeline at: https://github. com/skoren/bacValidation, except that we added option ‘-I6g’ to minimap2. The BAC libraries for CHM13 and HG00733 can be found at https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC59+ and+complete and https://www.ncbi.nlm.nih.gov/nuccore/?term=VMRC62+and+complete, respectively.
S1.10 Running yak evaluation
We used yak (version r55) to measure the per-base consensus accuracy (QV), the switch error rate and the hamming error rate. For QV evaluation, we first built the index for the short reads coming from the same sample:
yak count -b37 -t <nThreads> -o <sr.yak> <short-reads.fastq>
yak qv -t <nThreads> <sr.yak> <contigs.fa>
To evaluate the switch error rate and the hamming error rate, we first built the indexes from the paternal short reads as in section S1.1 and then estimate k-mer based error rates as follows:
yak trioeval -t <nThreads> <pak.yak> <mat.yak> <contigs.fa>
S1.11 Dipcall For the male sample HG002, we ran dipcall (version 0.1) as follows:
dipcall.kit/run-dipcall -x dipcall.kit/hs37d5.PAR.bed <prefix> hs37d5.fa \
<pat-asm.fa> <mat-asm.fa> > <prefix.mak>
make -j2 -f <prefix.mak>
For the female sample HG00733, we removed option ‘-x’. We used the GRCh37 variant of ‘hs37d5.fa’ here because GIAB works best with hs37d5.
S1.12 Evaluating collapsed misassemblies for inbred samples
We used scripts at: https://github.com/lh3/CHM-eval/blob/master/misc/clustreg.js, and https: //github.com/lh3/CHM-eval/blob/master/misc/select-collapse-het.js. The commands are as follows:
minimap2 -axasm20 -t <contigs.fa> <HiFi-reads.fasta> | samtools sort -o <aln.bam> -
htsbox pileup -vcf <contigs.fa> -q20 -Q20 -l5000 -S5000 -s5 <aln.bam> > <var.vcf>
./select-collapse-het.js -c <readCoverage> <var.vcf> | ./clustreg.js -n10
where ‘-l’ and ‘-S’ filter out alignments short than 5kb.
comparison hi-fi and pacbio using rice example
Contiguity of the ONT and PB assemblies. (a) Treemaps for contig length difference between the ONT
How to do tree maps
prokaryote example
metazoa Adineta vaga example
rice example
Example mixte
Angel Dominguez notes
Usefull for introduction | Opportunities and challenges in long-read sequencing data analysis
Ten step for assembly genome project
Twelve steps for assembly genome project
High-Quality Genome Assembly of Fusarium oxysporum f. sp. lini | The Genome Sequence of Five Highly Pathogenic Isolates of Fusarium oxysporum f. sp. lini
First i followwed the instruction from the developper
$ wget https://github.com/xiaochuanle/NECAT/releases/download/v0.0.1_update20200803/necat_20200803_Linux-amd64.tar.gz
$ tar xzvf necat_20200803_Linux-amd64.tar.gz
$ cd NECAT/Linux-amd64/bin
$ export PATH=$PATH:$(pwd)
But when
I got this error
syntax error at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 51, near "$cfg{"
Global symbol "%env" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 52.
Global symbol "%env" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 55.
Global symbol "%env" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 56.
syntax error at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 57, near "}"
syntax error at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 84, near "$env{"
Global symbol "$cfg" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 84.
Global symbol "@jobs" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 89.
Global symbol "$prjDir" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 95.
Global symbol "$env" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 101.
Global symbol "$env" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 115.
Global symbol "$cfg" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 115.
Global symbol "$cfg" requires explicit package name at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 121.
syntax error at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 121, near "$cfg{"
/home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm has too many errors.
Compilation failed in require at ./necat.pl line 13.
BEGIN failed--compilation aborted at ./necat.pl line 13.
first I install perl using
sudo yum install perl
perl -v show that we have 5.16
This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi
(with 41 registered patches, see perl -V for more detail)
Copyright 1987-2012, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
to update to 5.26 I found this page
# 1. Install a package with repository for your system:
# On CentOS, install package centos-release-scl available in CentOS repository:
$ sudo yum install centos-release-scl
# On RHEL, enable RHSCL repository for you system:
# $ sudo yum-config-manager --enable rhel-server-rhscl-7-rpms
# 2. Install the collection:
$ sudo yum install rh-perl526
# 3. Start using the software collection:
$ scl enable rh-perl526 bash
and then I checked
perl -v and bingo I got
This is perl 5, version 26, subversion 3 (v5.26.3) built for x86_64-linux-thread-multi
(with 27 registered patches, see perl -V for more detail)
Copyright 1987-2018, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
so I ran
./necat.pl
and I got this
Smartmatch is experimental at /home/kplee/programs/NECAT/Linux-amd64/bin/Plgd/Project.pm line 263.
Usage: necat.pl correct|assemble|bridge|config cfg_fname
correct: correct rawreads
assemble: generate contigs
bridge: bridge contigs
config: generate default config file
Out.