This directory contains the pipeline used to benchmark hybracter.
I used Snaketool to make this pipeline (and also hybracter
), and would highly recommend it!
To use this repository, you will need to change the 2 csv
input files in this directory with the correct FASTQ paths of the required files on your system.
You can get the FASTQs from the Zenodo repository here or follow the instructions in get_fastqs.md.
# creates conda env
mamba create -n hybracter_benchmarking pip
conda activate hybracter_benchmarking
# download the repository
git clone "https://github.com/gbouras13/hybracter_benchmarking"
cd hybracter_benchmarking
pip install -e .
hybracter_benchmarking --help
hybracter_benchmarking install
# to disable GPU on my system for a fair bechmarking with medaka
export CUDA_VISIBLE_DEVICES=""
cd ..
hybracter_benchmarking assemble-real --input real_assemble.csv --bulk_lerminiaux_csv bulk_assemble_lerminiaux.csv --bulk_lerminiaux_config bulk_assemble_lerminiaux_config.yaml --output ../hybracter_benchmarking_results/real_results --threads 16 --cores 16
- This needs to be done after the
hybracter_benchmarking assemble-real
is complete or else it will not work - The same input csv and output directory needs to be used as
hybracter_benchmarking assemble-real
hybracter_benchmarking assess-real --input real_assemble.csv --output ../hybracter_benchmarking_results/real_results --threads 16 --cores 16
-
You should get the following output directories (as per the output you can find here):
BENCHMARKS
- contains the time etc benchmarking for each run (sample x tool)DNADIFF
- contains raw chromosome Dnadiff results for each run (sample x tool)DNADIFF_PARSED_OUTPUT
- contains parsed chromosome Dnadiff results for each sampleDNADIFF_PLASMIDS
- contains plasmid Dnadiff results for each run (sample x tool)DNADIFF_PARSED_OUTPUT_PLASMID
- contains parsed plasmid Dnadiff results for each sampleREAL
- this contains all the actual output for each assembler. The following 5 directories will contain the all the raw output with subdirectories for each sample:HYBRACTER_HYBRID_OUTPUT
HYBRACTER_LONG_OUTPUT
DRAGONFLYE_HYBRID_OUTPUT
DRAGONFLYE_LONG_OUTPUT
UNICYCLER_OUTPUT
- Additionally, it should have
HYBRACTER_HYBRID_OUTPUT_REAL_BULK
- this contains the output for the 12 Lerminiaux et al isolates assembled usinghybracter hybrid
with modified config filebulk_assemble_lerminiaux_config.yaml
.
-
It will also contain a number of other subdirectories
_SUMMARIES
,_PLASMIDS
,_CHROMOSOMES
with parsed summary outputs and parsed specific plasmids and chromosome assemblies for Unicycler and Dragonflye (this made the assessment a lot easier and automated).
genomes
directory contains the reference genomesreference_genome_chromosomes
directory contains the reference chromosomes. The old ATCC references for the 5 ATCC strains (which were replaced by the curated Trycycler references in v2 of the preprint) are labelled as_atcc.fasta
reference_genome_plasmids
directory contains the reference plasmids.parse_genomes.py
was used to generate per contig and overall genome lengths for all reference genomes (output contained in thegenome_info
directory)python parse_genomes.py -d genomes -o genome_info
Graphs_Ghais
contains R script and data used to make plots for the Hybracter manuscript.bulk_assemble_lerminiaux_config.yaml
is the modified config file used on my system to benchmarkhybracter hybrid
showing the efficiency benefits ofhybracter
when run on multiple samples.fast_analysis
contains the benchmarking scripts and input for the extra 5 fast simplex model basecalled ATCC samples added in v2 of the preprint. See theREADME.md
inside for more informationduplex_analysis
contains the benchmarking scripts and input for the extra 5 duplex super-accuracy model basecalled ATCC samples added in v2 of the preprint. See theREADME.md
inside for more information.depth_analysis
contains the benchmarking scripts and input for the depth analysis added to v2 of the preprint (on Lerminiaux Isolate B). See theREADME.md
inside for more information.atcc_trycycler_comparisons
contains the output of comparing the ATCC reference assemblies to the curated Trycycler assemblies of the same data (Supplementary Table 13 of v2 of the preprint)- Note for
ATCC_17802_atcc.fasta
, the second chromosome was reoriented to begin with therepA
gene usingdnaapler all v0.7.0
to allow for comparison withcompare_assemblies.py
. - These were run as follows:
- Note for
mamba create -n compare_assembliesENV mappy edlib
conda activate compare_assembliesENV
python compare_assemblies.py
python compare_assemblies.py reference_genome_chromosomes/ATCC_10708.fasta reference_genome_chromosomes/ATCC_10708_atcc.fasta > atcc_trycycler_comparisons/ATCC_10708_comparison.txt 2>&1
python compare_assemblies.py reference_genome_chromosomes/ATCC_17802.fasta reference_genome_chromosomes/ATCC_17802_atcc.fasta > atcc_trycycler_comparisons/ATCC_17802_comparison.txt 2>&1
python compare_assemblies.py reference_genome_chromosomes/ATCC_25922.fasta reference_genome_chromosomes/ATCC_25922_atcc.fasta > atcc_trycycler_comparisons/ATCC_25922_comparison.txt 2>&1
python compare_assemblies.py reference_genome_chromosomes/ATCC_33560.fasta reference_genome_chromosomes/ATCC_33560_atcc.fasta > atcc_trycycler_comparisons/ATCC_33560_comparison.txt 2>&1
python compare_assemblies.py reference_genome_chromosomes/ATCC_BAA_679.fasta reference_genome_chromosomes/ATCC_BAA_679_atcc.fasta > atcc_trycycler_comparisons/ATCC_BAA_679_comparison.txt 2>&1
jkd6159_reference_comparisons
contains the output comparing the old 2010 Chua et al JKD6159 reference and the perfect 2023 Wick et al reference.- The reason this is included is that a reviewer of v1 of the manuscript noticed that a couple of samples consistently had worse performance (for all assemblers) than the rest.
- It turns out these were the JKD6159 R9 and R10 samples, and it was because we erroneously used the old 2010 reference instead of the perfect 2023 one.
- When we included the correct reference (v2), the number of errors was greatly reduced
- Running
compare_assemblies.py
on the two assemblies shows 8,255 differences (!), though this is inflated by a few large insertions/deletions
python compare_assemblies.py jkd6159_reference_comparisons/JDK6159_Chua_2010.fasta jkd6159_reference_comparisons/JKD6159_Wick_2023.fasta > jkd6159_reference_comparisons/JKD6159_comparison.txt 2>&1
Usage: hybracter_benchmarking [OPTIONS] COMMAND [ARGS]...
hybracter_benchmarking For more options, run: hybracter_benchmarking
--help
Options:
-v, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
install install hybracter hybracter_benchmarking
assemble-real assemble real reads hybracter_benchmarking
assess-real asssess real assembly output in hybracter_benchmarking
config Copy the system default config file
citation Print the citation(s) for this tool