Bayesian genome assembler using de Bruijn graph and profile HMM
Requires nightly rust
cargo build --release
./target/release/dbgphmm -h
To infer the L-DBG of the read length L, first we construct the initial k0-DBG from the reads by the draft
subcommand.
% ./target/release/dbgphmm draft --help
dbgphmm-draft
Construct draft DBG from reads
USAGE:
dbgphmm draft [OPTIONS] -k <K> -M <MIN_DEADEND_COUNT> -G <GENOME_SIZE> --dbg-output <DBG_OUTPUT> <READ_FASTA>
ARGS:
<READ_FASTA> Input read FASTA filename
OPTIONS:
-d, --dbg-output <DBG_OUTPUT> Output dbg filename
-g, --gfa-output <GFA_OUTPUT> Output GFA of dbg filename
-G <GENOME_SIZE> Expected size (total number of bases) of target genome
-h, --help Print help information
-k <K> k of DBG
-m <MIN_COUNT> Minimum occurrence of k-mers in read [default: 2]
-M <MIN_DEADEND_COUNT> Minimum occurrence of deadend k-mers in read
-p <P_ERROR> Expected error rate of reads. If not specified, it will use
p=0.001 (0.1%) HiFi error rate [default: 0.001]
-P <N_HAPLOTYPES> Expected number of haplotypes in target genome if known
Then infer the k-DBG (k=k0,...,L) by infer
subcommand.
% ./target/release/dbgphmm infer --help
dbgphmm-infer
USAGE:
dbgphmm infer [OPTIONS] --dbg-input <DBG_INPUT> --output-prefix <OUTPUT_PREFIX> -K <K_MAX> -G <GENOME_SIZE> -S <GENOME_SIZE_SIGMA> <READ_FASTA>
ARGS:
<READ_FASTA> Input read FASTA filename
OPTIONS:
-c <MAX_CYCLE_SIZE>
[default: 1000]
-d, --dbg-input <DBG_INPUT>
Filename of initial DBG
-e <P_INFER>
Error rate of reads while inference [default: 0.00001]
-G <GENOME_SIZE>
Expected size (total number of bases) of target genome
-h, --help
Print help information
-I <MAX_ITER>
Maximum number of iteration of posterior sampling of single k [default: 50]
-K <K_MAX>
Target k of DBG
-o, --output-prefix <OUTPUT_PREFIX>
Prefix of output files used as a working directory
-p <P_ERROR>
Expected error rate of reads. If not specified, it will use p=0.001 (0.1%) HiFi error
rate [default: 0.001]
--p0 <P0>
If probability of copy number being zero is above p0, the k-mer will be regarded as
zero-copy and purged. In default, 0.8 will be used. Larger p0, bigger the graph
[default: 0.8]
-S <GENOME_SIZE_SIGMA>
Expected size (total number of bases) sigma (standard deviation) of target genome
To infer DBG with HiFi reads with the expected genome size G
and the std var of the genome size S
, we recommend the following settings.
export OMP_NUM_THREADS=1
./target/release/dbgphmm draft -k 40 -G $G -m 2 -M 5 -p 0.0003 -P 2 -d out.dbg reads.fa
./target/release/dbgphmm infer -K 20000 -G $G -S $S -e 0.0003 --p0 0.99 -d out.dbg -o out reads.fa
# generate synthetic genome and reads
./target/release/draft -h
# infer and evaluate by the true genome
./target/release/infer -h
To see the actual usage, see ./scripts/sim.sh
.
${prefix}.k${k}.gfa
GFA output of the inferred DBG and copy numbers${prefix}.k${k}.inspect
INSPECT file describing the sampled copy numbers${prefix}.k${k}.dbg
serialized DBG${prefix}.k${k}.post
(for internal use) dump of posterior distribution
${prefix}.final.gfa
Final L-DBG. This corresponds to unitig graphs.${prefix}.final.euler.fa
A candidate genome corresponding to an Eulerian circuit. This corresponds to pseudo-haplotype contigs.
# comment
# G section: graph property summary
10000 G n_edges_full 103992
10000 G n_edges_compact 7
10000 G n_nodes_full 103990
10000 G n_nodes_compact 5
10000 G n_emittable_edges 86006
10000 G degree_stats {(1, 1): 103986, (2, 1): 2, (1, 2): 2}
# C section: sampled copy number assignments and scores
# size of k-mer
# sample id
# P(X|R) posterior probability
# P(R|X) likelihood
# P(G) of genome size
# |G_D(X)| the number of eulerian circuits
# |G| genome size
# difference to true copy numbers (if available)
# history of applied update cycles
# copy number assignment
# Optional data in JSON
10000 C 0 0.8 -14622.5404367 -9.43613172 0.001 800 0 [S(e2-e5-)] [2,1] {}
10000 C 1 0.2 -14622.0570281 -11.4357317 0.002 900 2 [] [2,2] {}
# ...
# E section: posterior probabilities of each edges
# size of k-mer
# edge id
# true copy number (if available)
# expected copy number
# P(X(e)=X*(e)|R) i.e., probability of copy number being true
# P(X(e)=0|R) i.e., probability of copy number being zero
# marginalized posterior distribution
10000 E e0 2 2.00000 0.999 0.00000 p(1)=0.000,p(2)=1.000,p(3)=0.000
10000 E e1 1 1.00000 0.999 0.00000 p(0)=0.000,p(1)=1.000,p(2)=0.000
# ...
# #: comment
# K section: k of DBG
K 4
# N section: node = (k-1)-mer
# id sequence of k-1-mer
N 0 nnn
N 1 CAG
# E section: edge = simple path of k-mers
# id source target joined sequence of k-mers
# current copy number
# corresponding base ids (internal representation)
E 0 1 0 CAGGAAnnn 1 9,10,11,12,13,14
E 1 1 1 CAGCAG 3 6,7,8
E 2 0 1 nnnTCCCAG 1 0,1,2,3,4,5
Bayesian genome assembly of segmental duplications by inferring k-mer copy numbers in de Bruijn graphs, WIP
source .env/bin/activate
maturin build
pip install --force-reinstall target/wheels/dbgphmm-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
python -c 'import dbgphmm; print(repr(dbgphmm.sum_as_string(1, 2)));'