/dbgphmm

Probabilistic genome assembler based on de Bruijn Graph and profile HMM

Primary LanguageRust

dbgphmm

Bayesian genome assembler using de Bruijn graph and profile HMM

Build

Requires nightly rust

cargo build --release
./target/release/dbgphmm -h

Infer

To infer the L-DBG of the read length L, first we construct the initial k0-DBG from the reads by the draft subcommand.

% ./target/release/dbgphmm draft --help
dbgphmm-draft
Construct draft DBG from reads

USAGE:
    dbgphmm draft [OPTIONS] -k <K> -M <MIN_DEADEND_COUNT> -G <GENOME_SIZE> --dbg-output <DBG_OUTPUT> <READ_FASTA>

ARGS:
    <READ_FASTA>    Input read FASTA filename

OPTIONS:
    -d, --dbg-output <DBG_OUTPUT>    Output dbg filename
    -g, --gfa-output <GFA_OUTPUT>    Output GFA of dbg filename
    -G <GENOME_SIZE>                 Expected size (total number of bases) of target genome
    -h, --help                       Print help information
    -k <K>                           k of DBG
    -m <MIN_COUNT>                   Minimum occurrence of k-mers in read [default: 2]
    -M <MIN_DEADEND_COUNT>           Minimum occurrence of deadend k-mers in read
    -p <P_ERROR>                     Expected error rate of reads. If not specified, it will use
                                     p=0.001 (0.1%) HiFi error rate [default: 0.001]
    -P <N_HAPLOTYPES>                Expected number of haplotypes in target genome if known

Then infer the k-DBG (k=k0,...,L) by infer subcommand.

% ./target/release/dbgphmm infer --help
dbgphmm-infer

USAGE:
    dbgphmm infer [OPTIONS] --dbg-input <DBG_INPUT> --output-prefix <OUTPUT_PREFIX> -K <K_MAX> -G <GENOME_SIZE> -S <GENOME_SIZE_SIGMA> <READ_FASTA>

ARGS:
    <READ_FASTA>    Input read FASTA filename

OPTIONS:
    -c <MAX_CYCLE_SIZE>
            [default: 1000]

    -d, --dbg-input <DBG_INPUT>
            Filename of initial DBG

    -e <P_INFER>
            Error rate of reads while inference [default: 0.00001]

    -G <GENOME_SIZE>
            Expected size (total number of bases) of target genome

    -h, --help
            Print help information

    -I <MAX_ITER>
            Maximum number of iteration of posterior sampling of single k [default: 50]

    -K <K_MAX>
            Target k of DBG

    -o, --output-prefix <OUTPUT_PREFIX>
            Prefix of output files used as a working directory

    -p <P_ERROR>
            Expected error rate of reads. If not specified, it will use p=0.001 (0.1%) HiFi error
            rate [default: 0.001]

        --p0 <P0>
            If probability of copy number being zero is above p0, the k-mer will be regarded as
            zero-copy and purged. In default, 0.8 will be used. Larger p0, bigger the graph
            [default: 0.8]

    -S <GENOME_SIZE_SIGMA>
            Expected size (total number of bases) sigma (standard deviation) of target genome

To infer DBG with HiFi reads with the expected genome size G and the std var of the genome size S, we recommend the following settings.

export OMP_NUM_THREADS=1
./target/release/dbgphmm draft -k 40 -G $G -m 2 -M 5 -p 0.0003 -P 2 -d out.dbg reads.fa
./target/release/dbgphmm infer -K 20000 -G $G -S $S -e 0.0003 --p0 0.99 -d out.dbg -o out reads.fa

Simulation experiments

# generate synthetic genome and reads
./target/release/draft -h
# infer and evaluate by the true genome
./target/release/infer -h

To see the actual usage, see ./scripts/sim.sh.

Outputs

Intermediate results

  • ${prefix}.k${k}.gfa GFA output of the inferred DBG and copy numbers
  • ${prefix}.k${k}.inspect INSPECT file describing the sampled copy numbers
  • ${prefix}.k${k}.dbg serialized DBG
  • ${prefix}.k${k}.post (for internal use) dump of posterior distribution

Final

  • ${prefix}.final.gfa Final L-DBG. This corresponds to unitig graphs.
  • ${prefix}.final.euler.fa A candidate genome corresponding to an Eulerian circuit. This corresponds to pseudo-haplotype contigs.

File formats

INSPECT

# comment
# G section: graph property summary
10000   G       n_edges_full    103992
10000   G       n_edges_compact 7
10000   G       n_nodes_full    103990
10000   G       n_nodes_compact 5
10000   G       n_emittable_edges       86006
10000   G       degree_stats    {(1, 1): 103986, (2, 1): 2, (1, 2): 2}
# C section: sampled copy number assignments and scores
# size of k-mer
#               sample id
#                       P(X|R) posterior probability
#                              P(R|X) likelihood
#                                                P(G) of genome size
#                                                                |G_D(X)| the number of eulerian circuits
#                                                                            |G| genome size
#                                                                                  difference to true copy numbers (if available)
#                                                                                      history of applied update cycles
#                                                                                                   copy number assignment
#                                                                                                          Optional data in JSON
10000   C       0       0.8    -14622.5404367    -9.43613172     0.001       800   0   [S(e2-e5-)]  [2,1]  {}
10000   C       1       0.2    -14622.0570281    -11.4357317     0.002       900   2   []           [2,2]  {}
# ...
# E section: posterior probabilities of each edges
# size of k-mer
#               edge id
#                       true copy number (if available)
#                               expected copy number
#                                       P(X(e)=X*(e)|R) i.e., probability of copy number being true
#                                               P(X(e)=0|R) i.e., probability of copy number being zero
#                                                       marginalized posterior distribution
10000   E       e0      2       2.00000 0.999   0.00000 p(1)=0.000,p(2)=1.000,p(3)=0.000
10000   E       e1      1       1.00000 0.999   0.00000 p(0)=0.000,p(1)=1.000,p(2)=0.000
# ...

DBG

# #: comment
# K section: k of DBG
K       4
# N section: node = (k-1)-mer
#       id      sequence of k-1-mer
N       0       nnn
N       1       CAG
# E section: edge = simple path of k-mers
#       id      source  target  joined sequence of k-mers
#                                               current copy number
#                                                       corresponding base ids (internal representation)
E       0       1       0       CAGGAAnnn       1       9,10,11,12,13,14
E       1       1       1       CAGCAG  3       6,7,8
E       2       0       1       nnnTCCCAG       1       0,1,2,3,4,5

Citation

Bayesian genome assembly of segmental duplications by inferring k-mer copy numbers in de Bruijn graphs, WIP

WIP: python binding

source .env/bin/activate
maturin build
pip install --force-reinstall target/wheels/dbgphmm-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
python -c 'import dbgphmm; print(repr(dbgphmm.sum_as_string(1, 2)));'