PlasBin-flow is a tool aimed at extracting from the assembly graph (in GFA format) of a bacterial isolate, groups of contigs (called plasmid bins), each groups being assumed to originate from one of a plasmid of the analyzed isolate.
PlasBin-flow is based on a Mixed-Integer Linear Programming (MILP) formulation that is described in the paper PlasBin-flow: a flow-based MILP algorithm for plasmid contigs binning.
PlasBin-flow can be installed from this repository
git clone https://github.com/cchauve/PlasBin-flow.git
PlasBin-flow is composed of two main scripts:
- plasbin_flow.py is the main script used to process an assembly graph.
- plasbin_utils.py is a script used to tune PlasBin-flow parameters given a reference dataset and preprocess a set of samples prior to processing them with PlasBin-flow.
PlasBin-flow is written in python (version 3.9.11+);
- plasbin_flow.py requires
- the python module networkX (version 2.7+),
- the ILP solver gurobi (version 9.1.2+).
- plasbin_utils.py requires the python modules
plasbin_utils.py also requires BLAST+ v2.6.0 (commands makeblastdb and blastn).
We strongly recommand to run PlasBin-flow using a dedicated python virtual environment (see https://docs.python.org/3.9/library/venv.html).
PlasBin-flow uses the Gurobi Solver. To use PlasBin-flow, a Gurobi license is needed, which is free for academics.
If you plan to use PlasBin-flow on an institutional High-Performance Computing system (such as a university cluster or a cluster of the Digital Research Alliance of Canada), we recommend that you contact a system administrator as it is lilely that such a license is already available.
Alternatively, to be used on a local system or on a local computer (all experiments in the paper PlasBin-flow: A flow-based MILP algorithm for plasmid contigs binning were ran on a laptop computer), you can obtain and install a free academic license following the instructions provided at Gurobi: Always Free for Academics.
PlasBin-flow requires the following input files (described in details below) to process a bacterial isolate:
- assembly graph file,
- contigs plasmid score file,
- GC content probabilities file.
It also requires two numerical parameters defining the seed contigs, that are contigs that are more likely to belong to a plasmid (see section 2.4 of the PlasBin-flow paper).
The assembly graph of a bacterial sample must be the provided as a
file in GFA1
format similar to the GFA files generated by the assembler Unicycler (file
assembly.gfa
) with a tag dp
encoding the normalized coverage,
of contigs or to the GFA files generated by the
gfa-connector
command from the assembler SKESA
with a tag KC
encoding K-mer counts that is used to compute the
normalized coverage of contigs.
PlasBin-flow scripts require input GFA files to be gzipped.
This file is a TSV file with at least two fields, the first one being the names of contigs of the assembly graph and the second a floating number between 0 and 1 that is used as an a priori measure of plasmidness of the assembly contigs. Additional fields can be present but are not used.
The plasmid score of a contig can be obtained using a plasmid contig classifier, such as plASgraph2.
As some plasmid contigs classifiers do not provide a score, or provide an unreliable score, for short contigs, PlasBin-flow handles such cases as follow:
- contigs present in the GFA file but not in the plasmid scores
file: such contigs receive a default plasmid score (optional parameter
default_pls_score
, default value: 0.5).; - for short contigs (defined as contigs of length strictly below the
optional parameter
min_ctg_len
(default value 0)) receive the default plasmid score (as above).
Alternatively, PlasBin-flow can compute this plasmid score using the notion of plasmid gene density, described in the PlasBin-flow paper (See section Preprocessing below).
As described in section 2.5 of the PlasBin-flow paper, GC content plays an important tole in the PlasBin-flow optimization model. PlasBin-flow relies on a set of relative GC content intervals (a partition of [0,1] into intervals, each defining a range of relative GC content, called a GC interval) and requires that each contig c is given, for each GC interval I a probability that c originates from a molecule of relative GC content in the range defined by I.
By default the considered GC intervals of PlasBin-flow are [0,0.4], [0.4,0.45], [0.45,0.5], [0.5,0.55], [0.55,0.6], [0.6,1] but customs intervals can be used. Given a GC intervals file and an assembly graph, the GC content probabilities file can be generated by the preprocessing commands of plasbin_utils.py.
The seed parameters are the minimum length l and the minimum plasmid score ps: any contig of length at least l (default value = 2650) and plasmid score at least ps (default value = 0.58) is a seed contig.
To process a sample, given the input files described above, the command plasbin_flow.py can be used as follows:
python code/plasbin_flow.py
-ag assembly graph
-gc contigs gc_content_file
-ps contigs plasmid scores file
-outdir output_dir
-outfile output_file
-log_file
[-seed_len seeds length threshold]
[-seed_score seeds plasmid score threshold]
[-assembler assembler used to create the GFA file]
[-gc_intervals GC content intervals file]
where
assembly graph
is the gzipped GFA1 format assembly graph file,contigs gc content file
is the GC content probability file described abovecontigs plasmid score file
is the contigs plasmid scores file described above,output_dir
andoutput_file
are the directory and file name for the output file,log_file
is the path to a file containing detailed log information and warnings,- [optional]
assembler
can be eitherunicycler
(default value, for GFA files with adp
tag encoding normalized depth coverage) orskesa
(for GFA files with aKC
tag encoding k-mer count used to infer the normalized depth coverage), - [optional]
seeds length threshold
andseeds plasmid score threshold
are the length and plasmid score thresholds defining seeds (default values, used in the PlasBin-flow paper experiments,2650
and0.58
), - [optional]
GC content intervals file
describes the GC content intervals used to compute GC content probability scores for contigs (defaut intervals, used in the PlasBin-flow paper experiments, are the ones in the file gc_intervals.txt.
The output of PlasBin-flow is a TSV file with each line containing the following information:
Plasmid bin Number (ID) associated with the plasmid bin
Flow value Flow value associated with the plasmid bin (used as a proxy for the plasmid bin copy number)
GC bin Index of GC content interval associated with the plasmid bin
Contigs Comma-separated list of contigs associated with plasmid bin along with their multiplicities
Thus a line in an output file from PlasBin-flow is as follows:
#Pls_ID Flow GC_bin Contigs
P1 2.5 0.4-0.45 a:2,b:3,c:2,d:1
Additional optional parameters can be passed to the script plasbin_flow.py that allows to modify the optimization model of PlasBin-flow:
-rmiter_max Maximum number of iterations to remove circular components. (integer, default: 50)
-alpha1 Weight of flow term of the objective function. (float, default: 1)
-alpha2 Weight of GC content term of the objective function. (float, default: 1)
-alpha3 Weight of plasmid score term of the objective function. (float, default: 1)
-p Plasmid score offset (float in [0,1], default: 0.5)
-min_pls_len Minimum plasmid length (integer, default: 1500); plasmid bins of smaller length are not reported
-default_pls_score Default plasmid score for short contigs or contigs not present in the the plasmid score file (float, default: 0.5)
-min_ctg_len Contigs of length below min_ctg_len receive plasmid score default_pls_score (int, default: 0)
-gurobi_mip_gap MIPGap parameter of Gurobi (optimality gap, float, default value: 0.05)
-gurobi_time_limit Time limit after which Gurobi stops (integer, default: 2400)
The default values are the values used in the experiments described in the PlasBin-flow paper.
Given the set of assembly graphs of a set of samples, the script plasbin_utils.py can be used to preprocess them in order to generate the plasmid scores ad GC content probabilities files for each sample.
Also, PlasBin-flow requires parameters (for example the thresholds defining seed contigs) and auxiliary files (such as GC intervals), that can be generated using a tuning dataset of samples for which true plasmids are known, using plasbin_utils.py.
We describe below the relevant commands of plasbin_utils.py. Examples are provided in example/tuning, example/preprocessing and example/plasbin-flow.
The script plasbin_utils.py takes as main input a CSV file describing a set of samples, either to be preprocessed prior to be analyzed with plasbin_flow.py or to be used to tune PlasBin-flow parameters.
This file must contain a header and, depending on the command used, can include the following fields:
sample
: sample name;gfa
: path to the gzipped GFA file of the sample;pls_fasta
: path to a gzipped FASTA file containing the true plasmids of the sample, one per FASTA entry;chr_fasta
: path to a gzipped FASTA file containing the chromosome(s) of the sample, one per FASTA entry;pls_score
: path to a plasmid score file (as described above) for the sample;gc_probabilities
: path to a GC probability file (as described above) for the sample;genes2ctgs_mappings
: path to a BLAST format 6 containing the result of mapping a set of reference plasmid genes (see Section "Reference plasmid genes database" below) onto the assembly graph contigs;ground_truth
: path to a file describing the ground truth for a sample, in TSV format with two fields (true plasmid name, contig name).
Additionally, all commands of plasbin_utils.py take three additional requird parameters:
out_dir
: directory where final results files are written;tmp_dir
: directory where temporary files are written (deleted unless the paramater--keep_tmp_dir
is used);log_file
: optional file containing detailed log information.
The command to create the GC content probabilities files for preprocessing a set of samples is
python plasbin_utils.py gc_probabilities
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
[--gc_intervals gc_intervals_file]
[--out_file out_file]
The input file input_file
must contain the fields sample
and gfa
.
The optional file gc_intervals_file
is a file that describes the GC
intervals, with one interval boundary per line (so its first line must
be 0, its last line must be 1 and in between the values must be
increasing, an example is provided in
gc_intervals.txt. If it is not
povided, the default intervals defined by boundaries
[0, 0.4, 0.45, 0.5, 0.55, 0.6, 1]
are used.
The file created for a sample that contains the GC content
probabilities is the file <out_dir>/<sample name>.gc.tsv
.
The parameter out_file
is optional; if provided, a new CSV file
out_file
is created from input_file
by adding a column
gc_probabilities
that contains the path to the computed GC
probabilities file.
Similar to the experiments described in the PlasBin-flow paper, the plasmid score of a contig can be generated using the notion of gene density: given a reference database of plasmid genes (see Section ""Reference plasmid genes database" below), the gene density of a contig is defined as the proportion of its length covered by mappings obtained by mapping the reference plasmid genes to the assembly contigs.
The command to compute the contig plasmid gene density for a set of samples is
python plasbin_utils.py gene_density
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
[--out_file out_file]
[--pid_threshold p]
[--cov_threshold c]
where
input_file
must contain the fieldssample
,gfa
andgenes2ctgs_mappings
;p
andc
are mappings quality threshold: every mapping of percent identity belowp
(default=0.95) and covering less thanc
(default=0.8) of the contig it maps to is discarded;out_file
is an optional parameter; if provided, a new CSV fileout_file
is created frominput_file
by adding a columnpls_score
that contains the path to the gene density file created for each sample:<out_dir>/<sample name>.gd.tsv
.
The command to create the mapping file for each sample is
python plasbin_utils.py map_genes_to_ctgs
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
--db_file pls_db_file
[--out_file out_file]
where
input_file
must contain the fieldssample
andgfa
;pls_db_file
is the path to a reference plasmid genes database file (FASTA file, see genes.fasta for the plasmid genes database used in the PlasBin-flow paper);out_file
is an optional parameter; if provided, a new CSV fileout_file
is created frominput_file
by adding a columngenes2ctgs_mappings
that contains the path to the mappings file created for each sample:<out_dir>/<sample name>.genes_mappings.txt
.
Finally, both preprocessing steps (computing GC content probabilities and gene density) can be done at once using
python plasbin_utils.py preprocessing
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
--out_file out_file
--db_file pls_db_file
[--gc_intervals gc_intervals_file]
[--pid_threshold p]
[--cov_threshold c]
where
input_file
must contain the fieldssample
andgfa
;- other parameters are as described above;
out_file
is a new CSV file created frominput_file
by adding columnsgc_probabilities
andpls_score
.
PlasBin-flow requires auxiliary data files and parameters in order to process samples:
- reference plasmid genes database used to compute the mapping file and the gene density,
- GC content intervals used to compute the GC content probabilities file,
- parameters defining seed contigs.
PlasBin-flow comes with default files/values for these (see the PlasBin-flow paper for a description of how these were obtained):
- reference plasmid genes database: genes.fasta
- GC content intervals: gc_intervals.txt
- seed contigs parameters: default value, length=2650, plasmid score=0.58.
Alternatively, given a reference set of samples for which the true plasmids are known, the script plasbin_utils.py can be used to compute these parameters and auxiliary files.
The command
python plasbin_utils.py pls_genes_db
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
creates a reference plasmid genes database out_dir/pls.genes.fasta
.
The file input_file
must contain the fields sample
, gfa
(so
tuning samples need to be re-assembled from reads data) and
pls_fasta
where each plasmid entry must be named using the plasmid
GenBank or RefSeq accession.
The command
python plasbin_utils.py gc_intervals
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
[--n_gcints n_gcints]
computes a GC content intervals file out_dir/gc.txt
.
The file input_file
must contain the fields sample
, pls_fasta
and chr_fasta
.
The optional parameter n_gcints
(default value 6) is the number of
intervals.
The command generates three files:
out_dir/gc.csv
: details of the GC content of all plasmids and chromosomes;out_dir/gc.png
: violin plot ofout_dir/gc.csv
out_dir/gc.txt
: GC contentn_gcints
intervals boundaries, obtained by splitting into equal range intervals the GC content observed in the plasmids of the reference samples.
We encourage a user too look at the provided files out_dir/gc.csv
and out_dir/gc.png
to assess the GC intervals file and modify it if
deemed relevant.
For a set of samples for which true plasmids are known, the command
python plasbin_utils.py ground_truth
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
[--out_file out_file]
[--pid_threshold p]
[--cov_threshold c]
computes, for each sample, a TSV file out_dir/<sample name>.ground_truth.tsv
that records true plasmid bins.
True plasmid bins for a sample are defined by mapping the sample
contigs to the true plasmids, and discarding any mapping with identity
below pid_threshold p
(default value = 0.95) and contig coverage below
--cov_threshold c
(default value = 0.8).
The file input_file
must contain the fields sample
, gfa
, and pls_fasta
.
The resulting ground truth file out_dir/<sample name>.ground_truth.tsv
lists contigs mapped to plasmids associated
with the sample. A line in the ground truth TSV file should contain
the name of the plasmid in the first column and a contig that
has been mapped to the plasmid in the second column. The file can
contain other information as long as the first two columns contain the
plasmid and contig names respectively.
P1 C1
P1 C2
P2 C3
...
If the optional parameter out_file
is provided it contains the path
to a new CSV file obtained from input_file
augmented by a column
ground_truth
that contains the path to the ground truth file
for each sample.
PlasBin-flow takes two parameters as input as the contig length and plasmid score thresholds to decide which contig from a given sample can be designated as seeds. Each plasmid bin output by PlasBin-flow should contain at least one seed contig. Seed parameters are decided by observing the length and plasmid score of contigs from the set of tuning samples, based on their mappings to the true plasmids.
The command
python plasbin_utils.py seeds
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
computes a file out_dir/seeds.tsv
that contains the optimal seed
contigs parameters.
Each line of the out_dir/seeds.tsv
file contains 6 tab-separated fields
(plasmid score, length, seed score, number of plasmids with seeds,
number of false seeds, number of plasmid without seed):
- plasmid score and length are the thresholds values,
- seed score defined as the difference of the next 2 fields described below,
- number of reference plasmids with seeds is the number of true plasmids with at least one contig identified as a seed based on the thresholds,
- number of false seeds is the number of contigs that meets the thresholds but are not assigned to a reference plasmid in their sample,
- number of plasmid with no seed.
A file out_dir/seeds_all.tsv
in the same format contains the sam results
but for all thresholds combinations, not only the ones with the best seed score.
The file input_file
must contain the fields sample
, gfa
,
ground_truth
and pls_score
.
Finally, all tuning steps (but building the reference plasmid genes database) can be done at once with the command
python plasbin_utils.py tuning
--input_file input_file
--out_dir out_dir
--tmp_dir tmp_dir
[--log_file log file]
--db_file pls_db_file
[--out_file out_file]
[--pid_threshold p]
[--cov_threshold c]
[--n_gcints n_gcints]
The file input_file
must contain the fields sample
, gfa
, pls_fasta
and chr_fasta
.
The parameters pid_threshold,cov_threshold,n_gcints
are as described above.
The file pls_db_file
is the FASTA file containing the reference
plasmid genes database.
This command creates the files out_dir/gc.[txt,png,csv]
,
out_dir/gc_intervals.txt
, out_dir/seeds.tsv
,
out_dir/pls.genes.fasta
, and for each sample out_dir/<sample name>.genes_mappings.txt
and out_dir/<sample name>.ground_truth.tsv
.
If the optional paramater out_file
is provided it contains the path
to a new CSV file obtained from input_file
augmented by columns
ground_truth
, pls_score
and genes2ctgs_mappings
.