/PlasBin-flow-pangenome

PlasBin-flow fork to handle pangenomes

Primary LanguagePythonMIT LicenseMIT

PlasBin-flow pangenome

FORKED FROM:

PlasBin-flow: A flow-based MILP algorithm for plasmid contigs binning

PlasBin-flow is a tool aimed at extracting from the assembly graph (in GFA format) of a bacterial isolate, groups of contigs (called plasmid bins), each groups being assumed to originate from one of a plasmid of the analyzed isolate.

PlasBin-flow is based on a Mixed-Integer Linear Programming (MILP) formulation that is described in the paper PlasBin-flow: a flow-based MILP algorithm for plasmid contigs binning.

Installation

PlasBin-flow can be installed from this repository git clone https://github.com/cchauve/PlasBin-flow.git

PlasBin-flow is composed of two main scripts:

  • plasbin_flow.py is the main script used to process an assembly graph.
  • plasbin_utils.py is a script used to tune PlasBin-flow parameters given a reference dataset and preprocess a set of samples prior to processing them with PlasBin-flow.

Prerequisites

PlasBin-flow is written in python (version 3.9.11+);

plasbin_utils.py also requires BLAST+ v2.6.0 (commands makeblastdb and blastn).

We strongly recommand to run PlasBin-flow using a dedicated python virtual environment (see https://docs.python.org/3.9/library/venv.html).

Obtaining Gurobi license

PlasBin-flow uses the Gurobi Solver. To use PlasBin-flow, a Gurobi license is needed, which is free for academics.

If you plan to use PlasBin-flow on an institutional High-Performance Computing system (such as a university cluster or a cluster of the Digital Research Alliance of Canada), we recommend that you contact a system administrator as it is lilely that such a license is already available.

Alternatively, to be used on a local system or on a local computer (all experiments in the paper PlasBin-flow: A flow-based MILP algorithm for plasmid contigs binning were ran on a laptop computer), you can obtain and install a free academic license following the instructions provided at Gurobi: Always Free for Academics.

Running PlasBin-flow

Input files

PlasBin-flow requires the following input files (described in details below) to process a bacterial isolate:

  • assembly graph file,
  • contigs plasmid score file,
  • GC content probabilities file.

It also requires two numerical parameters defining the seed contigs, that are contigs that are more likely to belong to a plasmid (see section 2.4 of the PlasBin-flow paper).

Assembly graph

The assembly graph of a bacterial sample must be the provided as a file in GFA1 format similar to the GFA files generated by the assembler Unicycler (file assembly.gfa) with a tag dp encoding the normalized coverage, of contigs or to the GFA files generated by the gfa-connector command from the assembler SKESA with a tag KC encoding K-mer counts that is used to compute the normalized coverage of contigs.

PlasBin-flow scripts require input GFA files to be gzipped.

Contigs plasmid score file

This file is a TSV file with at least two fields, the first one being the names of contigs of the assembly graph and the second a floating number between 0 and 1 that is used as an a priori measure of plasmidness of the assembly contigs. Additional fields can be present but are not used.

The plasmid score of a contig can be obtained using a plasmid contig classifier, such as plASgraph2.

As some plasmid contigs classifiers do not provide a score, or provide an unreliable score, for short contigs, PlasBin-flow handles such cases as follow:

  • contigs present in the GFA file but not in the plasmid scores file: such contigs receive a default plasmid score (optional parameter default_pls_score, default value: 0.5).;
  • for short contigs (defined as contigs of length strictly below the optional parameter min_ctg_len (default value 0)) receive the default plasmid score (as above).

Alternatively, PlasBin-flow can compute this plasmid score using the notion of plasmid gene density, described in the PlasBin-flow paper (See section Preprocessing below).

GC content probabilities file

As described in section 2.5 of the PlasBin-flow paper, GC content plays an important tole in the PlasBin-flow optimization model. PlasBin-flow relies on a set of relative GC content intervals (a partition of [0,1] into intervals, each defining a range of relative GC content, called a GC interval) and requires that each contig c is given, for each GC interval I a probability that c originates from a molecule of relative GC content in the range defined by I.

By default the considered GC intervals of PlasBin-flow are [0,0.4], [0.4,0.45], [0.45,0.5], [0.5,0.55], [0.55,0.6], [0.6,1] but customs intervals can be used. Given a GC intervals file and an assembly graph, the GC content probabilities file can be generated by the preprocessing commands of plasbin_utils.py.

Seeds parameters

The seed parameters are the minimum length l and the minimum plasmid score ps: any contig of length at least l (default value = 2650) and plasmid score at least ps (default value = 0.58) is a seed contig.

Usage

To process a sample, given the input files described above, the command plasbin_flow.py can be used as follows:

python code/plasbin_flow.py
       -ag assembly graph
       -gc contigs gc_content_file
       -ps contigs plasmid scores file
       -outdir output_dir
       -outfile output_file
       -log_file
       [-seed_len seeds length threshold]
       [-seed_score seeds plasmid score threshold]
       [-assembler assembler used to create the GFA file]
       [-gc_intervals GC content intervals file]

where

  • assembly graph is the gzipped GFA1 format assembly graph file,
  • contigs gc content file is the GC content probability file described above
  • contigs plasmid score file is the contigs plasmid scores file described above,
  • output_dir and output_file are the directory and file name for the output file,
  • log_file is the path to a file containing detailed log information and warnings,
  • [optional] assembler can be either unicycler (default value, for GFA files with a dp tag encoding normalized depth coverage) or skesa (for GFA files with a KC tag encoding k-mer count used to infer the normalized depth coverage),
  • [optional] seeds length threshold and seeds plasmid score threshold are the length and plasmid score thresholds defining seeds (default values, used in the PlasBin-flow paper experiments, 2650 and 0.58),
  • [optional] GC content intervals file describes the GC content intervals used to compute GC content probability scores for contigs (defaut intervals, used in the PlasBin-flow paper experiments, are the ones in the file gc_intervals.txt.

The output of PlasBin-flow is a TSV file with each line containing the following information:

Plasmid bin			Number (ID) associated with the plasmid bin
Flow value			Flow value associated with the plasmid bin (used as a proxy for the plasmid bin copy number)
GC bin				Index of GC content interval associated with the plasmid bin
Contigs				Comma-separated list of contigs associated with plasmid bin along with their multiplicities

Thus a line in an output file from PlasBin-flow is as follows:

#Pls_ID	Flow	GC_bin		Contigs
P1	2.5	0.4-0.45	a:2,b:3,c:2,d:1

Advanced options

Additional optional parameters can be passed to the script plasbin_flow.py that allows to modify the optimization model of PlasBin-flow:

-rmiter_max        Maximum number of iterations to remove circular components. (integer, default: 50)
-alpha1            Weight of flow term of the objective function. (float, default: 1)                              
-alpha2            Weight of GC content term of the objective function. (float, default: 1)
-alpha3            Weight of plasmid score term of the objective function. (float, default: 1)
-p                 Plasmid score offset (float in [0,1], default: 0.5)
-min_pls_len       Minimum plasmid length (integer, default: 1500); plasmid bins of smaller length are not reported
-default_pls_score Default plasmid score for short contigs or contigs not present in the the plasmid score file (float, default: 0.5)
-min_ctg_len       Contigs of length below min_ctg_len receive plasmid score default_pls_score (int, default: 0)
-gurobi_mip_gap    MIPGap parameter of Gurobi (optimality gap, float, default value: 0.05)
-gurobi_time_limit Time limit after which Gurobi stops (integer, default: 2400)

The default values are the values used in the experiments described in the PlasBin-flow paper.

Preprocessing and tuning

Given the set of assembly graphs of a set of samples, the script plasbin_utils.py can be used to preprocess them in order to generate the plasmid scores ad GC content probabilities files for each sample.

Also, PlasBin-flow requires parameters (for example the thresholds defining seed contigs) and auxiliary files (such as GC intervals), that can be generated using a tuning dataset of samples for which true plasmids are known, using plasbin_utils.py.

We describe below the relevant commands of plasbin_utils.py. Examples are provided in example/tuning, example/preprocessing and example/plasbin-flow.

Dataset input file

The script plasbin_utils.py takes as main input a CSV file describing a set of samples, either to be preprocessed prior to be analyzed with plasbin_flow.py or to be used to tune PlasBin-flow parameters.

This file must contain a header and, depending on the command used, can include the following fields:

  • sample: sample name;
  • gfa: path to the gzipped GFA file of the sample;
  • pls_fasta: path to a gzipped FASTA file containing the true plasmids of the sample, one per FASTA entry;
  • chr_fasta: path to a gzipped FASTA file containing the chromosome(s) of the sample, one per FASTA entry;
  • pls_score: path to a plasmid score file (as described above) for the sample;
  • gc_probabilities: path to a GC probability file (as described above) for the sample;
  • genes2ctgs_mappings: path to a BLAST format 6 containing the result of mapping a set of reference plasmid genes (see Section "Reference plasmid genes database" below) onto the assembly graph contigs;
  • ground_truth: path to a file describing the ground truth for a sample, in TSV format with two fields (true plasmid name, contig name).

Additionally, all commands of plasbin_utils.py take three additional requird parameters:

  • out_dir: directory where final results files are written;
  • tmp_dir: directory where temporary files are written (deleted unless the paramater --keep_tmp_dir is used);
  • log_file: optional file containing detailed log information.

Preprocessing: Computing GC probabilities files

The command to create the GC content probabilities files for preprocessing a set of samples is

python plasbin_utils.py gc_probabilities
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       [--gc_intervals gc_intervals_file]
       [--out_file out_file]

The input file input_file must contain the fields sample and gfa.

The optional file gc_intervals_file is a file that describes the GC intervals, with one interval boundary per line (so its first line must be 0, its last line must be 1 and in between the values must be increasing, an example is provided in gc_intervals.txt. If it is not povided, the default intervals defined by boundaries [0, 0.4, 0.45, 0.5, 0.55, 0.6, 1] are used.

The file created for a sample that contains the GC content probabilities is the file <out_dir>/<sample name>.gc.tsv.

The parameter out_file is optional; if provided, a new CSV file out_file is created from input_file by adding a column gc_probabilities that contains the path to the computed GC probabilities file.

Preprocessing: Computing plasmid scores using gene density

Similar to the experiments described in the PlasBin-flow paper, the plasmid score of a contig can be generated using the notion of gene density: given a reference database of plasmid genes (see Section ""Reference plasmid genes database" below), the gene density of a contig is defined as the proportion of its length covered by mappings obtained by mapping the reference plasmid genes to the assembly contigs.

The command to compute the contig plasmid gene density for a set of samples is

python plasbin_utils.py gene_density
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       [--out_file out_file]
       [--pid_threshold p]
       [--cov_threshold c]

where

  • input_file must contain the fields sample, gfa and genes2ctgs_mappings;
  • p and c are mappings quality threshold: every mapping of percent identity below p (default=0.95) and covering less than c (default=0.8) of the contig it maps to is discarded;
  • out_file is an optional parameter; if provided, a new CSV file out_file is created from input_file by adding a column pls_score that contains the path to the gene density file created for each sample: <out_dir>/<sample name>.gd.tsv.

The command to create the mapping file for each sample is

python plasbin_utils.py map_genes_to_ctgs
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       --db_file pls_db_file
       [--out_file out_file]

where

  • input_file must contain the fields sample and gfa;
  • pls_db_file is the path to a reference plasmid genes database file (FASTA file, see genes.fasta for the plasmid genes database used in the PlasBin-flow paper);
  • out_file is an optional parameter; if provided, a new CSV file out_file is created from input_file by adding a column genes2ctgs_mappings that contains the path to the mappings file created for each sample: <out_dir>/<sample name>.genes_mappings.txt.

Full preprocessing

Finally, both preprocessing steps (computing GC content probabilities and gene density) can be done at once using

python plasbin_utils.py preprocessing
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       --out_file out_file 
       --db_file pls_db_file
       [--gc_intervals gc_intervals_file]
       [--pid_threshold p]
       [--cov_threshold c]

where

  • input_file must contain the fields sample and gfa;
  • other parameters are as described above;
  • out_file is a new CSV file created from input_file by adding columns gc_probabilities and pls_score.

Tuning PlasBin-flow parameters

PlasBin-flow requires auxiliary data files and parameters in order to process samples:

  • reference plasmid genes database used to compute the mapping file and the gene density,
  • GC content intervals used to compute the GC content probabilities file,
  • parameters defining seed contigs.

PlasBin-flow comes with default files/values for these (see the PlasBin-flow paper for a description of how these were obtained):

  • reference plasmid genes database: genes.fasta
  • GC content intervals: gc_intervals.txt
  • seed contigs parameters: default value, length=2650, plasmid score=0.58.

Alternatively, given a reference set of samples for which the true plasmids are known, the script plasbin_utils.py can be used to compute these parameters and auxiliary files.

Tuning: Creating a plasmid gene database

The command

python plasbin_utils.py pls_genes_db
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]

creates a reference plasmid genes database out_dir/pls.genes.fasta.

The file input_file must contain the fields sample, gfa (so tuning samples need to be re-assembled from reads data) and pls_fasta where each plasmid entry must be named using the plasmid GenBank or RefSeq accession.

Tuning: Computing GC content intervals

The command

python plasbin_utils.py gc_intervals
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       [--n_gcints n_gcints]

computes a GC content intervals file out_dir/gc.txt.

The file input_file must contain the fields sample, pls_fasta and chr_fasta.

The optional parameter n_gcints (default value 6) is the number of intervals.

The command generates three files:

  • out_dir/gc.csv: details of the GC content of all plasmids and chromosomes;
  • out_dir/gc.png: violin plot of out_dir/gc.csv
  • out_dir/gc.txt: GC content n_gcints intervals boundaries, obtained by splitting into equal range intervals the GC content observed in the plasmids of the reference samples.

We encourage a user too look at the provided files out_dir/gc.csv and out_dir/gc.png to assess the GC intervals file and modify it if deemed relevant.

Tuning: Computing ground truth

For a set of samples for which true plasmids are known, the command

python plasbin_utils.py ground_truth
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       [--out_file out_file]
       [--pid_threshold p]
       [--cov_threshold c]

computes, for each sample, a TSV file out_dir/<sample name>.ground_truth.tsv that records true plasmid bins.

True plasmid bins for a sample are defined by mapping the sample contigs to the true plasmids, and discarding any mapping with identity below pid_threshold p (default value = 0.95) and contig coverage below --cov_threshold c (default value = 0.8).

The file input_file must contain the fields sample, gfa, and pls_fasta.

The resulting ground truth file out_dir/<sample name>.ground_truth.tsv lists contigs mapped to plasmids associated with the sample. A line in the ground truth TSV file should contain the name of the plasmid in the first column and a contig that has been mapped to the plasmid in the second column. The file can contain other information as long as the first two columns contain the plasmid and contig names respectively.

P1 C1
P1 C2
P2 C3
...

If the optional parameter out_file is provided it contains the path to a new CSV file obtained from input_file augmented by a column ground_truth that contains the path to the ground truth file for each sample.

Tuning: Computing seed contigs parameters

PlasBin-flow takes two parameters as input as the contig length and plasmid score thresholds to decide which contig from a given sample can be designated as seeds. Each plasmid bin output by PlasBin-flow should contain at least one seed contig. Seed parameters are decided by observing the length and plasmid score of contigs from the set of tuning samples, based on their mappings to the true plasmids.

The command

python plasbin_utils.py seeds
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]

computes a file out_dir/seeds.tsv that contains the optimal seed contigs parameters.

Each line of the out_dir/seeds.tsv file contains 6 tab-separated fields (plasmid score, length, seed score, number of plasmids with seeds, number of false seeds, number of plasmid without seed):

  • plasmid score and length are the thresholds values,
  • seed score defined as the difference of the next 2 fields described below,
  • number of reference plasmids with seeds is the number of true plasmids with at least one contig identified as a seed based on the thresholds,
  • number of false seeds is the number of contigs that meets the thresholds but are not assigned to a reference plasmid in their sample,
  • number of plasmid with no seed.

A file out_dir/seeds_all.tsv in the same format contains the sam results but for all thresholds combinations, not only the ones with the best seed score.

The file input_file must contain the fields sample, gfa, ground_truth and pls_score.

Full tuning

Finally, all tuning steps (but building the reference plasmid genes database) can be done at once with the command

python plasbin_utils.py tuning
       --input_file input_file
       --out_dir out_dir
       --tmp_dir tmp_dir
       [--log_file log file]
       --db_file pls_db_file
       [--out_file out_file]
       [--pid_threshold p]
       [--cov_threshold c]
       [--n_gcints n_gcints]

The file input_file must contain the fields sample, gfa, pls_fasta and chr_fasta.

The parameters pid_threshold,cov_threshold,n_gcints are as described above.

The file pls_db_file is the FASTA file containing the reference plasmid genes database.

This command creates the files out_dir/gc.[txt,png,csv], out_dir/gc_intervals.txt, out_dir/seeds.tsv, out_dir/pls.genes.fasta, and for each sample out_dir/<sample name>.genes_mappings.txt and out_dir/<sample name>.ground_truth.tsv.

If the optional paramater out_file is provided it contains the path to a new CSV file obtained from input_file augmented by columns ground_truth, pls_score and genes2ctgs_mappings.