/MinYS

MineYourSymbiont : Targetted genome assembly for metagenomics pipeline.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

MinYS - MineYourSymbiont

MinYS allows targeted assembly of bacterial genomes using a reference-guided pipeline. It consists in 3 steps :

  • Mapping metagenomic reads on a reference genome using BWA. And assembling the recruited reads using Minia.
  • Gapfilling the contig set using MindTheGap in contig mode.
  • Simplifying the GFA output of MindTheGap.

MinYS was developed in the GenScale lab by :

Requirements

  • BWA (read mapping)
  • Minia (contig assembly)
  • MindTheGap (gap-filling)
  • Bandage (Optionnal, for assembly graph visualization)
  • Samtools

Installation

With Conda

conda install -c bioconda minys

From source

git clone https://github.com/cguyomar/MinYS
cd MinYS
make -C graph_simplification/nwalign/
./MinYS.py

Test run

MinYS.py -1 test_data/reads.1.fq -2 test_data/reads.2.fq -ref test_data/ref.fa -out MinYS_results
# look at the output file:
head MinYS_results/gapfilling/minia_k31_abundancemin_auto_filtered_400_gapfilling_k31_abundancemin_auto.simplified.gfa
# should contain only one sequence node of 15,722 bp, or see also the logs:
more MinYS_results/logs/simplification.log

Options

[main options]:
  -in                   (1 arg) :    Input reads file
  -1                    (1 arg) :    Input reads first file
  -2                    (1 arg) :    Input reads second file
  -fof                  (1 arg) :    Input file of read files (if paired files, 2 columns tab-separated)
  -out                  (1 arg) :    output directory for result files [Default: ./MinYS_results]

[mapping options]:
  -ref                  (1 arg) :    Bwa index
  -mask                 (1 arg) :    Bed file for region removed from mapping

[assembly options]:
  -minia-bin            (1 arg) :    Path to Minia binary (if not in $PATH
  -assembly-kmer-size   (1 arg) :    Kmer size used for Minia assembly (should be given even if bypassing minia assembly step, usefull knowledge for gap-filling) [Default: 31]
  -assembly-abundance-min
                        (1 arg) :    Minimal abundance of kmers used for assembly [Default: auto]
  -min-contig-size      (1 arg) :    Minimal size for a contig to be used in gap-filling [Default: 400]

[gapfilling options]:
  -mtg-dir              (1 arg) :    Path to MindTheGap build directory (if not in $PATH)
  -gapfilling-kmer-size
                        (1 arg) :    Kmer size used for gap-filling [Default: 31]
  -gapfilling-abundance-min
                        (1 arg) :    Minimal abundance of kmers used for gap-filling [Default: auto]
  -max-nodes            (1 arg) :    Maximum number of nodes in contig graph [Default: 300]
  -max-length           (1 arg) :    Maximum length of gap-filling (nt) [Default: 50000]

[simplification options]:
  -l                    (1 arg) :    Length of minimum prefix for node merging, default should work for most cases [Default: 100]

[continue options]:
  -contigs              (1 arg) :    Contigs in fasta format - override mapping and assembly
  -graph                (1 arg) :    Graph in h5 format - override graph creation

[core options]:
  -nb-cores             (1 arg) :    Number of cores [Default: 0]

  • If minia or MindTheGap are not in the $PATH environment variable, a path to the minia binary or to the MindTheGap build directory has to be supplied using -minia-bin or -mtg-dir options
  • -contigs and -graph may be used to bypass the mapping/assembly step, and the graph creation, respectively. In the first case, -assembly-kmer-size should be supplied as the overlap between contigs.
  • Troubleshooting:
    • If running on a Lustre filesystem you might have problems with file locking while writing HDF5 output. Make sure your output directory is located on a local tmp directory or a NFS mounted directory, or try setting the environment variable HDF5_USE_FILE_LOCKING to 'FALSE'.

Documentation

A step by step tutorial of the analysis of one sample presented in the paper is available as a Jupyter notebook (or in markdown).

Utility scripts :

Some utility scripts are supplied along with MinYS in order to facilitate the post processing of the output gfa graph :

  • graph_simplification/enumerate_paths.py in.gfa out_dir Enumerate all the paths of each connected component of a graph. Returns paths that are substantially different from one another (ANI < 99% or alignment coverage <99%)

  • graph_simplification/filter_components.py in.gfa min_size Return a sub-graph containing all the connected components larger than min_size (in total assembled nt)

  • graph_simplification/gfa2fasta.py in.gfa out.fa Return all the sequences of the graph in a multi-fasta file

Reference

MinYS: Mine Your Symbiont by targeted genome assembly in symbiotic communities. Guyomar C, Delage W, Legeai F, Mougel C, Simon JC, Lemaitre C. NAR Genomics and Bioinformatics 2020, 2(3):lqaa047, doi:10.1093/nargab/lqaa047