/PanDelos-plusplus

PanDelos in C++

Primary LanguageC++

PanDelos plus plus

PanDelos plus plus: a dictionary-based method for pan-genome content discovery

License: MIT


Briefly description

PanDelos plus plus is used for the discovery of the pangenomic content in the population of bacteria with a type of approach without alignment and based on the analysis of the multiplicity of k-mer. It is a natively scalable methodology, whose algorithms are executed in parallel with OpenMP.

Software architecture

The PanDelos plus plus software is organized in 2 python modules and a C++ software that are piped together by a bash script. The bash script, pandelos.sh, provides the acces point the the PanDelos plus plus pipeline.

bash pandelos.sh <input.faa> <output_graph.net> <sequences_type> <log_file.txt> <clus_file.clus> <coco_file.txt>

Input format

The complete set of (gene) sequences <input.faa>, belonging to any of the studied genomes, must be provided as a text file.

For each sequence, two lines are reported in the file. An identification line that is composed of three parts separated by a tabulaiton character. The parts represent the genome identifiers, the gene identifier and the gene product.

After the identification line, the complete gene sequence in FASTA amino acid format is reported in a single line. No black lines are admitted between the indetification line and the sequence line, neighter between genes.

A valid file is given by the following example listing 4 genes from 2 genomes:

NC_000913	b0001@NC_000913:1	thr operon leader peptide
MKRISTTITTTITITTGNGAG
NC_000913	b0024@NC_000913:1	uncharacterized protein
MCRHSLRSDGAGFYQLAGCEYSFSAIKIAAGGQFLPVICAMAMKSHFFLISVLNRRLTLTAVQGILGRFSLF
NC_002655	Z_RS03160@NC_002655:1	hok/gef family protein
MLTKYALVAVIVLCLTVPGFTLLVGDSLCEFTVKERNIEFRAVLAYEPKK
NC_002655	Z_RS03165@NC_002655:1	protein HokE
MLTKYALVAVIVLCLTVLGFTLLVGDSLCEFTVKERNIEFKAVLAYEPKK

IMPORTANT: make sure the gene identifiers are unique within the input file. Commonly used file formats used to share genome annotaitons do not require that different locus tags of the same gene must be unique.

We suggest to use the following format to build unique gene identifiers:

gene_identifier@genome_identifier:unique_integer

The fields gene_identifier and genome_identifier are the same reported in the input file, while the unique_integer is used to disitrnghuish multiple copies of the same gene (same gene identifier) wihtin the same genome. The integer starts from 1 and it is incremented according to the order gene are written in the input file.

The examples provided in the examples folder generate 4 different dataset files, having the .faa extension, which can be consulted.


Output graph

Running the module in C ++ produces a graph named [output_graph].net containing the gene families, within which inconsistent families may be found

Sequences type

This software allows the discovery of pangenomic content for nucleotide (1) or amino acid (0) sequences

Log file

The execution requires a log file which will contain some main information about the execution

Clus file

The execution of PanDelos produces an output file named [clus_file].clus which reports the gene families retrieved by the software. Each row of the output file represented a specific gene family retrieved by PanDelos.


Installation

Requirements

Before running PanDelos plus plus, please verify that the following software is installed on your system

Setup

  • Download the software from here or clone the github repository
  • Enter the PanDelos-plusplus directory and type
bash setup/setup.sh

to compile the C++ source code of PanDelos-plus-plus and create the folders that the software needs.

The script allows to download from a repository of the files concerning genomes generated with PANPROVA of Mycoplasma genitalium and Escherichia coli.

These files, using utility scripts that will be described later, will be transformed into .faa files

Furthermore, it is also possible to have the same genome files in the form of nucleotides (for a total of 4 datasets available)

Compiling the C++ source code

The script setup/compile.sh is available to manually compile the program in C ++ or you can give the command:

g++ -w -std=c++17 -fopenmp -O3 src/cpp/main.cpp -o bin/pandelos_plus_plus.out

The software needs the C ++ 17 standard and the -fopenmp directive which enables the compiler to manage the pragma directives of the library that deals with parallelizing the algorithms in the software


Running the examples

Examples are available in the folder examples/ o to test PanDelos plus plus.

In general, each example script allows you to start multiple tests, separate and sequential, involving a different number of genomes, even in the form of intervals.

There are 4 examples available:

  1. 16 genomes of Mycoplasma genitalium with nucleotide sequences
  2. 16 genomes of Mycoplasma genitalium with amino acid sequences
  3. 16 genomes of Escherichia coli with nucleotide sequences
  4. 16 genomes of Escherichia coli with amino acid sequences

Utilities

  • script/panprova2gbk.sh.py: a Python script for converting a GFF file into the GBK file
  • script/panprova2nucleotides.sh script/panprova2nucleotides.sh: a Python script for converting a FNA and GFF files into a FASTA file composed of nucleotide sequences.
  • script/gbk2faa.sh gbk2faa.py: a python script to convert one or more GBK files into a single FASTA file
  • src/python/quality.py: calculates statistics about th eextracted pan-genome content and print them.

License

PanDelos plus plus is distributed under the MIT license. This means that it is free for both academic and commercial use. Note however that some third party components in PanDelos plus plus require that you reference certain works in scientific publications. You are free to link or use PanDelos plus plus inside source code of your own program. If do so, please reference (cite) PanDelos plus plus and this website. We appreciate bug fixes and would be happy to collaborate for improvements.


Contributors

  • Vincenzo Bonnici, University of Parma, Italy.
  • Giandonato Inverso