/genestrip-db

A selection of databases for Genestrip

Primary LanguageShellApache License 2.0Apache-2.0

Genestrip-DB - a selection of databases for Genestrip

This project contains some configuration files and a two scripts in order to generate databases and indexes for metagenomic analysis via Genestrip.

License

Genestrip-DB is free for any kind of use. However, the associated software, Genestrip, has a more restrictive License.

Building and installing

Genestrip-DB requires Maven 2 or 3 and the JRE 1.8 or higher.

To build the databases and indexes, cd to the installation directory genestrip-db. Given a matching Maven and JDK installation, sh bin/makedbs.sh will generate 9 databases (and indexes) of different sizes. The generation process is resource intensive and may take several days for all databases. Generating the bacterial databases is particularly time consuming.

Your machine should have:

  • 1 TB of free disk space - mainly for downloading genomes from NCBI,
  • at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy),
  • 48 GB of main memory,
  • a high bandwidth Internet connection.

The databases are based on and compatible with Genestrip v1.4.

The databases

All databases are genomic or based on total RNA.

Name Category Description Database disk size Sources and references
babesia protozoa Babesia species from the RefSeq and Genbank which are potentially pathogenic for humans 1.1 G General knowledge
borrelia bacteria Borrelia species from the RefSeq which are potentially pathogenic for humans 850 MB General knowledge
borrelia_plasmid plasmid Borrelia species from the RefSeq which are potentially pathogenic for humans 219 MB General knowledge
chronicb bacteria Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases 2.8 GB Collected from Armin Labs
chronicb-rna bacteria Same as chronicb but based on total RNA. 1.1 M
human_virus2 viral Viruses from the RefSeq and Genbank which are potentially pathogenic for humans 89 MB Extracted from the Viral Zone
parasites invertebrate Parasitic invertebrate animals from the RefSeq which are potentially pathogenic for humans 20 GB Collected from the book "Die Parasiten des Menschen" by Heinz Mehlhorn
protozoa protozoa Protozoan parasites from the RefSeq which are potentially pathogenic for humans 17 GB Collected from the German book "Die Parasiten des Menschen" by Heinz Mehlhorn
protozoa-rna protozoa Same as protozoa but based on total RNA 8.5 GB
vineyard fungi Fungal infections of grapevine taken from the RefSeq 4.7 GB Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis
plasmopara plant Peronosporales as infections of grapevine taken from the RefSeq 5.6 GB Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis

Note that Genestrip's updateddb-phase accounts for unspecific k-mers and largely avoids false positive counts during matches. To further reduce false positives, all databases except for vineyard, chronicb-rna and protozoa-rna are built such that k-mers also occurring in the human genome are pushed to the least common ancestor.

Testing the databases borrelia, borrelia_plasmid and chronicb

The script bin/matchticks.sh runs the Genestrip goal matchlr for 11 fastq files taken from this publication. To do so, the fastq files will be streamed from the corresponding NCBI server. As expected, Genestrip finds DNA from borrelia and other tick-borne infections accordingly.

Downloading the ready-made databases

If you don't want to generate them yourself, the databases and indexes can also be downloaded from genestrip.it.hs-heilbronn.de. The projects folder corresponds to the projects folder's state of this project, after the scripts bin/makedbs.sh and bin/matchticks.sh have run successfully on the RefSeq Release 226.