Genestrip-DB - a selection of databases for Genestrip
This project contains some configuration files and a two scripts in order to generate databases and indexes for metagenomic analysis via Genestrip.
Genestrip-DB is free for any kind of use. However, the associated software, Genestrip, has a more restrictive License.
Genestrip-DB requires Maven 2 or 3 and the JRE 1.8 or higher.
To build the databases and indexes, cd
to the installation directory genestrip-db
. Given a matching Maven and JDK installation, sh bin/makedbs.sh
will generate 9 databases (and indexes) of different sizes. The generation process is resource intensive and may take several days for all databases.
Generating the bacterial databases is particularly time consuming.
Your machine should have:
- 1 TB of free disk space - mainly for downloading genomes from NCBI,
- at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy),
- 48 GB of main memory,
- a high bandwidth Internet connection.
The databases are based on and compatible with Genestrip v1.4.
All databases are genomic or based on total RNA.
Name | Category | Description | Database disk size | Sources and references |
---|---|---|---|---|
babesia |
protozoa |
Babesia species from the RefSeq and Genbank which are potentially pathogenic for humans | 1.1 G | General knowledge |
borrelia |
bacteria |
Borrelia species from the RefSeq which are potentially pathogenic for humans | 850 MB | General knowledge |
borrelia_plasmid |
plasmid |
Borrelia species from the RefSeq which are potentially pathogenic for humans | 219 MB | General knowledge |
chronicb |
bacteria |
Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases | 2.8 GB | Collected from Armin Labs |
chronicb-rna |
bacteria |
Same as chronicb but based on total RNA. |
1.1 M | |
human_virus2 |
viral |
Viruses from the RefSeq and Genbank which are potentially pathogenic for humans | 89 MB | Extracted from the Viral Zone |
parasites |
invertebrate |
Parasitic invertebrate animals from the RefSeq which are potentially pathogenic for humans | 20 GB | Collected from the book "Die Parasiten des Menschen" by Heinz Mehlhorn |
protozoa |
protozoa |
Protozoan parasites from the RefSeq which are potentially pathogenic for humans | 17 GB | Collected from the German book "Die Parasiten des Menschen" by Heinz Mehlhorn |
protozoa-rna |
protozoa |
Same as protozoa but based on total RNA |
8.5 GB | |
vineyard |
fungi |
Fungal infections of grapevine taken from the RefSeq | 4.7 GB | Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
plasmopara |
plant |
Peronosporales as infections of grapevine taken from the RefSeq | 5.6 GB | Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis |
Note that Genestrip's updateddb
-phase accounts for unspecific k-mers and largely avoids false positive counts during match
es.
To further reduce false positives, all databases except for vineyard
, chronicb-rna
and protozoa-rna
are built such that k-mers also occurring in the human genome
are pushed to the least common ancestor.
The script bin/matchticks.sh
runs the Genestrip goal matchlr
for 11 fastq files taken from this publication.
To do so, the fastq files will be streamed from the corresponding NCBI server.
As expected, Genestrip finds DNA from borrelia and other tick-borne infections accordingly.
If you don't want to generate them yourself, the databases and indexes can also be downloaded from genestrip.it.hs-heilbronn.de.
The projects
folder corresponds
to the projects
folder's state of this project, after the scripts bin/makedbs.sh
and bin/matchticks.sh
have run successfully on the RefSeq Release 226.