/EukRep

Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets

Primary LanguagePythonMIT LicenseMIT

EukRep

Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets

Installation

  • Requires Python3
  • Install with conda (recommended)
$ conda create -y -n eukrep-env -c bioconda scikit-learn==0.19.2 eukrep
  • Install with pip (requires scikit-learn v 0.19.2)
$ pip install EukRep

Example Usage

  • Identify and output sequences predicted to be of eukaryotic origin from a fasta file:
$ EukRep -i <Sequences in Fasta format> -o <Eukaryote sequence output file>
  • Identify and output both sequences of eukaryotic and prokaryotic origin from a fasta file:
$ EukRep -i <Sequences in Fasta format> -o <Eukaryote sequence output file> --prokarya <Prokaryote sequence output file>

Obtaining Eukaryotic Bins

EukRep is intended to be used as one part of a larger pipeline. For obtaining high quality gene predictions and binning identified eukaryotic contigs as described in "Genome-reconstruction for eukaryotes from complex natural microbial communities" (West et al. in review), see methods section https://doi.org/10.1101/171355

-or-

See a provided example workflow (work in progess) https://github.com/patrickwest/EukRep_Pipeline

Adjusting Stringency

The stringency of identifying eukaryotic contigs can be adjusted with -m. The false positive rate (FPR) and false negative rate (FNR) for the strict, balanced, and lenient modes are shown below. Default is balanced. Prior to version 0.6.5, lenient was the default.

20kb

5kb

Data was obtained by running EukRep on 20kb and 5kb fragmented scaffolds from genomes from mock novel phyla.

Important Caveat

In our experience, most metagenomes do not have a eukaryotic genome present; however, EukRep has a false positive rate and you will still receive output in these cases.