leylabmpi/Struo2

TaxIDs and integration of prokaryotic and eukaryotic information

Opened this issue · 1 comments

Hello,

I've been exploring Struo2 and found some pretty cool improvements w.r.t. Struo1. Unfortunately, the fact that this pipeline is meant essentially for prokaryotes makes me wonder whether this could also be straightforwardly applied using eukaryotic genome information. In the past, I tried to integrate both prokaryotic/eukaryotic genome information for marine plankton communities but I ended up getting a bunch of errors. I was wondering whether you might suggest a way to bypass the need of a GTDB taxonomy, and instead run Struo2 using NCBI taxonomy information. I must note here that I'm planning to achieve this as follows:

  1. Run Struo2 on a bunch of prokaryotic genomes to generate HumanN3-compatible database.
  2. Update DB using eukaryotic gene sequences predicted via BUSCO+Augustus.

Any hints on how to best achieve this using Struo2?

Any feedback would be greatly appreciated!

The main challenges are integrating the gene data generated via BUSCO+Augustus and creating a hybrid taxonomy. https://github.com/nick-youngblut/gtdb_to_taxdump can possibly help with the taxonomy. I don't have experience with BUSCO or Augustus, so I'd have to see if the output can be formatted to conform with the existing pipeline.

Creating a hybrid kraken database isn't so hard, given that it does not require gene calling. So, one just needs to provide all genomes (bacteria, eukaryotes, etc) and a complete taxonomy (taxdump).