fmalmeida/bacannot

Add bakta

fmalmeida opened this issue · 8 comments

Study the best way to implement Bakta in the pipeline.

It will be nice to provide the users with the option to choose the base annotation with Prokka or Bakta, depending on their needs.

Check if it will be possible to add it.

Bakta outputs are extremely similar to Prokka, however, their annotation is more reliable. Therefore, the addition seems to be very straightforward:

  • Create a module for bakta so users can use either prokka or bakta
  • If using bakta, select the outputs that are similar to the ones produced by prokka and are used throughout the pipeline, thus, the rest of the pipeline would be exactly the same, using the GFF and TSV from bakta or prokka

One thing to think is:

  • Bakta depends on a heavy database, thus, it would not be adequate to put it into the docker image
  • Therefore, to add bakta to the pipeline, the pipeline itself must be reconfigured to have a module that create all the databases that are used throughout the pipeline
  • Then, make the pipeline receive a parameter setting path to this database, which would be easier to users to make them up to date
  • This would also make the docker images only possess the tools, and not the database files, making them smaller, and also making it possible to use the pipeline with different profiles such as: conda, docker or singularity

Recapitulating:

To add bakta it would be necessary to:

  • make the pipeline use tools from conda, docker or singularity with the databases being set in a custom user path
  • create a module to automatically download and format the databases for the pipeline
  • re-configure the pipeline to use the database files from this database directory provided by the user
  • add bakta

Now that pipeline has been restructured, this issue can become a reality.

Since bakta database is huge, instead of downloading and formatting with the pipeline users will have to download themselves as each system or institute will have a way to handle such massive download.

Thus, if users want to annotate and trigger bakta, they will have to simply:

  1. Download the database
  2. Set path to bakta database with --bakta_db

When using this parameter, the pipeline should automatically trigger bakta instead of prokka.

Finally, after very much time, workflow is now properly running from top to bottom when using bakta. For release, it is now required to:

  • Update the docs to explain about bakta option. How to use it? What to expect?
  • Update version on manifest
  • Update automatic reports so they understand when user used prokka or bakta. Check if everything is well rendered.
  • Automatic report, when using prokka must understand when pipeline run using additional hmm libraries for prokka, and which ones were used (from the ones possible when building databases).
  • To think. If using bakta, there is addional parsing of outputs that we can do to give users more information in outputs?

Almost ready.

  • requires running at least two annotations to evaluate how final results look like, so changes can be merged
  • And make sure docs are up to date

try to roll it up in the next 3 days

Something is wrong with bakta docker image. When running it, it is complaining about diamond.
With some -9 exit code.

Execution tests were finished. Now building new docker images, to check whether scripts and reports are properly updated so release can be made.

Finally done 🥳