Updating Database: Error with kraken2_add_taxID
PeterCx opened this issue · 4 comments
Hi Nick,
I am still having trouble updating my database with custom MAGs. I run snakemake using the following command and config file:
config-update.yaml.txt
snakemake --use-conda --cores 30 --configfile config-update.yaml
I have attached the snakemake log. The issue is around "kraken2_add_taxID".
snakemake_log.txt
Firstly it says missing output files (see snipet from log below). But this file is actually present. It simply unzips the file and places in genome directory.
[Sat Jan 28 16:22:48 2023]
rule kraken2_add_taxID:
input: /workspace/pot/peterc/Equine/Struo2/MAGs/ERR6929713_bin.678.fna.gz
output: tmp/db_update_tmp/peterc/Struo2_112566273/db_update/genomes/MAG_1203_Nanosyncoccus.fna
log: /workspace/pot/peterc/Equine/Struo2/Output/logs/db_update/kraken2_add_taxID/MAG_1203_Nanosyncoccus.log
jobid: 459
benchmark: /workspace/pot/peterc/Equine/Struo2/Output/benchmarks/db_update/kraken2_add_taxID/MAG_1203_Nanosyncoccus.txt
reason: Missing output files: tmp/db_update_tmp/peterc/Struo2_112566273/db_update/genomes/MAG_1203_Nanosyncoccus.fna
wildcards: sample=MAG_1203_Nanosyncoccus
resources: tmpdir=/tmp, time=59, mem_gb_pt=6
Later it is unable to produce the done files as a result and it is doomed from the get go. I have no idea how to solve it so help is appreciated. I have attached all other relevant files which may help diagnose the problem.
names.dmp.txt
Sample_Table.txt
Many thanks
Kind regards,
P
What is in the log for the fail run jobs (no the snakemake log)? You are right in that the python script for that job just uncompresses the genome and adds the taxid to the sequence header(s). Is the python script failing (kraken2_rename_genome.py
)?
The logs present in the directory - Struo2/Output/logs/db_update/kraken2_add_taxID are all completely empty. Where there is one log for each job/MAG I wanted to add to the database.
However, the file seqid2taxid.map is being generated correctly and has kraken taxids for each fasta header in each file. Also taxo.k2d.tmp is generated.
I re ran the code and this time it worked. I am not sure how as I did not do anything different. Or at least I can't remember making a change. Thanks for all the help and for making this tool.
Hi Nick,
Sorry to bother you again but I am encountering further problems with this. As per my last comment the database seemed to update successfully. However, I am unable to classify my reads using the database. See here the output from the build. Everything seems to have been succesful.
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [2.070s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 312273650832 bytes
Capacity estimation complete. [11m56.324s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 18 bits reserved for taxid.
Completed processing of 7746034 sequences, 202613488062 bp
Writing data to disk... complete.
Database files completed. [12h3m44.879s]
Database construction complete. [Total: 12h15m47.164s]
I have made several attempts to classify the reads some of which have returned the error:
Loading database information..... Killed
Other attempts have managed to load the database successfully and classify 50% - 90% of the reads but fails before finishing it results in a similar error -"Killed".
It seems to be related to these issues but there is no solution - [(https://github.com/DerrickWood/kraken2/issues/184)] [https://github.com/DerrickWood/kraken2/issues/84]
There must be a problem with the database. Note that I am able to classify no problem with the original database as downloaded from [(http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207)
It also cannot be a RAM issue. The updated database is only a small bit bigger than the original.
Also without making any changes I tried to re-build the database but immediately I encountered the same errors as the first comment in this thread.
I am not sure how to proceed. Many thanks
P