leylabmpi/Struo2

kraken2-build error when creating sequence ID to taxonomy ID map

Opened this issue · 9 comments

Hi,

I'm near the end of the Struo2 pipeline trying to create a custom kraken2 database using gtdb r207.

I've hit a wall though at the kraken2-build command, specifically one spot within the build_kraken2_db.sh script that the command calls. It seems that this section:

echo "Creating sequence ID to taxonomy ID map (step 1)..."
if [ -d "library/added" ]; then
  find library/added/ -name 'prelim_map_*.txt' | xargs cat > library/added/prelim_map.txt
fi
seqid2taxid_map_file=seqid2taxid.map
if [ -e "$seqid2taxid_map_file" ]; then
  echo "Sequence ID to taxonomy ID map already present, skipping map creation."
else
  step_time=$(get_current_time)
  find library/ -maxdepth 2 -name prelim_map.txt | xargs cat > taxonomy/prelim_map.txt
  if [ ! -s "taxonomy/prelim_map.txt" ]; then
    echo "No preliminary seqid/taxid mapping files found, aborting."
    exit 1
  fi
  grep "^TAXID" taxonomy/prelim_map.txt | cut -f 2- > $seqid2taxid_map_file.tmp || true
  if grep "^ACCNUM" taxonomy/prelim_map.txt | cut -f 2- > accmap_file.tmp; then
    if compgen -G "taxonomy/*.accession2taxid" > /dev/null; then
      lookup_accession_numbers accmap_file.tmp taxonomy/*.accession2taxid > seqid2taxid_acc.tmp
      cat seqid2taxid_acc.tmp >> $seqid2taxid_map_file.tmp
      rm seqid2taxid_acc.tmp
    else
      echo "Accession to taxid map files are required to build this DB."
      echo "Run 'kraken2-build --db $KRAKEN2_DB_NAME --download-taxonomy' again?"
      exit 1
    fi
  fi
  rm -f accmap_file.tmp
  finalize_file $seqid2taxid_map_file
  echo "Sequence ID to taxonomy ID map complete. [$(report_time_elapsed $step_time)]"
fi

Produces the error messages:

Accession to taxid map files are required to build this DB.
Run 'kraken2-build --db $KRAKEN2_DB_NAME --download-taxonomy again?

When I try to run through this line by line myself everything is fine until lookup_accession_numbers accmap_file.tmp taxonomy/*.accession2taxid > seqid2taxid_acc.tmp at which point I get the error Found 0/1363031 targets...lookup_accession_numbers: unable to open taxonomy/*.accession2taxid: No such file or directory

my ./taxonomy/ directory only contains the following:

-rw-r--r--+ 1  names.dmp
-rw-r--r--+ 1  nodes.dmp
drwxr-sr-x+ 2  .
-rw-r--r--+ 1  prelim_map.txt
drwxr-sr-x+ 5  ..

Should there be accession2taxid files in here? If so, when should they have been generated?

Happy to post on the kraken2 github if this is more appropriate but figured this maybe something that should have been generated elsewhere in the Struo2 pipeline.

Any help much appreciated, thanks!

hmm... an accession2taxid file shouldn't be needed, unless that recently changed. Can you please try just creating an empty accession2taxid file in the appropriate directory?

Thanks for the quick suggestion, no luck unfortunately.

Creating a blank file .accession2taxid or accession2taxid gives the same error and giving the file a filename like 1.accession2taxid, test.accession2taxid, blank.accession2taxid, etc just produces lookup_accession_numbers: unable to mmap taxonomy/1.accession2taxid: Invalid argument

I’m on vacation this week, but I’ll have a look at the problem ASAP

No worries, thanks! Enjoy your vacation.

@joshsimcock I haven't been able to reproduce this issue. Can you provide more info, such as:

  • the version of snakemake that you are using
  • the versions of kraken2 & bracken in the conda env that is used by snakemake (in the .snakemake/conda/ directory)

FYI: I'm working on creating Kraken2 & Bracken databases for Release 207 (followed later by the humann3 databases). They should be complete by the end of the week.

@nick-youngblut sorry for the long delay in replying.

snakemake = 7.6.2
kraken2 = 2.1.2
bracken = 2.5

Thanks for uploading the 207 release! Saves me a lot of time. If you can figure out what happened here great, but there is no rush as I can use your r207 builds for now thanks!

I am encountering the same problem, and I have resolved it by "chmod +x n*.dmp" after much time and effort. I am afraid that the problem is that the names.dmp and nodes.dmp are not able to be read. be read, as files in your ./taxonomy/ directory also were "-rw-r--r--".

After solving the above problem, I encountered another problem with the same warning. I found the head of my library fna contain many space that may cause the kraken tax id can't be read which were reported in accmap_file.tmp
image