nf-core/bacass

download_reference.py script can't find the top reference

sralchemab opened this issue · 10 comments

Description of the bug

The script download_reference.py from the FIND_DOWNLOAD_REFERENCE task fails to find the top reference within the provided refseq file.

After confirming that the refseq ID exists on the reference file it's been looked upon, I checked the script and found an issue in the following lines

# Construct the ref_query using assembly_accession and asm_name
assembly_accession = row[0]
asm_name = row[15]
ref_query = f"{assembly_accession}_{asm_name}"
# Check if ref_query matches the search value
if ref_query == top_reference:

The variable ref_query is made by concatenating assembly_accession and asm_name. However, top_reference only contains the assembly_accession. The issue is easily fixable by updating line 125 to:

            if assembly_accession == top_reference:

For the --ncbi_assembly_metadata option, I used assembly_summary_refseq.txt which I obtained from the README mentioned on the documentation for the option (--ncbi_assembly_metadata).

For the --kmerfinderdb, I tried using the Zenodo link on the option documentation, but the file looks like it's corrupted (the MD5 result matches but it throws an error when trying to unpack). Because of this, I followed the link to the Kmerfinder Databases from the same docs and downloaded the link from the top.

One extra issue to bear in mind, is that the KMERFINDER task looks for the file ${kmerfinder_db}/bacteria.name which in this version of the database does not exist as such. So in order to make it work, in that same folder, the following link has to be created:

ln -s bacteria.tax bacteria.name

This is something that could be addressed as well. However, it's easy to workaround without having to touch the pipeline.

Additionally, I would recommend maybe increasing a bit the memory requirements for the KMERFINDER task, because when running the full pipeline it was failing without specifying any error message. After debugging it, I found that it was a memory issue.

Command used and terminal output

# Download NCBI's assemblies summary
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt

# Download kmerfinder's DB
wget https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz

# Unpacking the DB and creating the soft link
tar xzf kmerfinder_db.tar.gz
cd kmerfinder_db/bacteria && ln -s bacteria.tax bacteria.name && cd -

# Run the pipeline
nextflow run nf-core/bacass \
    -revision '2.3.1' \
    -profile docker,test \
    --outdir ./results \
    --assembler 'unicycler' \
    --assembly_type 'short' \
    --skip_kmerfinder false \
    --kmerfinderdb ./kmerfinder_db/bacteria \
    --ncbi_assembly_metadata ./assembly_summary_refseq.txt \
    --skip_annotation
...
-[nf-core/bacass] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE)'

Caused by:
  Process `NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE)` terminated with an error exit status (1)


Command executed:

  ## Find the common reference genome
  find_common_reference.py \
      -d reports/ \
      -o references_found.tsv

  ## Download the winner reference genome from the ncbi database
  download_reference.py \
      -file references_found.tsv \
      -reference assembly_summary_refseq.txt \
      -out_dir .

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE":
      python: $(python --version | awk '{print $2}')
  END_VERSIONS

Command exit status:
  1

Command output:
  No assemblies responding to the top reference:  GCF_001515705.1  were found

Command error:
  No assemblies responding to the top reference:  GCF_001515705.1  were found
...

Relevant files

nextflow.log

System information

         Nextflow version : 24.04.3
                 Hardware : AWS EC2, 16 cores, 128 Gb RAM
                 Executor : local
         Container engine : Docker
                       OS : Amazon Linux 2023
Version of nf-core/bacass : 2.3.1

Hi @sralchemab,

Thanks for this issue and such detailed info.

The variable ref_query is made by concatenating assembly_accession and asm_name. However, top_reference only contains the assembly_accession. The issue is easily fixable by updating line 125 to:

            if assembly_accession == top_reference:

Regarding this, there is an in-progress PR, here, addressing this error . However, it hasn't been reviewed yet. Since you have been testing KmerFinder, it would be great if you could join the review process. :)

For the --kmerfinderdb, I tried using the Zenodo link on the option documentation, but the file looks like it's corrupted (the MD5 result matches but it throws an error when trying to unpack). Because of this, I followed the link to the Kmerfinder Databases from the same docs and downloaded the link from the top.

Oh... strange, it was working in previous releases. I'll take a look at it.

Additionally, I would recommend maybe increasing a bit the memory requirements for the KMERFINDER task, because when running the full pipeline it was failing without specifying any error message. After debugging it, I found that it was a memory issue.

Absolutely, I have also faced few memory issues with this step.

Issues:

  • Fix kmerfinder script download_reference.py (in progress #154 )
  • Fix corrupted kmerfinder database available in Zenodo && update prams documentation.
  • Updata kmerfinder db to latest version.
  • Increase base memory for Kmerfinder modules

Hi @Daniel-VM ! Thanks for looking into this so fast. I reviewed the PR and approved it. Would you like to add this as part of the documentation as well? Also, for some reason, it looks like my approval is not enough?

Oops, yes, you should request to join the nf-core community by sharing the URL to your GitHub repo in their Slack channel #github-invitations.

Would you like to add this as part of the documentation as well? Also, for some reason, it looks like my approval is not enough?

I think it would be better to merge PR #154 and then create a new one to include all these fixes.

Now that I realised what was the issue, I approved it with my personal Github account, with which I belong to nf-core.

Awesome, thanks. I’m currently looking into the corrupted Zenodo file, but feel free to contribute to the develop branch by addressing the issues mentioned above. You can find the contributing guidelines here.

Issues:

  • Fix kmerfinder script download_reference.py (in progress Fix kmerfinder scripts #154 )
  • Fix corrupted kmerfinder database available in Zenodo && update prams documentation.
  • Updata kmerfinder db to latest version.
  • Increase base memory for Kmerfinder modules

Hi @sralchemab , I have made a PR with the fixes above. I hope this will solve the bug.

Thanks, @Daniel-VM ! I'll give it a go tonight.

Thanks, @Daniel-VM, all fixed! However, I found two other unrelated things that I'll report as different issues.