download_reference.py script can't find the top reference
sralchemab opened this issue · 10 comments
Description of the bug
The script download_reference.py
from the FIND_DOWNLOAD_REFERENCE
task fails to find the top reference within the provided refseq file.
After confirming that the refseq ID exists on the reference file it's been looked upon, I checked the script and found an issue in the following lines
bacass/bin/download_reference.py
Lines 119 to 125 in c81202b
The variable ref_query
is made by concatenating assembly_accession
and asm_name
. However, top_reference
only contains the assembly_accession
. The issue is easily fixable by updating line 125 to:
if assembly_accession == top_reference:
For the --ncbi_assembly_metadata
option, I used assembly_summary_refseq.txt which I obtained from the README mentioned on the documentation for the option (--ncbi_assembly_metadata).
For the --kmerfinderdb
, I tried using the Zenodo link on the option documentation, but the file looks like it's corrupted (the MD5 result matches but it throws an error when trying to unpack). Because of this, I followed the link to the Kmerfinder Databases from the same docs and downloaded the link from the top.
One extra issue to bear in mind, is that the KMERFINDER
task looks for the file ${kmerfinder_db}/bacteria.name
which in this version of the database does not exist as such. So in order to make it work, in that same folder, the following link has to be created:
ln -s bacteria.tax bacteria.name
This is something that could be addressed as well. However, it's easy to workaround without having to touch the pipeline.
Additionally, I would recommend maybe increasing a bit the memory requirements for the KMERFINDER
task, because when running the full pipeline it was failing without specifying any error message. After debugging it, I found that it was a memory issue.
Command used and terminal output
# Download NCBI's assemblies summary
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
# Download kmerfinder's DB
wget https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz
# Unpacking the DB and creating the soft link
tar xzf kmerfinder_db.tar.gz
cd kmerfinder_db/bacteria && ln -s bacteria.tax bacteria.name && cd -
# Run the pipeline
nextflow run nf-core/bacass \
-revision '2.3.1' \
-profile docker,test \
--outdir ./results \
--assembler 'unicycler' \
--assembly_type 'short' \
--skip_kmerfinder false \
--kmerfinderdb ./kmerfinder_db/bacteria \
--ncbi_assembly_metadata ./assembly_summary_refseq.txt \
--skip_annotation
...
-[nf-core/bacass] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE)'
Caused by:
Process `NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE)` terminated with an error exit status (1)
Command executed:
## Find the common reference genome
find_common_reference.py \
-d reports/ \
-o references_found.tsv
## Download the winner reference genome from the ncbi database
download_reference.py \
-file references_found.tsv \
-reference assembly_summary_refseq.txt \
-out_dir .
cat <<-END_VERSIONS > versions.yml
"NFCORE_BACASS:BACASS:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE":
python: $(python --version | awk '{print $2}')
END_VERSIONS
Command exit status:
1
Command output:
No assemblies responding to the top reference: GCF_001515705.1 were found
Command error:
No assemblies responding to the top reference: GCF_001515705.1 were found
...
Relevant files
System information
Nextflow version : 24.04.3
Hardware : AWS EC2, 16 cores, 128 Gb RAM
Executor : local
Container engine : Docker
OS : Amazon Linux 2023
Version of nf-core/bacass : 2.3.1
Hi @sralchemab,
Thanks for this issue and such detailed info.
The variable
ref_query
is made by concatenatingassembly_accession
andasm_name
. However,top_reference
only contains theassembly_accession
. The issue is easily fixable by updating line 125 to:if assembly_accession == top_reference:
Regarding this, there is an in-progress PR, here, addressing this error . However, it hasn't been reviewed yet. Since you have been testing KmerFinder, it would be great if you could join the review process. :)
For the
--kmerfinderdb
, I tried using the Zenodo link on the option documentation, but the file looks like it's corrupted (the MD5 result matches but it throws an error when trying to unpack). Because of this, I followed the link to the Kmerfinder Databases from the same docs and downloaded the link from the top.
Oh... strange, it was working in previous releases. I'll take a look at it.
Additionally, I would recommend maybe increasing a bit the memory requirements for the
KMERFINDER
task, because when running the full pipeline it was failing without specifying any error message. After debugging it, I found that it was a memory issue.
Absolutely, I have also faced few memory issues with this step.
Issues:
- Fix kmerfinder script
download_reference.py
(in progress #154 ) - Fix corrupted kmerfinder database available in Zenodo && update prams documentation.
- Updata kmerfinder db to latest version.
- Increase base memory for Kmerfinder modules
Hi @Daniel-VM ! Thanks for looking into this so fast. I reviewed the PR and approved it. Would you like to add this as part of the documentation as well? Also, for some reason, it looks like my approval is not enough?
Oops, yes, you should request to join the nf-core community by sharing the URL to your GitHub repo in their Slack channel #github-invitations.
Would you like to add this as part of the documentation as well? Also, for some reason, it looks like my approval is not enough?
I think it would be better to merge PR #154 and then create a new one to include all these fixes.
Now that I realised what was the issue, I approved it with my personal Github account, with which I belong to nf-core.
Awesome, thanks. I’m currently looking into the corrupted Zenodo file, but feel free to contribute to the develop branch by addressing the issues mentioned above. You can find the contributing guidelines here.
Issues:
- Fix kmerfinder script
download_reference.py
(in progress Fix kmerfinder scripts #154 )- Fix corrupted kmerfinder database available in Zenodo && update prams documentation.
- Updata kmerfinder db to latest version.
- Increase base memory for Kmerfinder modules
Hi @sralchemab , I have made a PR with the fixes above. I hope this will solve the bug.
Thanks, @Daniel-VM ! I'll give it a go tonight.
Thanks, @Daniel-VM, all fixed! However, I found two other unrelated things that I'll report as different issues.