shandley/hecatomb

Error downloading databases

Closed this issue · 5 comments

I followed the directions to generate databases here and ran into an error. After decompressing the database tar file and running the following command: snakemake --configfile snakemake/config/sample_config.yaml --snakefile snakemake/workflow/download_databases.smk --cores 8 I get the output below. My snakemake version is 5.26.1.

Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
	count	jobs
	1	all
	1	cluster_uniprot
	1	download_id_taxonomy_mapping
	1	download_ncbi_taxonomy
	1	download_uniprot_viruses
	1	download_uniref50
	1	extract_ncbi_taxonomy
	1	line_sine_download
	1	make_bac_databases
	1	make_host_databases
	1	mmseqs_uniprot_clusters
	1	mmseqs_uniprot_taxdb
	1	mmseqs_urv
	1	mmseqs_urv_taxonomy
	1	uniprot_to_ncbi_mapping
	1	uniref_plus_viruses
	16

[Thu Oct  8 18:24:54 2020]
rule download_uniref50:
    output: databases/proteins/uniref50.fasta.gz
    jobid: 16


[Thu Oct  8 18:24:54 2020]
rule download_id_taxonomy_mapping:
    output: databases/taxonomy/idmapping.dat.gz
    jobid: 9


[Thu Oct  8 18:24:54 2020]
rule download_ncbi_taxonomy:
    output: databases/taxonomy/taxdump.tar.gz
    jobid: 14


[Thu Oct  8 18:24:54 2020]
rule make_bac_databases:
    input: databases/bac_giant_unique_species/bac_uniquespecies_giant.masked_Ns_removed.fasta
    output: databases/bac_giant_unique_species/ref
    jobid: 1
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule download_uniprot_viruses:
    output: databases/proteins/uniprot_virus.faa
    jobid: 4


[Thu Oct  8 18:24:54 2020]
rule make_host_databases:
    input: databases/human_masked/human_virus_masked.fasta
    output: databases/human_masked/ref
    jobid: 2
    resources: time_min=240, mem_mb=100000, cpus=16


[Thu Oct  8 18:24:54 2020]
rule line_sine_download:
    output: databases/contaminants/line_sine.fasta
    jobid: 3

[Thu Oct  8 18:24:54 2020]
[Thu Oct  8 18:24:54 2020]
Error in rule download_id_taxonomy_mapping:
Error in rule download_uniprot_viruses:
    jobid: 9
    jobid: 4
    output: databases/taxonomy/idmapping.dat.gz
    output: databases/proteins/uniprot_virus.faa
    shell:
        
        cd databases/taxonomy;
        curl -LO "https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    shell:
        
        mkdir -p databases/proteins && curl -Lgo databases/proteins/uniprot_virus.faa "https://www.uniprot.org/uniprot/?query=taxonomy:%22Viruses%20[10239]%22&format=fasta&&sort=score&fil=reviewed:no"
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)


[Thu Oct  8 18:24:55 2020]
Error in rule line_sine_download:
    jobid: 3
    output: databases/contaminants/line_sine.fasta
    shell:
        
        (curl -L http://sines.eimb.ru/banks/SINEs.bnk &&                 curl -L http://sines.eimb.ru/banks/LINEs.bnk)                 | sed -e '/^>/ s/ /_/g' | seqtk rename                 > databases/contaminants/line_sine.fasta
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job line_sine_download since they might be corrupted:
databases/contaminants/line_sine.fasta
[Thu Oct  8 18:24:57 2020]
Finished job 14.
1 of 16 steps (6%) done
[Thu Oct  8 18:29:42 2020]
Finished job 2.
2 of 16 steps (12%) done
[Thu Oct  8 18:30:36 2020]
Finished job 1.
3 of 16 steps (19%) done
[Thu Oct  8 18:40:49 2020]
Finished job 16.
4 of 16 steps (25%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/pathogen1/stahan/hecatomb/.snakemake/log/2020-10-08T182453.577588.snakemake.log```

I think they are all the same issue, (perhaps curl is missing?)

Can you take a look in /mnt/pathogen1/stahan/hecatomb/.snakemake/log/2020-10-08T182453.577588.snakemake.log for line_sine_download and see if it gives you more information about the error

Additionally, what does curl --version return?

curl --version returns curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.44 zlib/1.2.7 libidn/1.28 libssh2/1.8.0 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets

I was able to fix the error in rule line_sine_download by installing seqtk to my environment.

Fixed the issue by installing curl, cd-hit, mmseqs and seqtk to my environment.

Reopening ... Rob to add those to conda environment.

mostly irrelevant with newest version.