nextstrain/ncov

one of the commands exited with non-zero exit code

Closed this issue ยท 6 comments

[Wed Feb 19 21:48:25 2020]
Finished job 13.
1 of 17 steps (6%) done
ERROR: Problem reading in data/sequences.fasta:
Duplicate key 'Italy/INMI1/2020'
[Wed Feb 19 21:48:26 2020]
Error in rule filter:
jobid: 17
output: results/filtered.fasta
shell:

    augur filter             --sequences data/sequences.fasta             --metadata data/metadata.tsv             --include config/include.txt             --exclude config/exclude.txt             --min-length 15000             --output results/filtered.fasta
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

when I put the sequences.fasta into data file, this error was raised.
I have tried many ways but faild.
I don't know whether is the problem from my sequences.fasta. But I have arranged these data like what described in readme.

image

Hi @hackerchenzhuo , from what you've posted, it seems the problem is that you have two copies of the sequence 'Italy/INMI1/2020' in your sequences.fasta file. (Or, two sequences that are called 'Italy/INMI1/2020'.)

If you search for this name in the sequences file you should be able to find where both of them are, and then you can remove one (if they are duplicates) or rename one (if you have accidentally put the wrong name to a sequence). This should clear the error.

I hope that helps!

If you search for this name in the sequences file you should be able to find where both of them are, and then you can remove one (if they are duplicates) or rename one (if you have accidentally put the wrong name to a sequence). This should clear the error.

Hi @emmahodcroft, while this problem is due to duplicate strain names in the data provided from GISAID, is it possible that a change in augur may help prevent this? Both of the Italy/INMI1/2020 sequences are very short, under 400nt. If they made it through this duplicate key error, they would be dropped by augur's size filter anyway, and would not make it into further processing. Would it make sense to apply the size filter when loading the sequences.fasta file before any duplicate-checking? Alternatively if the trailing "|EPI_ISL_[id]|datestamp" extra text included in the header line of the GISAID-distributed fasta file containing all sequences could be included instead of stripped out, the two Italy/INMI1/2020 sequences could be allowed to be loaded. This also happened with two French samples before they were renamed on GISAID.

Hi @hackerchenzhuo , from what you've posted, it seems the problem is that you have two copies of the sequence 'Italy/INMI1/2020' in your sequences.fasta file. (Or, two sequences that are called 'Italy/INMI1/2020'.)

If you search for this name in the sequences file you should be able to find where both of them are, and then you can remove one (if they are duplicates) or rename one (if you have accidentally put the wrong name to a sequence). This should clear the error.

I hope that helps!
@emmahodcroft thank you very much !
I make it!

If you search for this name in the sequences file you should be able to find where both of them are, and then you can remove one (if they are duplicates) or rename one (if you have accidentally put the wrong name to a sequence). This should clear the error.

Hi @emmahodcroft, while this problem is due to duplicate strain names in the data provided from GISAID, is it possible that a change in augur may help prevent this? Both of the Italy/INMI1/2020 sequences are very short, under 400nt. If they made it through this duplicate key error, they would be dropped by augur's size filter anyway, and would not make it into further processing. Would it make sense to apply the size filter when loading the sequences.fasta file before any duplicate-checking? Alternatively if the trailing "|EPI_ISL_[id]|datestamp" extra text included in the header line of the GISAID-distributed fasta file containing all sequences could be included instead of stripped out, the two Italy/INMI1/2020 sequences could be allowed to be loaded. This also happened with two French samples before they were renamed on GISAID.

good idea

Hi @brianpardy, so while in this case the length filter would drop both of the Italy sequences, in a more general setting duplicate sequences may not be dropped, and this can cause a lot of problems downstream (we actually didn't use to have this check, so I'm unfortunately familiar with the mess this can cause!).
We never know (generally) what's causing this duplication, so it's important that the user check and figure out what's wrong.

In the context of the SARS-CoV-2 build, I recognise it's a bit frustrating when the sequences arrive this way from GISAID and we know in advance there are two with the same strain name. What I'd recommend is writing a short script that can be run on the sequences file, which will do a 'de-duplication' step before the main snakemake pipeline runs. This is essentially what we do 'behind-the-scenes' for the main Nextstrain build to generate our own 'sequences.fasta' file.

Thanks @emmahodcroft, that makes perfect sense as to why a filter-first change would not generalize. A custom de-duplication step is basically what I have been doing on my side and it works well.