nextstrain/ncov

Deduplication metadata fails

mwelkers opened this issue · 6 comments

I run the sanitize_metadata as part of the workflow on a freshly downloaded GISAID metadata set.

When I subsequently run any augur filter it always errors out with the following error:

"ERROR: The following strains are duplicated in '/data/home/ncov/data/metadata_gisaid.tsv.gz':
Spain/AS-242164052/2021

I also sed removed everything from Spain but the come to other sequences.

python3 ${scripts_dir}/sanitize_metadata.py \
		--metadata ${data_dir}/metadata.tsv \
		--rename-fields 'Submission date=submission_date' 'Clade=nextstrain_clade' 'Virus name=strain' 'Accession ID=gisaid_epi_isl' 'Pango lineage=pango_lineage' 'Sequence length=sequence_length' 'Collection date=date' \
		--database-id-columns "Accession ID"  gisaid_epi_isl  \
		--parse-location-field Location \
		--strip-prefixes "hCoV-19/" \
		--output ${data_dir}/metadata_gisaid.tsv.gz

augur filter \
	--sequence-index ${data_dir}/sequences_gisaid_index.tsv.gz \
	--metadata ${data_dir}/metadata_gisaid.tsv.gz \
	--query "(country == 'Netherlands') & (division == 'Noord-Holland') & (Host=='Human')"  \
	--min-date ${last_three_months} \
	--max-date ${current_date} \
	--exclude-ambiguous-dates-by any \
	--subsample-max-sequences 1600 \
	--group-by year month \
	--output-strains ${data_dir}/subsampled_1600_strains_Noord-Holland_from_${last_three_months}_until_${current_date}.txt

The entries in the original gisaid metadatafile are:

hCoV-19/Spain/AS-242164052/2021         Original        betacoronavirus EPI_ISL_1701313 2021-04-06      Europe / Spain / Asturias               29829  Human    48      Female  GR      P.1.15  PANGO-v1.18     VOC Gamma GR/501Y.V3 (P.1+P.1.*) first detected in Brazil/Japan (Spike_R190S,NSP6_S106del,Spike_H655Y,Spike_E484K,N_R203K,Spike_K417T,Spike_T1027I,NSP1_H110Y,Spike_D138Y,Spike_N501Y,NS3_S253P,NSP9_P83L,Spike_L18F,N_P80R,Spike_T20N,NSP3_K977Q,NSP6_G107del,NSP6_F108del,NSP3_S370L,N_G204R,NS8_E92K,Spike_P26S,NSP12_P323L,Spike_D614G,Spike_V1176F,NSP13_E341D)    2021-04-23              True    True   0.379664085286

hCoV-19/Spain/AS- 242164052/2021                Original        betacoronavirus EPI_ISL_1913066 2021-04-06      Europe / Spain / Asturias              29817    Human   48      Female  GR      P.1.15  PANGO-v1.18     VOC Gamma GR/501Y.V3 (P.1+P.1.*) first detected in Brazil/Japan (Spike_R190S,NSP6_S106del,Spike_H655Y,Spike_E484K,N_R203K,Spike_K417T,Spike_T1027I,NSP1_H110Y,Spike_D138Y,Spike_N501Y,NS3_S253P,NSP9_P83L,Spike_L18F,N_P80R,Spike_T20N,NSP3_K977Q,NSP6_G107del,NSP6_F108del,NSP3_S370L,N_G204R,NS8_E92K,Spike_P26S,NSP12_P323L,Spike_D614G,Spike_V1176F,NSP13_E341D)    2021-05-05              True   True                     0.379716269242

And notice the space in the strainname there.

When looking in the sanitize_metadata.py (line 472)

        # Replace whitespaces from strain names with nothing to match Nextstrain's
        # convention since whitespaces are not allowed in FASTA record names.
        metadata[strain_field] = metadata[strain_field].str.replace(" ", "")

It removes whitespaces here and then almost directly after writes in to the metadata file (line 484) making a duplicate strain in the metadata file. So it looks like the duplicates aren't removed because when I read the documentation it should only maintain the last (highest?) GISAID ID but it keeps multiple gisaid ID's in the metadata field. Or any idea what I am doing wrong here because thee should be many other people having the same issue if it is a real bug right?

Nextstrain: nextstrain.cli 6.1.0.post1
Augur: augur 19.2.0

Thanks for this detailed and thorough bug report.

sanitize_metadata.py indeed filters out duplicates by strain name, but it's one of the first things it does. It then goes on to modify strain names in a few cases:

# Strip prefixes from strain names.
if args.strip_prefixes:
metadata[strain_field] = metadata[strain_field].apply(
lambda strain: strip_prefixes(strain, args.strip_prefixes)
)
# Replace whitespaces from strain names with nothing to match Nextstrain's
# convention since whitespaces are not allowed in FASTA record names.
metadata[strain_field] = metadata[strain_field].str.replace(" ", "")
# Replace standard characters that are not accepted by all downstream
# tools as valid FASTA names.
metadata[strain_field] = metadata[strain_field].str.replace("'", "-")

These modifications, however, can produce new collisions on strain name as you've seen. I'd consider that a bug.

I'm not sure why others haven't seen/reported this issue. Can you say more about what "a freshly downloaded GISAID metadata set" means in your case? It may be that you're getting the problematic sequences, e.g.

hCoV-19/Spain/AS-242164052/2021
hCoV-19/Spain/AS- 242164052/2021

for some reason but others typically aren't.

It's just a guess, but when you download sequences from GISAID website, you can get a box like this:
image

Giving an option to replaces spaces with underscores. For me it's ticked by default and I always leave it so - but perhaps if that's become unticked somehow, it might be the source of the problem?

Thanks for the replies. With a "fresh GISAID download" I mean the most recent full download of all GISAID sequences and metadata via the website (see image).
image
When you click on it there is no option the replace spaces with underscores, just the tick box as a reminder of the TOU. So that is the same I think as Emma's tickbox not being ticked as indeed source of the problem.

image

Would it change (or solve) anything if the replace spaces part is done then at the very start of the script as it appears to work fine for everybody as that has already been done prior to starting the sanitize_metadata.py script?

Thanks again for all the hard work you all put in maintaining the Nextstrain pipeline :-).

I found a working solution without changing any code. I just run the sanitize_metadata.py script twice. The first time it will have created 'new' duplicates just prior to writing it to csv and by running it a second time it removes these 'new' duplicate entries. The subsequent augur filter options then work again...

I haven't used that particular download set myself, but I suppose it must not have spaces replaced, which is causing the problem. If you have the 'Genomics epidemiology' section in the Downloads panel, then the FASTA/metadata from there may work better (however not everyone has access to this section).

I'm glad you found a workaround though!

I'm looking into this for a related discussion forum post.