CDCgov/datasets-sars-cov-2

VOI/VOC table is missing accessions for consensus genomes

proychou opened this issue · 5 comments

As I mentioned on SPHERES Slack, the table for VOI/VOC https://github.com/CDCgov/datasets-sars-cov-2/blob/master/datasets/sars-cov-2-voivoc.tsv is missing accessions for the consensus sequence.

Screen Shot 2021-10-05 at 5 27 08 AM

Would be great to add these Genbank/GISAID accessions and also possibly the PANGO lineage assigned at the time these were generated, along with version. But accessions would be most crucial to add.

So for example in the VOI/VOC dataset, we have a sample name hCoV-19_Wales_PHWC-4C8F5E_2021 which would correspond to the GISAID sample with the same name but substitute _ for /, ie, hCoV-19/Wales/PHWC-4C8F5E/2021. Does that answer this issue?

Doesn't really help because we'd still need to look these up one by one in GISAID after doing those substitutions. Alternatively, if one has access they could query the larger GISAID metadata file. It seems like unnecessary steps though, and not everyone has access to those datasets. However, if accessions were provided, those could be entered directly into the GISAID search tool which all users have access to.

It's also inconsistent between the voc and non-voc set. The latter does have the Genbank accessions. Seems like the best solution would just be to provide Genbank/GISAID accessions for both, no?

Hi @proychou does @daisy0223 's latest address the issue?

Yes, perfect! Thank you!!

Okay great. Thank you for your feedback and helping us make these datasets better!