nextstrain/seasonal-flu

Some strain names in titer records are misspelled

huddlej opened this issue · 0 comments

Current Behavior

Titer records define strain names for test and reference viruses, but we do not automatically cross-check these names with existing names in GISAID/GenBank records. As a result, there can be misspelled strain names in the titer records that lead us to omit these measurements from analyses when they do not match the corresponding sequence record's strain name.

Tal from the Bloom lab has helpfully provided a list of misspellings that he detected in our Crick titers:

"Misspelled Virus Name","Correct Virus Name"
"A/Camb/925256/2020","A/Cambodia/925256/2020"
"A/Christchurch/4/1985","A/ChristChurch/4/1985"
"A/Christchurch/515/2019","A/ChristChurch/515/2019"
"A/Christchurch/515/2019-egg","A/ChristChurch/515/2019-egg"
"A/CotedIvore/544/2016","A/CoteDIvoire/544/2016"
"A/Eng/538/2018","A/England/538/2018"
"A/Greecd/4/2017","A/Greece/4/2017"
"A/Hk/5738/2014","A/HongKong/5738/2014"
"A/Hk/656/2018","A/HongKong/656/2018"
"A/Hk/675/2018","A/HongKong/675/2018"
"A/Lyon/CHU/R1811667/2018","A/Lyon/CHU-R1811667/2018"
"A/Lyon/CHU/R181282/2018","A/Lyon/CHU-R181282/2018"
"A/Lyon/CHU/R1813393/2018","A/Lyon/CHU-R1813393/2018"
"A/Lyon/CHU/R190259/2019","A/Lyon/CHU-R190259/2019"
"A/Lyon/CHU/R190377/2019","A/Lyon/CHU-R190377/2019"
"A/Lyon/CHU/R1914685/2019","A/Lyon/CHU-R1914685/2019"
"A/Lyon/CHU/R1915450/2019","A/Lyon/CHU-R1915450/2019"
"A/Lyon/EHPAD/108/2019","A/Lyon/EHPAD-108/2019"
"A/Nor/2516/2018","A/Norway/2516/2018"
"A/Nor/2620/2018","A/Norway/2620/2018"
"A/Nor/4436/2016","A/Norway/4436/2016"
"A/Norway/3806-egg","A/Norway/3806/2016-egg"
"A/Singapore/INFIMH-16-001/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/INFIMH-16-001/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016","A/Singapore/INFIMH-16-0019/2016"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/Singapore/Infimh-16-0019/2016-egg","A/Singapore/INFIMH-16-0019/2016-egg"
"A/StEtienne/1912/2018","A/Saint-Etienne/1912/2018"
"A/StEtienne/1998/2018","A/Saint-Etienne/1998/2018"
"A/StEtienne/2539/2020","A/Saint-Etienne/2539/2020"
"A/Stock/6/2014","A/Stockholm/6/2014"
"A/Switz/8060/2017-egg","A/Switzerland/8060/2017-egg"
"A/Switzerlandz/8060/2017-egg","A/Switzerland/8060/2017-egg"

Expected behavior

Misspelled strains in the list above should match their sequence strain names.

Possible solution

In addition to manually correcting these records in our database, we should also consider flagging any titer records with potential misspellings. One easy check would be for records whose test or reference strains don't have corresponding records in the sequence database.