Priesemann-Group/covid19_inference

Normalize country names to iso 3166

Closed this issue · 3 comments

In order to automatize country analysis later on, we'll need to deal with the different ways countries are named in different sources. South Korea for instance is called "South Korea" (google), "Korea, South" (JHU) and "Republic of Korea" (apple). What about enforcing e.g. iso 3166 naming during download?

Yep, but normalize all the countries take some time. Do you know a fast way to do it?

Not really. Best I can think of is checking if the names are in the iso 3166 list, and implementing a manual translation dict for those that are not, and giving a warning if the name is not in either.

So, the way to go, if someone wants to implement it, is:

  • First build a dictionary {'iso 3166 name': ['alternative name 1', 'alternative name 2']}. So we will have one dictionary for all datasets, so to potentially save some work.
  • Secondly, build a function that is given as input a country name or a list of country names and outputs the iso 3166 name(s).
  • Third, change the data_retrieval functions/classes to return iso 3166 country names