mar-muel/local-geocode

Odd number of countries retrieved during retrieval of data

Closed this issue · 4 comments

Hey there,
I was reinitialising the Geocode class, when I noticed that the number of countries returned by the file is much greater than what I would expect. Screenshot below of the data retrieved and the count of the countries. I would expect this number to be more like 200 so it's closer to the countries listed on this page https://www.geonames.org/countries/

Screenshot 2024-02-04 at 18 19 45

This is a fantastic library btw, thank you for providing it!

Hey there - If I remember correctly it's because country names have a ton of different variants (think e.g. US, USA, United States, etc...) and also have various spellings in different languages, e.g. Italy, Italia, Repubblica Italiana, etc.

If your application only relies on English country names there might be ways to filter this specifically 🤔 Would need to look into how the country aliases are annotated. Thing is that the "official" country names are interestingly rarely used by anyone (i.e. people don't tend to spell out the full name of the US normally), so we can't just ignore aliases.

Hey @mar-muel, that makes sense thanks! Yes would be great to be able to filter by English-only. I see in the code that the featureCodes_en.txt file is downloaded, does that not filter by English only?
I'm using this as my reference for the available API's https://www.geonames.org/export/ws-overview.html but not sure if this is the correct source...

Unfortunately, it seems like the alternate names of places are not properly annotated. E.g. the alternate names of Toledo, Spain are given as a list of strings without any sort of language annotation:

'Taleda,Toledas,Tolede,Toledo,Toledo i Spania,Toledu,Toletum,Toleu,Tolède,XTJ,tlytlt,to le do,toledo,toleto,tolledo,toredo,tuo lai duo,tuo li duo,twldw,twldw  aspanya,twlydw,Τολέδο,Таледа,Толедо,Տոլեդո,טאלעדא,טולדו,تولدو، اسپانیا,توليدو,طليطلة,طلیطلہ,तोलेदो,ਤੋਲੇਦੋ,டொலேடோ,โตเลโด,ტოლედო,ቶሌዶ,トレド,托利多,托萊多, '레도

As I mentioned above, I cannot simply ignore these alternative names of places as they are sometimes more meaningful than the official names. Else something like this would not be possible:

>>> gc.decode("L.A.")
[{'name': 'L.A.', 'official_name': 'Los Angeles', 'country_code': 'US', 'longitude': -118.24368, 'latitude': 34.05223, 'geoname_id': '5368361', 'location_type': 'city', 'population': 3898747}]

Unfortunate but understandable, thanks :)