Sensitivity to capitalization, punctuation, and places sharing a name.
khof312 opened this issue · 3 comments
Hi @elyase this is great work, thanks - very fast. I am encountering a few reliability issues however. Specifically, I am finding that the library is very sensitive to capitalization and punctuation (ignores lowercase, ignores countries if followed by other properly capitalized words) and that it also has trouble disambiguating between multiple places with the same name. For example:
GeoText("France Is A Country").country_mentions
>>OrderedDict()
GeoText("paris France").country_mentions
>>OrderedDict([('FR', 1)])
GeoText("Paris France").country_mentions
>>OrderedDict()
GeoText("Paris, France").country_mentions
>> OrderedDict([('FR', 1), ('US', 1)])
(Presumably because there are also American cities named Paris?)
Just wanted to flag this for future updates...thanks!
Thanks for bringing up those issues. You are right that there are a lot of wrong corner cases, some can be traced back to the data, some have to do with limitations of the regex approach.
In my wish list is to add an optional machine learning approach that can do better disambiguation. This will hopefully do better disambiguation but will be somewhat slower and have some more dependencies.
For now I will manually patch those cases you found out and fix them for the next release.
Thanks! Didn't mean to make demands, this is already a great service that you are providing for free :) I am using the library regardless, thank you!!! If I have the time, I will also try to propose some fixes.
Change the regex expression to [A-Za-z]+[a-zà-ú](?:[ '-][A-Z]+[a-zà-ú])*
This will solve the sensitivity to capitalization. But there are some issues apart from the regex as well. For example, despite the regex detecting "LONDON" as a candidate, it does not get captured.