Numerous False Negatives
GrayEye opened this issue · 4 comments
Hello Elyase, very glad you have created and maintained this very useful python library. I'm currently using it to help parse quite a lot of info from the USPTO. Anyway I noticed quite a few errors where the library didn't capture the city and/or country from the string. Here are some examples of strings from the source data I ran the library against where the city and/or country was not picked out. Hopefully these cases can help you improve the library.
INDIANAPOLIS INDIANA.
BARDSLEY, ENGLAND
ST. LOUIS, MO.
WHITING, INDIANA, AND CHICAGO, ILLINOIS.
PHILADELPHIA PA.
LEROY, N.Y.
LYNDONVILLE, VT.
AMENIA, N. Y.
COPPERHILL, TENN.
DETROIT AND JOSEPH CAMPAU AT THE RIVER,MICH.
IVORYTON, CONN.
ST. LOUIS, MO. CORPORATION OF MISSOURI.
OGDENSBURG, N.Y.
NEAR SHEFFIELD, ENGLAND
INDIANAPOLIS IND.
BASLE,
ST. LOUIS, MO. REPUBLISHED BY MONSANTO COMPANY,/ST. LOUIS, MO.
LABORATORY PARK DECATUR, ILL.
1006 OAZA KADOMA, KADOMA-CHO KITAKAWACHI-GUN, OSAKA,
3501 W. 48TH PLACE CHICAGO 32, ILL.
700 BROADWAY NEW YORK, N.Y.
811 WYANDOTTE KANSAS CITY, MO.
835 S. 8TH ST. ST. LOUIS 2, MO.
47/51 EXMOUTH MARKET, ROSEBERRY AVE. LONDON E.C.1, ENGLAND
1407 CUMMINGS DRIVE RICHMOND 20, VA.
In order for it to work the input text must make use of capitalization, because the underlying regex statement and the idea behind this library is to catch city names as capitalized named entities - otherwise it would only be a lookup.
Ok, that makes sense. I can attempt to title case the data before I process it. However I also have to point out that no matter what I do certain cities like St. Louis are never recognized. Even when input as just "St. Louis" or "Saint Louis".
This is right. You have to understand that there are two things at work here. A regular expression that tries to catch all named entities in a text, store it in a list and then look up those named entities in a table of city names. In cases like St. Louis I would guess that the regular expression does not catch the "St." in St. Louis, that is why it is not recognized.
You can however take the regular expression and craft it to your needs or you can create multiple regular expressions, concatenate those into one list and do the lookup from this.
Hello, I've been using str.title() to capitalise strings. However, 'Malasya' is not identified as country even tho it comes up in origin: http://www.geonames.org/search.html?q=malasya