cfpb/grasshopper

Produce list of synonyms for addresses

Closed this issue · 10 comments

i.e. St = Street, Rd = Road, etc

The format should follow Elasticsearch conventions, for the above it would be:

court => court, ct
street => street, st

I'll take a first crack at this if no one has started.

Between @wpears link and the Pub 28 site I think I have enough. If there is something else to look at let me know.

@awolfe76 This is already being worked on by @lesgou, please coordinate with her.

lets bring this up at scrum. i kind of want @kgudel to do this, for the experience. has someone proposed a format? am assuming json (like what @wpears sent - https://github.com/openaddresses/machine/blob/master/openaddr/expand.py)

@feomike @awolfe76 @kgudel There is a specific format that Elasticsearch needs for the synonyms file (synonyms.txt), I already sent that out to @lesgou and it looks something like this (contrived example):

# Abbreviations
court => court, ct
3 street => street, st

#States
4 maryland => md, maryland

ok, got it. thanks @jmarin

@jmarin i was using a slightly different format.

court, ct
street, st

maryland, md

From my understanding this way there is no "replacement" of values, it just treats the two as the same at search time.

I took a stab at this a while back as part of the parser. It's based on
the pub28 abbreviations.

https://github.com/hkeeler/grasshopper-parser/blob/normalize/abbreviations.yaml

On Fri, Aug 21, 2015, 8:59 AM Juan Marin Otero notifications@github.com
wrote:

@feomike https://github.com/feomike @awolfe76
https://github.com/awolfe76 There is a specific format that
Elasticsearch needs for the synonyms file (synonyms.txt), I already sent
that out to @lesgou https://github.com/lesgou and it looks something
like this (contrived example):

Abbreviations

court => court, ct
3 street => street, st

#States
4 maryland => md, maryland


Reply to this email directly or view it on GitHub
#134 (comment).

Here's the ES documentation on the syntax for synonyms file. You'll see multiple formats for the synonyms:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html#_solr_synonyms

After the after scrum meeting the => method is the way to go. Its better for several reasons when you know the data set has a standard. For example, in tiger we know that Avenue is always spelled out in the data (it's never ave or av).

For data sets that don't have this standard another synonyms format may be better.

@jmarin @feomike - just want to make sure I have that right.

👍

That is correct