Produce list of synonyms for addresses
Closed this issue · 10 comments
i.e. St = Street, Rd = Road, etc
The format should follow Elasticsearch conventions, for the above it would be:
court => court, ct
street => street, st
I'll take a first crack at this if no one has started.
Between @wpears link and the Pub 28 site I think I have enough. If there is something else to look at let me know.
lets bring this up at scrum. i kind of want @kgudel to do this, for the experience. has someone proposed a format? am assuming json (like what @wpears sent - https://github.com/openaddresses/machine/blob/master/openaddr/expand.py)
@jmarin i was using a slightly different format.
court, ct
street, st
maryland, md
From my understanding this way there is no "replacement" of values, it just treats the two as the same at search time.
I took a stab at this a while back as part of the parser. It's based on
the pub28 abbreviations.
https://github.com/hkeeler/grasshopper-parser/blob/normalize/abbreviations.yaml
On Fri, Aug 21, 2015, 8:59 AM Juan Marin Otero notifications@github.com
wrote:
@feomike https://github.com/feomike @awolfe76
https://github.com/awolfe76 There is a specific format that
Elasticsearch needs for the synonyms file (synonyms.txt), I already sent
that out to @lesgou https://github.com/lesgou and it looks something
like this (contrived example):Abbreviations
court => court, ct
3 street => street, st#States
4 maryland => md, maryland—
Reply to this email directly or view it on GitHub
#134 (comment).
Here's the ES documentation on the syntax for synonyms file. You'll see multiple formats for the synonyms:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html#_solr_synonyms
After the after scrum meeting the =>
method is the way to go. Its better for several reasons when you know the data set has a standard. For example, in tiger we know that Avenue
is always spelled out in the data (it's never ave
or av
).
For data sets that don't have this standard another synonyms format may be better.
@jmarin @feomike - just want to make sure I have that right.
👍
That is correct