Repository of synonyms, protected words, stop words, localizations, and other vocabularies to improve the precision, recall, and usability of search results.
Each locale's synonyms are in a separate YAML file (e.g., es.yml
, en.yml
). Here is a sample entry:
inmunización, vacuna, vacunación:
:notes: Approved (synonyms) and (stemming). AFF 11/12/14
:status: Approved
:analyzed: inmunizacion, vacun, vacunacion
The entry listing is a comma-separated list of natural language terms, probably lemmas.
The notes
field can be long and multi-line, but it still needs to be valid YAML. Notes include information on the type of synonym:
- Abbreviations
- Acronyms
- Clipped words
- Gerunds
- Irregular plurals
- Language variants
- Misspellings
- Numbers
- Spelling variants
- Stemming
- Synonyms
- Tickers (stock ticker symbols)
- Verbs
The status
is either Approved
, Rejected
, or Candidate
.
The analyzed
field is a comma-separated list of the entry terms after they have been run through an analyzer and de-duped. The analysis chain comprises 6 filters:
- standard
- asciifolding
- lowercase
- es_stop_filter
- es_protected_filter
- es_stem_filter
In the example entry above, vacuna
becomes vacun
because of the Spanish stemmer, and vacunación
and inmunización
become vacunacion
and inmunizacion
, respectively, because of ASCII folding.
Generate the text file of approved synonyms for each locale, like this:
cd synonyms
./lib/yaml_to_solr.rb es.yml > es.txt
./lib/yaml_to_solr.rb en.yml > en.txt
You can then reference these files when you define your per-locale synonym filters.
Each locale's protected words are in a separate YAML file (e.g., es.yml
, en.yml
). The keyword marker token filter keeps these words from getting treated by the stemmer.
Here is a sample entry:
irs:
:notes: Stems to ir in minimal_english
:status: Approved
The entry listing is the token after ASCII folding and lowercasing.
Generate the text file of approved protected words for each locale, like this:
cd protected_words
./lib/yaml_to_solr.rb es.yml > es.txt
./lib/yaml_to_solr.rb en.yml > en.txt
You can then reference these files when you define your per-locale keyword marker filters.
Each locale's stop words are in a separate YAML file (e.g., es.yml
, en.yml
). Here is a sample entry:
they:
:notes:
:status: Approved
The entry listing is the token after ASCII folding and lowercasing.
Generate the text file of approved stop words for each locale, like this:
cd stop_words
./lib/yaml_to_solr.rb es.yml > es.txt
./lib/yaml_to_solr.rb en.yml > en.txt
You can then reference these files when you define your per-locale stop word filters.
The Elasticsearch index mapping used to transform entries into analyzed fields is here:
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"ignore_chars": {
"type": "mapping",
"mappings": [
"'=>",
"’=>",
"`=>"
]
}
},
"filter": {
"es_protected_filter": {
"type": "keyword_marker",
"keywords": [
"ronaldo"
]
},
"es_stem_filter": {
"type": "stemmer",
"name": "light_spanish"
},
"es_stop_filter": {
"type": "stop",
"stopwords": [
"a",
"al",
"ante",
"aquel",
"aquello",
"bajo",
"cabe",
"cada",
"como",
"con",
"conmigo",
"consigo",
"contigo",
"contra",
"cual",
"cuando",
"de",
"del",
"desde",
"despues",
"donde",
"durante",
"e",
"el",
"en",
"entonces",
"entre",
"es",
"esta",
"esto",
"fin",
"fue",
"ha",
"hacia",
"has",
"hasta",
"la",
"las",
"le",
"les",
"los",
"mas",
"mediante",
"menos",
"mi",
"ni",
"o",
"para",
"pero",
"por",
"que",
"quien",
"salvo",
"segun",
"ser",
"si",
"sin",
"so",
"sobre",
"solamente",
"solo",
"somos",
"son",
"soy",
"su",
"suya",
"suyo",
"suyos",
"tal",
"tambien",
"tras",
"u",
"un",
"una",
"unas",
"unos",
"via",
"y"
]
},
"en_protected_filter": {
"type": "keyword_marker",
"keywords": [
"irs"
]
},
"en_stem_filter": {
"type": "stemmer",
"name": "minimal_english"
},
"en_stop_filter": {
"type": "stop",
"stopwords": [
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"but",
"by",
"for",
"if",
"in",
"into",
"is",
"no",
"not",
"of",
"on",
"or",
"s",
"such",
"t",
"that",
"the",
"their",
"then",
"there",
"these",
"they",
"this",
"to",
"was",
"with"
]
}
},
"analyzer": {
"en_analyzer": {
"type": "custom",
"char_filter": [
"ignore_chars"
],
"filter": [
"standard",
"asciifolding",
"lowercase",
"en_stop_filter",
"en_protected_filter",
"en_stem_filter"
],
"tokenizer": "standard"
},
"es_analyzer": {
"type": "custom",
"char_filter": [
"ignore_chars"
],
"filter": [
"standard",
"asciifolding",
"lowercase",
"es_stop_filter",
"es_protected_filter",
"es_stem_filter"
],
"tokenizer": "standard"
}
}
}
}
}
}
The DigitalGov Search application uses these files to provide localized translations of text strings based on the locale set for the user.