/elasticsearch-concatenate-token-filter

Elasticsearch plugin which only provides a TokenFilter that merges tokens in a token stream back into one. Taken from http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

Primary LanguageJava

elasticsearch-concatenate-token-filter

Elasticsearch plugin which only provides a TokenFilter that merges tokens in a token stream back into one. Taken from http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

ElasticSearch version support

This plugin targets 1.X versions of ES, and won't work for 2.X.

Support for ElasticSearch 2.2.0 was added thanks to @bomberby and can be found in the 2.2.0 branch. It may be compatible for all 2.X versions, but was only tested on 2.2.0.

Install

To install on your current ES node, use the plugin binary provided in the bin folder (on Ubuntu it should be under /usr/share/elasticsearch/bin)

For Elasticsearch 2.2.0 use

sudo bin/plugin install francesconero/elasticsearch-concatenate-token-filter/2.2.0

Usage

The plugin provides a token filter of type concatenate which has one parameter token_separator. Use it in your custom analyzers to merge tokenized strings back into one single token (usually after applying stemming or other token filters).

Arrays

When saving arrays of strings to a field, these are handled in elasticsearch as separate tokens, so this filter would collapse all the elements of the array into one, and usually you don't want that to happen. As a workaround you can set position_offset_gap on the field to a high number and pass the same number as the increment_gap parameter to the filter, which then only concatenates all tokens closer than this value.

Example

Given the custom analyzer (see https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html):

{
  "analysis" : {
    "filter" : {
      "concatenate" : {
        "type" : "concatenate",
        "token_separator" : "_"
      },
      "custom_stop" : {
        "type": "stop",
        "stopwords": ["and", "is", "the"]
      }
    },
    "analyzer" : {
      "stop_concatenate" : {
        "filter" : [
          "custom_stop",
          "concatenate"
        ],
        "tokenizer" : "standard"
      }
    }
  }
}

the string:

"the fox jumped over the fence"

would be analyzed as:

"fox_jumped_over_fence"