jprante/elasticsearch-plugin-bundle

decompound filter returns non-compound words twice

Opened this issue · 2 comments

First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).

This is the simplified version of my index settings to reproduce the problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder]
            filter:
                decompounder:
                    type: decompound

Querying /_analyze with the text Grundbuchamt Anwältin returns:

tokens:
- token: "Grundbuchamt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Grund"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "buch"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "amt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1

As you can see, the token Anwältin is returned twice with the same offset and position.

(Setting subwords_only to true eliminates the duplicates by the way.)

Do you have an idea how we might fix this behaviour?

There may be a flaw. As a workaround, removing duplicates from token stream can be performed by a standard "unique" filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

Thanks! I just came back to post this as well. What's important to note is that the unique filter should be used with only_on_same_position: true, because otherwise the term frequency will be heavily distorted.

As an example for others with the same problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder, unique_decomp]
            filter:
                unique_decomp:
                    type: unique
                    only_on_same_position: true
                decompounder:
                    type: decompound