/elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

Gentle introduction to basic elasticsearch constructs boosting search: ngrams, shingles, stemmers, suggesters and fuzzy queries.

GNU General Public License v3.0GPL-3.0

License: GPL v3

elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

preface

  • goals of this workshop
    • introduction to typos and suggestions handling in elasticsearch
    • introduction to basic constructs boosting search
      • ngram and edge ngram (typos, prefixes)
      • shingles (phrase)
      • stemmers (operating on roots rather than words)
      • fuzzy queries (typos)
      • suggesters
  • in docker-compose there is elasticsearch + kibana (7.6) prepared for local testing
    • cd docker/compose
    • docker-compose up -d
  • workshop and answers are in workshop directory

introduction

  • searching natural language is inherently imprecise - computers can't comprehend natural language
    • they need heuristic, an algorithmic equivalent of true linguistic comprehension
  • Lucene is composed of many text processing tools
    • example: prefix query
      • simply matches the beginning letters of words
    • example: fuzzy queries
      • find words that need at most a certain number of 'edits', to match the query
    • example: snowball stemmer and the metaphone phonetic analyzer
      • mimic grammatical and phonetic aspects of language
    • example: autocomplete functionality (other names: search as you type, type-ahead search)

ngram

  • splitting a token into multiple subtokens
  • sliding window that moves across the word - continuous sequence of characters of the specified length
  • example
    • token: pillar
    • 1-grams: p, i, l, l, a, r
    • bigrams: pi, il, ll, la, ar
    • trigrams: pil, ill, lla, lar
  • how it could boost searching of incorrectly spelled word
    • search: "pilar"
    • bigrams for "pillar": pi, il, ll, la, ar
    • bigrams for "pilar": pi, il, la, ar
      • 4/5 match comparing to correctly spelled word
  • drawback: more words than intended will match the original
  • use-case
    • analyze text when the language is not known
    • handle multiple languages with a single analyzer
    • analyze text when the language combines words in different manner than other European languages
      • japanese: no spaces between words
      • german: long compound words

edge ngrams

  • no user will search for "Database" using the "ase" chunk of characters
    • for many applications, only ngrams that start at the beginning of words are needed
    • that’s where edge n-grams come into play
  • edge n-grams = consecutive prefixes
  • example
    • token: pillar
    • edge n-grams: p, pi, pil, pill, pilla, pillar
  • helpful for searching words with prefix
    • vs prefix query: prefix query is much more time consuming
    • but indexing edge ngrams is longer (and indexes are bigger - contain prefixes)
  • index vs search analyzer
    • standard approach: same analyzer at index time and at search time
    • different advice for edge ngrams
      • index time: save all prefixes
      • search time: search for the terms that user typed in
      • good practice to set upper limit
        • for example if search for text with length > 8 - full text search instead of terms
  • prepare index with edge-ngram analyzer
    PUT /index-name
    {
        "settings": {
            "analysis": {
                "filter": {
                    "autocomplete_filter": { // filter that will split tokens into edge ngrams
                        "type": "edge_ngram",
                        "min_gram": 2, // smallest ngrams to generate, default: 1
                        "max_gram": 5 // largest ngrams to generate, default: 2
                    }
                },
                "analyzer": {
                    "autocomplete_analyzer": { // custom analyzer: standard + autocomplete
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [ "lowercase", "stop", "autocomplete_filter"]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "description": { // field name with autocomplete feature
                    "type": "text", // normal mapping
                    "fields" : { // multi-field mapping
                        "autocomplete": { // mapping responsible for autocomplete feature
                            "type": "text", 
                            "analyzer": "autocomplete_analyzer", 
                            "search_analyzer": "standard" // override - by default, queries use the analyzer defined above
                        },
                        "english": { // other mappings for the same field
                            "type": "text",
                            "analyzer": "english"
                        }
                    }
                }
            }
        }
    }
    
  • querying for autocomplete
    GET /index-type/_search
    {
        "query": {
            "bool": {
                "should": { // filter, must
                    "match": { "description.autocomplete": "search for descr" }
                }
            }
        }
    }
    

shingles

  • ngrams at the token level instead of the character level
  • example
    • token: "please divide this sentence into shingles"
    • shingles bigrams: please divide, divide this, this sentence, sentence into, and into shingles
  • rare phrase matches will have a bigger score boost than more common phrases
  • downside: larger indices
    • pre-bake phrase matching
      • build phrases into the index
      • avoid creating phrases at query time and save some processing time/speed
  • index-phrases option on a text field
    • two-term word combinations (shingles) are indexed into a separate field
    • allows exact phrase queries (no slop) to run more efficiently, at the expense of a larger index
      • slop = how far we allow the terms to be
  • field type search_as_you_type
    • creates the following fields
      • my_field
        • if an analyzer is not configured, the default analyzer for the index is used
      • my_field._2gram
        • wraps the analyzer of my_field with a shingle token filter of shingle size 2
      • my_field._3gram
        • wraps the analyzer of my_field with a shingle token filter of shingle size 3
      • my_field._index_prefix
        • wraps the analyzer of my_field._3gram with an edge ngram token filter
    • params
      • min_shingle_size
      • max_shingle_size
  • most efficient way to serve a search-as-you-type is multi_match query of type bool_prefix
    • bool_prefix - constructs a bool query from the terms
      • each term except the last is used in a term query
      • last term is used in a prefix query
    • example: please divide this sen
      • is parsed into terms: please, divide, this, sen
      • each term except last is used in exact match
      • last term is used as prefix query: sen -> sentence
  • prepare index with shingles
    PUT /index-type
    {
        "mappings": {
            "properties": {
                "description": { "type": "search_as_you_type"}
                ...
            }
        }
    }
    
  • querying
    GET /index-type/_search
    {
        "query": {
            "multi_match": {
                "query": "...",
                "type": "bool_prefix",
                "fields": [
                    "description._2gram",
                    "description._3gram"
                ]
            }
        }
    }
    

stemming

  • process of reducing a word to its root form
    • root form may not be a real word
  • extremely handy when searching
    • match words sharing the root
    • make your searches more flexible than rigid exact matching
  • example
    • word: "administrations"
    • root: "administr"
    • match: "administrator", "administration", "administrate"
  • categories
    • algorithmic stemmers
      • based on a set of rules
      • example - remove the -s and -es from the end of plural words
      • advantages
        • little setup
        • little memory
        • typically faster than dictionary stemmers
      • disadvantages
        • problem with irregular words
          • be, are, and am
          • mouse and mice
          • foot and feet
      • types
        • stemmer - porter stemming algorithm, several languages
        • kstem - algorithmic stemming + built-in dictionary
        • porter_stem - porter stemming algorithm, recommended for English
        • snowball - Snowball-based stemming rules, several languages
        • differences in how aggressive they stem
          • more aggressive = chop off more
    • dictionary stemmers
      • stem words by looking in a dictionary
      • advantages
        • more accurate
        • irregular words
        • distinguishes similar (in context of spelling) words but not related conceptually
          • example: organ and organization, broker and broken
      • disadvantages
        • stemmer is only as good as its dictionary
          • must include a significant number of words, be updated regularly, change with language trends
        • use a significant amount of RAM
          • can slow the stemming process significantly
      • types
        • hunspell

fuzzy

  • some typos cannot be solved with n-grams
  • Levenshtein distance between two words
    • minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other
  • Damerau-Levenshtein distance = Levenshtein + transposition
    • changing a character (box -> fox)
    • removing a character (black -> lack)
    • inserting a character (sic -> sick)
    • transposing two adjacent characters (act -> cat)
  • Levenshtein vs Damerau-Levenshtein
    • compare 'aex' and 'axe'
    • Levenshtein distance: two edits away
      • 'aex' is as far from 'axe' as 'faxes'
    • Damerau-Levenshtein: one edit away
      • makes greater intuitive sense in most cases
  • fuzzy query
    GET index-search/_search
    {
        "query": {
            "bool": {
                "should": [
                    { "prefix": { "field-name": "..." } },
                    { "fuzzy": { "field-name": { "value": "...", "fuzziness": 2 } } }
                ]
            }
        }
    }
    
    • returns documents that contain terms similar (LD distance) to the search term
    • creates a set of all possible variations of the search term within a specified edit distance
    • does not analyze the query text first
      • exact matches for each expansion
    • parameters worth mentioning
      • fuzziness - maximum edit distance allowed for matching
      • max_expansions - maximum number of variations created, defaults: 50
      • prefix_length - number of beginning characters left unchanged when creating expansions, defaults: 0
    • often combined with prefix search
  • stemming context
    • snowball analyzer
    • fuzzy search for 'running', will be stemmed to 'run'
    • misspelled 'runninga' stems to 'runninga'
    • 'running' will not match the 'runninga'
      • 'run' is more than 2 edits away from 'runninga'
    • can cause a bit of confusion
      • use the simple analyzer with fuzzy queries
      • disable synonyms as well

suggesters

  • suggests documents containing similar looking terms
  • still under development
  • types
    • term-suggester
      • suggests terms based on edit distance
    • completion-suggester
      • provides auto-complete/search-as-you-type functionality
      • not meant for spell correction or did-you-mean functionality
      • uses data structures that enable fast lookups, but are costly to build and are stored in-memory
      • supports fuzzy queries
  • define index
    PUT index-name
    {
        "mappings": {
            "properties": {
                "search_associations": {
                    "type": "completion"
                },
                ...
            }
        }
    }
    
  • index document
    POST index-name/_doc
    {
        "search_associations": [ ... ],
    }
    
  • query
    POST index-name/_search
    {
        "suggest": {
            "suggest-name": {
                "prefix": "...",
                "completion": {
                    "field": "search_associations"
                }
            }
        }
    }