Smile-SA/elasticsuite

standard_edge_ngram not working fine when using capital letters

Denis2310 opened this issue · 2 comments

Preconditions

Magento Version : 2.4.6-p4

ElasticSuite Version : 2.11.6

Environment : Development

Third party modules :

Steps to reproduce

  1. Create product called "Rotational viscometer: ViscoQC 100 H"
  2. Set standard_edge_ngram as default analyzer for product title
  3. Search for "viscoq", product should be shown because "viscoq" is part of product name
  4. Search for "ViscoQ" or "viscoQ", product should be shown because both are part of product name

Expected result

  1. Create product called "Rotational viscometer: ViscoQC 100 H"
  2. Set standard_edge_ngram as default analyzer for product title
  3. Search for "viscoq", product shown in result
  4. Search for "ViscoQ" or "viscoQ", product shown in result

Actual result

  1. Create product called "Rotational viscometer: ViscoQC 100 H"
  2. Set standard_edge_ngram as default analyzer for product title
  3. Search for "viscoq", product shown in result
  4. Search for "ViscoQ" or "viscoQ", product is not shown in result

I will close this, I have created plugin for search query to set it as lowercase.

use Magento\Search\Model\Query;
use Magento\Search\Model\QueryFactory;

class LowercaseQueries
{
    public function afterGet(QueryFactory $subject, Query $result): Query
    {
        if (!$result->hasData('_lowercased')) {
            $result->setQueryText(mb_strtolower($result->getQueryText()));
            $result->setData('_lowercased', true);
        }

        return $result;
    }
}

Hello @Denis2310,

Long story short, it could be related to some filter of your custom analyzer.

The main culprits I see would be either

  • either the "word_delimiter" filter (either "word_delimiter" or "reference_word_delimiter") that will generate word parts based on case transition in addition of letter/digit transition.
  • or the absence of the "lowercase" filter

Here what's happening on an unmodified standard analyzer

image

viscoQ becomes

  • "viscoQ" and then "viscoq" (x2)

If both word_delimiter.preserve_original and word.delimiter.catenate_all are false, then
viscoQ becomes

  • "visco Q" and then "visco q"

=> This will not match "viscoq".

If in addition the "lowercase" filter is missing, then
viscoQ becomes

  • "visco Q"

=> This will not match "viscoq" either.

If you decided to customize the "standard_edge_ngram" analyzer and replace its "word_delimiter" filter by the "reference_word_delimiter" filter, it could be the original of the problem since

        <filter name="word_delimiter" type="word_delimiter" language="default">
            <generate_word_parts>true</generate_word_parts>
            <catenate_words>true</catenate_words>
            <catenate_numbers>true</catenate_numbers>
            <catenate_all>true</catenate_all>
            <split_on_case_change>true</split_on_case_change>
            <split_on_numerics>true</split_on_numerics>
            <preserve_original>true</preserve_original>
        </filter>

        <filter name="reference_word_delimiter" type="word_delimiter" language="default">
            <generate_word_parts>true</generate_word_parts>
            <catenate_words>false</catenate_words>
            <catenate_numbers>false</catenate_numbers>
            <catenate_all>false</catenate_all> <==
            <split_on_case_change>true</split_on_case_change>
            <split_on_numerics>true</split_on_numerics>
            <preserve_original>false</preserve_original> <==
        </filter>

If you didn't apply any changes, please check that you enabled the following experimental settings in search relevance

  • Spellchecking configuration > Terms vectors configuration > [Experimental] Use all tokens from term vectors
  • Spellchecking configuration > Terms vectors configuration > [Experimental] Use edge ngram analyzer in term vectors
  • Relevance configuration > Exact match configuration > [Experimental] Use default analyzer in exact matching filter query

Regards,