standard_edge_ngram not working fine when using capital letters

Question

standard_edge_ngram not working fine when using capital letters

Denis2310 opened this issue 6 months ago · 2 comments

Denis2310 commented 6 months ago

Preconditions

Magento Version : 2.4.6-p4

ElasticSuite Version : 2.11.6

Environment : Development

Third party modules :

Steps to reproduce

Create product called "Rotational viscometer: ViscoQC 100 H"
Set standard_edge_ngram as default analyzer for product title
Search for "viscoq", product should be shown because "viscoq" is part of product name
Search for "ViscoQ" or "viscoQ", product should be shown because both are part of product name

Expected result

Create product called "Rotational viscometer: ViscoQC 100 H"
Set standard_edge_ngram as default analyzer for product title
Search for "viscoq", product shown in result
Search for "ViscoQ" or "viscoQ", product shown in result

Actual result

Create product called "Rotational viscometer: ViscoQC 100 H"
Set standard_edge_ngram as default analyzer for product title
Search for "viscoq", product shown in result
Search for "ViscoQ" or "viscoQ", product is not shown in result

Answer 1 · 2024-07-19T04:48:48.000Z

I will close this, I have created plugin for search query to set it as lowercase.

use Magento\Search\Model\Query;
use Magento\Search\Model\QueryFactory;

class LowercaseQueries
{
    public function afterGet(QueryFactory $subject, Query $result): Query
    {
        if (!$result->hasData('_lowercased')) {
            $result->setQueryText(mb_strtolower($result->getQueryText()));
            $result->setData('_lowercased', true);
        }

        return $result;
    }
}

Answer 2 · 2024-07-19T06:21:15.000Z

Hello @Denis2310,

Long story short, it could be related to some filter of your custom analyzer.

The main culprits I see would be either

either the "word_delimiter" filter (either "word_delimiter" or "reference_word_delimiter") that will generate word parts based on case transition in addition of letter/digit transition.
or the absence of the "lowercase" filter

Here what's happening on an unmodified standard analyzer

viscoQ becomes

"viscoQ" and then "viscoq" (x2)

If both word_delimiter.preserve_original and word.delimiter.catenate_all are false, then
viscoQ becomes

"visco Q" and then "visco q"

=> This will not match "viscoq".

If in addition the "lowercase" filter is missing, then
viscoQ becomes

"visco Q"

=> This will not match "viscoq" either.

If you decided to customize the "standard_edge_ngram" analyzer and replace its "word_delimiter" filter by the "reference_word_delimiter" filter, it could be the original of the problem since

        <filter name="word_delimiter" type="word_delimiter" language="default">
            <generate_word_parts>true</generate_word_parts>
            <catenate_words>true</catenate_words>
            <catenate_numbers>true</catenate_numbers>
            <catenate_all>true</catenate_all>
            <split_on_case_change>true</split_on_case_change>
            <split_on_numerics>true</split_on_numerics>
            <preserve_original>true</preserve_original>
        </filter>

        <filter name="reference_word_delimiter" type="word_delimiter" language="default">
            <generate_word_parts>true</generate_word_parts>
            <catenate_words>false</catenate_words>
            <catenate_numbers>false</catenate_numbers>
            <catenate_all>false</catenate_all> <==
            <split_on_case_change>true</split_on_case_change>
            <split_on_numerics>true</split_on_numerics>
            <preserve_original>false</preserve_original> <==
        </filter>

If you didn't apply any changes, please check that you enabled the following experimental settings in search relevance

Spellchecking configuration > Terms vectors configuration > [Experimental] Use all tokens from term vectors
Spellchecking configuration > Terms vectors configuration > [Experimental] Use edge ngram analyzer in term vectors
Relevance configuration > Exact match configuration > [Experimental] Use default analyzer in exact matching filter query

Regards,