nut-tree/nut.js

Endless loop in findPartialMatches, azure plugin code

jweingarten-rv opened this issue · 6 comments

Version

    "@nut-tree/nl-matcher": "^3.0.2",
    "@nut-tree/nut-js": "^4.3.0",
    "@nut-tree/plugin-azure": "^2.1.0",

Short overview
While upgrading plugin-azure from version 1 to 2.1.0 we are running into an endless loop that boils down to a bug in findPartialMatches,js in @nut-tree\plugin-azure\dist\ocr\stringQueryProcessing.function.js.

node version:
v20.11.0

OS type and version:
Windows / Mac

Detailed error description
Full code sample to reproduce

While upgrading plugin-azure from version 1 to 2.1.0 we are running into an endless loop that boils down to a bug in findPartialMatches in @nut-tree\plugin-azure\dist\ocr\stringQueryProcessing.function.js.

During execution the function findPartialMatches is called with the following parameters:

line.text: Edit Actions: Clear All
line.words: ['Edit', 'Actions:', 'Clear', 'All']
config: {partialMatch: true}
searchWords: ['edit', 'action']
pm: []

Note that the line has Edit Actions: and we are searching for edit actions, no colon in the search.

When it loops through the code it ends up in

 else {
   idx += selectedWords.length;
  selectedWords = [];
}

That will reset idx back to 0 and you are in an endless loop.

Here is a fix, not pretty and most likely not how you will fix it, that I am currently using so that I can stay on version 2.1.0:

function findPartialMatches(line, config, searchWords, pm) {
    let selectedWords = [];
    nut_js_1.providerRegistry.getLogProvider().debug(`Searching through ${line.words.length} words`);
    for (let idx = 0; idx < line.words.length;) {
        let cnt = -1;
        const currentWord = line.words[idx];
        const currentWordText = config?.caseSensitive ? currentWord.text : currentWord.text.toLocaleLowerCase();
        if (currentWordText === searchWords[0]) {
            selectedWords = [currentWord];
            for (let i = 1; i < searchWords.length; ++i) {
                const nextWord = line.words[idx + i];
                const nextWordText = config?.caseSensitive ? nextWord.text : nextWord.text.toLocaleLowerCase();
                if (nextWordText !== searchWords[i]) {
                    cnt = idx + i;
                    selectedWords = [];
                    break;
                }
                else {
                    selectedWords.push(nextWord);
                }
            }
            if (selectedWords.length === searchWords.length) {
                idx += selectedWords.length;
                const wordsBoundingBox = (0, determineWordBoundingBox_function_1.determineWordBoundingBoxFunction)(selectedWords);
                if (wordsBoundingBox != null) {
                    pm.push({
                        ...line,
                        words: selectedWords,
                        boundingBox: (0, mapBoundingBoxToRegion_function_1.mapRegionToBoundingBox)(wordsBoundingBox)
                    });
                    nut_js_1.providerRegistry.getLogProvider().debug(`Found match ${selectedWords.map(word => word.text).join(" ")} at ${JSON.stringify(wordsBoundingBox)}`);
                }
            }
            else {
                idx = cnt !== -1 ? cnt : selectedWords.length
                selectedWords = [];
            }
        }
        else {
            ++idx;
        }
    }
}

Additional content

Please provide any (mandatory) additional data to reproduce the error (Dockerfiles etc.)

Hi @jweingarten-rv 👋

Thank you for this report and the detailed analysis!
I can confirm this bug and I'm currently evaluating possible ways to address this.

What's your expected outcome, given your input data?

Hi @s1hofmann,

well I actually was thinking about that as well a bit. To me the partialMatch setting is a bit unclear. Especially when I see that Azure comes back with an array of strings. Should in this case it pass evenhough there is a : in the text? I honestly am not sure.

The person that wrote the test, that uncovered the issue would most likely expect that Edit Actions would partially match Edit Actions: Clear All.

But does that imply that in case of a textLine search we just make sure that indexOf is >=0 anywhere in the line found? Someone could also expect that if you search for Edit Clear in a string of value Edit Actions: Clear All it should come back as found.
Not sure if there is a good solution without breaking existing use cases.

One option could be to add more options. In conjunction with partialMatch additional settings like ignoreNonAlphanumericCharacters

In my specific use case, I can go either way and make the test pass. Specifically its selecting a menu and I just added the : to the search string to make it pass.