Bug in extractor-enum.js with original text indexes
alberchou opened this issue · 1 comments
alberchou commented
Good afternoon,
I was having an issue with repeated tokens (I want to recognize operations over a query) and I think that the function extract(srcInput) on extractor-enum.js has a little bug, the originalTextIndex is being increased by token length but not by the separators.
For example:
- You have the following entity to be recognized: sum
- You process the following sentence: I want the sum of something1, sum of something2, sum of something3... , sum of something10
- When the number of split characters (space or ,) is not taken into account, it causes that there are values repeated in the originalPositionMap dictionary.
I'm using version 4.27.0:
npm list node-nlp
`-- node-nlp@4.27.0
It's happening in extractor-enum.js line 306 to 322 (async extract(srcInput))
Best regards.
alberchou commented
I think that changing this:
originalTextIndex += tokenizeResult.tokens[i].length;
to this:
originalTextIndex = originaltextPos + tokenizeResult.tokens[i].length;
may solve the problem