Issue with longest string matching
giriannamalai opened this issue · 3 comments
When a word is overlap with another "Flashtext" did not took the largest phrase.
For ex.,
`from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('love python', 'Luv Py')
keyword_processor.add_keyword('python programming in ML', 'Luv Py2')
keyword_processor.add_keyword('Love Django', 'Django')
sentence = "I love python programming in ML"
keywords_found = keyword_processor.extract_keywords(sentence)
keywords_found # ['Luv Py']`
The actual result should be "Luv Py2" . But we got 'Luv Py'.
Any help to fix this?
Hi @giriannamalai,
So, flashtext indeed always takes the longest phrase for keywords provided the keywords starting with the same word. In your case like you mentioned, in your keywords, one of them is "love python"
, stored in the trie dictionary-like as EOL-l-o-v-e-[space]-p-y-t-h-o-n-EOL
. Other one is "python programming in ML"
stored as EOL-p-y-t-h-o-n-[space]-p-r-o-g-r-a-m-m-i-n-g.....-EOL
. For the sentence"I love python programming in ML"
It starts reading each character from start till the end and when character l
of love
appears it gets the phrase "love python"
and extracts the keyword out and for the remaining text "programming in ML"
it doesn't find any keywords.
So, if instead if you replace keyword "python programming in ML"
to "love python programming in ML"
, your action result would be "Luv Py2"
.
Kind Regards,
Nandan Thakur
No. This won't work. I gave you a scenario with the example. Changing Input is not a good idea. We have to find some other ideas.
I switched to this library: https://github.com/WojciechMula/pyahocorasick/ which can handle overlapping matches.