Decompound Words For German Languange
Opened this issue · 15 comments
In German language it is common to combine nouns without whitespace.
e.g.
apple => Apfel
tree => Baum
apple tree => Apfelbaum (no white space between the two words)
Having that said, searching for "Baum" (tree) should also give a hit for the apple tree. If there are documents with "Baum" and "Apfelbaum" then user may expect that the document with "Baum" is higher ranked, but they also expect to find "Apfelbaum" within the result.
In Elasticsearch there is a HyphenationCompoundWordTokenFilter that split words by using a hyphon ruleset and a word list. The hyphon ruleset helps to avoid splitting words in a wrong way and may speedup the search for words within other words.
Anyway any simple tokenizer that uses a word list to split the words would help a lot.
This should actually already be done via the fast-fuzzy system to some extent (although this could definitely be added to the traditional fuzzy system)
In theory the system should split apfelbaum
into apfel baum
providing the documents contain a high frequency count of apfel
or baum
and not apfelbaum
(if it matches exactly the system wont try segment it) unless something else is closer / occurs more commonly.
The second part of the idea you mentioned isn't currently implemented automatically i.e joining parts back up although this is slightly possible via synonyms doing apfel:apfelbaum
as a mapping but I can see the possible use for doing this automatically (although maybe this could be an opt in feature and/or only done when the system has started to run out of options on the original method)
You can also manually tell the system to segment while keeping the original word via adding the synonym mapping apfelbaum:apfel,baum
The fast-fuzzy system without synonyms does not give good results in my test. By using the synonyms the way you suggested also didn't work for me:
Searching for "Baum" in a dataset containing "Apfelbaum" did only return the documents with exact match "Baum"
"query": [
{
"fuzzy": {
"ctx": "Baum"
},
"occur": "must"
}
]
with synonyms (already added before indexing)
{"apfelbaum":["baum","apfel"],"apfelschale":["schale","apfel"],"nussbaum":["baum","nuss"]}
Using synonyms the other way round improved the results a lot:
Apfel,Baum:Apfelbaum
That way searching for Baum
or Apfel
also returns the document "Apfelbaum" as result.
search:
{
"query": [
{
"fuzzy": {
"ctx": "Baum"
},
"occur": "must"
}
]
}
with synonyms:
{"schale":["apfelschale"],"apfel":["apfelschale","apfelbaum"],"baum":["apfelbaum","nussbaum"],"nuss":["nussbaum"]}
result:
with score:410.13638 title:Apfelbaum
with score:410.13638 title:Nussbaum
with score:410.13638 title:Baum
with score:23.718506 title:Zierleiste
Sadly in that case "Apfelbaum" goes before the exact match which I think could be improved in LNX because exact match should always have higher ranking than synonym matches.
But even if this could be improved, I am not sure how to create this synonyms automatically.
I guess without this feature LNX can not be used with German language to produce good results.
Yeah I think we should do some sort of boost adjustment to have the system display the results better.
Searching for "Baum" in a dataset containing "Apfelbaum" did only return the documents with exact match "Baum"
This is quite a difficult problem to solve in terms of deciding how the system should match close terms or if we should try something more radical like modifying the tokenized text in such a way that it matches those sorts of things.
But for now I'm not sure how easy improving it would be realistically. I don't think you'll match Apfelbaum
for the term Baum
in many existing systems anyway right now because most are prefixed based (although I'm not saying that's ideal)
I am not sure how to create this synonyms automatically.
That is ultimately the biggest issue plaguing relevancy with stuff like this, how and when should stuff be 'similar'/ related without human intervention
This is the case with some Armenian words too.
Maybe we could try using something like Hugging Face Tokenizers or Hugging Face Transformers and use NLP models (in this case stop words) distributed online, instead of creating new ones from scratch (and if they do not exist we will create them separately for each missing language of course)? There is this library that I found called rust_bert, which essentially is a port of Hugging Face's official transformer API to Rust and it seems like it is very well-maintained.
This might heavily impact the performance though. What do you think?
Hugging face have all their tokenizers in Rust to begin with it's what the backbone of their system is in.
But yes, I'm slightly concerned to the performance impact it would have (which I imagine is a lot).
This is the case with some Armenian words too.
Armenian words also suffer from the system having slightly conflicting methods of normalizing the unicode right now so it's largely trying its hardest but its difficult when two systems are potentially producing different conversions (for fast fuzzy that is)
this issue also runs with CJK languages as well.
I think the tokenizing system could do with be played around with to try improve relevancy in places, especially for non-whitespace separated languages.
I see. Let's say that ..
(or something similar) is an optional whitespace specifier. So, in the stop words file, we will just need to specify apfel..baum
and let the tokenizer do its thing.
What do you think about drafting a protocol for this? Or there is no need for doing so?
that would be an idea, although im not sure how well that would work in practice if you have hundreds of these words it's going to get pretty tedious.
I think we should start by trying to improve the automatic tokenization first before trying to do manual special casing (although that certainly would still be a good idea)
@keywan-ghadami-oxid @michaelgrigoryan25 I have an experimental branch relevancy-tests
if you fancy trying that out and giving feedback on how the relevancy is in your respective languages. Although it's slightly out of tune atm it might be a possible solution.
This does tank the English relevancy however.
@keywan-ghadami-oxid @michaelgrigoryan25 I have an experimental branch
relevancy-tests
if you fancy trying that out and giving feedback on how the relevancy is in your respective languages. Although it's slightly out of tune atm it might be a possible solution.This does tank the English relevancy however.
Sure!
You change improves the relevance a lot. I did not need any synonym settings, and already did a lot of testing.
One downside of the change is that indexing is now a lot slower. Maybe it should be configurable which tokenizer to use.
Anyway thank you already for this great improvement.
Yeah... It is considerably slower and also affects the English relevancy quite negatively. Although it seems like a step in the right direction.
I guess indexing all nGrams is quite heavy and much more then it needs (sometimes even bad). I was wondering if using a rust hyphenation library to first check good positions to split the words in combination with matching the tokens against a language specific dictionary could drastically reduce the amount of tokens. see this
https://github.com/uschindler/german-decompounder
Another (maybe stupid or genius) idea would be to use the word list (together with some stemming and hyphen rules) to build a giant regular expression that can be used to split any string into words in linear time.
So I've worked out a system that should work as best as we can do in reality without compromising on relevancy for other languages or indexing time.
Hopefully should have a test system available soon but for the most part, it brings prefix searching for free with no additional overhead to what already exists with lnx i.e apple
will match appletree
. But also with the ability to opt into suffix support. The reason why this is opt-in is that it adds an additional load to memory usage when processing a commit and can potentially increase commit times (although not by much), this gives allows you to match appletree
when you search for the query tree
if enabled.
Note there's one real caveat:
Prefix and Suffix search only works for words that are under 7
characters long. In theory, this can be increased but this comes at the cost of additional processing time, slower searches and less relevancy for most other words. i.e words like apple
/Apfel
will match appletree
/Apfelbaum
respectively but words like wonderer
won't have a prefix search potentially missing terms like wonderers
realistically though most words where you would have a prefix situation probably fit in under 6 characters. This same logic applies to suffix searching.
The plus side is though that this is essentially free (minus some commit time for suffix search) for us to use and should hopefully drastically improve relevancy for situations like this.
note: when I say it will increase commit time I mean it goes from talking about 8s to processing 400,000 unique words to 17s when suffix processing is enabled. (400,000 unique words roughly is ~5 million document database of arbitrary user data)