Ability to add score for top k results
Opened this issue · 11 comments
Hi, its a nice library ... is there a way to add data to any Node so when retriueving thousand of results for given prefix only the top k results are returned.
Adding any kinf of score to an entry which is internally sorted would be great.
Guess any mix of length and score would be great ... so if there is a key which matches the prefix it should return always as with others which are ranked higher.
This allows for early termination when searching for top k (10 or 20) ...
A custom comparer would be fine so you can customize this behaviour.
Thanks for your feedback.
Could you please provide an example?
GetByPrefix
returns IEnumerable<T>
, so you can simply use LINQ Take
method and it won't iterate through the entire collection. If I understand your question correctly...
Yes but it has to be guaranteed that is sorted ... by any Comparer I set. So T should implement IComparable or the Trie itself to hold a sorted List (or similar)
Hm... can you provide an example (mb some pseudo-code?), because I can't get the idea
And trie is a sorted data structure by nature...
Yes the texts you add / index are sorted but I mean something like this
var trie = new Trie<T>(IComparer<T> comparer); // maybe new EntityOccurenceDescComparer() so all T values are automatically sorted so we can top 10 or 10 to 20 in a fast ways (SortedList maybe supports indexer)
trie.Add(new Entry() { Occurence = 12, Anything = "Nice one" });
...add more entries
var entries = trie.GetByPrefix("ch", int topK = 10) --> gets the top kost items by comparer (could do also over Take(...) if its IEnumerable()
The point is when you have millions of entries and wanna show up 10 to the user it's not useful to return all items of the trie ...
Not quite sure if I understood this correctly, just to confirm:
Let's say you have this object
class Entry { int Rank; }
You construct a trie:
var trie = new StringTrie<Entry>();
trie.Add("ch_key1", new Entry {Rank = 1});
trie.Add("ch_key2", new Entry {Rank = 1});
...
trie.Add("ch_key100", new Entry {Rank = 50});
Then you want to select by prefix (e.g. "ch") let's say first 10 entries considering the Rank property (priority), i.e. "select first 10 entries with prefix ch
starting from the highest rank"
Yes, but in a performant way so the entries have to be presorted. So Linq is no option on the whole set.
Maybe its best to store inside a sortedlist or sorted set within the trie node.
One more example would be prefix "sch" in German language...there are 600000 words. But within my domain I manage some kind of statistics and would fetch these words based om that.
Got it. Yeah, it is not possible with the current implementation.
Right, probably the easiest way will be to extend TrieNode
to be able to swap collection of the Children
property.
I suppose another example could be: word starting with "ch" sorted by occurrence (if you were parsing some text and incrementing the occurrence of words as a property on that node), but now we're drifting into elasticsearch/Apache Solr territory.
I meant occurrences of words...every time a certain word occurs, update the occurrence of that word (in the example above, "Rank" could be "Occurrences"). I was just highlighting another use case to provide validity to the request.
@nhaberl recently I've done some work on the library, and was looking at your case. Consider this example:
Trie trie = ["star", "start", "stack", "stop", "stay", "key"];
// where 'star' has rank 1, 'start' has rank 2, 'stack' has rank 3, etc
{root}
/\
s k
/ \
t e
/ \ \
a o [y]
/ | \ \
[r][y] c [p]
/ \
[t] [k]
where [char] -- is end of word
So the trie node s
will need somehow to store these different ranks, then the node t
will also need to store all the ranks, etc.
This will enormously increase the trie memory size, and complexity of the retrieval.
For your case, I think the more valuable option will be to have multiple tries, e.g.:
Trie trieRank1 = [ ... ]; // all words with rank 1
Trie trieRank2 = [ ... ]; // all words with rank 2
...
and then, let's say you need to select 10 words which start with st
starting with rank 1, then 2, etc.
So you firstly select the required words from trieRank1
, if there are less than 10 results you select from trieRank2
, etc