How does triebeard compare to data.atble

Question

How does triebeard compare to data.atble

statquant opened this issue 8 years ago · 5 comments

Hello,
came across your package and got interested.
data.table is a package that provide enhanced data.frame structures that rely on radix sort.
Did you perform some benchmarks ?
Do you think your package could bring some features to users that are working on column oriented tables ?

Many thanks

Answer 1 · 2017-03-06T07:26:02.000Z

The packages are very, very different, I'm afraid. "Radix" is simply a term for a numerical system - in the case of sorting it refers to the fact that radix sorting is integer based, in the case of radix tries it refers to the fact that the trie structure is defined by a radix (1,2,3,4,5...) referring to how many branches a node occupies.

Radix tries are in no way related to radix sorting and the trie structure isn't tabular or culmnar in any way - it's simply a structure optimised for very, very, very efficient partial and fuzzy matching of string prefixes. So it's not really got anything to do with tabular structures, although it has quite a few use cases because of how it's optimised.

So: nope, no benchmarks (there are no comparable operations, it'd be like comparing apples to....bunyaviruses), and while it is certainly useful for specific vectorised string manipulations (as the README and vignette describe) it exists with or without data.table and similar packages.

Answer 2 · 2017-03-06T08:26:06.000Z

@Ironholds hi, actually I did not intent to suggest things were related (I shouldn't have used the radix word), just that as a data.table user I use a lot of string matching, typically like
DT[myStrColumn %like% "prefix*"] . Which are usually millions or rows.
As my question stated, I was wondering if "your package could bring some features to users that are working on column oriented tables ", benchmarks would have been useful.

Answer 3 · 2017-03-06T09:36:47.000Z

Ahh, gotcha. Possibly? In that particular example, not really: I guess the trie equivalent would be creating a trie containing the prefix and then DT[longest_match(trie, myStrColumn) == "prefix"]. The former is, it seems, actually faster (for a 10m-element vector at least) but I imagine the speed increase would decrease as the trie got larger.

Answer 4 · 2017-03-06T09:54:16.000Z

Ok, data.table advise using the fastmatch package that builds a hash table, as described here .
The "fuzziness" functionality triebeard adds seems interesting, I'll try to test and, should you be interested, come back for feedback.
Thanks

Answer 5 · 2017-03-07T02:36:46.000Z

Yeah, hash tables have a worst worst-case lookup time than tries, but are generally speedier; probably not worth spending too much time with triebeard for your specific use case.