how many memory if load 1 billion token with average length 5(range from 1 to 20)
SeekPoint opened this issue · 9 comments
I personally hate this kind of answers, but have to say that: "it depends".
It depends on three major factors:
- distribution of words -- the underlying data structure compresses common prefixes, so if there are many unique words, which don't share prefixes with others, then data structure can become huge;
- python version -- the module adheres to python's internal representation of strings, i.e. a symbol might occupy one byte, two, or four bytes; this obviously affects on memory consumption, as we need to save the symbols;
- python's allocator -- the internal memory fragmentation likely increase the real memory consumption, because we allocate usually small memory chunks.
All in all I'm unable to estimate it. You can obtain memory usage of an existing automaton: https://pyahocorasick.readthedocs.io/en/latest/#get-stats-dict.
it help.
I think, since pyahocorasick is implement with ANSI C, and python plays as a wrapper, so 'python version' and 'python allocator' should not affects the memory consumption.
I had test on 20 millions token of Chinese, it looks the memory consumption is not a problemn on a 16g win10 laptop
@lovejasmine I'm happy it worked for you. :) Would you please paste here the output from get_stats method? It might help others.
my log shows every 3 million tokens take 4% * 16GB memory
I use:
def printSys():
info = psutil.virtual_memory()
print(u'mem percentage:', info.percent)
for statistation
I wonder if optimization for logographic languages like Chinese is necessary? For example, trie tree works best for Western European languages like English with a very limited character set.
@lerner-zhang , so, do you have any ideal on trie tree works with Chinese or anything replace TRIE tree with Chinese?
@lovejasmine I have found some papers on this issue but not read yet. You can google using keywords like "trie" and "Chiinese".