data set is too big (which is too big to be held in one machine's mem), and I should break it to small daily set

Question

data set is too big (which is too big to be held in one machine's mem), and I should break it to small daily set

jackyhawk opened this issue 2 years ago · 2 comments

Thanks for the excellent code.

and I met one question, my data set is too big (which can not be held in one machine's mem), and I should break it to small daily set.
so I should first generate each day's walk result (sequence) and then train by other code(suan as Gensim) as word2vec.

All I want is the random walking result

as for the walking result, should I just return before the part listed as following?
and then save dw_rw to disk for latter training?

Answer 1 · 2022-05-12T13:45:00.000Z

You will need to deal with multiprocessing slightly better than I do in the training loop. One option would be to just run the random walk generation and write to the file in the single thread. As for the place, it is correct.

Answer 2 · 2022-05-12T15:40:08.000Z

Thanks very much.
Is there any other repo that is available to generate random walk sequence for big data set?
I found when I use data set bigger than 10 million edge, the memory required would be bigger than my memory capacity(200G)