Long Corpus Build Times

Question

Long Corpus Build Times

Headline opened this issue 6 years ago · 5 comments

Hello,

Thanks for making this - it has worked very well in the year or so I've been using it.

This project of mine started as a small-scale tinkering project, but now the data consumed by the Markov chain has gotten huge.

2020-02-26T02:41:08.942Z [UtilBot] debug: MarkovStore#parseFile: Parsed N/A with 106836 lines. [Queue: 0]
2020-02-26T02:48:34.196Z [UtilBot] info: MarkovStore#buildCorpus: Markov chain built for 335290997317697536 with 106836 lines

The parsing of this data took 7 minutes, and I'm curious of ways to improve the amount of time it takes to generate the markov chain.

Thanks!

Answer 1 · 2020-02-26T19:04:45.000Z

Hello!

First, I'm glad to know that you found my module useful, thank you.

The buildCorpus() method is certainly far from being optimized. I could try to optimize it, but if I can't reduce the time complexity - and I don't know if I can -, you won't notice any significant improvement.

Instead, what about utility methods to import/export the built corpus? I don't know how often you're calling buildCorpus(), or if your corpus changes often, but you could build it once for the day, save the built result, and re-use it later?

Answer 2 · 2020-02-26T21:50:03.000Z

Instead, what about utility methods to import/export the built corpus?

Ah; I'm sure this is almost exactly what I'm looking for. Basically I only build the corpus once, and then add to it as more chat messages come in. Being able to export a corpus after every modification would dramatically help reduce load times.

Answer 3 · 2020-03-28T11:19:00.000Z

Hey @Headline, sorry for the delay but I didn't have much time to dedicate to this project lately.

However, I published a version 3.0.0-beta.1 with a few changes. I've also added undocumented (at the moment) methods to .export() and .import(data)

I hope it will help you before I can finalize this properly.

Answer 4 · 2020-09-25T08:27:09.000Z

This feature is now documented, tested and published (along other breaking changes) under 3.0.0

Answer 5 · 2020-09-25T20:47:57.000Z

Thank you!