scambier/markov-strings

Long Corpus Build Times

Headline opened this issue · 5 comments

Hello,

Thanks for making this - it has worked very well in the year or so I've been using it.

This project of mine started as a small-scale tinkering project, but now the data consumed by the Markov chain has gotten huge.

2020-02-26T02:41:08.942Z [UtilBot] debug: MarkovStore#parseFile: Parsed N/A with 106836 lines. [Queue: 0]
2020-02-26T02:48:34.196Z [UtilBot] info: MarkovStore#buildCorpus: Markov chain built for 335290997317697536 with 106836 lines

The parsing of this data took 7 minutes, and I'm curious of ways to improve the amount of time it takes to generate the markov chain.

Thanks!

Hello!

First, I'm glad to know that you found my module useful, thank you.

The buildCorpus() method is certainly far from being optimized. I could try to optimize it, but if I can't reduce the time complexity - and I don't know if I can -, you won't notice any significant improvement.

Instead, what about utility methods to import/export the built corpus? I don't know how often you're calling buildCorpus(), or if your corpus changes often, but you could build it once for the day, save the built result, and re-use it later?

Instead, what about utility methods to import/export the built corpus?

Ah; I'm sure this is almost exactly what I'm looking for. Basically I only build the corpus once, and then add to it as more chat messages come in. Being able to export a corpus after every modification would dramatically help reduce load times.

Hey @Headline, sorry for the delay but I didn't have much time to dedicate to this project lately.

However, I published a version 3.0.0-beta.1 with a few changes. I've also added undocumented (at the moment) methods to .export() and .import(data)

I hope it will help you before I can finalize this properly.

This feature is now documented, tested and published (along other breaking changes) under 3.0.0

Thank you!