polyfact/polyfire-js

splitText function is too long

Opened this issue · 2 comments

I've noticed that the splitText function is running pretty slow. When it's called on its own, it takes about 150 to 300 milliseconds. But when it's used on a whole list of transcripts in the frontend, it takes way too long and really slows down the app.

We need the splitText function to work faster, even with a big list of transcripts, to keep the app running smoothly.

As a quick fix, I've switched to using TokenTextSplitter from the langchain library, which is a lot faster for my needs. But this is just a temporary solution, and it would be great to have a more permanent fix in the polyfire-js library.

I had similar problems in the api part a while ago.

A big optimization is to call encode once and do the splitting directly on the tokens then decode everything.

But even outside of that, I don't know if we still need an algorithm that complex. Right now it's trying as much as possible to cut between paragraphs first, lines second, sentences third etc.... while trying to have chunks as even as possible.
I feel like it's something we needed during the autodoc era but isn't really relevant anymore.

Maybe we could just do the same thing as in the api and just cut at the chunkSize limit or at least just enforce a sentence rule (where we would just split at every full stop, encode, merge sentences until they make a chunk size and decode every chunk)