Trivially small chunks returned

Question

Trivially small chunks returned

thelazydogsback opened this issue 7 months ago · 0 comments

llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks.
These include:

Single characters (like a copyright symbol)
Small runs of characters like "******************"
Single words
Single sentences

Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.

I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.