nlmatics/nlm-ingestor

Trivially small chunks returned

thelazydogsback opened this issue · 0 comments

llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks.
These include:

  • Single characters (like a copyright symbol)
  • Small runs of characters like "******************"
  • Single words
  • Single sentences

Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.

I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.