bug(api): Sometimes PDF contents gets mangled on extraction
CollectiveUnicorn opened this issue · 0 comments
CollectiveUnicorn commented
Steps to reproduce
- Download a PDF like the one attached here: aesopsfables00aeso.pdf
- Upload the PDF to using the API
- Create a new vector store or update an existing one using the PDF
- Wait for the vector indexing to complete
- Observe in the DB that the
content
is missing spaces between words.
Expected result
- The
content
piece of thevector_content
has spaces so that when given to the LLM the text is intelligible.
Actual Result
- The
content
piece of thevector_content
does not have spaces, so the LLM gets a glob of text.
Visual Proof (screenshots, videos, text, etc)
Additional Context
- This does not happen for all PDFs only some of them.