Develop a better algorithm for splitting pdf text
Opened this issue · 0 comments
Background
The gpt-3.5-turbo api can only handle 4096 tokens (~3000 words) in one completion, including all messages (request and response, system prompt).
Note: It is possible to precisely tokenize text with
transformers.GPT2Tokenizer
library.
So we need to split the full text into chunks and process those chunks separately (and concurrently for performance).
Currently the text is being split with this method
lifeline/server/src/handlers/openai_api_handler.py
Lines 14 to 23 in 914fa86
Task
Design a new splitting algorithm so that instead of splitting every 50 sentences, it will split on logical boundaries in the full text. This will hopefully yield more accurate results from text completions. This could also include trimming the text to exclude unimportant data (e.g. the last 2 pages of most u of c outlines).
Getting started
A possible avenue for achieving this using an NLP library. @harsweet seemed to have some idea on this. There are techniques available for identifying the logical boundaries in text that we could apply. This will be a challenging issue to take on!