techstartucalgary/lifeline

Develop a better algorithm for splitting pdf text

Opened this issue · 0 comments

Background

The gpt-3.5-turbo api can only handle 4096 tokens (~3000 words) in one completion, including all messages (request and response, system prompt).

Note: It is possible to precisely tokenize text with transformers.GPT2Tokenizer library.

So we need to split the full text into chunks and process those chunks separately (and concurrently for performance).

Currently the text is being split with this method

def split(text: str) -> List[str]:
"""Splits the text into chunks of 50 sentences each"""
sentences = text.split(". ")
sentences_per_chunk = 50
chunks = [
". ".join(sentences[i : i + sentences_per_chunk]) + ". "
for i in range(0, len(sentences), sentences_per_chunk)
]
return chunks

Task

Design a new splitting algorithm so that instead of splitting every 50 sentences, it will split on logical boundaries in the full text. This will hopefully yield more accurate results from text completions. This could also include trimming the text to exclude unimportant data (e.g. the last 2 pages of most u of c outlines).

Getting started

A possible avenue for achieving this using an NLP library. @harsweet seemed to have some idea on this. There are techniques available for identifying the logical boundaries in text that we could apply. This will be a challenging issue to take on!