trantrikien239/cetasearch

[RISK] Trim Context Algorithm Issue

toreleon opened this issue · 0 comments

Hi Kien, I am currently developing the same Chatbot as catesearch, when I look at your source I see that you have a risk in your context trimming code.
image
You are checking the length of context by split word but the length of words is not the same as the length of tokens and because GPT-3 uses the byte-pair encoding architecture to encode tokens, so your algorithm may stuck in some rare language such as Vietnamese if these are in list_paragraph[:2]. I recommend you use the GPT2TokenizerFast(https://huggingface.co/docs/transformers/model_doc/gpt2) to check the length of tokens instead of split.