[RISK] Trim Context Algorithm Issue
toreleon opened this issue · 0 comments
toreleon commented
Hi Kien, I am currently developing the same Chatbot as catesearch, when I look at your source I see that you have a risk in your context trimming code.
You are checking the length of context by split word but the length of words is not the same as the length of tokens and because GPT-3 uses the byte-pair encoding architecture to encode tokens, so your algorithm may stuck in some rare language such as Vietnamese if these are in list_paragraph[:2]. I recommend you use the GPT2TokenizerFast(https://huggingface.co/docs/transformers/model_doc/gpt2) to check the length of tokens instead of split.