Encode of a new dataset, confused about <|endoftext|> encoding
AliceZhang2016 opened this issue ยท 4 comments
When encode a new dataset and use <|endoftext|> as delimiter, for example:
message <|endoftext|> message
The encode function in "src/encoder.py" will transform the encoding of "<|endoftext>" into [27, 91, 437, 1659, 5239, 91, 29] instead of [50256] (50256 is index of <|endoftext> in dict).
So I go to check "src/encoder.py", find that
import regex as re
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
text = "<|endoftext|> hi."
for token in re.findall(pat, text):
print(token)
I get:
<|
endoftext
|>
hi
.
Why it splits <|endoftext|> into three parts (which I think it leads to the wrong encoding of <|endoftext|>)? Should it rather be:
<|endoftext|>
hi
.
@AliceZhang2016, I don't know if you have solved this issue already, the enc.encode('some text goes here')
function from openai/gpt-2 assumes the input to be the text content only, and is not designed to detect special tokens.
I assume @nshepperd (thanks for releasing this repository! ๐) thought the encoder is able to detect the <|endoftext|>
token and assign it the value 50256
.
I have made a modification inside the load_dataset()
function in load_dataset.py
to handle it appropriately here.
That being said, I don't think it makes overly a large difference in the fine tuning process. Running the fine tuning training script, the network seems to learn/associate the sequence of <|endoftext|>
broken down as text is the <|endoftext|>
itself.
I may be wrong here but let me know your thoughts! Thanks :)
@farrell236 Thanks for sharing your idea : )
Notice that the line 39 in load_dataset.py
, the author directly add <|endoftext|>
after raw_text
, which will be encoded using enc.encode()
in line 35, I don't understand your assumption that the encoder is able to detect the <|endoftext|>
token because I don't find the detection code in encoder.py
.
I made the similar modifications like yours, that is to encode plain text by enc.encode()
and manually add encoding of <|endoftext|>
. I also agree that it won't make large difference in fine tuning process. But I think that encode <|endoftext|>
as a whole could be more reasonable, just like what you modified in your code.
@AliceZhang2016, I may have worded it badly. I do agree that enc.encode('block before <|endoftext|> block after')
does not detect <|endoftext|>
as a token, and instead breaks it down into chunks.
In[2]: enc.encode('<|endoftext|>')
Out[2]: [27, 91, 437, 1659, 5239, 91, 29]
which corresponds to ['<', '|', 'end', 'of', 'text', '|', '>']
bpe codes
The changes were to circumvent this :)
@farrell236 Then I totally agree with you. ๐