LUMIA-Group/rasat

Possible error in adding new tokens to t5_tokenizer

Closed this issue · 5 comments

Is the extra space supposed to be here for the <= and < tokens for the t5_tokenizer?

t5_tokenizer.add_tokens([AddedToken(" <="), AddedToken(" <")])

We add these two tokens to consistant with the tokenizer in run_seq2seq.py as

tokenizer.add_tokens([AddedToken(" <="), AddedToken(" <")])

i see---in either case, i ran evaluation on spider dataset both with and without the extra spacing and it doesn't seem to make a difference in terms of the accuracy.

There are only a few examples contain "<" or "<=" in this dataset, I guess this is why it was not affected.

that makes sense. i guess i was originally curious as to why the extra space is there---i.e., why is it <= instead of <=.