Possible error in adding new tokens to t5_tokenizer
Closed this issue · 5 comments
Is the extra space supposed to be here for the <=
and <
tokens for the t5_tokenizer
?
We add these two tokens to consistant with the tokenizer in run_seq2seq.py as
Line 155 in b912ce8
i see---in either case, i ran evaluation on spider dataset both with and without the extra spacing and it doesn't seem to make a difference in terms of the accuracy.
There are only a few examples contain "<" or "<=" in this dataset, I guess this is why it was not affected.
that makes sense. i guess i was originally curious as to why the extra space is there---i.e., why is it <=
instead of <=
.
well, we just keep what PICARD author does and do not change it,
https://github.com/ServiceNow/picard/blob/6a252386bed6d4233f0f13f4562d8ae8608e7445/seq2seq/run_seq2seq.py#L140