huggingface/tokenizers

BPE Decoder cleanup option

w-zygmuntowicz opened this issue · 2 comments

Hi there!

At the moment I'm not even sure if it's an issue, so just wanted to ask you. I've been using AutoTokenizer from the transformer package in python with HerBERT tokenizer which was using a BPE Decoder. So after encoding and decoding a sample text eg. "How are you?" I got <s>How are you? </s>. On the other hand when I'm using the tokenizers I got different output How are you ? (notice the whitespace before question mark).

This is the code I was using to reproduce the results:

tokenizers

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
encoding = tokenizer.encode('How are you?')
t.decode(e.ids) # => 'How are you ?'

transformers

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
encoding = tokenizer.encode('How are you?')
tokenizer.decode(encoding) # => '<s>How are you? </s>'

I have made a small research and I found that there is already a code that cleans up the whitespaces before punctuation marks, but I couldn't find it in the BPE Decoder, only in WordPiece Decoder. On the contrary in the Transformers the cleanup method is used for this case here.

Like I said at the start I'm not even sure if it's an issue, just wanted to ask is that behaviour expected?

I think transformers has cleanup_tokenization_spaces=True. Note that tokenizer = AutoTokenizer.from_pretrained('ipipan/silver-retriever-base-v1') uses the wrapper around tokenizers. Not just the tokenizer so it is expected to not have exactly the same results.

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.