tensorflow/text

Issue in detokenization of token_ids in BertTokenizer

DeependraParichha1004 opened this issue · 7 comments

When detokenizing the hindi_token_ids why it is tokenizing into bytes rather than strings?

image

Actual



While detokenizing the ids in some other languages it is working perfectly fine.

image

English_language

You are correct. Devanagari has the Unicode range of U+0900..U+097F. For specific character sets the pre-tokenization (used before Wordpiece) performs byte splitting. The ranges are defined here and includes these Unicode characters.

A bit of history: The original BERT paper broke down tokenization into a pre-tokenization step and wordpiece tokenization. When developing the BertTokenizer, there was discussion internally whether to simply use the same pre-tokenization algorithm, or do our best to improve on it. We eventually settled on replicating what was used for the paper, since researchers wanted something that "just worked" with the current vocabularies and models so they could spend time iterating on the core model code and not worry about discrepancies in performance due to tokenization.

I agree this shouldn't be splitting in bytes, but this code will not be changed since doing it could impact models when they update the package version. We have talked about building another "basic tokenizer" that could be substituted into the pre-tokenization step, but there hasn't been a lot of requests for it.

The BertTokenizer is really just a convenience class that wraps a regex_split, normalization, and WordpieceTokenizer. You could instead perform these steps yourself using the WhitespaceTokenizer or UnicodeScriptTokenizer or regex_split op, normalize_utf8 or FastBertNormalizer, and WordpieceTokenizer or FastWordpieceTokenizer. Review the BertTokenizer or FastBertTokenizer for implementation details.

Ya @broken. But, is there any other model or option where I don't get byte type output because every time I've to decode to get the desired output?

Apologies, but my previous response was incorrect. I was off a hexidecimal place and inappropriately put Devanagari in one of the byte split ranges.

Actually, this is just an issue with presentation. We generally work with Unicode strings and characters using their bytes. You have the Unicode bytes and want them to be displayed as readable Unicode strings.

You can do something like:

[s.decode('utf-8') for s in list(words_1.numpy())]

The reason why the English characters display is that they are represented as a single byte in UTF-8, so Python is displaying them as such.

So you mean to say that Hindi characters are not represented as a single byte? If yes, then how are they represented?

Right. They are multiple bytes.

The Unicode range for this character set is U+0900..U+097F. This is in hex; in decimal the codepoints are 2304 to 2431. It takes multiple bytes to represent these numbers in UTF-8.

Example:
character: थ
Unicode+hex: U+925
Decimal: 2341
UTF-8 bytes (in hex): E0 A4 A5

UTF-8 is how these characters are encoded so the computer knows what character to show when reading bytes in memory. Here's another example. I'll start by writing that character from Python3 into a file.

s = 'थ'
f = open("test.txt", "w")
f.write(s)
f.close()

After running that simple program, we can call cat to output the file which gives us that character:

$ python test.py 
$ cat test.txt 
थ

However, when we actually look at the bytes that make up the file we see the similar UTF-8 bytes from above:

$ hexdump test.txt 
0000000 a4e0 00a5                              
0000003

Hopefully this is clear. I encourage you to read more about Unicode and UTF-8 if you want to learn more about character encodings.

Thankyou @broken for your time. I'll definately explore more about character encodings.