Issue in detokenization of token_ids in BertTokenizer

Question

Issue in detokenization of token_ids in BertTokenizer

DeependraParichha1004 opened this issue 2 years ago · 7 comments

DeependraParichha1004 commented 2 years ago

When detokenizing the hindi_token_ids why it is tokenizing into bytes rather than strings?

While detokenizing the ids in some other languages it is working perfectly fine.

Answer 1 · 2022-06-15T21:16:06.000Z

You are correct. Devanagari has the Unicode range of U+0900..U+097F. For specific character sets the pre-tokenization (used before Wordpiece) performs byte splitting. The ranges are defined here and includes these Unicode characters.

A bit of history: The original BERT paper broke down tokenization into a pre-tokenization step and wordpiece tokenization. When developing the BertTokenizer, there was discussion internally whether to simply use the same pre-tokenization algorithm, or do our best to improve on it. We eventually settled on replicating what was used for the paper, since researchers wanted something that "just worked" with the current vocabularies and models so they could spend time iterating on the core model code and not worry about discrepancies in performance due to tokenization.

I agree this shouldn't be splitting in bytes, but this code will not be changed since doing it could impact models when they update the package version. We have talked about building another "basic tokenizer" that could be substituted into the pre-tokenization step, but there hasn't been a lot of requests for it.

The BertTokenizer is really just a convenience class that wraps a regex_split, normalization, and WordpieceTokenizer. You could instead perform these steps yourself using the WhitespaceTokenizer or UnicodeScriptTokenizer or regex_split op, normalize_utf8 or FastBertNormalizer, and WordpieceTokenizer or FastWordpieceTokenizer. Review the BertTokenizer or FastBertTokenizer for implementation details.

Answer 2 · 2022-06-16T07:19:02.000Z

Ya @broken. But, is there any other model or option where I don't get byte type output because every time I've to decode to get the desired output?

Answer 3 · 2022-06-16T20:11:40.000Z

Apologies, but my previous response was incorrect. I was off a hexidecimal place and inappropriately put Devanagari in one of the byte split ranges.

Actually, this is just an issue with presentation. We generally work with Unicode strings and characters using their bytes. You have the Unicode bytes and want them to be displayed as readable Unicode strings.

You can do something like:

[s.decode('utf-8') for s in list(words_1.numpy())]

The reason why the English characters display is that they are represented as a single byte in UTF-8, so Python is displaying them as such.

Answer 4 · 2022-06-16T20:20:18.000Z

Here's an example colab:
https://colab.research.google.com/gist/broken/6aa191285063b915164e6f3addbf1fd7/tf-text-issue-949-hindi-bert-tokenization.ipynb

Answer 5 · 2022-06-21T17:25:33.000Z

So you mean to say that Hindi characters are not represented as a single byte? If yes, then how are they represented?

Answer 6 · 2022-06-23T02:35:30.000Z

Right. They are multiple bytes.

The Unicode range for this character set is U+0900..U+097F. This is in hex; in decimal the codepoints are 2304 to 2431. It takes multiple bytes to represent these numbers in UTF-8.

Example:
character: थ
Unicode+hex: U+925
Decimal: 2341
UTF-8 bytes (in hex): E0 A4 A5

UTF-8 is how these characters are encoded so the computer knows what character to show when reading bytes in memory. Here's another example. I'll start by writing that character from Python3 into a file.

s = 'थ'
f = open("test.txt", "w")
f.write(s)
f.close()

After running that simple program, we can call cat to output the file which gives us that character:

$ python test.py 
$ cat test.txt 
थ

However, when we actually look at the bytes that make up the file we see the similar UTF-8 bytes from above:

$ hexdump test.txt 
0000000 a4e0 00a5                              
0000003

Hopefully this is clear. I encourage you to read more about Unicode and UTF-8 if you want to learn more about character encodings.

Answer 7 · 2022-06-25T06:05:18.000Z

Thankyou @broken for your time. I'll definately explore more about character encodings.