Exporting tokenizer state
mapmeld opened this issue · 4 comments
Hi - I am trying out your code to train a Charformer/GBST model on Thai Wikipedia. I aim to cover both ASCII and Thai.
I'm running into three concerns now - would appreciate if you have advice:
-
I know that ByT5 goes from a byte's value directly to its index in embeddings. In Charformer, to add one non-ASCII character, would I need to expand vocab_size to 256 + 1, or should vocab_size need to fit the highest codepoint (i.e. vocab_size > 3675 to include 0x0e5b)
-
Once I pass the tokenizer many lines of text, I'd like to export and reuse the tokenizer. I ran into these errors when running to_onnx and to_torchscript:
Unsupported: ONNX export of Slice with dynamic inputs. DynamicSlice is a deprecated experimental op.
Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults
- Other library code wants to pass a batch as an array of texts to the tokenizer (similar to top-level readme example). Based on the docs I know that Charformer expects
torch.tensor([list(line_of_text.encode('utf-8'))])
, and I need to pad this for same-length strings. Should I try to make this more like other tokenizers w/ PaddingStrategy, etc
- For your first question, 256+1 should do it.
- For the third question, because GBST is not strictly a tokenizer but a trainable network itself, it does not work directly with text input. However you can try something like this
>>> model = GBST(
... embed_size=128,
... max_block_size=4,
... downsampling_factor=2,
... score_calibration=True,
... vocab_size=259,
... )
>>> # use byt5 tokenizer to tokenize byte sequences
>>> from text_embeddings.byte.byt5 import BYT5Tokenizer
>>> tokenizer = BYT5Tokenizer()
>>> results = tokenizer(["Life is like a box of chocolates.", "Coding is fun."], add_special_tokens=True)
>>> assert results["input_ids"].shape == torch.Size([2, 1024, 259])
>>> ids = torch.argmax(torch.tensor(results["input_ids"]), dim=-1)
>>> assert ids.shape == torch.Size([2, 1024])
>>> assert model(ids).shape == torch.Size([2, 512, 128])
The BYT5Tokenizer comes with hardcoded special tokens so the vocab size is fixed at 259 (padding, cls, and sep tokens)
You can create your own BYT5Tokenizer-like tokenizer text_embeddings/byte/byt5.py
- For the third question, torch.onnx added support for
repeat_interleave
in its master branch, so you can install pytorch nightly to export the model.
if __name__ == "__main__":
import torch.onnx # nightly torch only
model = GBST(
embed_size=128,
max_block_size=4,
downsampling_factor=2,
score_calibration=True,
vocab_size=259,
)
from text_embeddings.byte.byt5 import BYT5Tokenizer
tokenizer = BYT5Tokenizer()
results = tokenizer(
["Life is like a box of chocolates.", "Coding is fun."], add_special_tokens=True
)
assert results["input_ids"].shape == torch.Size([2, 1024, 259])
ids = torch.argmax(torch.tensor(results["input_ids"]), dim=-1)
assert ids.shape == torch.Size([2, 1024])
# Export the model
torch.onnx.export(
model, # model being run
ids, # model input (or a tuple for multiple inputs)
"gbst.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=["input"], # the model's input names
output_names=["output"], # the model's output names
dynamic_axes={
"input": {0: "batch_size"}, # variable length axes
"output": {0: "batch_size"},
},
)
Thanks for reaching out and bringing up these great questions! And I will add a compatible tokenizer to GBST asap.
I have added a ByteTokenizer
to GBST, you can now see usage example from the readme file.
Thanks for looking into this! I was able to export the model with pytorch-nightly. I need to keep working on the tokenizer.
Two follow-ups:
- There's a typo in the readme code
"input": {0: "batch_size", 1: "sequence_length"},},
- As I understand it, I should pass all of my training text into the GBST model using
model(torch.tensor(results["input_ids"]).long())
and then this module is trained? Or I should first build a larger Transformer model/network around the module?
@mapmeld thanks for pointing out the typo.
And it should be the second case. In other words, GBST should be trained with your downstream model/layers together, end to end. Something like this:
# toy code
class NN(nn.Module):
def __ini__(self):
self.embedding = GBST()
self.lstm = nn.LSTM()
def forward(self, input_ids):
embeddings = self.embedding(input_ids)
hidden = self.lstm(embeddings)
return input_ids