ChenghaoMou/embeddings

Exporting tokenizer state

mapmeld opened this issue · 4 comments

Hi - I am trying out your code to train a Charformer/GBST model on Thai Wikipedia. I aim to cover both ASCII and Thai.

I'm running into three concerns now - would appreciate if you have advice:

  • I know that ByT5 goes from a byte's value directly to its index in embeddings. In Charformer, to add one non-ASCII character, would I need to expand vocab_size to 256 + 1, or should vocab_size need to fit the highest codepoint (i.e. vocab_size > 3675 to include 0x0e5b)

  • Once I pass the tokenizer many lines of text, I'd like to export and reuse the tokenizer. I ran into these errors when running to_onnx and to_torchscript:

Unsupported: ONNX export of Slice with dynamic inputs. DynamicSlice is a deprecated experimental op.
Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults

  • Other library code wants to pass a batch as an array of texts to the tokenizer (similar to top-level readme example). Based on the docs I know that Charformer expects torch.tensor([list(line_of_text.encode('utf-8'))]), and I need to pad this for same-length strings. Should I try to make this more like other tokenizers w/ PaddingStrategy, etc

@mapmeld

  • For your first question, 256+1 should do it.
  • For the third question, because GBST is not strictly a tokenizer but a trainable network itself, it does not work directly with text input. However you can try something like this
    >>> model = GBST(
    ...     embed_size=128,
    ...     max_block_size=4,
    ...     downsampling_factor=2,
    ...     score_calibration=True,
    ...     vocab_size=259,
    ... )
    >>> # use byt5 tokenizer to tokenize byte sequences
    >>> from text_embeddings.byte.byt5 import BYT5Tokenizer
    >>> tokenizer = BYT5Tokenizer()
    >>> results = tokenizer(["Life is like a box of chocolates.", "Coding is fun."], add_special_tokens=True)
    >>> assert results["input_ids"].shape == torch.Size([2, 1024, 259])
    >>> ids = torch.argmax(torch.tensor(results["input_ids"]), dim=-1)
    >>> assert ids.shape == torch.Size([2, 1024])
    >>> assert model(ids).shape == torch.Size([2, 512, 128])

The BYT5Tokenizer comes with hardcoded special tokens so the vocab size is fixed at 259 (padding, cls, and sep tokens)
You can create your own BYT5Tokenizer-like tokenizer text_embeddings/byte/byt5.py

  • For the third question, torch.onnx added support for repeat_interleave in its master branch, so you can install pytorch nightly to export the model.
if __name__ == "__main__":

    import torch.onnx  # nightly torch only

    model = GBST(
        embed_size=128,
        max_block_size=4,
        downsampling_factor=2,
        score_calibration=True,
        vocab_size=259,
    )

    from text_embeddings.byte.byt5 import BYT5Tokenizer

    tokenizer = BYT5Tokenizer()
    results = tokenizer(
        ["Life is like a box of chocolates.", "Coding is fun."], add_special_tokens=True
    )
    assert results["input_ids"].shape == torch.Size([2, 1024, 259])

    ids = torch.argmax(torch.tensor(results["input_ids"]), dim=-1)
    assert ids.shape == torch.Size([2, 1024])

    # Export the model
    torch.onnx.export(
        model,  # model being run
        ids,  # model input (or a tuple for multiple inputs)
        "gbst.onnx",  # where to save the model (can be a file or file-like object)
        export_params=True,  # store the trained parameter weights inside the model file
        opset_version=11,  # the ONNX version to export the model to
        do_constant_folding=True,  # whether to execute constant folding for optimization
        input_names=["input"],  # the model's input names
        output_names=["output"],  # the model's output names
        dynamic_axes={
            "input": {0: "batch_size"},  # variable length axes
            "output": {0: "batch_size"},
        },
    )

Thanks for reaching out and bringing up these great questions! And I will add a compatible tokenizer to GBST asap.

I have added a ByteTokenizer to GBST, you can now see usage example from the readme file.

Thanks for looking into this! I was able to export the model with pytorch-nightly. I need to keep working on the tokenizer.

Two follow-ups:

  • There's a typo in the readme code "input": {0: "batch_size", 1: "sequence_length"},},
  • As I understand it, I should pass all of my training text into the GBST model using model(torch.tensor(results["input_ids"]).long()) and then this module is trained? Or I should first build a larger Transformer model/network around the module?

@mapmeld thanks for pointing out the typo.

And it should be the second case. In other words, GBST should be trained with your downstream model/layers together, end to end. Something like this:

# toy code
class NN(nn.Module):
    def __ini__(self):
        self.embedding = GBST()
        self.lstm = nn.LSTM()
    def forward(self, input_ids):
        embeddings = self.embedding(input_ids)
        hidden = self.lstm(embeddings)
        return input_ids