Give reasonable recommendations or at least consistent defaults for CLMBR vocabulary size
scottfleming opened this issue · 2 comments
Is your feature request related to a problem? Please describe.
Various papers that use CLMBR-T-Base (eg https://arxiv.org/abs/2311.11483) use a vocabulary size of 65,536 codes.
The tutorial trains the tokenizer with a vocab_size
of 128 but then sets up the CLMBR task with a clmbr_vocab_size
of 64.
The FEMRTransformerConfig sets a default vocab_size
of 32,768, but it's unclear if it's even possible to use that default through standard routes, as a lot of the methods that internall instantiate FEMRTransformerConfig
have vocab size as a required argument with no defaults.
Describe the solution you'd like
Ideally both the tutorial and documentation would (A) provide guidance on how to choose the vocabulary size and (B) offer reasonable defaults for the scale of data we'd imagine most users are working with. At very least, the vocab size used throughout the tutorial should be consistent.
I do not really have any good guidance here. I will add a note that 128 is 100% wrong though and purely to make the tutorial run fast
Ok. The tutorials have been updated and I am going to close this issue. Thanks for bringing it to my attention. It was not obvious that 128 was a dummy value.