som-shahlab/femr

Give reasonable recommendations or at least consistent defaults for CLMBR vocabulary size

scottfleming opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
Various papers that use CLMBR-T-Base (eg https://arxiv.org/abs/2311.11483) use a vocabulary size of 65,536 codes.
The tutorial trains the tokenizer with a vocab_size of 128 but then sets up the CLMBR task with a clmbr_vocab_size of 64.
The FEMRTransformerConfig sets a default vocab_size of 32,768, but it's unclear if it's even possible to use that default through standard routes, as a lot of the methods that internall instantiate FEMRTransformerConfig have vocab size as a required argument with no defaults.

Describe the solution you'd like
Ideally both the tutorial and documentation would (A) provide guidance on how to choose the vocabulary size and (B) offer reasonable defaults for the scale of data we'd imagine most users are working with. At very least, the vocab size used throughout the tutorial should be consistent.

I do not really have any good guidance here. I will add a note that 128 is 100% wrong though and purely to make the tutorial run fast

Ok. The tutorials have been updated and I am going to close this issue. Thanks for bringing it to my attention. It was not obvious that 128 was a dummy value.