Abel2076/json2binidx_tool

RWKV tokenizer should generate uint16 indices, instead of int32

desktable opened this issue · 0 comments

The RWKV tokenizer has a vocabulary size of 65525. Even after adding dummy tokens, the vocabulary size only grows to 65536. Therefore its output index can fit into the "uint16" dtype, which supports up to 65536 tokens.

However, due to this function, preprocess_data.py will pick the "int32" dtype instead.

def __best_fitting_dtype(vocab_size=None):
    if vocab_size is not None and vocab_size < 65500:
        return np.uint16
    else:
        return np.int32

Source: https://github.com/Abel2076/json2binidx_tool/blob/9051dad73f9ef84c45cfe8bb0736f2edfe228619/tools/indexed_dataset.py#L29C7-L29C7