microsoft/Tokenizer

TS Optimization: avoid allocations and array slicing in byte pair encoding

connor4312 opened this issue · 0 comments

Currently bytePairEncode creates an array of arrays byteIndicesAndRanks which is spliced and removed as data is deleted.

Instead, it may be faster to use and reuse two typed arrays: one for indicies in byteIndicesAndRanks and one for the byteIndicesAndRanks themselves Splicing an item from the list would instead become calling indicies.set(indicies.subarray(index + 1), index) (or perhaps a manual shift would be faster).