Issues
- 5
- 0
Update exact, general, idiomatic regexes
#36 opened by dhbrojas - 0
Build database of added & suggested tokens
#35 opened by dhbrojas - 0
Add `Tokenizer::sample_encode` method
#8 opened by dhbrojas - 0
Add node bindings and web-based visualiser
#11 opened by dhbrojas - 3
Add many more suggested and added tokens
#22 opened by dhbrojas - 1
- 0
Use log frequencies on a per-document basis
#33 opened by dhbrojas - 1
Add Python methods to `PyTokenizer`
#9 opened by dhbrojas - 1
Add typing support for Python bindings
#27 opened by dhbrojas - 2
- 1
Add support for "added tokens"
#26 opened by dhbrojas - 2
- 4
Automated evaluation pipeline and vocabulary hub
#29 opened by dhbrojas - 3
Add support for vocabulary "extension packs"
#3 opened by dhbrojas - 2
Benchmark hot paths in the crate
#6 opened by dhbrojas - 2
- 0
Enable training on >5GB on a 16GB laptop
#16 opened by dhbrojas - 2
- 1
- 1
Add base CodeGeeX vocabulary to TokenGeeX
#13 opened by dhbrojas - 2
- 1
- 1
Prune low frequency tokens after training
#31 opened by dhbrojas - 1
Collect most frequent tokens from un-strict vocab and add them to strict vocab
#28 opened by dhbrojas - 0
Measure compression on production data
#4 opened by dhbrojas - 2
Tokenizer should operate on bytes
#7 opened by dhbrojas - 1
Make capcode lossless
#24 opened by dhbrojas - 0
- 0
Add support for special tokens
#17 opened by dhbrojas - 1
- 0