/tokeniz-webui

A no code tokeniz builder where user can interact with the webui to create their own tokenizer using huggingface

Primary LanguagePython

Common Tokenizers for LLMs Byte-Pair Encoding (BPE): Widely used in models like GPT-2 and GPT-3. Efficient for handling subword units, allowing for a balance between vocabulary size and representation power. Can be trained from scratch or fine-tuned using libraries like Hugging Face’s Tokenizers. WordPiece: Used in BERT and its variants. Similar to BPE but incorporates a more complex algorithm to handle subwords. Supports training on custom datasets to improve model performance on domain-specific tasks. SentencePiece: A versatile tokenizer that can handle various languages and scripts. Useful for non-English languages or when dealing with unique characters. Can be trained from scratch using custom text data. Unigram: Another tokenizer option available in Hugging Face’s library. Focuses on maximizing the likelihood of the training data, which can be beneficial for certain applications. spaCy: A rule-based tokenizer that is highly customizable. Useful for tasks requiring specific tokenization rules or language features. Moses: Primarily used in machine translation but can be adapted for other NLP tasks. Rule-based and allows for extensive customization.