tokenizers
There are 30 repositories under tokenizers topic.
xebia-functional/xef
Building applications with LLMs through composability, in Kotlin, Scala, ...
jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Arunprakash-A/DL-Pytorch-Workshop
Develop DL models using Pytorch and Hugging Face
chonkie-ai/autotiktokenizer
🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
Prismadic/magnet
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
sayakpaul/count-tokens-hf-datasets
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers
1kkiRen/Tokenizer-Changer
Python script for manipulating the existing tokenizer.
Hugging-Face-Supporter/tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
unfoldingWord/string-punctuation-tokenizer
Small library that provides functions to tokenize a string into an array of words with or without punctuation
arturom/search-analysis
A graphical user interface for the Elasticsearch Analyze API
Beomi/megatronlm_dataset_autotokenizer
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.
Anush008/tokenizers
Multi-arch bindings for @huggingface/tokenizers.
mickymultani/LLM-Architecture
Visualize some important concepts related to LLM architectures.
sappho192/Tokenizers.DotNet
[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library
cobanov/turkish-bpe-tokenizer
Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language
symanto-research/merge-tokenizers
Package to align tokens from different tokenizations.
willsaliba/LDR_Transformer
ML Model designed to learn compositional structure of LEGO assemblies
adkwn1/question-answer-app
Question and Answer web applicaiton using fine-tuned and pre-trained T5 models. Application runs on Streamlit.
Jeronymous/deep_learning_notebooks
Self-containing notebooks to play simply with some particular concepts in Deep Learning
jungsoh/transformers-question-answering
Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering
victoryosiobe/kingchop
Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.
DanielPFlorian/Transformers-Github-Semantic-Search
NLP Dataset Creation and Semantic Search Demonstration
infinilabs/pizza-stemmers
🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.
OmkarBorhade98/Text_Summarization
Text Summarization using NLP
s2458588/wsm-tokenizer
Bachelor Thesis Repository. Wsm-tokenizer (word shape mapping) uses vocabulary comparisons to find probable morphemes in lexemic tokens.
u84819482/Nano-transformer
Minimal encoder for text classification, decoder for text generation, ViT for image classification
helena-intel/test-prompt-generator
Create prompts with a given token length for testing LLMs and other transformers text models.
lepisma/tokenizers.el
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library