tokenization

There are 1291 repositories under tokenization topic.

  • sentencepiece_chinese_bpe

    使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。

    Language:Python119
  • charformer-pytorch

    Implementation of the GBST block from the Charformer paper, in Pytorch

    Language:Python118
  • lima

    lima

    The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.

    Language:C++114
  • tkseem

    Arabic Tokenization Library. It provides many tokenization algorithms.

    Language:Jupyter Notebook107
  • TweebankNLP

    [LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset

    Language:Python105
  • openai-tools

    A collection of tools for working with OpenAI

    Language:C#100
  • python-fpe

    FPE - Format Preserving Encryption with FF3 in Python

    Language:Python100
  • WordTokenizers.jl

    High performance tokenizers for natural language processing and other related tasks

    Language:Julia99
  • dlp-dataflow-deidentification

    Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

    Language:Java95
  • attacut

    A Fast and Accurate Neural Thai Word Segmenter

    Language:Python90
  • wisesight-sentiment

    Thai social media text sentiment dataset

    Language:Jupyter Notebook87
  • nlpcloud-python

    NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...

    Language:Python85
  • klmbr

    klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

    Language:TeX80
  • Coursera-DeepLearning.AI-Natural-Language-Processing-Specialization

    This Repository Contains Solution to the Assignments of the Natural Language Processing Specialization from Deeplearning.ai on Coursera Taught by Younes Bensouda Mourri, Łukasz Kaiser, Eddy Shyu

    Language:Jupyter Notebook80
  • wongnai-corpus

    Collection of Wongnai's datasets

  • Real-World-Assets-RWA

    This repository comprises the theoretical and technical aspects of tokenisation of real world assets.

    Language:Solidity76
  • Vaaku2Vec

    Language Modeling and Text Classification in Malayalam Language using ULMFiT

    Language:Jupyter Notebook73
  • SeTok

    Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

    Language:Python72
  • uax29

    A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.

    Language:Go68
  • MBTI-Personality-Classifier

    A model which uses your social media posting predict your MBTI personality type.

    Language:Jupyter Notebook67
  • h-net-dynamic-chunking

    Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon

    Language:Python64
  • ling

    Natural Language Processing Toolkit in Golang

    Language:Go64
  • CMTAT

    Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

    Language:JavaScript63
  • vaulty

    Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy

    Language:Go62
  • wink-tokenizer

    Multilingual tokenizer that automatically tags each token with its type

    Language:JavaScript62
  • spacy-server

    🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec

    Language:Python60
  • code_tokenize

    Fast tokenization and structural analysis of any programming language

    Language:Python59
  • bert_tokenization_for_java

    This is a java version of Chinese tokenization descried in BERT.

    Language:Java59
  • contracts

    On-chain RWA Tokenization Framework

    Language:Solidity56
  • unscanny

    Painless string scanning.

    Language:Rust56
  • cookbook

    The Unicode Cookbook for Linguists

    Language:TeX56
  • FastBertTokenizer

    Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

    Language:C#50
  • Natural-Language-Processing-Fundamentals

    Use Python and NLTK to build out your own text classifiers and solve common NLP problems

    Language:Jupyter Notebook50
  • cashtokens

    cashtokens

    A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.

  • nlpcloud-js

    NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...

    Language:JavaScript48
  • xontrib-output-search

    xontrib-output-search

    Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.

    Language:Python47