/cl-tokenizers

Primary LanguageCommon LispMIT LicenseMIT

cl-tokenizers

Tokenizers used for encoding/embedding and LLMs.

Status

This project is just getting started. Currently only the OpenAI Tiktoken BPE encoding is supported. It would be great to get some more implemented!

Installation

cl-tokenizers is not in quicklisp. It will need to be installed in the local-projects quicklisp directory:

> cd /$USER_HOME/quicklisp/local-projects/
> git clone https://github.com/jolby/cl-tokenizers.git

Basic Usage at the REPL

CL-USER>(ql:quickload :tokenizers)
> (:TOKENIZERS)
CL-USER>(defparameter *cl100k-encoder* (tokenizers:get-encoder :tiktoken "cl100k_base"))
> *CL100K-ENCODER*
CL-USER>(tokenizers:encode *cl100k-encoder* "hello world")
> #(15339 1917)
CL-USER>(tokenizers:decode *cl100k-encoder* #(15339 1917))
> "hello world"