/go-tokenizers-bindings

Go bindings for HuggingFace Tokenizer

Primary LanguageGoMIT LicenseMIT

Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings.

Using pre-built binaries

Build your Go application using pre-built native binaries: docker build --platform=linux/amd64 -f example/Dockerfile .

Available binaries:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/daulet/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Benchmarks

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/daulet/tokenizers
BenchmarkEncodeNTimes-10     	  996556	     11851 ns/op	     116 B/op	       6 allocs/op
BenchmarkEncodeNChars-10      1000000000	     2.446 ns/op	       0 B/op	       0 allocs/op
BenchmarkDecodeNTimes-10     	 7286056	      1657 ns/op	     112 B/op	       4 allocs/op
BenchmarkDecodeNTokens-10    	65191378	     211.0 ns/op	       7 B/op	       0 allocs/op
PASS
ok  	github.com/daulet/tokenizers	126.681s

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.