daulet/tokenizers

Thread safety issue

RJKeevil opened this issue · 3 comments

Hi, very happy to see the encodeWithOptions in main, but I've just spotted it has made tokenizers non-threadsafe. If I call encode on multiple goroutines simultaneously I now get the following:

[signal SIGSEGV: segmentation violation code=0x2 addr=0x7fd978021000 pc=0x10545f6]

goroutine 190 [running]:
runtime.throw({0x199ae0a?, 0x4ac2000?})
/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc00c5e2f00 sp=0xc00c5e2ed0 pc=0x45eedc
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:875 +0x285 fp=0xc00c5e2f60 sp=0xc00c5e2f00 pc=0x4752e5
github.com/daulet/tokenizers.uintVecToSlice(...)
/go/pkg/mod/github.com/!r!j!keevil/tokenizers@v0.0.0-20231116195110-2cfc0ca34c96/tokenizer.go:79
github.com/daulet/tokenizers.(*Tokenizer).EncodeWithOptions(0x450c80?, {0xc002da8960?, 0x1?}, 0x1, {0xc00c5e3250, 0x2, 0x43f588?})
/go/pkg/mod/github.com/!r!j!keevil/tokenizers@v0.0.0-20231116195110-2cfc0ca34c96/tokenizer.go:163 +0x2f6 fp=0xc00c5e3148 sp=0xc00c5e2f60 pc=0x10545f6

daulet commented

can you please share minimal repro?

daulet commented

I wasn't able to repro just by running bunch of encodes in parallel, I suppose we need to see your repro.

Apologies, side effect of Docker caching the Cargo build step, it was seeing an old version of your main branch in git clone . When it is released ill substitute that for the download as artifact approach and it wont occur again. I tested on roughly 10m tokens on 12 threads and all is happy, looks good to me.