sugarme/tokenizer

panic:fatal error: concurrent map writes

Opened this issue · 2 comments

I got error panic: concurrent map writes , BPE TokenizeWithCache func, Concurrent read and write operations on the map can lead to a panic.

func (b BPE) TokenizeWithCache(sequence string) (retVal []tokenizer.Token) {

	if hit, ok := b.Cache.cmap[sequence]; ok {
		return b.WordToTokens(hit)
	} else {
		word := b.MergeWord(sequence)
		retVal = b.WordToTokens(*word)
		if b.Cache != nil {
			b.Cache.SetValues([]CacheItem{
				{sequence, *word},
			})
		}
		return retVal
	}
}

Please check~

@ZeroYuJie,

Please share the error log detail and example how to replicate. Thanks!

@sugarme
I am using this in my multi-goroutine testing. first i use this func to init model tokenizer, then I initialized a tokenizer within a global variable.
the code like this:

func OfflineLLMTokenizerInit(modelName string) (*tokenizer.Tokenizer, error) {
	configFile, err := tokenizer.CachedPath(modelName, "tokenizer.json")
	if err != nil {
		return nil, err
	}
	tk, err := pretrained.FromFile(configFile)
	if err != nil {
		return nil, err
	}
	return tk, nil
}

var tk *tokenizer.Tokenizer

func main() {
	tk, _ = OfflineLLMTokenizerInit("NousResearch/Redmond-Puffin-13B")
	benchNum := 10000
	for i := 0; i < benchNum; i++ {
		go func(number int) {
			//random str len = 1000
			input := random.RandString(1000)
			encoderSingle, _ := tk.EncodeSingle(input, false)
			println(fmt.Sprintf("routine=%d,%s,len=%d", number, input, len(encoderSingle.Tokens)))
		}(i)
	}
	time.Sleep(time.Minute)
}

then it will throw the panic:
image
the stack :
image

Because the cache b.Cache.cmap
I think the cmap should use sync.Map or removing this cache...