go-ego/gse

sentence can choose tolower or keep origin sentence?

Opened this issue · 3 comments

hello, I want to keep uppercase letter。 like example:

	text := "Hello world, Helloworld. Winter is coming! 你好世界."
	jieba := new(gse.Segmenter)
	jieba.LoadDict()
	res := jieba.Cut(text)
	println(ToJson(res))

}

the result is : ["hello"," ","world",","," ","helloworld","."," ","winter"," ","is"," ","coming","!"," ","你好","世界","."]

I hope the result is ["Hello"," ","world",","," ","Helloworld","."," ","Winter"," ","is"," ","coming","!"," ","你好","世界","."]


And I have seen the option params: https://github.com/go-ego/gse/blob/master/segmenter.go

image

I want this can be set by params.
image

@vcaesar Could you help me with the option param toLower? thanks very much

@CocaineCong hello, Could you help me with the option param toLower? bacause i want to use this gse for tokenize sentences and then use mmh3 to encode tokens.

the character is lowercase or uppercase, it's very important to me.
Because words mmh3 value are different when they are lowercase or uppercase.