Port new Tokeniser from Linguist
bzz opened this issue · 2 comments
Part of the #155
Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.
This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.
This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.
Linguist tokenize is defined using flex-based tokenizer.l
.
1. Generating Go code from flex grammar
Golang does have limited version of it in ported https://gitlab.com/cznic/golex but it is missing 2 features to in order to be used with the above definition:
- Trailing context (re1/re2).
- All flex % prefixed options except %s and %x.
(see logs in details for reproduction instructions)
wget https://raw.githubusercontent.com/github/linguist/master/ext/linguist/tokenizer.l
go get -u modernc.org/golex
golex -o lex.go tokenizer.l
tokenizer.l:35:1: unknown %option "never-interactive yywrap reentrant nounput warn nodefault header-file=\"lex.linguist_yy.h\" extra-type=\"struct tokenizer_extra *\" prefix=\"linguist_yy\""
tokenizer.l:87:16 - "\<[[:alnum:]_!.^/?-]+ {" - trailing context not supported
tokenizer.l:103:15 - "[[:alnum:]_.@#^/*]+ {" - trailing context not supported
At this point it's a hard to estimate the effort of adding those features upstream.
2. Porting lexer grammar to Ragel
Instructive go-nuts thread on this subject points out worth trying a bit more complex solution, similar to discussion in #167, based on ragel, another FSM generator that can be "compiled" to Go code. That would only require porting 1 file .l
-> .rl
which is much more manageable effort.
3. Using flex-generated native lexer through the cgo
Hidden behind a compilation tag, this option includes direct usage of the same native, flex-generated tokenizer from the Linguist. This is a low-hanging fruit as does not require much effort to port and is a simplest way to verify the hypothesis of classifier accuracy from #194.
#193 (comment) updated to include another option of using existing flex-based tokenizer though cgo.