Port new Tokeniser from Linguist

Question

Port new Tokeniser from Linguist

bzz opened this issue 5 years ago · 2 comments

Part of the #155

Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.

This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.

This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.

Answer 1 · 2019-02-08T21:34:44.000Z

Linguist tokenize is defined using flex-based tokenizer.l.

1. Generating Go code from flex grammar

Golang does have limited version of it in ported https://gitlab.com/cznic/golex but it is missing 2 features to in order to be used with the above definition:

Trailing context (re1/re2).
All flex % prefixed options except %s and %x.

(see logs in details for reproduction instructions)

wget https://raw.githubusercontent.com/github/linguist/master/ext/linguist/tokenizer.l
go get -u modernc.org/golex
golex -o lex.go tokenizer.l

tokenizer.l:35:1: unknown %option "never-interactive yywrap reentrant nounput warn nodefault header-file=\"lex.linguist_yy.h\" extra-type=\"struct tokenizer_extra *\" prefix=\"linguist_yy\""
tokenizer.l:87:16 - "\<[[:alnum:]_!.^/?-]+              {" - trailing context not supported
tokenizer.l:103:15 - "[[:alnum:]_.@#^/*]+                {" - trailing context not supported

At this point it's a hard to estimate the effort of adding those features upstream.

2. Porting lexer grammar to Ragel

Instructive go-nuts thread on this subject points out worth trying a bit more complex solution, similar to discussion in #167, based on ragel, another FSM generator that can be "compiled" to Go code. That would only require porting 1 file .l -> .rl which is much more manageable effort.

3. Using flex-generated native lexer through the cgo

Hidden behind a compilation tag, this option includes direct usage of the same native, flex-generated tokenizer from the Linguist. This is a low-hanging fruit as does not require much effort to port and is a simplest way to verify the hypothesis of classifier accuracy from #194.

Answer 2 · 2019-04-08T14:38:52.000Z

#193 (comment) updated to include another option of using existing flex-based tokenizer though cgo.