timtadh/lexmachine

2 Questions

nobozo opened this issue · 3 comments

  1. I like what I'm seeing. I'm interested in using lexmachine to scan some input text, but in my case, the input text is in UTF-8. It appears that your scanner is expecting the input to be a byte slice, as shown by this:

func (self *Lexer) Scanner(text []byte) (*Scanner, error)

I'm concerned about what would happen a character in the text being scanned occupies more than 1 byte. From a quick look at your docs, it doesn't appear that you handle this situation. Is this correct?

  1. I'm looking at the amount of work that gets done in the initLexer function. It's probably not significant in real life but I'm wonder if there's a way to preprocess the results of the lexerAdd calls in such a way that the results can be read in to a program without having to do the work done by lexerAdd each time the program is run.

Thanks,
Jon Forrest

Jon,

It handles UTF8 (or any other encoding) you just need to write your patterns in such a way that the multibyte encoded characters match as expected. Given the general nature of the encoding problem this library doesn't try and solve it for you.

For instance the snowman (aka 0xE2 0x98 0x83) can be matched by the pattern []byte{0xE2, 0x98, 0x83}. If you read carefully your will notice the patterns are also byte slices which let's you specify such things.

If you encounter something UTF8 encoded that you can't lex, I consider that a bug. Please open an issue.

As for your second issue, I am exploring a code generation backend. However, the design is non-trivial for hooking up the action functions. In general, as long as you generate you lexer inside of an init() at program startup the cost is trivial. However, for small programs that are only trying to lex one thing as fast as they can and then shutdown I could see that overhead being annoying (which is why I am toying with some code gen designs).

All the best,
Tim

When handle unicode, the columns of token is not right.

Like: "你好" should be considered 2 columns, not 6

Totally right it gives the columns in bytes. Not something I can easily fix without implementing full unicode support. If you want the correct column you will need to extract via separate analysis which would be painful.

As an aside, getting the columns correct while supporting users ability to change the location of the TC at in a callback was really complicated to begin with. Not being a unicode expert I am not sure I want to deal with all the potential bugs it would introduce to try and be correct for unicode.