alwinb/html-lexer

Learning to write a lexer, can you help me document what the states in this project translate to?

alshdavid opened this issue · 2 comments

Hey, I am learning about writing a lexer and came across your project which I felt was a great resource to get started. I am currently reading it to get an understanding of how lexers could work.

I am reading through the state types and documenting them to get an understanding of what each state means - any chance you could help out?

I have a fork here (ported to TypeScript because I was struggling to understand the flow without type information).

https://github.com/alshdavid/html-lexer/blob/annotate-and-refactor/src/states/states.ts

I have annotated STN and ETN, but some of the acronyms are beyond me haha

Oh I see, translations are at the end of the table!
image

alwinb commented

Hi! I see you've found your answer, but if you have more questions, feel free to ask!

I think that this way of writing a lexer, with a hand-written transition table that encodes a DFA, is not very common.

I quite like this article which I have used as an inspiration. I've also built upon things that I have learnt by working on this project of mine.

It is more common to see hand-written lexers that implement the transitions in code, typically using an outer switch statement on the state, and inner switch statements on the input character, or the other way around. This is very similar (the table can easily be translated into such code, mechanically) but it does allow one to do some ad-hoc look-ahead, or to break down the state into multiple parts, and/or add other bits of logic, all of which can be handy. It also avoids having to use the cryptic mnemonic names for the states.

I try to align my code with mathematically well behaved formalisms as much as I can; so I can see the lesser affordances of using a table as an advantage instead. It is just an encoding of a DFA, one that also associates an output token-type to each accepting state. Personally I also like using a table because it makes it easier to cross check each character-class against each state; which makes it easier to check if I'm implementing the language correctly.

The code that 'drives' the DFA uses some tricks to produce the longest match as a token.

If you are interested, I'd be happy to try and explain more!