PaperMonoid/javaccc

Regex character class compilation

Closed this issue · 3 comments

What should a regex character class be compiled to?

A regex character class is a set of valid character for an expression. It's basically a bunch of or expressions. We could for example have the following:

["a", "e", "i", "o", "u"] = "a" | "e" | "i" | "o" | "u"

These two expressions are equivalent and one could be replaced by the other. Although there are many things that regex character classes make more convenient. For example ranges:

["a" - "z"] = "a" | "b" | "c" | "d" | ... | "z"

Regex character classes also allow to get the complement of a set. For example the following expression matches everything BUT digits:

~["0" - "9"] = ???

In our compilers class they taught us how to make an automata for the or expression. But they never told us anything about character classes.

or-automata
These are my notes on the or automata.

Most of the character classes stuff could be implemented as a bunch of or expressions, but it's unclear what's the correct translation for the other expressions such as ranges and set complements.

What should be done?

I've read this article about email regex and it has a diagram which shows how character classes COULD BE implemented, so I'm leaving it here as a suggestion:
email-regex

It's a problem. We could easily choose, for the moment, only to work with 'or' expressions. But it is not what characterizes us. This also complicates the drawing of the diagrams, because I begin to occur cases in which the current algorithm will fail. I think it's okay to base ourselves on the work done in the article about regular expression for an email. And, I propose to leave aside drawing the diagrams, and better to take advantage of some technology that already exists, for example, generate the code for LaTex, as explained here. And so focus on generating the information that LaTex needs to draw the diagram. And maybe, later build our library for draw automatas.

I was thinking about using LaTeX too, it might be a lot easier to do it that way and it might look better.

I'm almost done writing the regex automata compiler, there's no optimization yet but that shouldn't be too hard. I'll push mi commit as soon as I'm done with it.

Well, then it is decided. We are going to pivot the idea, change to develop an algorithm that generates the LaTex code necessary to draw the diagram, to later compile it with LaTex (?). I'm going to move the previous code to an identifiable folder to save it.