BurntSushi/ucd-generate

Subcommand to generate custom tables containing a combination of properties / property values.

thomcc opened this issue · 4 comments

Essentially: a flag for compiling a table that is roughtly equivalent to a regex character class.

For example, currently the perl-word flag could be expressed as [\p{alpha}\p{joinc}\p{gc=digit}\p{gc=M}\p{gc=pc}], or something along those lines.

I also wanted this when I was experimenting with different options for getting a more accurate version of unicode-width (one that operates on strings and takes grapheme sequences into account, using heuristics to determine the actual rendered width on terminals).

When I've wanted this, I've resorted to just hacking it into a local branch of ucd-generate, or a separate project with a bunch of copypasted code. This also would make the work in #37 unnecessary, as it could be expressed as just a combination of \p{gc=...}s.

One option would be to accept regex char class syntax and use regex-syntax to parse it... The benefit here is that syntax is quite rich and supports the full spectrum of set operations (for example, the include/exclude flags we currently support can't express intersection or inversion, and doesn't nest).

But I believe it would be using the version of the UCD it was compiled with? Is this true? Does it matter? I'm on my lunch break now and haven't investigated this, so it's possible it's easy and does exactly what we want.


However we do it, we'd want to express something like:

enum ClassType {
    Range(u32, u32),
    Property(String),
    // .0 is == vs !=
    PropertyVal(bool, String, String),
    Expr(BinOp, Box<Class>, Box<Class>),
}
struct Class {
    negated: bool,
    ty: ClassType,
}
enum BinOp {
    Union,
    Intersection,
    Difference,
    SymmetricDifference,
}

Which then could be evaluated in a fairly easy manner, ignoring many concerns around performance, error handling, etc:

// This is horrifyingly inefficient, and just to prove the idea
type CharSet = BTreeSet<u32>;

fn evaluate(c: Class) -> CharSet {
    let set: CharSet = match (c.ty) {
        Range(lo, hi) => (lo..=hi).collect(),
        Property(s) => codepoints_with_property(&s, None),
        PropertyVal(true, p, v) =>
            codepoints_with_property(&p, Some(&v))?,
        PropertyVal(false, p, v) =>
            invert(codepoints_with_property(&p, Some(&v))),

        Expr(Union, lhs, rhs) =>
            evaluate(*lhs).union(evaluate(*rhs)),
        // ... etc
    };
    if c.negated {
        invert(set)
    } else {
        set
    }
}

fn invert(c: CharSet) -> CharSet {
    (0..0x110000).collect::<CharSet>().difference(&c)
}
// This assumes this function exists, which is probably not the signature
// it would actually have.
fn codepoints_with_property(name: &str, value: Option<&str>) -> CharSet;

Then, we'd use that to emit tables.

Briefly, yes, I think this is a great idea.

However, we unfortunately shouldn't use regex syntax for this. Or more specifically, we shouldn't use regex-syntax for it. In particular, ucd-generate and regex-syntax currently form a cyclic dependency: ucd-generate depends on regex-syntax for its dfa and regex sub-commands, but regex-syntax depends on ucd-generate to produce its Unicode tables.

I am currently working on moving dfa and regex out of ucd-generate and into a tool maintained as part of the regex-automata project. So we definitely shouldn't add any new dependencies on regex-syntax's Unicode tables in ucd-generate.

Okay. Do you have a preference for how the parser is written? (e.g. hand-written, nom, use some generator library...)

Also, not that it matters here, but... ucd-parse still uses regex to parse a lot of things, so the cyclic dep remains, no?

I guess I would question whether we need a parser at all, and whether we can do this via a command line interface. I guess if all we supported was a union operation, then it would be easy.

Also, not that it matters here, but... ucd-parse still uses regex to parse a lot of things, so the cyclic dep remains, no?

The linked issue is long, but I brought this up in it (I think), and apparently it's the conceptual dependency that's the problem? That is, any old regex engine could be substituted in ucd-parse for regex.

One thing I was thinking about was making ucd-parse's dependency on regex disable its Unicode features so that ucd-generate doesn't have the Unicode tables from regex compiled in at all. If we did that, then we could use the Ast parser in regex-syntax, which doesn't use the Unicode tables at all. So maybe that's the right approach? I wish I could give you a straight answer, but honestly Debian's policies in this realm are so completely opaque to me, that all I can do is give you a best guess based on my conversation in the linked issue.

Otherwise, I generally hand-roll parsers. I'd prefer not bringing in any extra dependencies. But I also don't think we should re-implement the Unicode character class parser that's in regex-syntax, because it's pretty gnarly.

So:

  1. Consider whether we can do this at the CLI level.
  2. Otherwise, just use the regex-syntax AST parser. The key downside here is that we couldn't use the HIR translator, which is what does the heavy lifting of assembling the set of codepoints. However, it needs to do it efficiently. I imagine we could just throw the codepoints in a BTreeSet and use its various set operations.

So, using regex-syntax's ast parser was all I had planned on, as using the tables it's compiled with would give wrong results if it wasn't compiled with data produced by the same version of the UCD -- e.g. this is a case where the dep cycle would cause wrong results to be produced.

I had then intended on using something only slightly less naive than the evaluate code given in the first comment to produce a BTreeSet<u32>.

Consider whether we can do this at the CLI level.

I can think of a number of use cases for more advanced operations... But perhaps they aren't worth supporting. We could always add a variant later that accepted char set syntax, or something equivalently powerful.

Otherwise, I generally hand-roll parsers. I'd prefer not bringing in any extra dependencies

This is also my preference, not that it seems like it will matter here.