open-i18n/rust-unic

Implement CharRange type

CAD97 opened this issue ยท 14 comments

CAD97 commented

[assign: @CAD97]

Replace usages of (char, char) as an inclusive range with a custom Range struct adapted for use on characters.

Filed #93 as a follow up to this, for CharProperty API accepting CharRange as input.

CAD97 commented

I found char-iter on crates.io after a little digging, but that's only an iterator and not an actual range object. A cursory search found only better-range that actually provided a range over characters, and that crate doesn't have documentation or source links on crates.io.

(Also, both crates are two years old, but then again character iteration isn't going to change.)

In conclusion, adding a CharRange type will not duplicate work already done in the ecosystem.

Because a character range is such a self-contained interest (and I'm currently working on a project that would also enjoy having one), I'm considering making it its own crate. What do you think @behnam?

Thanks for the research, @CAD97. Very helpful.

Yeah, now thinking about it again, looks like things like char_property and char_range can be standalone crates, and not just parts of unic-utils, so they can be easily everywhere.

That said, I like the idea of keeping all these things under one repository, because it allows faster development and one-step reviews, instead of going back and forth between repos. The main problem in this area is the fact that we don't have CI/devops tools to test cross-repo PRs before merging. Hopefully this will change in a near future, but for now, it's too much cost to maintain small repos. This is why there are many multi-package repositories (like https://github.com/BurntSushi/ripgrep/), which host many small packages internally, so they can be developed with all the cross-package tests.

I think eventually we can split out these kinds of packages into their own repo. But for the current stage, which is fast development and not maintaining a stable API yet, I think we better keep them all in.

About organization in UNIC, we can actually put these components under /unic/char/range/ and /unic/char/property/, and name them unic-char-range and unic-char-property. This will keep names short, but still show which repository they come from, and keep them easy to find.

What do you think?

CAD97 commented

That seems like a good path forward to me!

I think we can generally improve Range<char> in rustc, so we can drop this class later. But, for now, we need the functionality and this model is the best abstraction, and easier to replace with built-in libs later.

Some ideas for now...

We want CharRange to implement std::ops::Range, and we would need a CharRangeInclusive to implement std::ops::RangeInclusive.

Both will be implementing std::iter::ExactSizeIterator and DoubleEndedIterator.

Also, char doesn't satisfy trait std::iter::Step, which makes it harder to support for loops over CharRange directly. We may be able to fix this upstream. I'll file an issue.

What do you think?

CAD97 commented

ops::Range is a struct, so for now we need to just replace it. Hopefully, as time goes on, we can fold this into ops::Range<char>. For now I've made two structs: CharRange and CharIter. The range object is effectively a fake ops::Range<char> except instead of implementing Iterator it implements IntoIterator<CharIter>. CharIter is effectively a slightly modified version of the iterator I originally proposed in #85.

I'm just putting some polish onto it right now and then I'll send the PR.

Right, right, not traits to implement, but types to stay similar to.

How about CharRangeInclusive? Don't we need them for those data table records starting/ending at the edges of char boundaries?

CAD97 commented

I've just done an inclusive range. The reasoning is that that's how character ranges are traditionally described, such as the regex set [a-z], which includes both a and z in the set.

If we really want an exclusive range it can be an alternate constructor.

Well, I don't really care much about the exclusive range, but I want us to stay as close to rustc as possible in semantics of the names. If we only have an inclusive range, I say we should call it CharRangeInclusive.

For easy of use of the shorter syntax, we can have a chars!(start, end) macro, that returns the inclusive range. This is short, easy to ready, and not breaking consistency with rustc namings.

What do you think?

CAD97 commented

I've submitted the PR so you can see the code; it checks out but I'm currently expanding docs and writing tests. I wouldn't be unhappy actually making CharRange and CharRangeInclusive. The chars! macro idea is a good one; we can even enable for c in chars!('a'..='z') with the macro!

Now that we want to be fancy, we can even support the U+xxxx syntax in the macro, so we can have this:

for c in chars!(U+0061..=U+FFFF)

๐Ÿ˜„

CAD97 commented

I'm pretty sure that if we want to do that, we'll need a procedural macro. There's no way within macro_rules! hygiene to convert U+0061 to '\u{0061}'.

Yeah, that was mostly a joke. But, if we ever end up writing a procedural macro for building transformation pipelines, we can consider supporting the U+ format.

CAD97 commented

Closed by #112