Implement CharRange type
CAD97 opened this issue ยท 14 comments
Filed #93 as a follow up to this, for CharProperty
API accepting CharRange
as input.
I found char-iter
on crates.io after a little digging, but that's only an iterator and not an actual range object. A cursory search found only better-range
that actually provided a range over characters, and that crate doesn't have documentation or source links on crates.io.
(Also, both crates are two years old, but then again character iteration isn't going to change.)
In conclusion, adding a CharRange
type will not duplicate work already done in the ecosystem.
Because a character range is such a self-contained interest (and I'm currently working on a project that would also enjoy having one), I'm considering making it its own crate. What do you think @behnam?
Thanks for the research, @CAD97. Very helpful.
Yeah, now thinking about it again, looks like things like char_property
and char_range
can be standalone crates, and not just parts of unic-utils
, so they can be easily everywhere.
That said, I like the idea of keeping all these things under one repository, because it allows faster development and one-step reviews, instead of going back and forth between repos. The main problem in this area is the fact that we don't have CI/devops tools to test cross-repo PRs before merging. Hopefully this will change in a near future, but for now, it's too much cost to maintain small repos. This is why there are many multi-package repositories (like https://github.com/BurntSushi/ripgrep/), which host many small packages internally, so they can be developed with all the cross-package tests.
I think eventually we can split out these kinds of packages into their own repo. But for the current stage, which is fast development and not maintaining a stable API yet, I think we better keep them all in.
About organization in UNIC, we can actually put these components under /unic/char/range/
and /unic/char/property/
, and name them unic-char-range
and unic-char-property
. This will keep names short, but still show which repository they come from, and keep them easy to find.
What do you think?
That seems like a good path forward to me!
I think we can generally improve Range<char>
in rustc, so we can drop this class later. But, for now, we need the functionality and this model is the best abstraction, and easier to replace with built-in libs later.
Some ideas for now...
We want CharRange
to implement std::ops::Range
, and we would need a CharRangeInclusive
to implement std::ops::RangeInclusive
.
Both will be implementing std::iter::ExactSizeIterator
and DoubleEndedIterator
.
Also, char
doesn't satisfy trait std::iter::Step
, which makes it harder to support for
loops over CharRange
directly. We may be able to fix this upstream. I'll file an issue.
What do you think?
ops::Range
is a struct, so for now we need to just replace it. Hopefully, as time goes on, we can fold this into ops::Range<char>
. For now I've made two structs: CharRange
and CharIter
. The range object is effectively a fake ops::Range<char>
except instead of implementing Iterator
it implements IntoIterator<CharIter>
. CharIter
is effectively a slightly modified version of the iterator I originally proposed in #85.
I'm just putting some polish onto it right now and then I'll send the PR.
Right, right, not traits to implement, but types to stay similar to.
How about CharRangeInclusive
? Don't we need them for those data table records starting/ending at the edges of char boundaries?
I've just done an inclusive range. The reasoning is that that's how character ranges are traditionally described, such as the regex set [a-z]
, which includes both a
and z
in the set.
If we really want an exclusive range it can be an alternate constructor.
Well, I don't really care much about the exclusive range, but I want us to stay as close to rustc as possible in semantics of the names. If we only have an inclusive range, I say we should call it CharRangeInclusive
.
For easy of use of the shorter syntax, we can have a chars!(start, end)
macro, that returns the inclusive range. This is short, easy to ready, and not breaking consistency with rustc namings.
What do you think?
I've submitted the PR so you can see the code; it checks out but I'm currently expanding docs and writing tests. I wouldn't be unhappy actually making CharRange
and CharRangeInclusive
. The chars!
macro idea is a good one; we can even enable for c in chars!('a'..='z')
with the macro!
Now that we want to be fancy, we can even support the U+xxxx
syntax in the macro, so we can have this:
for c in chars!(U+0061..=U+FFFF)
๐
I'm pretty sure that if we want to do that, we'll need a procedural macro. There's no way within macro_rules!
hygiene to convert U+0061
to '\u{0061}'
.
Yeah, that was mostly a joke. But, if we ever end up writing a procedural macro for building transformation pipelines, we can consider supporting the U+
format.