open-i18n/rust-unic

CanonicalCombiningClass should probably be a newtype

CAD97 opened this issue · 5 comments

CAD97 commented

(Tuple struct, whatever)

Blocking on associated consts (we like that don't we), here's the proposed design:

struct CanonicalCombiningClass(u8);

impl CanonicalCombiningClass {
    pub const NOT_REORDERED = CanonicalCombiningClass(0);
    pub const OVERLAY = CanonicalCombiningClass(1);
    pub const NUKTA = CanonicalCombiningClass(7);
    // and so on
}

impl CanonicalCombiningClass {
    pub fn of(ch: char) -> CanonicalCombiningClass;
}

impl CanonicalCombiningClass {
    fn is_not_reordered(&self) -> bool;
}

Since CanonicalCombiningClass should is clearly a distinct concept from a u8, it should be a distinct type. We should use Rust's mechanisms to express that with this zero cost abstraction.

So, this is the area that we haven't spent much time yet. CCC is a Numeric-type property, meaning that it basically represents a number. But, it also is an Enumerable-type property, and that's how it's mostly used in existing code.

The main reason to have it of type Numeric in Unicode is to define ordering for the values. Since we can do that easily for enums in Rust, it's fine to convert it to enum.

I don't think we need newtype here, because enum works very well, and, more importantly, associated_consts just got a limited release and it will take time to make it to stable.

But, for a good use of newtype in the codebase, see the bidi level type here: https://github.com/behnam/rust-unic/blob/master/components/bidi/src/level.rs

I think newtype could be a good pattern for properties of type Numeric, if we don't want to use u8. Let's talk about that when we get there.

What do you think?

CAD97 commented

By my reading Canonical_Combining_Class needs to be a numeric type in UNIC, because

Do not assume that absence of a long symbolic alias implies non-use of a particular Canonical_Combining_Class.

As an example, while 10-199 are assigned as a block to the aliases Ccc10 - Ccc199,

For use in regular expression matching, fixed position classes (ccc=10 through ccc=199) which actually occur in the Unicode Character Database for any version are given predictable aliases of the form "Ccc10", "Ccc11", and so forth.

204, 208, 210, and 212 are present on the Values table but do not have an assigned long alias.

We are guaranteed that

The character property invariants regarding Canonical_Combining_Class guarantee that values, once assigned, will never change, and that all values used will be in the range 0..254.

So it does really seem like this numeric property should be exposed as u8-with-extra-meaning.

It does seem that this is blocked on associated consts, but if we really want to bring this out "unstable" beforehand, we could use a helper mod like we have for abbr_names.

The table also misses Long value for row 199, but the text clearly saying that ccc=199 has a predictable alias as Ccc199. And I believe that's also the case for rows 204, 208, 210, and 212.

I'm going to submit a feedback about this problem in the spec.

That said, I agree that the numbers are actually important here. But, maybe we can use an enum and only have the numeric values as a secondary representation, like the long/abbr names. So, we will have CanonicalCombiningClass::from_number() which gives you the enum variant equivalent to the number.

So, one question would be, are there any common functionalities that works heavily with the numeric values of CCC and therefore we better keep it numeric?

My guess is that the answer is no. There's no numeric operations being done on these values, and all that happens is matching and ordering.

WDYT?

Actually, you're right, there a few CCC that do not have a long name. Those lines are missing in PropertyValueAliases.txt.

So, yeah, let's make CCC a newtype and use long_names and abbr_names modules to define consts.

Here's the list from UCD-10.0.0:

# Canonical_Combining_Class (ccc)

ccc;   0; NR                         ; Not_Reordered
ccc;   1; OV                         ; Overlay
ccc;   7; NK                         ; Nukta
ccc;   8; KV                         ; Kana_Voicing
ccc;   9; VR                         ; Virama
ccc;  10; CCC10                      ; CCC10
ccc;  11; CCC11                      ; CCC11
ccc;  12; CCC12                      ; CCC12
ccc;  13; CCC13                      ; CCC13
ccc;  14; CCC14                      ; CCC14
ccc;  15; CCC15                      ; CCC15
ccc;  16; CCC16                      ; CCC16
ccc;  17; CCC17                      ; CCC17
ccc;  18; CCC18                      ; CCC18
ccc;  19; CCC19                      ; CCC19
ccc;  20; CCC20                      ; CCC20
ccc;  21; CCC21                      ; CCC21
ccc;  22; CCC22                      ; CCC22
ccc;  23; CCC23                      ; CCC23
ccc;  24; CCC24                      ; CCC24
ccc;  25; CCC25                      ; CCC25
ccc;  26; CCC26                      ; CCC26
ccc;  27; CCC27                      ; CCC27
ccc;  28; CCC28                      ; CCC28
ccc;  29; CCC29                      ; CCC29
ccc;  30; CCC30                      ; CCC30
ccc;  31; CCC31                      ; CCC31
ccc;  32; CCC32                      ; CCC32
ccc;  33; CCC33                      ; CCC33
ccc;  34; CCC34                      ; CCC34
ccc;  35; CCC35                      ; CCC35
ccc;  36; CCC36                      ; CCC36
ccc;  84; CCC84                      ; CCC84
ccc;  91; CCC91                      ; CCC91
ccc; 103; CCC103                     ; CCC103
ccc; 107; CCC107                     ; CCC107
ccc; 118; CCC118                     ; CCC118
ccc; 122; CCC122                     ; CCC122
ccc; 129; CCC129                     ; CCC129
ccc; 130; CCC130                     ; CCC130
ccc; 132; CCC132                     ; CCC132
ccc; 133; CCC133                     ; CCC133 # RESERVED
ccc; 200; ATBL                       ; Attached_Below_Left
ccc; 202; ATB                        ; Attached_Below
ccc; 214; ATA                        ; Attached_Above
ccc; 216; ATAR                       ; Attached_Above_Right
ccc; 218; BL                         ; Below_Left
ccc; 220; B                          ; Below
ccc; 222; BR                         ; Below_Right
ccc; 224; L                          ; Left
ccc; 226; R                          ; Right
ccc; 228; AL                         ; Above_Left
ccc; 230; A                          ; Above
ccc; 232; AR                         ; Above_Right
ccc; 233; DB                         ; Double_Below
ccc; 234; DA                         ; Double_Above

Right now we have to do these manually, but I think we can eventually have a generator script that takes PropertyValueAliases.txt and generate the type (or char_property!()) output for us, bringing in human-readable description of each field from a file we maintain locally.

CAD97 commented

You sent that in while I was writing a post! It's below.

Yes, writing the generation would not be very difficult at all on top of what I've got for the generation pipeline currently. Since many values lack names, it seems best to newtype on top of the representation, though a #[repr(u8)] would be possible (though require generating names for all theoretically valid values even without names plus a breaking change whenever a new CCC name was added to the standard).


My quick overview of the related documentation didn't find any direct usages of the numeric values.

Compromise?:

#[repr(u8)]
enum CanonicalCombiningClass {
    // ...
}

Though I would definitely generate this from PropertyValueAliases.txt. That definitely should not be difficult.