hexawyz/NetUnicodeInfo

7519 characters are not Unicode valid

Closed this issue · 2 comments

The page Unicode Character Count V13.0 shows 281,392 characters in Total Assigned.
https://www.unicode.org/versions/stats/charcountv13_0.html

However, there are 288,911 (7519 more) UnicodeCharInfo's where the _unicodeCharacterDataIndex field is >= 0 (which I thought meant the character was valid).

The file https://www.unicode.org/Public/UCD/latest/ucd/UCD.zip shows the intervals of all valid Unicode characters.

I've created a program to show all invalid UnicodeCharInfo's.
Demo.zip

Perhaps you could incorporate the list in your project.

Examples:

This is what I've done in my own project:

public static IReadOnlyCollection<UnicodeCharInfo> All => _all.Value;
private static readonly Lazy<IReadOnlyCollection<UnicodeCharInfo>> _all
    = new Lazy<IReadOnlyCollection<UnicodeCharInfo>>(() =>
        _list // list of all valid Unicode characters (validUnicodeCharacters)
            .Select(x => UnicodeInfo.GetCharInfo(x))
            .ToList());

Hi,

You should not look at implementation details such as _unicodeCharacterDataIndex. These are not public for a reason.

The correct way to determine if a code point is assigned is to look at its Category. Unassigned code points will report a category of UnicodeCategory.OtherNotAssigned.

See https://unicode-browser.azurewebsites.net/codepoints/2065 (this is still running an older version of the lib, but it should be correct)

That worked!

I changed the validation.

I also forgot to include some categories while reading the UCD.zip file.

Enumerable.Range(0, 0x10FFFF)
    .Select(x => UnicodeInfo.GetCharInfo(x))
    .Where(x => x.Category != UnicodeCategory.OtherNotAssigned)
    .Count();
// returns 283440

Which is makes sense according to https://www.unicode.org/versions/stats/charcountv13_0.html

  • Total Designated = 283,506
  • Noncharacters = 66
  • Total Designated - Noncharacters = 283,440 valid Unicode characters

Thanks a bunch.