7519 characters are not Unicode valid
Closed this issue · 2 comments
The page Unicode Character Count V13.0 shows 281,392 characters in Total Assigned.
https://www.unicode.org/versions/stats/charcountv13_0.html
However, there are 288,911 (7519 more) UnicodeCharInfo's where the _unicodeCharacterDataIndex field is >= 0 (which I thought meant the character was valid).
The file https://www.unicode.org/Public/UCD/latest/ucd/UCD.zip shows the intervals of all valid Unicode characters.
I've created a program to show all invalid UnicodeCharInfo's.
Demo.zip
Perhaps you could incorporate the list in your project.
Examples:
- https://www.fileformat.info/info/unicode/char/2065/index.htm
- UnicodeInfo.GetCharInfo(0x2065)
- _unicodeCharacterDataIndex = 7375
- https://www.fileformat.info/info/unicode/char/10FFFE/index.htm
- UnicodeInfo.GetCharInfo(0x10FFFE)
- _unicodeCharacterDataIndex = 33840
This is what I've done in my own project:
public static IReadOnlyCollection<UnicodeCharInfo> All => _all.Value;
private static readonly Lazy<IReadOnlyCollection<UnicodeCharInfo>> _all
= new Lazy<IReadOnlyCollection<UnicodeCharInfo>>(() =>
_list // list of all valid Unicode characters (validUnicodeCharacters)
.Select(x => UnicodeInfo.GetCharInfo(x))
.ToList());
Hi,
You should not look at implementation details such as _unicodeCharacterDataIndex
. These are not public for a reason.
The correct way to determine if a code point is assigned is to look at its Category
. Unassigned code points will report a category of UnicodeCategory.OtherNotAssigned.
See https://unicode-browser.azurewebsites.net/codepoints/2065 (this is still running an older version of the lib, but it should be correct)
That worked!
I changed the validation.
I also forgot to include some categories while reading the UCD.zip file.
Enumerable.Range(0, 0x10FFFF)
.Select(x => UnicodeInfo.GetCharInfo(x))
.Where(x => x.Category != UnicodeCategory.OtherNotAssigned)
.Count();
// returns 283440
Which is makes sense according to https://www.unicode.org/versions/stats/charcountv13_0.html
- Total Designated = 283,506
- Noncharacters = 66
- Total Designated - Noncharacters = 283,440 valid Unicode characters
Thanks a bunch.