/NetUnicodeInfo

Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.

Primary LanguageC#MIT LicenseMIT

.NET Unicode Information Library

Build Status

Summary

This project consists of a library that provides access to some of the data contained in the Unicode Character Database.

Version of Unicode supported

Unicode 13.0 Emoji 13.0

Breaking changes from versions 1.x to 2.x

UnicodeRadicalStrokeCount.StrokeCount is now of type System.SByte instead of type System.Byte.

Using the library

Reference the NuGet package

Grab the latest version of the package on NuGet: https://www.nuget.org/packages/UnicodeInformation/. Once the library is installed in your project, you will find everything you need in the System.Unicode namespace.

Basic information

Everything provided by the library will be under the namespace System.Unicode. XML documentation should be complete enough so that you can navigate the API without getting lost.

In its current state, the project is written in C# 7.3, compilable by Roslyn, and targets both .NET Standard 2.0 and .NET Standard 1.1. The library UnicodeInformation includes a (large) subset of the official Unicode Character Database stored in a custom file format.

Example usage

The following program will display informations on a few characters:

using System;
using System.Text;
using System.Unicode;

namespace Example
{
	internal static class Program
	{
		private static void Main()
		{
			Console.OutputEncoding = Encoding.Unicode;
			PrintCodePointInfo('A');
			PrintCodePointInfo('∞');
			PrintCodePointInfo(0x1F600);
		}

		private static void PrintCodePointInfo(int codePoint)
		{
			var charInfo = UnicodeInfo.GetCharInfo(codePoint);
			Console.WriteLine(UnicodeInfo.GetDisplayText(charInfo));
			Console.WriteLine("U+" + codePoint.ToString("X4"));
			Console.WriteLine(charInfo.Name ?? charInfo.OldName);
			Console.WriteLine(charInfo.Category);
		}
	}
}

Explanations:

  • UnicodeInfo.GetCharInfo(int) returns a structure UnicodeCharInfo that provides access to various bit of information associated with the specified code point.
  • UnicodeInfo.GetDisplayText(UnicodeCharInfo) is a helper method that computes a display text for the specified code point. Since some code points are not designed to be displayed in a standalone fashion, this will try to make the specified character more displayable. The algorithm used to provide a display text is quite simplistic, and will only affect very specific code points. (e.g. Control Characters) For most code points, this will simply return the direct string representation.
  • UnicodeCharInfo.Name returns the name of the code point as specified by the Unicode standard. Please note that some characters will, by design, not have any name assigned to them in the standard. (e.g. control characters) Those characters, however may have alternate names assigned to them, that you can use as fallbacks. (e.g. UnicodeCharInfo.OldName)
  • UnicodeCharInfo.OldName returns the name of the character as defined in Unicode 1.0, when applicable and different from the current name.
  • UnicodeCharInfo.Category returns the category assigned to the specified code point.

Included Properties

From UCD

  • Name
  • General_Category
  • Canonical_Combining_Class
  • Bidi_Class
  • Decomposition_Type
  • Decomposition_Mapping
  • Numeric_Type (See also kAccountingNumeric/kOtherNumeric/kPrimaryNumeric. Those will set Numeric_Type to Numeric.)
  • Numeric_Value
  • Bidi_Mirrored
  • Unicode_1_Name
  • Simple_Uppercase_Maping
  • Simple_Lowercase_Mapping
  • Simple_Titlecase_Mapping
  • Name_Alias
  • Block
  • ASCII_Hex_Digit
  • Bidi_Control
  • Dash
  • Deprecated
  • Diacritic
  • Extender
  • Hex_Digit
  • Hyphen
  • Ideographic
  • IDS_Binary_Operator
  • IDS_Trinary_Operator
  • Join_Control
  • Logical_Order_Exception
  • Noncharacter_Code_Point
  • Other_Alphabetic
  • Other_Default_Ignorable_Code_Point
  • Other_Grapheme_Extend
  • Other_ID_Continue
  • Other_ID_Start
  • Other_Lowercase
  • Other_Math
  • Other_Uppercase
  • Pattern_Syntax
  • Pattern_White_Space
  • Quotation_Mark
  • Radical
  • Soft_Dotted
  • STerm
  • Terminal_Punctuation
  • Unified_Ideograph
  • Variation_Selector
  • White_Space
  • Lowercase
  • Uppercase
  • Cased
  • Case_Ignorable
  • Changes_When_Lowercased
  • Changes_When_Uppercased
  • Changes_When_Titlecased
  • Changes_When_Casefolded
  • Changes_When_Casemapped
  • Alphabetic
  • Default_Ignorable_Code_Point
  • Grapheme_Base
  • Grapheme_Extend
  • Grapheme_Link
  • Math
  • ID_Start
  • ID_Continue
  • XID_Start
  • XID_Continue
  • Unicode_Radical_Stroke (This is actually kRSUnicode from the Unihan database)
  • Code point cross references extracted from NamesList.txt

NB: The UCD property ISO_Comment will never be included since this one is empty in all new Unicode versions.

From Unicode Emoji

  • Emoji
  • Emoji_Presentation
  • Emoji_Modifier
  • Emoji_Modifier_Base
  • Emoji_Component
  • Extended_Pictographic

From Unihan

  • kAccountingNumeric
  • kOtherNumeric
  • kPrimaryNumeric
  • kRSUnicode
  • kDefinition
  • kMandarin
  • kCantonese
  • kJapaneseKun
  • kJapaneseOn
  • kKorean
  • kHangul
  • kVietnamese
  • kSimplifiedVariant
  • kTraditionalVariant

Regenerating the data

The project UnicodeInformation.Builder takes cares of generating a file named ucd.dat. This file contains Unicode data compressed by .NET's deflate algorithm, and should be included in UnicodeInformation.dll at compilation.