A C# implementation of the Unicode grapheme cluster breaking algorithm.
- This library uses Unicode 10.0 version of grepheme boundary algorithm.
- In .NET 5.0,
StringInfo.GetTextElementEnumerator
can enumerate graphemes correctly with Unicode 13.0 algorithm.
https://www.nuget.org/packages/GraphemeSplitter/
Install-Package GraphemeSplitter
using GraphemeSplitter;
using static System.Console;
using static System.String;
public partial class Program
{
static string Split(string s) => Join(", ", s.GetGraphemes());
static void Main()
{
WriteLine(Split("π¨βπ¨βπ§βπ¦π©βπ©βπ§βπ¦π¨βπ¨βπ§βπ¦")); // π¨βπ¨βπ§βπ¦, π©βπ©βπ§βπ¦, π¨βπ¨βπ§βπ¦
}
}
This library basically implements http://unicode.org/reports/tr29/.
Expample:
type | text | split result |
---|---|---|
diacritical marks | aΜΜΜ Μ‘bΜΜΜ’Μ£cΜΜΜ£Μ€dΜ ΜΜ₯Μ¦ | "aΜΜΜ Μ‘", "bΜΜΜ’Μ£", "cΜΜΜ£Μ€", "dΜ ΜΜ₯Μ¦" |
variation selector | θθσ θσ | "θ", "θσ ", "θσ " |
asian syllable | αα ‘α«αα §αΌαα ‘αα ¦αα | "αα ‘α«", "αα §αΌ", "αα ‘", "αα ¦", "αα " |
family emoji | π¨βπ¨βπ§βπ¦π©βπ©βπ§βπ¦π¨βπ¨βπ§βπ¦ | "π¨βπ¨βπ§βπ¦", "π©βπ©βπ§βπ¦", "π¨βπ¨βπ§βπ¦" |
emoji skin tone | π©π»π±πΌπ§π½π¦πΎ | "π©π»", "π±πΌ", "π§π½", "π¦πΎ" |
but slacks out the GB10, GB12, and GB13 rules for simplification.
original:
- GB10 β¦ (E_Base | EBG) Extend* Γ E_Modifier
- GB12 β¦ sot (RI RI)* RI Γ RI
- GB13 β¦ [^RI] (RI RI)* RI Γ RI
implemented:
- GB10 β¦ (E_Base | EBG) Γ Extend
- GB10 β¦ (E_Base | EBG | Extend) Γ E_Modifier
- GB12/GB13 β¦ RI Γ RI
Difference is:
sequence | original | implemented |
---|---|---|
aΜπ»β (U+61, U+300, U+1F3FB) | Γ Γ· | Γ Γ |
π―π΅πΊπΈ (U+1F1EF, U+1F1F5, U+1F1FA, U+1F1F8) | Γ Γ· Γ | Γ Γ Γ |
(where Γ· and Γ means boundary and no bounadry respectively.)
This library is influenced by