/unidump

Dump info on unicode codepoints in a string

Primary LanguageObjective-COtherNOASSERTION

unidump

Dump a ton of information about the Unicode codepoints in a string.

For example, let's try it against some of those fun new composed emoji:

$ unidump ๐Ÿƒ๐Ÿพโ€โ™€๏ธ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง๐Ÿ‡จ๐Ÿ‡ญ
'๐Ÿƒ๐Ÿพโ€โ™€๏ธ'
Composed emoji: 'woman with medium-dark skin tone running'
[
	'๐Ÿƒ'
	RUNNER
	Unicode	U+1f3c3
	UTF-8	f0 9f 8f 83 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)

	'๐Ÿพ'
	EMOJI MODIFIER FITZPATRICK TYPE-5
	Unicode	U+1f3fe
	UTF-8	f0 9f 8f be 
	Category	Modifier_Symbol (Sk)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)

	<unprintable>
	ZERO WIDTH JOINER
	Aliases:
		ZWJ (abbreviation)
	Unicode	U+200d
	UTF-8	e2 80 8d 
	Category	Format (Cf)
	Block	General_Punctuation (Punctuation)

	'โ™€'
	FEMALE SIGN
	Unicode	U+2640
	UTF-8	e2 99 80 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols (Misc_Symbols)

	'โ—Œ๏ธ'
	VARIATION SELECTOR-16
	Aliases:
		VS16 (abbreviation)
	Unicode	U+fe0f
	UTF-8	ef b8 8f 
	Category	Nonspacing_Mark (Mn)
	Block	Variation_Selectors (VS)
	Combiner	Single
]

'๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง'
Composed emoji: 'family with two fathers and two daughters'
[
	'๐Ÿ‘จ'
	MAN
	Unicode	U+1f468
	UTF-8	f0 9f 91 a8 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)

	<unprintable>
	ZERO WIDTH JOINER
	Aliases:
		ZWJ (abbreviation)
	Unicode	U+200d
	UTF-8	e2 80 8d 
	Category	Format (Cf)
	Block	General_Punctuation (Punctuation)

	'๐Ÿ‘จ'
	MAN
	Unicode	U+1f468
	UTF-8	f0 9f 91 a8 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)

	<unprintable>
	ZERO WIDTH JOINER
	Aliases:
		ZWJ (abbreviation)
	Unicode	U+200d
	UTF-8	e2 80 8d 
	Category	Format (Cf)
	Block	General_Punctuation (Punctuation)

	'๐Ÿ‘ง'
	GIRL
	Unicode	U+1f467
	UTF-8	f0 9f 91 a7 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)

	<unprintable>
	ZERO WIDTH JOINER
	Aliases:
		ZWJ (abbreviation)
	Unicode	U+200d
	UTF-8	e2 80 8d 
	Category	Format (Cf)
	Block	General_Punctuation (Punctuation)

	'๐Ÿ‘ง'
	GIRL
	Unicode	U+1f467
	UTF-8	f0 9f 91 a7 
	Category	Other_Symbol (So)
	Block	Miscellaneous_Symbols_And_Pictographs (Misc_Pictographs)
]

'๐Ÿ‡จ๐Ÿ‡ญ'
Composed emoji: 'flag of Switzerland'
[
	'๐Ÿ‡จ'
	REGIONAL INDICATOR SYMBOL LETTER C
	Unicode	U+1f1e8
	UTF-8	f0 9f 87 a8 
	Category	Other_Symbol (So)
	Block	Enclosed_Alphanumeric_Supplement (Enclosed_Alphanum_Sup)

	'๐Ÿ‡ญ'
	REGIONAL INDICATOR SYMBOL LETTER H
	Unicode	U+1f1ed
	UTF-8	f0 9f 87 ad 
	Category	Other_Symbol (So)
	Block	Enclosed_Alphanumeric_Supplement (Enclosed_Alphanum_Sup)
]

Or, against a recent iOS/OSX crashing string:

$ unidump เฐœเฑเฐžโ€Œเฐพ
'เฐœเฑเฐž'
[
	'เฐœ'
	TELUGU LETTER JA
	Unicode	U+0c1c
	UTF-8	e0 b0 9c 
	Category	Other_Letter (Lo)
	Block	Telugu
	Script	Telugu

	'โ—Œเฑ'
	TELUGU SIGN VIRAMA
	Unicode	U+0c4d
	UTF-8	e0 b1 8d 
	Category	Nonspacing_Mark (Mn)
	Block	Telugu
	Script	Telugu
	Combiner	Single

	'เฐž'
	TELUGU LETTER NYA
	Unicode	U+0c1e
	UTF-8	e0 b0 9e 
	Category	Other_Letter (Lo)
	Block	Telugu
	Script	Telugu
]

'โ€Œเฐพ'
[
	<unprintable>
	ZERO WIDTH NON-JOINER
	Aliases:
		ZWNJ (abbreviation)
	Unicode	U+200c
	UTF-8	e2 80 8c 
	Category	Format (Cf)
	Block	General_Punctuation (Punctuation)

	'โ—Œเฐพ'
	TELUGU VOWEL SIGN AA
	Unicode	U+0c3e
	UTF-8	e0 b0 be 
	Category	Nonspacing_Mark (Mn)
	Block	Telugu
	Script	Telugu
	Combiner	Single
]

Building

Uses a private OSX framework for emoji data. Tested on 10.13; no promises elsewhere.

Requires icu4c. Install with Homebrew: brew install icu4c.

To build:

  • Clone
  • Run fetch-unicode-data to download needed data files from unicode.org.
  • Run generateNameMaps.py to process the Obj-C templates.
  • Open the project and build in Xcode.