/grapheme-stats

Create simple grapheme statistcs of text files

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Simple grapheme statistics of text files

This is a simple Python program that reads a text file and outputs csv table with statistics of all graphemes encountered in that file.

It is meant for debugging different kinds of problems.

Graphemes are composed characters that can consist of one or more characters. For example the grapheme õ can be represented both as LATIN SMALL LETTER O WITH TILDE and LATIN SMALL LETTER O + COMBINING TILDE.

Output statistics

The output statistics file include the grapheme, count, number of codepoints, names of the codepoints. Here is an example output table which could be used for finding problems of double representations of the grapheme õ.

grapheme count number of codepoints codepoint names
õ 3 1 LATIN SMALL LETTER O WITH TILDE
1 2 LATIN SMALL LETTER O; COMBINING TILDE

Usage

Run the program with the command python grapheme-stats.py filename.txt.