/wildebeest

Text normalization and cleaning; analysis of types of characters used, encoding issues

Primary LanguagePython

wildebeest

normalize.py

Script repairs common encoding errors, normalizes characters into their canonical form, maps digits and some punctuation to ASCII, deletes many non-printable characters and performs other repair, normalization and cleaning steps. A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.). The script contains a list of normalization modules as listed below. The script argument --skip allows users to specify any normalization modules they want to skip.

Usage   (click below for details)

CLI to normalize a file: python -m wildebeest or its alias wb-norm
python -m wildebeest  [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [--lc LANGUAGE-CODE] [--skip NORM-STEPS] [-v] [--version]
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  --skip NORM-STEPS     comma-separated list of normalization/cleaning steps to be skipped: repair-encodings-errors,del-surrogate,del-
                        ctrl-char,del-arabic-diacr,del-hebrew-diacr,core-compat,pres-form,ligatures,signs-and-
                        symbols,cjk,width,font,small,vertical,enclosure,hangul,repair-combining,combining-compose,combining-
                        decompose,punct,punct-dash,punct-arabic,punct-cjk,punct-greek,punct-misc-f,space,digit,arabic-char,farsi-
                        char,pashto-char,georgian-char,look-alike,repair-xml,repair-url-escapes,repair-token (default: nothing skipped)
  -v, --verbose         write change log etc. to STDERR
  --version             show program's version number and exit

Example:

python -m wildebeest -i corpus-raw.txt -o corpus-wb.txt --lc eng --skip punct-dash,enclosure,del-arabic-diacr

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides. Note: For robustness regarding input files that do not fully conform to UTF8, please use -i (rather than STDIN), as it includes UTF8-encoding error handling.

norm_clean_string (Python function call to normalize a string)
from wildebeest.normalize import Wildebeest
wb = Wildebeest()
ht = {}                             # dictionary sets/resets steps to be skipped (default: not skipped)
# ht['SKIP-punct-dash'] = 1         # optionally skip normalization of ndash, mdash etc. to ASCII hyphen-minus.
# ht['SKIP-enclosure'] = 1          # optionally skip 'enclosure' normalization
# ht['SKIP-del-arabic-diacr'] = 1   # optionally skip 'delete arabic diacritic' normalization
wb.load_look_alike_file()           # optional
print(wb.norm_clean_string('🄐…25km²', ht, lang_code='eng'))
print(wb.norm_clean_string('೧೯೨೩', ht, lang_code='kan'))

Note: Please make sure that your $PYTHONPATH includes the directory in which this README file resides.

Installation
# from PyPi (after public release)
pip install wildebeest

# Latest master branch: either https or git/ssh 
pip install git+https://github.com/uhermjakob/wildebeest.git

# For editing/development
git clone git://github.com/uhermjakob/wildebeest.git
cd wildebeest
pip install --editable .   # run it from dir having setup.py

To call wildebeest after installation, run python -m wildebeest or its alias wb-norm.

List of Normalization Steps

repair-encodings-errors

The script generally expects input encoded in UTF8. However, it will recognize and repair some common text encoding errors:

  • (Some) text is still encoded in Windows1252 or Latin1. Any byte that is not part of a well-formed UTF8 character will be interpreted as a Windows1252 character (and mapped to UTF8). This includes printable Latin1 characters as a subset.
  • Text in Windows1252 was incorrectly converted to UTF8 by a Latin1-to-UTF8 converter. This maps Windows1252 characters \x80-\x9F to \u0080-\uu009F, which is the Unicode block of C1 control characters. These C1 control characters are extremely rare, and so our script will interpret such C1 control characters as ill-converted Windows1252 characters, as do many major software applications such as Google Chrome, Microsoft Outlook, Github (text files) and PyCharm (where they are often displayed in a slightly different form).
  • Text in Windows1252 or Latin1 was converted twice, using some combination of Latin1-to-UTF8 converter and Windows1252-to-UTF converter; or a file already in UTF8 was incorrectly subjected to another conversion. Sample wildebeest repair:
    • Input: Donâ��t tell your â��fiancéâ�� â�� Schöne GrüÃ�e aus Mährenâ�¦ â�� Ma sÅ�ur trouve ça «bête». ¡Coño! â�¬50 â�¢ 25km² â�¢ ½µm
    • Output: Don’t tell your “fiancé” — Schöne Grüße aus Mähren… – Ma sœur trouve ça «bête». ¡Coño! €50 • 25km² • ½µm

Other normalization modules

  • del-surrogate (deletes surrogate characters (representing non-UTF8 characters in input), alternative/backup to windows-1252)
  • del-ctrl-char (deletes control characters (expect tab and linefeed), zero-width characters, byte order mark, directional marks, join marks, variation selectors, Arabic tatweel)
  • core-compat (normalizes Hangul Compatibility characters to Unicode standard Hangul characters)
  • arabic-char (to Arabic canonical forms, e.g. maps Farsi kaf/yeh to Arabic versions)
  • farsi-char (to Farsi canonical forms, e.g. maps Arabic yeh, kaf to Farsi versions)
  • pashto-char (to Pashto canonical forms, e.g. maps Arabic kaf to Farsi version)
  • georgian-char (to Georgian canonical forms, e.g. to standard script, map archaic characters)
  • pres-form (e.g. maps from presentation form (isolated, initial, medial, final) to standard form)
  • ligatures (e.g. decomposes non-Arabic ligatures (e.g. ij, ffi, DŽ, ﬓ))
  • signs-and-symbols (e.g. maps symbols (e.g. kappa symbol) and signs (e.g. micro sign µ))
  • cjk (e.g. CJK square composites (e.g. ㋀㏾))
  • width (e.g. maps fullwidth and halfwidth characters to ASCII, e.g. A to A)
  • font (maps font-variations characters such as ℂ, ℹ, 𝒜 to regular characters)
  • small (maps small versions of characters to normal versions, such as small ampersand ﹠ to regular &)
  • vertical (maps vertical versions of punctuation characters with normal horizontal version, such as vertical em-dash ︱ to horizontal em-dash —)
  • enclosure (decomposes circled, squared and parenthesized characters, e.g. 🄐 to (A))
  • hangul (combine Hangul jamos onto Hangul syllables)
  • repair-combining (e.g. order of nukta/vowel-sign)
  • combining-compose (e.g. applies combining-modifiers to preceding character, e.g. ö (o + ̈) -> ö)
  • combining-decompose (e.g. for some Indian characters, splits off Nukta)
  • del-arabic-diacr (e.g. deletes optional Arabic diacritics such as fatha, damma, kasra)
  • del-hebrew-diacr (e.g. deletes Hebrew points)
  • digit (e.g. maps decimal-system digits of 54 scripts to ASCII digits)
  • punct (e.g. maps ellipsis … to periods ... and two-dot-lead ‥ to ..; a few math symbols ∭; ⒛ 🄆 )
  • punct-dash (e.g. maps various dashes, hyphens, minus signs to ASCII hyphen-minus)
  • punct-arabic (e.g. Arabic exclamation mark etc. to ASCII equivalent)
  • punct-cjk (e.g. Chinese Ideographic Full Stop etc. to ASCII equivalent)
  • punct-greek (e.g. Greek question mark etc. to ASCII equivalent)
  • punct-misc-f (e.g. Tibetan punctuation to ASCII equivalent)
  • space (e.g. maps non-zero spaces to normal space)
  • look-alike (normalizes Latin/Cyrillic/Greek look-alike characters, e.g. Latin character A to Greek Α (capital alpha) in otherwise Greek word)
  • repair-xml (e.g. repairs multi-escaped tokens such as " or ‌)
  • repair-url-escapes (e.g. repairs multi-escaped url substrings such as Jo%25C3%25ABlle_Aubron)
  • repair-token (e.g. splits +/-/*/digits off Arabic words; maps not-sign inside Arabic to token-separating hyphen)

wb_analysis.py

Script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts, tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.

usage: wb_analysis.py [-h] [-i INPUT-FILENAME] [--batch BATCH] [-s] [-o OUTPUT-FILENAME] [-j JSON-OUTPUT-FILENAME] [--file_id FILE_ID]
                      [--lc LANGUAGE-CODE] [-v] [-pb] [-n MAX_CASES] [-x MAX_EXAMPLES] [-r REF-FILENAME] [--version]

Analyzes a given text for a wide range of anomalies

options:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  --batch BATCH_DIR     Directory with batch of input files (BATCH_DIR/*.txt)
  -s, --summary         single summary line per file
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  -j JSON-OUTPUT-FILENAME, --json JSON-OUTPUT-FILENAME
                        (default: None)
  --file_id FILE_ID
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  -v, --verbose         write change log etc. to STDERR
  -pb, --progress_bar   Show progress bar
  -n MAX_CASES, --max_cases MAX_CASES
                        max number of cases per group
  -x MAX_EXAMPLES, --max_examples MAX_EXAMPLES
                        max number of examples per line
  -r REF-FILENAME, --ref_id_file REF-FILENAME
                        (optional file with sentence reference IDs)
  --version             show program's version number and exit

Sample calls:

wb_analysis.py --help
echo 'Hеllο!' | wb_analysis.py
wb_analysis.py -i test/data/hello.txt
wb_analysis.py -i test/data/wildebeest-test.txt -o test/data/wildebeest-test-out
wb_analysis.py --batch test/data/phrasebook -s -o test/data/phrasebook-dir-out
wb_analysis.py -i test/data/phrasebook/deu.txt -r test/data/phrasebook/eng.txt -o test/data/phrasebook-deu-out
wb_analysis.py -i test/data/wildebeest-test-invalid-utf8.txt

wb-analysis.pl

Old Perl script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, non-ASCII punctuation, characters from a variety of language groups, very long tokens, unsplit 's, unsplit punctuation, script mixing; split URLs, email addresses, filenames, XML tokens.

It will report the number of instances in each category and give examples.

Currently available: wildebeest_analysis.pl (Perl) v2.6 (April 28, 2021)