/wildebeest

Scripts investigate, repair and normalize a wide range of text file problems at the character level.

Primary LanguagePythonMIT LicenseMIT

wildebeest

The wildebeest scripts investigate, repair and normalize text for a wide range of issues at the character level.

wb-ana (or wb_analysis.py)

This script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts (e.g. Latin and Cyrillic), tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.

wb-norm (or wb_normalize.py)

This script automatically corrects some of the issues raised by wb-ana. The script can repair common encoding errors, normalize characters into their UTF8-canonical form, map digits and some punctuation to ASCII, delete many non-printable characters and perform other repair, normalization and cleaning steps. A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.). Normalization steps can be activated à la carte.

Installation

Click here for installation info
# Install from PyPi:
pip install wildebeest-nlp

# Alternatively, pip-install from GitHub master branch:
pip install git+https://github.com/uhermjakob/wildebeest.git

# Alternatively, clone GitHub, which might be useful for editing/development:
git clone https://github.com/uhermjakob/wildebeest.git
# or git clone git://github.com/uhermjakob/wildebeest.git
cd wildebeest
pip install --editable .   # run it from dir having setup.py

A pip-install will provide commands wb-norm and wb-ana as well as their alternate forms wb_normalize.py and wb_analysis.py.

After a regular git clone (without pip-install), in order to be able to call the Python scripts wb_normalize.py and wb_analysis.py, make sure that:

  1. wb_normalize.py and wb_analysis.py are executable (i.e. 'x' mode bits are set)
  2. your $PYTHONPATH includes the directory in which this README file resides in ("outer wildebeest") and
  3. your $PATH includes the directory that includes wb_normalize.py and wb_analysis.py ("inner wildebeest")

wb-norm (or wb_normalize.py)

The script repairs common encoding errors, normalizes characters into their canonical form, deletes many non-printable characters and performs other repair, normalization and cleaning steps. The script can be parameterized to include or exclude specific normalization steps (e.g. whether or not to map non-ASCII digits and punctuation to ASCII). A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.).

Usage   (click below for details)

CLI to normalize a file: wb-norm or wb_normalize.py
usage: wb-norm [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [--lc LANGUAGE-CODE] [--skip NORM-STEPS]
               [--add NORM-STEPS] [--all] [--all-except NORM-STEPS] [--only NORM-STEPS] [-v] [--version]
# or wb_normalize.py [-h] ...

Normalizes and cleans a given text

options:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  --skip NORM-STEPS     perform all default normalization/cleaning steps except those specified in comma-separated list
                        (default normalization/cleaning steps: repair-encoding-errors,del-surrogate,del-ctrl-char,
                        del-tatweel,core-compat,pres-form,hangul,repair-combining,combining-compose,combining-decompose,
                        repair-xml,repair-url-escapes)
  --add NORM-STEPS      perform all default normalization/cleaning steps plus those specified in comma-separated list 
                        (non-default normalization/cleaning steps: del-zero-width,del-arabic-diacr,del-hebrew-diacr,
                        ligatures,signs-and-symbols,cjk,width,font,small,vertical,enclosure,punct,punct-dash,punct-arabic,
                        punct-cjk,punct-greek,punct-misc-f,space,digit,arabic-char,farsi-char,pashto-char,georgian-char,
                        look-alike,repair-token)
  --all                 perform all normalization/cleaning steps, i.e. repair-encoding-errors,del-surrogate,
                        del-zero-width,del-ctrl-char,del-tatweel,del-arabic-diacr,del-hebrew-diacr,core-compat,pres-form,
                        ligatures,signs-and-symbols,cjk,width,font,small,vertical,enclosure,hangul,repair-combining,
                        combining-compose,combining-decompose,punct,punct-dash,punct-arabic,punct-cjk,punct-greek,
                        punct-misc-f,space,digit,arabic-char,farsi-char,pashto-char,georgian-char,look-alike,repair-xml,
                        repair-url-escapes,repair-token
  --all-except NORM-STEPS
                        perform all normalization/cleaning steps except those specified in comma-separated list
  --only NORM-STEPS     perform only normalization/cleaning steps specified in comma-separated list
  -v, --verbose         write change log etc. to STDERR
  --version             show program's version number and exit

Examples:

wb-norm -h  # for full usage info
wb-norm --version
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`  # go to directory where wildebeest-nlp is installed
cd wildebeest/test/data
wb-norm --lc fas -i wildebeest-test.txt -o wildebeest-test-norm.txt
wb-norm --lc fas --verbose --skip del-ctrl-char,del-tatweel < wildebeest-test.txt > wildebeest-test-norm-custom.txt
wb-norm --all < wildebeest-test.txt > wildebeest-test-norm-all.txt
wb-norm --all-except del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb-norm --only del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb-norm --add del-arabic-diacr,del-hebrew-diacr --skip del-ctrl-char,del-tatweel < wildebeest-test.txt
Same for alternate script name wb_normalize.py
wb_normalize.py -h  # for full usage info
wb_normalize.py --version
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`
cd wildebeest/test/data
wb_normalize.py --lc fas -i wildebeest-test.txt -o wildebeest-test-norm.txt
wb_normalize.py --lc fas --verbose --skip del-ctrl-char,del-tatweel < wildebeest-test.txt > wildebeest-test-norm-custom.txt
wb_normalize.py --all < wildebeest-test.txt > wildebeest-test-norm-all.txt
wb_normalize.py --all-except del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb_normalize.py --only del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb_normalize.py --add del-arabic-diacr,del-hebrew-diacr --skip del-ctrl-char,del-tatweel < wildebeest-test.txt

Note: For robustness regarding input files that do not fully conform to UTF8, please use -i (rather than STDIN), as it includes UTF8-encoding error handling.

norm_clean_string (Python function call to normalize a string)

Note: When working on a clone (as opposed to a pip-install), please make sure that your $PYTHONPATH includes the directory in which this README file resides.

from wildebeest.wb_normalize import Wildebeest
wb = Wildebeest()
ht = wb.build_norm_step_dict(base='ALL')  # base values: 'NONE', 'DEFAULT', 'ALL' (normalization steps)
# ht = wb.build_norm_step_dict()  # defaults: base = 'DEFAULT', skip = None, add = None
# ht = wb.build_norm_step_dict(base='NONE', add=['digit', 'enclosure'])  # normalize only digits (to ASCII) and enclosures
# ht = wb.build_norm_step_dict(base='DEFAULT', skip=['del-tatweel'], add=['digit', 'space'])
# ht = wb.build_norm_step_dict(base='ALL', skip=['punct-dash', 'enclosure', 'del-arabic-diacr'])
wb.load_look_alike_file()           # optional
print(wb.norm_clean_string('🄐…25km²', ht, lang_code='eng'))
print(wb.norm_clean_string('೧೯೨೩', ht, lang_code='kan'))

Normalization Steps

The script can perform a wide variety of normalization steps.

  • 12 normalization steps are performed by default, including basic character repair and UTF8 encoding normalization. The default is generally suitable for applications that largely need to preserve the original text.
  • Another 25 normalization steps are available through options --add (list of steps), --all, --all-except (list of steps). The --all and --all-excpet settings are suitable for many NLP applications.
  • Default normalization steps can be disabled by option --skip (list of steps).
  • Option --only (list of steps) applies only the normalization steps listed (without default normalization steps unless explicitly listed).
  • Option --all-except (list of steps) is equivalent to --all --skip (list of steps)
List of normalization steps included by default
  • repair-encoding-errors The script generally expects input encoded in UTF8. However, it will recognize and repair some common text encoding errors:
    • (Some) text is still encoded in Windows1252 or Latin1. Any byte that is not part of a well-formed UTF8 character will be interpreted as a Windows1252 character (and mapped to UTF8). This includes printable Latin1 characters as a subset.
    • Text in Windows1252 was incorrectly converted to UTF8 by a Latin1-to-UTF8 converter. This maps Windows1252 characters \x80-\x9F to \u0080-\uu009F, which is the Unicode block of C1 control characters. These C1 control characters are extremely rare, and so our script will interpret such C1 control characters as ill-converted Windows1252 characters, as do many major software applications such as Google Chrome, Microsoft Outlook, Github (text files) and PyCharm (where they are often displayed in a slightly different form).
    • Text in Windows1252 or Latin1 was converted twice, using some combination of Latin1-to-UTF8 converter and Windows1252-to-UTF converter; or a file already in UTF8 was incorrectly subjected to another conversion. Sample wildebeest repair:
    • Input: Donâ��t tell your â��fiancéâ�� â�� Schöne GrüÃ�e aus Mährenâ�¦ â�� Ma sÅ�ur trouve ça «bête». ¡Coño! â�¬50 â�¢ 25km² â�¢ ½µm
    • Output: Don’t tell your “fiancé” — Schöne Grüße aus Mähren… – Ma sœur trouve ça «bête». ¡Coño! €50 • 25km² • ½µm
  • del-surrogate deletes surrogate characters (representing non-UTF8 characters in input), alternative/backup to windows-1252
  • del-ctrl-char deletes control characters (expect tab and linefeed), some variation selectors
  • del-tatweel deletes Arabic tatweel (a text alignment character that increases the distance between Arabic letters)
  • core-compat normalizes Hangul Compatibility characters to Unicode standard Hangul characters
  • pres-form e.g. maps from presentation form (isolated, initial, medial, final) to standard form
  • hangul combine Hangul jamos onto Hangul syllables
  • repair-combining e.g. order of nukta/vowel-sign
  • combining-compose e.g. applies combining-modifiers to preceding character, e.g. ö (o + ̈) -> ö
  • combining-decompose e.g. for some Indian characters, splits off Nukta
  • repair-xml e.g. repairs multi-escaped tokens such as &quot; or &amp;#x200C;
  • repair-url-escapes e.g. repairs multi-escaped url substrings such as Jo%25C3%25ABlle_Aubron
List of additional normalization steps included by --all option
  • del-zero-width deletes zero-width characters, byte order mark, directional marks, join marks
  • arabic-char to Arabic canonical forms, e.g. maps Farsi kaf/yeh to Arabic versions
  • farsi-char to Farsi canonical forms, e.g. maps Arabic yeh, kaf to Farsi versions
  • pashto-char to Pashto canonical forms, e.g. maps Arabic kaf to Farsi version
  • georgian-char to Georgian canonical forms, e.g. to standard script, map archaic characters
  • ligatures e.g. decomposes non-Arabic ligatures (e.g. ij, ffi, DŽ, ﬓ)
  • signs-and-symbols e.g. maps symbols (e.g. kappa symbol) and signs (e.g. micro sign µ)
  • cjk e.g. CJK square composites (e.g. ㋀㏾)
  • width e.g. maps fullwidth and halfwidth characters to ASCII, e.g. A to A
  • font maps font-variations characters such as ℂ, ℹ, 𝒜 to regular characters
  • small maps small versions of characters to normal versions, such as small ampersand ﹠ to regular &
  • vertical maps vertical versions of punctuation characters with normal horizontal version, such as vertical em-dash ︱ to horizontal em-dash —
  • enclosure decomposes circled, squared and parenthesized characters, e.g. 🄐 to (A)
  • del-arabic-diacr e.g. deletes optional Arabic diacritics such as fatha, damma, kasra
  • del-hebrew-diacr e.g. deletes Hebrew points
  • digit e.g. maps decimal-system digits of 54 scripts to ASCII digits
  • punct e.g. maps ellipsis … to periods ... and two-dot-lead ‥ to ..; a few math symbols ∭; ⒛ 🄆
  • punct-dash e.g. maps various dashes, hyphens, minus signs to ASCII hyphen-minus
  • punct-arabic e.g. Arabic exclamation mark etc. to ASCII equivalent
  • punct-cjk e.g. Chinese Ideographic Full Stop etc. to ASCII equivalent
  • punct-greek e.g. Greek question mark etc. to ASCII equivalent
  • punct-misc-f e.g. Tibetan punctuation to ASCII equivalent
  • space e.g. maps non-zero spaces to normal space
  • look-alike normalizes Latin/Cyrillic/Greek look-alike characters, e.g. Latin character A to Greek Α (capital alpha) in otherwise Greek word
  • repair-token e.g. splits +/-/*/digits off Arabic words; maps not-sign inside Arabic to token-separating hyphen

wb-ana (or wb_analysis.py)

Script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts, tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.

Usage

CLI to analyze a file: wb-ana or wb_analysis.py
usage: wb-ana  [-h] [-i INPUT-FILENAME] [--batch BATCH] [-s] [-o OUTPUT-FILENAME] [-j JSON-OUTPUT-FILENAME] [--file_id FILE_ID]
               [--lc LANGUAGE-CODE] [-v] [-pb] [-n MAX_CASES] [-x MAX_EXAMPLES] [-r REF-FILENAME] [--version]
# or wb_analysis.py  [-h] ... 
  
Analyzes a given text for a wide range of anomalies

options:
  -h, --help            show this help message and exit
  -i INPUT-FILENAME, --input INPUT-FILENAME
                        (default: STDIN)
  --batch BATCH_DIR     Directory with batch of input files (BATCH_DIR/*.txt)
  -s, --summary         single summary line per file
  -o OUTPUT-FILENAME, --output OUTPUT-FILENAME
                        (default: STDOUT)
  -j JSON-OUTPUT-FILENAME, --json JSON-OUTPUT-FILENAME
                        (default: None)
  --file_id FILE_ID
  --lc LANGUAGE-CODE    ISO 639-3, e.g. 'fas' for Persian
  -v, --verbose         write change log etc. to STDERR
  -pb, --progress_bar   Show progress bar
  -n MAX_CASES, --max_cases MAX_CASES
                        max number of cases per group
  -x MAX_EXAMPLES, --max_examples MAX_EXAMPLES
                        max number of examples per line
  -r REF-FILENAME, --ref_id_file REF-FILENAME
                        (optional file with sentence reference IDs)
  --version             show program's version number and exit

Examples:

wb-ana --help
echo 'Hеllο!' | wb-ana                  # 'Hеllο!' mischievously includes a Cyrillic and a Greek character
echo 'Hеllο!' | wb-norm --all | wb-ana  # different result
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`  # go to directory where wildebeest-nlp is installed
cd wildebeest/test/data
wb-ana -i hello.txt
wb-ana -i wildebeest-test.txt -o wildebeest-test-out
wb-ana --batch phrasebook -s -o phrasebook-dir-out
wb-ana -i phrasebook/deu.txt -r phrasebook/eng.txt -o phrasebook-deu-out
wb-ana -i wildebeest-test-invalid-utf8.txt
Same for alternate script name wb_analysis.py
wb_analysis.py --help
echo 'Hеllο!' | wb_analysis.py
echo 'Hеllο!' | wb_normalize.py --all | wb_analysis.py
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`
cd wildebeest/test/data
wb_analysis.py -i hello.txt
wb_analysis.py -i wildebeest-test.txt -o wildebeest-test-out
wb_analysis.py --batch phrasebook -s -o phrasebook-dir-out
wb_analysis.py -i phrasebook/deu.txt -r phrasebook/eng.txt -o phrasebook-deu-out
wb_analysis.py -i wildebeest-test-invalid-utf8.txt
wildebeest.wb_analysis.process (Python function call to analyze a string, a list of strings, or a file)

Note: When working on a clone (as opposed to a pip-install), please make sure that your $PYTHONPATH includes the directory in which this README file resides.

import pprint
import sys
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(string="Hеllο!")   # "Hеllο!" mischievously includes a Cyrillic and a Greek character
wb.pretty_print(sys.stdout)            # custom pretty-print with OVERVIEW and DETAIL sections to STDOUT
pprint.pprint(wb.analysis)             # generic pretty-print
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(strings=["Hеllο!", "Tschüß"])
print(wb.analysis)  # print analysis object (nested dictionary)

Assuming an input file corpus.txt, e.g. built by:

printf 'Hеllο!\nTschüß\n' > corpus.txt
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(in_file='corpus.txt')
print(wb.analysis)
import wildebeest.wb_analysis as wb_ana
with open(f'out.txt', 'w') as out, open('out.json', 'w') as json:
    wb_ana.process(in_file='corpus.txt', pp_output=out, json_output=json)

wb-analysis.pl

Old Perl script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, non-ASCII punctuation, characters from a variety of language groups, very long tokens, unsplit 's, unsplit punctuation, script mixing; split URLs, email addresses, filenames, XML tokens.

Reports the number of instances in each category and give examples. Currently available: wildebeest_analysis.pl (Perl) v2.6 (April 28, 2021)