Is it safe!? Marathon Man - do not click on this if you are squeamish!
- Is this file US-ASCII (7 bit) safe?
- Is this file UTF-8 safe?
- Is this file cp1252 safe?
- Is this file ... safe?
Python 3.x and 2.7 (maybe 2.x?) tools:
- unicode_dentist.py - Python 3 or 2 - find invalid bytes in a single file, will report statistics and context
- wildcard_unicode_dentist.py - same as above but handles multiple files in a list (potentially with wild cards, even under Microsoft Windows)
- line_check.py - display lines that are not in the expected encoding, along with line number and final "bad" line number count
- validate_utf8.py - Python 3 or 2 - check if files are encoded correctly or not, bool pass or fail, use unicode_dentist.py or line_check.py to determine problem(s)
- force_into_utf8.py - Python 3 ONLY - force file from encoding (possibly moji-bake with both cp1252 and utf-8) into utf-8
- asciinator.py - Python 3 or 2 - attempt to force a file into 7-bit us-ascii (potentially use force_into_utf8.py, then asciinator.py)
Run on this readme.
$ ./line_check.py us-ascii README.md
README.md:
28:'Euro symbol: \x80'
29:'Copyright symbol: \xa9'
30:'Euro symbol (again): \x80'
48:' README.md:10: 128 \x80 @ 152'
50:' Euro symbol: \x80'
52:' README.md:11: 169 \xa9 @ 171'
54:' Copyright symbol: \xa9'
56:' README.md:12: 128 \x80 @ 193'
58:' Euro symbol (again): \x80'
61:' 169 0xa9 not valid us-ascii character \xa9 occurrences 1'
62:' 128 0x80 not valid us-ascii character \x80 occurrences 2'
11 problem lines in README.md
Example usage:
python unicode_dentist.py unicode_dentist.py
python unicode_dentist.py README.md
Euro symbol: € Copyright symbol: © Euro symbol (again): €
Sample output pure US-ASCII file:
C:\tmp>unicode_dentist.py unicode_dentist.py
['C:\\tmp\\unicode_dentist.py', 'unicode_dentist.py']
2
expected_encoding: us-ascii
'unicode_dentist.py' is valid us-ascii
Sample output non ASCII:
C:\>unicode_dentist.py README.md
['C:\\unicode_dentist.py', 'README.md']
2
expected_encoding: us-ascii
us-ascii '\xa9'
us-ascii '\x80'
README.md:10: 128 € @ 152
'Euro symbol: \x80'
Euro symbol: €
README.md:11: 169 © @ 171
'Copyright symbol: \xa9'
Copyright symbol: ©
README.md:12: 128 € @ 193
'Euro symbol (again): \x80'
Euro symbol (again): €
========= character table =========
169 0xa9 not valid us-ascii character © occurrences 1
128 0x80 not valid us-ascii character € occurrences 2
=================================================================
env FILE_ENCODING=us-ascii DO_NOT_STOP=true ./validate_utf8.py
env FILE_ENCODING=us-ascii DO_NOT_STOP=true ./validate_utf8.py README.md validate_utf8.py