/specialchars-to-html

Convert special characters to HTML

Primary LanguagePythonMIT LicenseMIT

HTML Character / Entity Converter

Overview

This tool converts special characters to their HTML versions, for example converting "œ" to the string œ. The output file(s) are saved - by default - a filename-processed.txt (keeping the original extension).

Usage

This requires only a standard installation of Python 3. For help, see the Python-Guide.com.

Once Python is installed, download the contents of this repository (either via) git clone or downloading the ZIP from the green "CODE" button dropdown.

Then, go to that directory in your command prompt / terminal; run as follows:

cd ~/Downloads/specialchars-to-html  # Or whatever the download path is
python3 convert.py ~/path/to/source-file.txt # Sometimes "python", not "python3"

Usage Example

usage: convert.py [-h] [--outpath OUTPATH] [--nosuffix] [--debug]
            [--customsuffix CUSTOMSUFFIX] [--dropextension] [--displayonly]
            [--showfilename] source_file [source_file ...]

"Special character" to HTML processor. 
    This tool converts special characters to their HTML versions, e.g. Ӓ.
    The output file(s) are saved - by default - a filename-processed.txt
    (keeping the original extension).
    

positional arguments:
  source_file           The file(s) to process

optional arguments:
  -h, --help            show this help message and exit
  --debug               Display debugging information

File Writing/Saving Options:
  These options let you customize the file saving/naming behavior.

  To save with a custom filetype, combine --dropextension and --customsuffix

  --outpath OUTPATH, -o OUTPATH
                        Write files to the specified output location
  --nosuffix, -S        Don't add "-processed" to filename. Use w/eg. --outpath)
  --customsuffix CUSTOMSUFFIX, -s CUSTOMSUFFIX
                        Set a custom suffix to add to the file when saving.
  --dropextension, -x   Drop the existing file extension

Display / stdout:
  --displayonly, -d     Display the data only; do not write output files
  --showfilename, -n    Display the filename above the output (e.g. if
                        processing multiple files)


Usage Examples:

    convert.py -S example-file.md
        Processes "example-file.md" and saves as "example-file-processed.md" in
        the same directory as the source.

    convert.py -S -o out/ some-file.txt
        Saves processed files to path "out/" w/identical names to input files)

Converting Line Endings

This may be needed in soem circumstances - determine if this is the case with the file command - you should have CRLF and UTF-8, not CR and ASCII

# Install "dos2unix" - like "brew install dos2unix" ?

# mac2unix will convert the "CR" (carriage-return) at the end of each file
# to "LF" (linefeed) - the unix standard.  This is useful even just to do
# examinations of the files, as shown around "Tracking Down Gremlins"

mac2unix < source-file.tsv > file-with-lf.tsv
unix2dos < file-with-lf.tsv > file-with-crlf.tsv

# from inside the convert util path!
mkdir out/
python3 convert.py -S file-with-lf.tsv -o out/

unix2dos < out/file-with-lf.tsv > file-with-new-chars-and-crlf.tsv

Tracking Down Gremlins

Amont other things, VSCode with the "gremins" extension can help.

To find all "upper ascii" (as in, not normal text) characters in a file. This can be run in bash (in the "Terminal" on a Mac, for example).

Note: You may want to ensure the files have LF endings; see mac2unix above. Otherwise, grep will not know where line endings are, and you'll get useless results.

# BEST: Shows all lines with upper ascii / funky character
# Note: may require use without the -P on OSX

# Just show the item name & ID
grep -P '[^\x00-\x7f]' some-source-file-lf.tsv | cut -d$'\t' -f1,3

# Long output (all fields / data) with spacing for readability
grep -P '[^\x00-\x7f]' some-source-file-lf.tsv | sed -z 's/\n/\n\n/g'

# Simple version (LONG output)
grep -P '[^\x00-\x7f]' some-source-file-lf.tsv


### Below are some older versions and versions that select specific data

# Shows all lines (prefixed with line numbers) with upper ascii
grep -anP '[\x80-\xFF]' some-source-file-lf.tsv

# Shows only the first chunks of data, to make it easier to read
# (splits on tabs) - add ",5" to the end to get the whole description
grep -anP '[\x80-\xFF]' some-source-file-lf.tsv | cut -d$'\t' -f 1-3

# Find othr stuff that may not appear above: check if your locale is UTF-8
locale

# grep -  using unicode locale
grep -avx '.*' some-source-file-lf.tsv

# ASSORTED NOTES DO NOT USE

# nonascii() { LANG=C grep -P --color=always '[^\x00-\x7f]'; }
# cat file | nonascii()

External Links & References