/print-nonascii

Unix CLI that prints lines that contain non-ASCII characters.

Primary LanguageMakefile

npm versionlicense

Contents

print-nonascii: print lines that contain non-ASCII characters.

print-nonascii is a Unix CLI that locates lines in text files or
stdin input that contain non-ASCII characters, which is helpful when
diagnosing character encoding problems.

Lines can be printed as-is and/or using abstract representations of non-ASCII characters in one of several formats; namely:

  • -v, --caret ... the same representation cat -v uses, based on caret notation.
  • --bash ... per-byte two-digit hex. escape sequences such as \xc3
  • --psh ... PowerShell Unicode escape sequences such as `u{20ac} for

Note: --psh only works correctly with properly UTF-8-encoded input.

Line numbers can be prepended on request, and output for multiple input files is by default preceded with headers identifying each input file.

Caveat: For now, no automated tests are run before releases.

Examples

# Create a test file with 1 line containing a non-ASCII character.
$ cat <<'EOF' > /tmp/test.txt
one
twö
three
EOF

# Print only lines that have non-ASCII characters, as-is.
$ print-nonascii /tmp/test.txt
twö

# Print only lines that have non-ASCII characters, with line numbers:
$ print-nonascii -n /tmp/test.txt
2:twö

# Print only lines that have non-ASCII characters, using PowerShell 
# Unicode escape-sequence notation (--psh), preceded by the 
# line as-is (--raw).
# The Unicode code point of character "ö" is U+00F6:
$ print-nonascii --psh --raw /tmp/test.txt
twö
tw`u{f6}

# Ditto with line numbers and per-byte Bash escape sequences:
$ print-nonascii --bash --raw /tmp/test.txt
twö
tw\xc3\xb6

# Simulate input from multiple files by specifying the same file
# twice, so as to show the headers identifying each input file 
# (suppress with -b).
# Note that each header line (invisibly) starts with control 
# character U+0001, so as to allow more predictable
# identification of header lines in the output.
$ print-nonascii -n /tmp/test.txt /tmp/test.txt 
###	/tmp/test.txt
2:twö
�###	/tmp/test.txt
2:twö

Installation

Prerequisites

  • When installing from the npm registry: macOS and Linux
  • When installing manually: any Unix platform with bash that also has perl installed.

Installation from the npm registry

With Node.js installed, install the package as follows:

[sudo] npm install print-nonascii -g

Note:

Note: Even if you don't use Node.js, its package manager, npm, works across platforms and is easy to install; try curl -L https://git.io/n-install | bash

  • Whether you need sudo depends on how you installed Node.js / io.js and whether you've changed permissions later; if you get an EACCES error, try again with sudo.
  • The -g ensures global installation and is needed to put print-nonascii in your system's $PATH.

Manual installation

  • Download the CLI as print-nonascii.
  • Make it executable with chmod +x print-nonascii.
  • Move it or symlink it to a folder in your $PATH, such as /usr/local/bin (macOS) or /usr/bin (Linux).

Usage

Find concise usage information below; for complete documentation, read the manual online, or, once installed, run man print-nonascii (print-nonascii --man if installed manually).

$ print-nonascii --help


Prints lines that contain non-ASCII characters.

    print-nonascii [--<mode> [-r]] [-n] [-b] [file ...]
    print-nonascii -q                        [file ...]

    --<mode> prints abstract representations of non-ASCII chars.; one of:
      --caret, -v ... use caret notation, as cat -v would.
      --bash ... represent non-ASCII bytes as \xhh 
      --psh ... (PowerShell) represent non-ASCII Unicode characters as  
                Unicode escape sequences: <backtick>u{h...}
    
    -r, --raw ... with --<mode>, print each matching line as-is too, first.

    -n, --line-number ... prefix the output lines with their line number from  
     the original file, using format "<line-number>:" - decimal line numbers,  
     no padding, no space before or after the ":"

    -b, --bare ... suppress per-input-filename headers

    -q ... quiet mode: produce no output; signal presence of non-ASCII chars.  
           with exit code 0; exit code 100 signals that there are none.

Standard options: --help, --man, --version, --home

License

Copyright (c) 2017 Michael Klement mklement0@gmail.com (http://same2u.net), released under the MIT license.

Acknowledgements

This project gratefully depends on the following open-source components, according to the terms of their respective licenses.

npm dependencies below have an optional suffix denoting the type of dependency: the absence of a suffix denotes a required run-time dependency; (D) denotes a development-time-only dependency, (O) an optional dependency, and (P) a peer dependency.

npm dependencies

Changelog

Versioning complies with semantic versioning (semver).

  • v0.0.3 (2017-09-11):

    • [enhancement] Header lines are now only printed for input files that produce at least 1 output line.
  • v0.0.2 (2017-09-10):

    • [fix] Header line is no longer printed twice when --<mode> is combined with --raw.
    • Header line now uses a tab char. to separate prefix ### from the filename.
  • v0.0.1 (2017-09-10):

    • Initial release.