This provides a Haskell library and a CLI executable to determine character encoding (i.e., so-called "charset") from given HTML bytes.
The precendence order for determining the character encoding is:
- A BOM (byte order mark) before any other data in the HTML document itself.
- A
<meta>
declaration with acharset
attribute or anhttp-equiv
attribute set toContent-Type
and a value set forcharset
. Note that it looks at only first 1024 bytes. - Mozilla's Charset Detectors heuristics. To be specific, it delegates to the charsetdetect-ae package, a Haskell implementation of that.
The package is available on Hackage: html-charset.
>>> import Data.ByteString.Lazy
>>> import Text.Html.Encoding.Detection
>>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd<html><head>..."
Just "UTF-8"
>>> detect "<html><head><meta charset=latin-1>..."
Just "latin-1"
>>> detect "<html><head><title>\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"
Note that the detect
function takes a lazy bytestring, not strict.
Read the API docs for details.
We currently doesn't provide any official binaries. The CLI program can be installed using Cabal or Stack: html-charset.
$ curl https://www.haskell.org/onlinereport/ | html-charset
ASCII
$ curl http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/ | html-charset
shift_jis
Although it's less likely, html-charset
may fail to determine the character
encoding, and for the case it prints nothing (only a line feed, exactly).
You can customize the string to print when it fails by configuring
-f
/--on-failure
option.
Witten by Hong Minhee. Licensed under LGPL 2.1 or higher.