/perl-web-encodings

Web character encodings for Perl

Primary LanguagePerl

=head1 NAME

Web::Encoding - Web Encodings APIs

=head1 SYNOPSIS

  use Web::Encoding;
  $bytes = encode_web_utf8 $chars;
  $chars = decode_web_utf8 $bytes;

=head1 DESCRIPTION

The C<Web::Encoding> module provides a set of functions to handle
Web-compatible character encodings.

Also, there are following modules in the C<perl-web-encodings>
repository:

=over 4

=item L<Web::Encoding::UnivCharDet>

The universalchardet (or universal detector) implementation in Perl,
which can be used to implement HTML parsers.

=item L<Web::Encoding::Normalization>

Implementation of Unicode's string normalization algorithms, i.e. NFC,
NFD, NFKC, and NFKD.

=item L<Web::Encoding::Preload>

Preloading encoding modules and data files.

=back

=head1 FUNCTIONS

Functions described in these subsections are exported by default.

=head2 Encoding labels and properties of encodings

There are following functions to handle encoding labels and to obtain
properties of encodings:

=over 4

=item $key = encoding_label_to_name $label

Find the encoding identified by the specified label.  As does the "get
an encoding" steps [ENCODING], this function ignores leading and
trailing spaces, and compares labels ASCII case-insensitively.  The
function returns the encoding key (not a name), if found, or C<undef>.

=item $key = fixup_html_meta_encoding_name $key

Replace a encoding key for the purpose of HTML character encoding
declaration, as in "prescan a byte stream to determine its encoding"
and "change the encoding" algorithms [HTML].  The argument must be an
encoding key (not a name or label).  The function returns an encoding
key.

=item $key = get_output_encoding_key $key

Return the result of applying the steps to "get an output encoding"
[ENCODING].  The argument must be an encoding key (not a name or
label).  The function returns an encoding key.

=item $name = encoding_name_to_compat_name $key

Replace an encoding key to its official name as used in e.g.
C<characterSet> or C<inputEncoding> attributes of the C<Document>
interface [ENCODING] [DOM].  The argument must be an encoding key (not
a name or label).  The function returns an encoding name.

=item $boolean = is_ascii_compat_encoding_name $key

Return whether the specified encoding is an ASCII-compatible character
encoding [ENCODING] or not.  The argument must be an encoding key (not
a name or label).

=item $boolean = is_encoding_label $label

Return whether the specified label identifies an encoding [ENCODING]
or not.  It compares labels ASCII case-insensitively.  Unlike the
C<encoding_label_to_name> function, however, this function does not
ignore spaces.

=item $key = locale_default_encoding_name $tag

Return the encoding key (not a name or label) of the default character
encoding for a locale [HTML].  If no default is known for the
specified locale, C<undef> is returned.

The argument, which identifies the locale, must be either a BCP 47
language tag or a string C<*>.  The language tag must be the primary
language tag only, C<zh-TW>, or C<zh-CN>, otherwise no data is
available.  The tags are ASCII case-insensitive.  If C<*> is
specified, the global default encoding that can be used when the
locale is not known or the locale has no default is returned.

=back

For the purpose of this module, the B<key> of the encoding is a short
string uniquly identifying the encoding.  It is a lowercased variant
of the encoding name [ENCODING].

Note that the encoding names in the Encoding Standard are not
compatible with Perl L<Encode> module's encoding names.

=head2 Encoders and decoders

There are following functions for encoding and decoding:

=over 4

=item $bytes = encode_web_utf8 $chars

Encode the character string in UTF-8 and return the encoded bytes.

This function can be used to implement the "UTF-8 encode" operation
[ENCODING].

=item $chars = decode_web_utf8 $bytes

Decode the bytes as UTF-8 and return the decoded character string.
Any bad byte is replaced by U+FFFD characters without failure.

This function can be used to implement the "UTF-8 decode" operation
[ENCODING].

=item $chars = decode_web_utf8_no_bom $bytes

Decode the bytes as UTF-8, not recognizing BOM, and returns the
decoded character string.  Any bad byte is replaced by U+FFFD
characters without failure.

This function can be used to implement the "UTF-8 decode without BOM"
operation [ENCODING].

=item $bytes = encode_web_charset $key, $chars

Encode the character string and return the encoded bytes.

The first argument must be the key of the encoding used to encode the
string.

Any character not representable in the encoding is converted to an
HTML decimal character reference for the character.

This function can be used to implement the "encode" operation with
error mode C<html> [ENCODING] [ENCODING16].

=item $chars = decode_web_charset $key, $bytes

Decode the bytes and return the decoded character string.

The first argument must be the key of the encoding used to decode the
bytes.

Any bad byte is replaced by U+FFFD characters without failure.

This function is equivalent to the following code using
L<Web::Encoding::Decoder>:

  $decoder = Web::Encoding::Decoder->new_from_encoding_key ($key);
  $decoder->ignore_bom (1);
  return $decoder->bytes ($bytes) . $decoder->eof;

=item [$name, $name, ...] = encoding_names

Return the list of the encoding keys (i.e. the lowercase variants of
the encoding names), as an array reference.

=back

In addition to UTF-8, following legacy encodings are supported:
IBM866
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-8-I
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
KOI8-R
KOI8-U
macintosh
windows-874
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
x-mac-cyrillic
gb18030
GBK
Big5
EUC-JP
ISO-2022-JP
Shift_JIS
EUC-KR
x-user-defined
UTF-16BE
UTF-16LE
replacement

=head1 SPECIFICATIONS

=over 4

=item ENCODING

Encoding Standard <https://encoding.spec.whatwg.org/>.

=item ENCODING16

UTF-16 encoder
<https://github.com/whatwg/encoding/commit/8360f775c8df145f649047c7d59c5ff733ade112>.

=item HTML

HTML Standard <https://html.spec.whatwg.org/>.

=item DOM

DOM Standard <https://dom.spec.whatwg.org/>.

=item ENCVALID

Encoding Validation
<https://wiki.suikawiki.org/n/Encoding%20Validation>.

=back

=head1 DEPENDENCY

The module requires Perl 5.8 or later.

=head1 AUTHOR

Wakaba <wakaba@suikawiki.org>.

=head1 LICENSE

Copyright 2011-2018 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=cut