madsen/io-html

Trouble with detecting charset using function find_charset_in

Closed this issue · 3 comments

I try to determine charset with your library when load page from this site: http://www.dunloptyres.ru/.

Your code have restrictions in length of header:
https://metacpan.org/source/CJM/IO-HTML-1.001/lib/IO/HTML.pm#L199

I try to change this value on 2048 and it works fine.
May be need to increase this value, but i think that this wrong solution of this problem...
Or add environment variable which may to change this value or more clever solution...
?

The 1024-byte limit comes right out of the HTML 5 standard, section 8.2.2.2:

The authoring conformance requirements for character encoding declarations limit them to only appearing in the first 1024 bytes. User agents are therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first 1024 bytes, but not to stall beyond that.

dunloptyres.ru is not following the standard; its <meta...charset tag doesn't come until 1798 bytes.

However, I have no objection to making this limit configurable. But are you using IO::HTML directly, or just using some other module that depends on it?

I use HTTP::Message which have method decoded_content.

Setting $IO::HTML::bytes_to_check to a larger value should solve this issue. That variable was introduced in version 1.002, and is now in the new 1.004 stable release.