Doesn’t handle Unicode signature/byte-order-mark
da2x opened this issue · 1 comments
da2x commented
Returns the following unexpected errors when encountering UTF BOM/signatures.
1:1: ERROR: Expected a doctype token
<!DOCTYPE html>
^
1:2: ERROR: This is not a legal doctype
<!DOCTYPE html>
^
Expected behaviour: Check the first bytes of the document and detect BOM byte sequence. Set the document encoding to the encoding indicated by the BOM sequence (e.g. UTF-8 or UTF-16 LE). Strip the BOM sequence and proceed with parsing the document as normal.
https://encoding.spec.whatwg.org/#decode
https://html.spec.whatwg.org/#writing
Some test cases:
UTF-8 signature mark:
Nokogiri::HTML5.parse(
"\xEF\xBB\xBF<!DOCTYPE html>\n<html></html>".
force_encoding('UTF-8'),
max_errors: 10).
errors.each { |err| puts(err) }
UTF-16 (BE) byte-order-mark:
Nokogiri::HTML5.parse(
"\xFE\xFF".force_encoding('UTF-16BE') +
"<!DOCTYPE html>\n<html></html>".
encode('UTF-16BE', 'UTF-8'),
max_errors: 10).
errors.each { |err| puts(err) }
UTF-16 (LE) byte-order-mark:
Nokogiri::HTML5.parse(
"\xFF\xEF".force_encoding('UTF-16LE') +
"<!DOCTYPE html>\n<html></html>".
encode('UTF-16LE', 'UTF-8'),
max_errors: 10).
errors.each { |err| puts(err) }
stevecheckoway commented
Thank you for the bug report. I've got a fix that should land soon (assuming all the tests pass).