Doesn’t handle Unicode signature/byte-order-mark

Question

Doesn’t handle Unicode signature/byte-order-mark

da2x opened this issue 5 years ago · 1 comments

Returns the following unexpected errors when encountering UTF BOM/signatures.

1:1: ERROR: Expected a doctype token
<!DOCTYPE html>
^
1:2: ERROR: This is not a legal doctype
<!DOCTYPE html>
 ^

Expected behaviour: Check the first bytes of the document and detect BOM byte sequence. Set the document encoding to the encoding indicated by the BOM sequence (e.g. UTF-8 or UTF-16 LE). Strip the BOM sequence and proceed with parsing the document as normal.

https://encoding.spec.whatwg.org/#decode
https://html.spec.whatwg.org/#writing

Some test cases:

UTF-8 signature mark:

Nokogiri::HTML5.parse(
  "\xEF\xBB\xBF<!DOCTYPE html>\n<html></html>".
  force_encoding('UTF-8'),
  max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (BE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFE\xFF".force_encoding('UTF-16BE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16BE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (LE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFF\xEF".force_encoding('UTF-16LE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16LE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }

Answer 1 · 2020-01-09T19:34:05.000Z

Thank you for the bug report. I've got a fix that should land soon (assuming all the tests pass).