inikulin/parse5

Wrong HTML string when reading from a file with encoding utf8 with bom

faust21 opened this issue · 1 comments

Environment: nodejs
The file: a.html
File encoding: utf8 with bom
The parse and serialize codes:

fs.readFile('a.html', 'utf8', (err, data) => {
    const dom = parse(data);
    const html = serialize(dom);
    fs.writeFile('a.html', Buffer.from(html, 'utf8'), (werr) => {
    });
});

The source file's content like this:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <title>xxx</title>
</head>
<body>
    <div></div>
</body>
</html>

but the serialized content became this:

<html><head></head><body>
    <meta charset="UTF-8">
    <title>xxx</title>

    <div></div>
</body></html>

As you can see, the head's content in body now. But if the source file's encoding is utf8, then the issue disappears.

Solved, I removed the bom header.