Unable to parse documents with un-quoted attribute values
dreystone opened this issue · 3 comments
dreystone commented
I encountered some pages that were using minify, and the meta and link tags in the head were missing the quotes for the attribute values.
According to WC3, this is permitted part of HTML5 spec for attributes:
https://html.spec.whatwg.org/multipage/syntax.html#attributes-2
Here is a code example which fails:
<!DOCTYPE html>
<html lang=en-US>
<head>
<meta charset=utf-8><meta content="IE=edge" http-equiv=X-UA-Compatible>
<meta content=unsafe-url name=referrer>
<link href=/images/favicons/favicon--16x16.png rel=icon sizes=16x16 type=image/png>
</head>
<body>
page contents
</body>
</html>
scinfu commented
I parsed this HTML without problems.
Can you explain what does not work?
nikolaykargin commented
This snippet was correctly parsed with the latest version of the library. Below is the code snippet and output.
import Foundation
import SwiftSoup
var html = """
<!DOCTYPE html>
<html lang=en-US>
<head>
<meta charset=utf-8><meta content="IE=edge" http-equiv=X-UA-Compatible>
<meta content=unsafe-url name=referrer>
<link href=/images/favicons/favicon--16x16.png rel=icon sizes=16x16 type=image/png>
</head>
<body>
page contents
</body>
</html>
"""
let doc = try SwiftSoup.parse(html)
let metaElements = try doc.select("head *")
for meta in metaElements {
if let attributes = meta.getAttributes() {
print(meta.tagName(), attributes.compactMap { "\($0.getKey())=\($0.getValue())" })
}
}
print(try doc.body()?.text() ?? "–")
meta ["charset=utf-8"]
meta ["content=IE=edge", "http-equiv=X-UA-Compatible"]
meta ["content=unsafe-url", "name=referrer"]
link ["href=/images/favicons/favicon--16x16.png", "rel=icon", "sizes=16x16", "type=image/png"]
page contents