UTF-8 issue when try to create a DOM document
vaso123 opened this issue · 1 comments
I have a fetched page by CURL, what charset is windows-1250, and doctype is
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
I change the encoding of my string, check it, and replace the meta charset in string:
$html = str_replace('windows-1250', 'UTF-8', mb_convert_encoding($result, 'UTF-8')); var_dump(mb_detect_encoding($html, "UTF-8, ASCII, ISO-8859-1, windows-1250")); $Doc = \phpQuery::newDocumentHTML($html, 'UTF-8'); echo pq($Doc)->html();
All the UTF-8 characters are messy. var_dump says, its UTF-8, content-type="text/plain; charset=UTF-8"
.
When I var_dump($Doc);
I see, the DOMDocument encoding and xmlencoding values are nulls.
But if I am using:
$Dom = new \DOMDocument(); $Dom->loadHTML($html);
and var_dump it, then everyhing is fine, the characters are ok.
I've checked the createDocumentWrapper
and the $contentType
is ok.
If I set the static $debug to true I've get this:
`string 'Load markup for content type text/html;charset=utf-8' (length=52)
string 'Loading HTML, content type 'text/html;charset=utf-8'' (length=52)
string 'Full markup load (HTML):
' (length=275)string 'DOC: UTF-8 REQ: UTF-8' (length=21)
string 'Full markup load (HTML), documentCreate('utf-8')' (length=48)
string 'Selecting document '52280a0c077ec7c5fb2f2350db12f22c' as default one' (length=68)`
I found a workaround, but this is not in the documentation as I see.
$html = trim(str_replace('windows-1250', 'UTF-8', mb_convert_encoding($result, 'UTF-8')));
$Dom = new \DOMDocument();
$Dom->loadHTML($html);
$Doc = \phpQuery::newDocumentHTML($Dom, 'UTF-8');