TobiaszCudnik/phpquery

UTF-8 issue when try to create a DOM document

vaso123 opened this issue · 1 comments

I have a fetched page by CURL, what charset is windows-1250, and doctype is

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

I change the encoding of my string, check it, and replace the meta charset in string:

$html = str_replace('windows-1250', 'UTF-8', mb_convert_encoding($result, 'UTF-8')); var_dump(mb_detect_encoding($html, "UTF-8, ASCII, ISO-8859-1, windows-1250")); $Doc = \phpQuery::newDocumentHTML($html, 'UTF-8'); echo pq($Doc)->html();

All the UTF-8 characters are messy. var_dump says, its UTF-8, content-type="text/plain; charset=UTF-8".

When I var_dump($Doc); I see, the DOMDocument encoding and xmlencoding values are nulls.

But if I am using:

$Dom = new \DOMDocument(); $Dom->loadHTML($html);

and var_dump it, then everyhing is fine, the characters are ok.

I've checked the createDocumentWrapper and the $contentType is ok.

If I set the static $debug to true I've get this:

`string 'Load markup for content type text/html;charset=utf-8' (length=52)

string 'Loading HTML, content type 'text/html;charset=utf-8'' (length=52)

string 'Full markup load (HTML):

' (length=275)

string 'DOC: UTF-8 REQ: UTF-8' (length=21)

string 'Full markup load (HTML), documentCreate('utf-8')' (length=48)

string 'Selecting document '52280a0c077ec7c5fb2f2350db12f22c' as default one' (length=68)`

I found a workaround, but this is not in the documentation as I see.

$html = trim(str_replace('windows-1250', 'UTF-8', mb_convert_encoding($result, 'UTF-8')));
$Dom = new \DOMDocument();
$Dom->loadHTML($html);
$Doc = \phpQuery::newDocumentHTML($Dom, 'UTF-8');