PhpGt/Dom

escaped characters

kartofelek007 opened this issue · 18 comments

I have html with code:

<script>
p.append(" są ");
</script>

I read my html, add div and print output ($content is html get by ob_get_clean() and it good because i tested it in my other function):

$document = new HTMLDocument($content);
$h1 = $document->querySelector("#pageTitle");
$div = $document->createElement("div");
$div->innerHTML = "lorem";
$h1->after($div);
$html = $document->documentElement;

I got:

<script>
p.append(" s&#261; ");
</script>
g105b commented

Hi @kartofelek007 ,

Thanks for the issue report. I'll take a look at this and create a test case to investigate. I know the content of a script tag is treated differently in the underlying DOMDocument, so maybe there's something this library can do to patch this behaviour.

Will keep you up to date when I have a fix.

Greg.

g105b commented

Hi @kartofelek007,

Please can you check over the changes I've just pushed to branch 356-escaped-characters? I believe this has fixed the issue, and it's also improved the way escaping works for the textContent property.

Unit tests available here: https://github.com/PhpGt/Dom/blob/356-escaped-characters/test/phpunit/HTMLDocumentTest.php#L650-L667

I am simple man. I dont know how to use this with composer.

g105b commented

😄 I'm happy to help. If you are requiring phpgt/dom directly through composer, you can refer to a particular branch by using this syntax:

{
	"require": {
		"phpgt/dom": "dev-356-escaped-characters"
	}
}

Alternatively, you could clone this repository, use git checkout 356-escaped-characters and use it directly, like in the unit tests.

If you need any help integrating, I'm more than happy advising you - I'm really invested in helping get my open source work accessible to all.

I install it on my page.

2022-07-29_101642

In composer.lock I have something like:

{
            "name": "phpgt/dom",
            "version": "dev-356-escaped-characters",
            "source": {
                "type": "git",
                "url": "https://github.com/PhpGt/Dom.git",
                "reference": "c2f11cc299feefc7ee2faae61178de5b04a17dec"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/PhpGt/Dom/zipball/c2f11cc299feefc7ee2faae61178de5b04a17dec",
                "reference": "c2f11cc299feefc7ee2faae61178de5b04a17dec",
                "shasum": ""
            },

But it still not working.
Part of my course page:

<pre class="line-numbers"><code class="language-js">
const p = document.querySelector("#testNodeText");
const btn = document.querySelector(".btn-textNodeTest");

const word1 = document.createTextNode("Psy");
p.append(word1);

p.append(" są ");

const word2 = document.createTextNode("fajne");
p.append(word2);

btn.addEventListener("click", () =&gt; {
    console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć
    word1.textContent = "Koty też";
    word2.textContent = "super!";
});
</code></pre>

<script>
    {
        const p = document.querySelector("#testNodeText");
        const btn = document.querySelector(".btn-textNodeTest");

        const word1 = document.createTextNode("Psy");
        p.append(word1);

        p.append(" s&#261; ");

        const word2 = document.createTextNode("fajne");
        p.append(word2);

        btn.addEventListener("click", () => {
            console.dir(word1); //wypiszemy sobie by zobaczy&#263; co mo&#380;emy u&#380;y&#263;

            word1.textContent = "Koty te&#380;";
            word2.textContent = "super!";
        });
    }
</script>

In normal html elements (pre, code etc) the letters are correct. But in script tag all non eglish letters are escaped (look for example at comment after console.dir

g105b commented

Thank you for supplying a test case. I'll add this to the unit tests to replicate in a controlled environment. If you keep your eye on the 356-escaped-characters branch you'll see the new tests appear soon.

g105b commented

Hi @kartofelek007,

I've added a test to check that the non-English characters within script tags are not escaped. The test is passing, so I think I need more information from you about your particular use-case, so I can write more tests and isolate this bug properly.

Please could you provide me the source HTML that contains the <script> tag, and the PHP code that you are using to load it into the DOM, and the PHP code that renders the DOM back to the browser?

With your help I'm sure we can figure out why this strange encoding issue is occurring.

P.S. cool DOOM terminal

g105b commented

Take a look at scratch.php - I've pushed this to the branch which can be tested within a real web browser.

The scratch.php loads your example JavaScript into an HTMLDocument, then echos the document to the page.

I serve the scratch.php locally using this command: php -S 0.0.0.0:8000 scratch.php and this allows me to view the page in my web browser at http://localhost:8000/

Please see this screen recording. I hope this helps get us to a solution:

Marcin

Hmmm. Your example work fine.
I paste my function to your code and get escape characters. Maybe my function is shity?
You can see what I do in real on my page: https://kursjs.pl/kurs/dom/dom-tworzenie-i-usuwanie#createTextNode (now i use my own regexp solution)

<?php
use Gt\Dom\HTMLDocument;

require "vendor/autoload.php";

$content = <<<HTML
<div class="page-container" id="pageContainer">

<h1 class="page-title" id="pageTitle">Tworzenie i usuwanie elementów</h1>

<pre class="line-numbers"><code class="language-js">
const p = document.querySelector("#testNodeText");
const btn = document.querySelector(".btn-textNodeTest");

const word1 = document.createTextNode("Psy"); //pierwsze słowo
p.append(word1);

p.append(" są "); //nie potrzebujemy tutaj referencji

const word2 = document.createTextNode("fajne"); //ostatnie słowo
p.append(word2);

btn.addEventListener("click", () => {
    console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć
    word1.textContent = "Koty też";
    word2.textContent = "super!";
});
</code></pre>

<script>
    {
        const p = document.querySelector("#testNodeText");
        const btn = document.querySelector(".btn-textNodeTest");

        const word1 = document.createTextNode("Psy");
        p.append(word1);

        p.append(" są "); //nie potrzebujemy tutaj referencji

        const word2 = document.createTextNode("fajne");
        p.append(word2);

        btn.addEventListener("click", e => {
            console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć

            word1.textContent = "Koty też";
            word2.textContent = "super!";
        });
    }
</script>
</div>
HTML;


function generateContentTable($content) {
    $document = new HTMLDocument($content);

    $h2 = $document->querySelectorAll("h2.subtitle");

    $html = '<div class="table-of-content" role="navigation">
                <h2 class="visuallyhidden">Spis treści</h2>
                    <ul class="table-of-content-list">';

    foreach($h2 as $el) {
        $html .= '<li><a href="#' . $el->getAttribute('id') . '">' . $el->innerText . '</a></li>';
    }
    $html .= '</ul></div>';

    $div = $document->createElement("div");
    $div->innerHTML = $html;

    $h1 = $document->querySelector("#pageTitle");
    $h1->after($div);
    $html = $document->documentElement;
    return $html->innerHTML;
}

$c = generateContentTable($content);
echo $c;

Edit. I test my funcion in two scanarios in your scratch.php:

function generateContentTable($content) {
    //$document = new HTMLDocument($content);
    //$html = $document->documentElement;
    return $content; //return OK
}

function generateContentTable($content) {
    $document = new HTMLDocument($content);
    $html = $document->documentElement;
    return $html->innerHTML; //return ESCAPED LETTERS
}
g105b commented

Hi! Thanks for providing more detail. I can see where your issue lies now - within the innerHTML function's implementation.

I'll see what I can do to improve this functionality, but in the meantime, I think you can resolve your issue by using the HTMLDocument's toString function.

Rather than doing this:

$html = $document->documentElement;
return $html->innerHTML;

do this instead:

return (string)$document;

You will not see any of this escaping after casting the document to a string. I will see if I can provide a fix to the innerHTML property so it behaves as expected.

There is some advice I'd like to share with you at this point, which is to try and avoid concatenating strings of HTML, and instead using the DOM functionality like createElement and appendChild. That way, there's no need to worry about valid syntax or typos. I've got myself out of a lot of trouble in the past by avoiding concatenating strings of HTML.

This escape letters outside script tag:

$document = new HTMLDocument($content);
return (string)$document;

but not in script tag

g105b commented

Hello @kartofelek007

Sorry for the delay, but I've spent some time tonight trying to get UTF-8 characters working across all tests.

Please can you check this is working for you? I believe I've fixed things for you.

Greg.

P.S. It's on the 356-escaped-characters branch!

After update this return content where every inline js script (include inline google ad script) look like this:

<script>@@@@@@@@@@@@@@@@---script-6323981498a9f---@@@@@@@@@@@@@@@@</script>

And most important - generate take long time. For example for my single page this may be 20-30s. I create simple test with 200lines html - generate may take 3-4s? If I delete all script from this html, generate is muuuuuch faster.

g105b commented

I've just added some more tests to the branch and can confirm that I've somehow added some weird bugs. Sorry to keep this open for so long, I'll revisit it and see if I can fix whatever weird bug I've introduced.

I'm very poor with php. In dont know how to run test. I do this in my own way: I download your repo, checkout do character branch, install composer, recreate test page (see abowe 29 jul) and run it. Then i run your library on my page.

I'm not suprised, because this php functionality is somehow broken and not refined

g105b commented

@kartofelek007 I made a big mistake with my previous fix. I believe I have now fixed everything in the latest commit - sorry for confusing things.

Here's how you can run the tests locally, with the latest version of the code:

git checkout 356-escaped-characters
git pull
composer install
vendor/bin/phpunit test/phpunit

I believe you should be able to see the correct characters within the <script> tags, and everything should be nice and fast without any delay.

Thank you for helping me figure this one out, I'm very interested in hearing how you get on.

First test on test page and my real page pass correct. Looks like it's working fine now. Good job!

g105b commented

Great! I'll keep the pull request open for a bit longer while I integrate the branch into a few live apps I'm running.