stevebauman/hypertext

[Question/bug?] Transformed HTML-to-text still includes "&"

Closed this issue · 4 comments

Hi @stevebauman thanks for developing this, I'm trying it out on a new project. I'm finding that the transformer does mostly what I'd like it to do, but even though it's decoding some HTML entities like   and “ it's leaving behind &. Is there a reason this one is excluded?

example string: <p>Here's some &nbsp;text that is a bit &ldquo;rough &amp; ready&rdquo;</p>
output: Here's some text that is a bit “rough &amp; ready”

I think this is probably related to using HTMLPurifier, but since it seems the goal is to get to plain text, I'm wondering if maybe an extra step is needed in the transformation pipeline.

[To clarify: I'm using this in the context of preparing text for a Meilisearch index, within a Laravel app.]

Hey @sgilberg, thanks for trying out hypertext!

Let me give this a shot -- I think we may just need to run html_entity_decode() over the result before returning it.

I'm going to classify this as a bug 👍

Hey @sgilberg,

I've just resolved this in the latest v1.1.1 release.

I've added your example as a test case to ensure it's been covered:

it('converts html entities into their true form', function () {
expect(
transformer()->toText(<<<HTML
<p>Here's some &nbsp;text that is a bit &ldquo;rough &amp; ready&rdquo;</p>
HTML)
)->toEqual("Here's some text that is a bit “rough & ready”");

Run composer update and you're all set! Thanks again for the report 🙏

Thanks @stevebauman confirmed this now works in my application, and I can now drop my own html_entity_decode() workaround 👍

Excellent, great to hear @sgilberg. Appreciate you reporting back and confirming.