jugglerchris/rust-html2text

Strip invisible unicode characters

ayrat555 opened this issue · 4 comments

Thank you for this library. I'm using it in https://github.com/ayrat555/el_monitorro to remove HTML from data feeds.

Is it possible to also remove invisible Unicode characters from text?

for example https://unicode-table.com/en/200B/

it seems the library rust-html2text converts some HTML codes to this

Hi,
Glad you're finding this library useful, I always like seeing people and projects using it!

Can you what problem you have with this character, and are there any others that are also a problem? I've generally assumed that nearly all characters (apart from control characters) should be passed through to the output in case they're significant to the text (all sorts of formatting characters and so on). Since html2text is producing formatted text there may be an argument that U+200B should be removed, but I'd rather have a general rule than a pile of exceptions if possible!

If you have an example (small) HTML file that shows an issue, that would be great.

Thanks!

I've generally assumed that nearly all characters (apart from control characters) should be passed through to the output in case they're significant to the text (all sorts of formatting characters and so on)

I think it makes sense.

The problem I had is the output of rust-html2text consisted only of a single character U+200B. Now I'm removing this kind of character myself https://github.com/ayrat555/el_monitorro/pull/130/files#diff-5f0de2e362f7c469382fd72a12a43663032d809f82792ff92cca65b97ad84e9fR41

Hi,
Ok - if you have an HTML example which incorrectly comes out with just that character feel free to raise an issue.