Feature: Please add plain text output functionality

Question

Feature: Please add plain text output functionality

yevgenpapernyk opened this issue 4 years ago · 6 comments

yevgenpapernyk commented 4 years ago

Like .summary() but plain text instead of the .summary() html version.

E.g. as a new method or as an argument for the .summary() method.

That would be very useful for Natural Language Processing.

Answer 1 · 2020-07-24T15:54:41.000Z

I highly recommend using html2text library on the .summary() output for that.

    converter = HTML2Text()
    converter.ignore_links = True
    converter.ignore_emphasis = True
    converter.body_width = 0
    text = converter.handle(html)
    return text

given that it's that easy and that different people need different rendering options, and the options might change over time and I would need to reflect them in the library interface, I'd like to leave it as is.
However, I might consider adding a simple version, for that you need just .text_content() in lxml.

Answer 2 · 2020-07-29T17:51:13.000Z

Shameless plug: trafilatura builds upon readability-lxml and can convert the output to TXT, XML, CSV and JSON.

Answer 3 · 2020-08-24T09:19:07.000Z

However, I might consider adding a simple version, for that you need just .text_content() in lxml.

So I'll leave the issue opened until you decide whether you want to add it, right?

Answer 4 · 2021-08-17T03:58:35.000Z

Is there a plan to support textContent like we have in JS module https://github.com/mozilla/readability#parse?

Answer 5 · 2021-08-17T04:36:58.000Z

Yes if many people want an easy way to have text output, I'll add it.

Answer 6 · 2021-08-20T05:08:41.000Z

Could you please support to get clear text content?