buriy/python-readability

Feature: Please add plain text output functionality

yevgenpapernyk opened this issue · 6 comments

Like .summary() but plain text instead of the .summary() html version.

E.g. as a new method or as an argument for the .summary() method.

That would be very useful for Natural Language Processing.

buriy commented

I highly recommend using html2text library on the .summary() output for that.

    converter = HTML2Text()
    converter.ignore_links = True
    converter.ignore_emphasis = True
    converter.body_width = 0
    text = converter.handle(html)
    return text

given that it's that easy and that different people need different rendering options, and the options might change over time and I would need to reflect them in the library interface, I'd like to leave it as is.
However, I might consider adding a simple version, for that you need just .text_content() in lxml.

adbar commented

Shameless plug: trafilatura builds upon readability-lxml and can convert the output to TXT, XML, CSV and JSON.

However, I might consider adding a simple version, for that you need just .text_content() in lxml.

So I'll leave the issue opened until you decide whether you want to add it, right?

Is there a plan to support textContent like we have in JS module https://github.com/mozilla/readability#parse?

buriy commented

Yes if many people want an easy way to have text output, I'll add it.

Could you please support to get clear text content?