Feature: Please add plain text output functionality
yevgenpapernyk opened this issue · 6 comments
Like .summary() but plain text instead of the .summary() html version.
E.g. as a new method or as an argument for the .summary() method.
That would be very useful for Natural Language Processing.
I highly recommend using html2text library on the .summary() output for that.
converter = HTML2Text()
converter.ignore_links = True
converter.ignore_emphasis = True
converter.body_width = 0
text = converter.handle(html)
return text
given that it's that easy and that different people need different rendering options, and the options might change over time and I would need to reflect them in the library interface, I'd like to leave it as is.
However, I might consider adding a simple version, for that you need just .text_content() in lxml.
Shameless plug: trafilatura builds upon readability-lxml
and can convert the output to TXT, XML, CSV and JSON.
However, I might consider adding a simple version, for that you need just .text_content() in lxml.
So I'll leave the issue opened until you decide whether you want to add it, right?
Is there a plan to support textContent
like we have in JS module https://github.com/mozilla/readability#parse?
Yes if many people want an easy way to have text output, I'll add it.
Could you please support to get clear text content?