/python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Primary LanguagePythonApache License 2.0Apache-2.0

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

$ conda install -c conda-forge readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.2 Added article author(s) (thanks @mattblaha)
  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

BUGS

cchardet is not working, further more, it will break when cchardet is present log.debug always runs even when log level is above debug

TODO

  1. migrate mozilla/readability's functionality here - find next page - is_renderable: check renderability before parsing
  2. Add more tests - WeChat articles - Sina/Sohu/Netease articles - Articles with code
  3. Rename and upload to PyPI - Readability2 - 0.9.0
  4. Remove all classes and ids
  5. i18n support, different consts for different language.
  6. Extract title/author/publish_time and other metadata