/python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

Primary LanguageHTML

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

Usage

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'

Change Log

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 and 3.4

Licensing

This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js
  • Ruby port by starrhorne and iterationlabs
  • Python port by gfxmonk
  • Decruft effort to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.