blackfellas/python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

HTML

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

Usage

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'

Change Log

0.3 Added Document.encoding, positive_keywords and negative_keywords
0.4 Added Videos loading and allowed more images per paragraph
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 and 3.4

Licensing

This code is under the Apache License 2.0 license.

Thanks to

Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort to move to lxml
"BR to P" fix from readability.js which improves quality for smaller texts
Github users contributions.