Briticle is a Python library to extract the main content of a webpage.
- BeautifulSoup 4
- html5lib / lxml
- kindelgen (required if you want save into mobi files)
Copy the single briticle.py file to your project or install it into your pythonpath:
# python setup.py install
>>> from briticle import Briticle
>>> bs = Briticle()
>>> bs.open('http://example.com/blog-post-url/')
>>> print bs.text # the main content
>>> print bs.html # the main content with html tags
# If we want save to mobi file
>>> save_dir = "/Users/mitnk/tmp"
>>> url = "http://mitnk.com/79/code_notes_i/"
>>> bf = BriticleFile(url, save_dir)
>>> bf.save_to_mobi()
Just run the following command in the directory contains briticle.py:
# python test.py
mitnk @ twitter whgking AT gmail DOT com