/crawler

Blog crawler for the blogforever project.

Primary LanguagePythonMIT LicenseMIT

BlogForever crawler

Install for python 2.6:

pip install scrapy==0.18.4
pip install lxml httplib2 feedparser selenium python-Levenshtein
install http://phantomjs.org/download.html to /opt/phantomjs/bin/phantomjs

Run:

scrapy crawl newcrawl -a startat=http://www.quantumdiaries.org/
scrapy crawl updatecrawl -a startat=http://www.quantumdiaries.org/ -a since=1388593000

Test:

pip install pytest pytest-incremental
py.test

Source tree docstrings:

bibcrawl
├── model
│   ├── commentitem.py: Blog comment Item
│   ├── objectitem.py: Super class of comment and post item
│   └── postitem.py: Blog post Item
├── pipelines
│   ├── backendpropagate.py: Saves the item in the back-end
│   ├── downloadfeeds.py: Downloads comments web feed
│   ├── downloadimages.py: Download images
│   ├── extractcomments.py: Extracts all comments from html using the comment feed
│   ├── files.py: Files pipeline back-ported to python 2.6
│   ├── processhtml.py: Process html to extract article, title and author
│   └── renderjavascript.py: Renders the original page with PhantomJS and takes a screenshot
├── spiders
│   ├── newcrawl.py: Entirely crawls a new blog
│   ├── rsscrawl.py: Super class of new and update crawl
│   └── updatecrawl.py: Partialy crawls a blog for new content of the web feed
├── utils
│   ├── contentextractor.py: Extracts the content of blog posts using a RSS feed
│   ├── ohpython.py: Essential functions that should have been part of python core
│   ├── parsing.py: Parsing functions
│   ├── priorityheuristic.py: Priority heuristic for page download, favors page with links to posts
│   ├── stringsimilarity.py: Dice's coefficient similarity function
│   └── webdriverpool.py: Pool of PhantomJS processes to parallelize page rendering
├── blogmonitor.py: Queries the database and starts new and update crawls when needed
└── settings.py: Scrapy settings

TODO:

Add to the DB, per blog

  • link to web-feed
  • latest etag of this feed
  • date of last crawl (unix format)

Blog monitor algo:

if isFresh, start an updatecrawl with last crawl date
otherwise we are fine for this blog.