/disqus-crawler

Copied from https://github.com/louisguitton/disqus-crawler and modified for breitbart scraping

Primary LanguagePython

Blog Comments Crawler

Modified version of louisguitton's code

What is it?

This project is a web crawler that can get and store the comments of a blog if it is powered by DISQUS.

How is it done?

For the crawling, this project uses scrapy. It stores the comments in a MongoDB database, using the pymongo client. A good tutorial to follow is this one.

When scrapping the web, 2 kind of problems kind arise:

  • the target page is to slow to render because it uses a lot of javascript
  • the target page renders everything really fast but what you were interested in was something that disappears when the page is renderred

To overcome these situations, one can deploy a tiy web-browser on a local machine that will render the pages at his will. This project uses Splash, on a local Docker container. A good tutorial to follow is this one.

Folder structure

README.md The file you're looking at

main.sh It calls the different jobs.

get_posts.py Called from main.sh. It takes care of MongoDB

scrapy.cfg Nothing to report

purseblog Folder create when running $scrapy startproject purseblog

  • settings.py Here you set up Splash

  • pipelines.py Nothing to report

  • items.py Nothing to report

  • init.py Nothing to report

  • spiders The folder containing the crawlers

    • getDisqusUrl.py The crawler in charge of the first job in main.sh

    • getJson.py The crawler in charge of the second job in main.sh

    • init.py Nothing to report

Installation

  • Clone the github repository and cd into it
  • Open main.sh and change the url to the blog page you want to crawl
  • Make sure a mongod instance is running on your computer $ mongod
  • Make sure a splash instance is running (more information here) $ docker run -p 8050:8050 scrapinghub/splash
  • Run the main.sh script $ sh main.sh

Contacts

The author is Louis Guitton