Web Scraper for WordPress-based Blogs

By Julia, Kiril, Martina, and Nikolay


Usage:

Scraper and Formatter:

  • Run main.py with the name of a supported website.
  • To scrape the website to a json, run with -s/--scrape.
  • To format scraped data to a json, run with -f/--format.
  • Run without -s and -f to scrape and save only formatted data.
  • Specify the number of articles to scrape with -n NUM.

Web App:

  • web_instance.py starts a debug server with all previously scraped data.

  • run.sh:

    • scrapes our primary supported blog (travelsmart),
    • starts a server and opens it in the default browser,
    • proceeds to scrape all supported blogs.

    Newly scraped data is automatically loaded in.
    An argument may be passed to specify the number of posts to scrape from each blog.


Supported blogs:

  1. travelsmart
  2. bozho
  3. igicheva
  4. pateshestvenik
  5. az_moga

Task

Web scraper - automatically gather info from selected websites (blogs):

  1. Develop a scraper using a Test Driven Development process.
  2. Process the data for subsequent usage (storage/access/search).
  3. Present the data through a simple frontend.