/webreg_scrapy

A WebReg scraper via Scrapy

Primary LanguagePython

Webreg Scrapy

This is a web scraper for retrieving UCI course information from the UCI University Registrar. This is a tool I built for the UCI Course API.

Table of Contents

  1. Usage
    1. Process
  2. Requirements
  3. Development
    1. Installing Dependencies
    2. Running the Scraper
    3. Handling UCI Data Changes
    4. Roadmap
  4. Contributing

Usage

Use this scraper to grab course information and import it into a PostgreSQL database

Process

  1. Scraper is hosted on Heroku
  2. Executes the department spider to grab updated list of departments
  3. Executes a course spider for each department in department list
  4. Uploads all the information to the AWS RDS PostgreSQL database

Requirements

  • PostgreSQL

Development

Installing Dependencies

From within the root directory:

pip install -r requirements.txt

Running the Scraper

Start up PostgreSQL server with correct relations setup

// To crawl courses into database
scrapy crawl course_scrapy  
// To crawl courses into database and store them into courses.json
scrapy crawl course_scrapy -o courses.json

Handling UCI Data Changes

  1. Change items.py
  2. Change the way course_spider.py parses
  3. Change the models.py to reflect database schema
  4. Change pipelines.py to manage the insertion of new data

Roadmap

View the project roadmap here

Contributing

See CONTRIBUTING.md for contribution guidelines.