This is a web scraper built to scrape course information from the Syllabus Search Database at Waseda University.
- Python 3, version 3.6.2 and above.
- pip3 (package manager for Python3) 9.0.1 and above.
- MongoDB shell 3.6.0 and above.
- Google Chrome Driver
- Robo 3T (Optional but recommended)
NOTE: Currently, this guide is written for Mac users. The general procedure also applies to Window users, but the input commands might be different.
Installations for Python 3, MongoDB shell, and Robo 3T are pretty straight-forward, so they are not covered.
After installing Python 3, run the command below inside your terminal/command line to check it's available. Note that it is python3, not python.
python3 --version
Run the next command to check if pip3 is available.
pip3 --version
For more questions, follow the detailed guide here and replace all python, pip command with python3 and pip3.
Install virtual environment for Python 3 by running
pip3 install virtualenv
Create a folder that will be used as a virtual environment for this project.
mkdir my-virtual-env
Initialize and activate the environment.
virtualenv my-virtual-env
source my-virtual-env/bin/activate
Clone this project into the virtual environment folder, and install dev dependencies.
cd my-virtual-env
git clone https://github.com/wasetime/waseda-syllabus-scraper.git
pip3 install -r requirements-dev.txt
Run the following command to check if MongoDB shell is available.
mongo --version
You should see an output like MongoDB shell version v3.6.0.
Run the following command to start the daemon database process.
mongod
Remember that you will need to start the database before scraping so that the scraped data can be exported to MongoDB.
This project automates Google Chrome to click on links and proceed to the next page of search results. Download the driver here and put it somewhere you like.
Inside search.py
, you need to replace the original chrome driver path with your own one.
# Replace /Users/oscar/chromedriver with your own chrome driver path, e.g. /Users/myself/my-chrome-driver
self.driver = webdriver.Chrome('/Users/oscar/chromedriver')
Also, you can specify the courses of a particular semester and school you want to scrape.
# Change the target semester and school here.
target_semester = 'Fall'
target_school = 'All'
At last, if needed feel free to change the name of output MongoDB collection inside settings.py
.
# Change the name of the output collection here
MONGO_COLLECTION = "raw_2017F_courses"
Type the following command inside your terminal
python3 run_search.py
You should see a new Google Chrome Icon pop up. Open it and it should display "Chrome is being controlled by automated test software.". Depends on the target you selected, the scraping process may take a few minutes.
After finish scraping, you can deactivate the virtual environment by typing the following command
deactivate
You can use mongo shell (pure CLI) or Robo3T (provides a great GUI) to validate if the interested data is scraped and stored correctly in MongoDB.
The Waseda syllabus database only provides data related to courses. In order to obtain classroom and building information, we have extract and group them into separate collections. This can be done using MongoDB's Aggregation Framework.
This project contains a aggregate.js
file that helps automating the entire aggregation process. However, it is necessary to change some variables inside before starting.
Currently, there is no written guide for this section, but you can follow the comments in aggregate.js
to tweak and customize your own aggregation process.
Type the following command inside your terminal to start using mongo shell and load the aggregation script.
mongo
load("/path/to/aggregation/script.js")
It should return true
if the aggregation is successful.
If you have obtained the desired results, congratulations! Or if you encountered some troubles during scraping or aggregating the data, feel free to submit an issue. :)
- Python3 - The language used.
- Scrapy - The scraping framework used.
- Selenium - The browser automation framework used.
- MongoDB - The database used for storing results.
Submit an issue or a pull request!
- Oscar Wang - Initial work - OscarWang114
- Taihei Sato - Add a new url - tsato815
This project is licensed under the MIT License - see the LICENSE.md file for details