/crawler

A periodic web crawler to download course data from Oscar and make it available for the GT Scheduler application

Primary LanguageTypeScriptGNU Affero General Public License v3.0AGPL-3.0

GT Schedule Crawler

A periodic web crawler to feed course data into GT Scheduler.

Sample: 202008.json

To report a bug or request a new feature, please create a new Issue in the GT Scheduler website repository.

📃 License & Copyright Notice

This work is a derivative of the original and spectacular GT Schedule Crawler project created by Jinseo Park (as a part of the overall GT Scheduler project). The original work and all modifications are licensed under the AGPL v3.0 license.

Original Work

Copyright (c) 2020 Jinseo Park (parkjs814@gmail.com)

Modifications

Copyright (c) 2020 the Bits of Good "GT Scheduler" team

🔍 Overview

The crawler is a command-line application written in TypeScript (a typed superset of JavaScript) that runs using Node.js to crawl schedule data from Oscar (Georgia Tech's registration management system).

It operates as a series of steps that are processed after one another (see src/index.ts) for each current "term" (combination of year and semester, i.e. Fall 2021).

In order to process the prerequisites data for each course (which comes in the form of a string like "Undergraduate Semester level CS 2340 Minimum Grade of C and Undergraduate Semester level LMC 3432 Minimum Grade of C" that can become much more complex), the crawler also utilizes an ANTLR grammar and generated parser in order to convert the prerequisites data retrieved from Oscar into a normalized tree structure. The grammar itself and the generated parser/lexer code can be found in the src/steps/prereqs/grammar folder.

The crawler is run every 30 minutes using a GitHub Action workflow, which then publishes the resultant JSON to the gh-pages where it can be downloaded by the frontend app: https://gt-scheduler.github.io/crawler/202008.json.

🚀 Running Locally

  • Node.js (any recent version will probably work)
  • Installation of the yarn package manager version 1 (support for version 2 is untested)

Running the crawler

After cloning the repository to your local computer, run the following command in the repo folder:

yarn install

This may take a couple minutes and will create a new folder called node_modules with all of the dependencies installed within. This only needs to be run once.

Then, to run the crawler, run:

yarn start

After the crawler runs, a series of JSON files should have been created in a new data directory in the project root.

Utilizing structured logging

By default, the crawler outputs standard log lines to the terminal in development. However, it also supports outputting structured JSON log events that can be more easily parsed and analyzed when debugging. This is turned on by default when the crawler is running in a GitHub Action (where the LOG_FORMAT environment variable is set to json), but it can also be enabled for development.

The utility script yarn start-logged can be used to run the crawler and output JSON log lines to a logfile in the current working directory:

yarn start-logged

To analyze the JSON log lines data, I recommend using jq since it is a powerful tool for parsing/analyzing JSON in the shell. The following command imports all lines in the latest log file and loads them all as one large array for further processing (note: this command will probably only work on Unix-like systems (Linux and probably macOS), so your mileage may vary. If you're running into issues, try running it on a Linux computer and make sure you have jq installed):

cat $(find . -type f -name "*.log" | sort -n | tail -1) | jq -cs '.'

For some useful queries on the log data, see 📚 Useful queries on crawler logs.

Using the Python Finals Data Scraper

First, ensure Python 3.9 or newer is installed. Then, install the necessary Python modules with the included requirements.txt file:

pip install -r requirements.txt

Run the reviser to augment the data previously scraped with the new finals data

python ./src/Revise.py

The JSON files in the data folder will now contain updated information regarding the finals date and time.

More information can be found here

Updating the list of finals PDFs

The Registrar publishes a PDF with the Finals schedule at the start of each semester. The page with the PDF for the Fall 2022 semester can be found here

The matrix.json file contains a mapping from term to the pdf file.
The key is one of the terms identified by the scraper here.
The value is the direct address for the PDF file such as this

This mapping needs to be updated each semester when a new schedule is posted

More information can be found on the wiki

Linting

The project uses pre-commit hooks using Husky and lint-staged to run linting (via ESLint) and formatting (via Prettier). These can be run manually from the command line to format/lint the code on-demand, using the following commands:

  • yarn run lint - runs ESLint and reports all linting errors without fixing them
  • yarn run lint:fix - runs ESLint and reports all linting errors, attempting to fix any auto-fixable ones
  • yarn run format - runs Prettier and automatically formats the entire codebase
  • yarn run format:check - runs Prettier and reports formatting errors without fixing them

👩‍💻 Contributing

The GT Scheduler project welcomes (and encourages) contributions from the community. Regular development is performed by the project owners (Jason Park and Bits of Good), but we still encourage others to work on adding new features or fixing existing bugs and make the registration process better for the Georgia Tech community.

More information on how to contribute can be found in the contributing guide.