This repo allows you to do two things:
- scrape Open Music Theory to store page titles, links to assignments, and download PDFs of assignments
- create a workbook by combine the separate PDFs into one PDF
This project requires Python 3.9 or newer, and uses Poetry for dependency management. If you haven't already done so, install Poetry according to the installation instructions.
Once Poetry is installed, run the command poetry install
in the directory containing the pyproject.toml
file.
The Python library Scrapy does the actual scraping. The spider loads the URL specified in omt/omt/spiders/omt_spider.py
(which is https://viva.pressbooks.pub/openmusictheory/part/fundamentals/) and performs the following actions:
- stores the page title, assignments section, and downloads all PDF files linked to in the assignments section
- finds the link to the next page, and performs the step above on the next page if it exists
To start scraping:
- make sure any PDFs in the
omt/assignment_pdfs/assignment_pdfs
directory are moved or deleted - navigate to the
omt
directory with the terminal commandcd omt
- run the command
scrapy runspider omt/spiders/omt_spider.py
The actual scraping takes a while, as there are delays between page loads and file downloads built in to the spider to avoid harming the server.
The scraping process creates a CSV file with all page data called assignments_{datetime}.csv
and stores all downloaded PDFs in the omt/assignment_pdfs/assignment_pdfs
directory.
While still in the omt
directory, run the command python combine_pdfs.py
. This will create a PDF file titled omt_workbook.pdf
in the omt/assignment_pdfs/joined
directory.