/org-syn-scraper

A simple scraper for the website orgsyn.org

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

OrgSynScraper

This repository contains a Python script that lets you scrape all PDF links from the website http://orgsyn.org/ and download the PDF files.

A little bit of background information about the project is available in this post on my blog.

Example usage:

Dumping only the links of a specific volume, for example volume 42:

./org_syn_scraper.py dump_links --volume=42 --links-only

Dumping the links and additional information of a specific volume:

./org_syn_scraper.py dump_links --volume=60

This returns an JSON array of objects with the following keys:

Key Description
annual_volume The annual volume containing the document
page The page of the document in the annual volume
name The name of the procedure described by the document
aliases An array with alternative names of the procedure described by the document
slug A slug generated out of the name of the procedure, that can be used as file name
url The URL of the PDF document

Downloading all files into the directory output:

./org_syn_scraper.py download output

Downloading all files of volume 96 into the directory volume_96:

./org_syn_scraper.py download --volume=96 output