A collection of scrapy/Python spiders for web scraping legal materials in Singapore.
Description | Project | Spider | Status | Features | Related Website |
---|---|---|---|---|---|
PDPC Decisions | pdpcSpider | PDPCCommissionDecisions | Works | List of all decisions in JSON Downloads PDF Decisions |
https://www.pdpc.gov.sg/All-Commissions-Decisions |
So you want to get scraping ASAP?
Clone this repository.
git clone https://github.com/houfu/zeekerscrapers.git
Install with a python virtual environment (I use poetry).
cd zeekercrapers
poetry install
Change the directory to a Project (e.g. pdpcSpider)
cd pdpcSpider
Run a spider using the scrapy command line tool, specifying an output file if desirable.
scrapy crawl PDPCCommissionDecisions -o output.csv
Watch what you have wrought.
(scrapy contains many settings you can use. If you plan to make full use of these spiders, please be responsible.)
MIT License, Copyright 2022 Ang Hou Fu
This project runs on pytest
.
You can run all the tests written for all scrapers by running the pytest
command in the root directory.
You are welcome to contribute new Spiders or new features to existing spiders. Please open an Issue in the repository and let's work something out!