/custom-web-scraper

A One-off web scraper.

Primary LanguagePython

Custom Web Scraper

This is a one-off project to scrape data from the web. Built for a hospitality and entertainment product. Documented here for posterity and discussion. Specific urls and proprietary details are excluded from this repository.

Technologies

It uses a mix of technologies, selected for expedience and utility: Make, Bash, cURL, awk, Python3, jq.

Overview

The scraper runs in a series of stages. Each stage takes an input generates an output. Outputs are cached on the filesystem. The stages invoked through a Makefile

Stage Input Action Ouput
1 secret Bash scripts cURL to query a list of urls A list indexed 'location' headers
2 stage 1 Awk extracts url from location header A list of indexed Urls
3 stage 2 Python iterates through list and caches url content Directory of .gz files named by index value
4 stage 3 Python iterates through cached .gz files and applies regex for fields of interest Directory of JSON files named by index
5 stage 4 Bash and jq filter json files according to tuned selection criteria A file with a list of indexes relevant to search