A simple way to scrape websites, just download the HTML once, and process it as many times as you want.
To get going, just clone/fork this repo (or use it as a github template)
git clone git@github.com:benwinding/scrape-reduce.git
cd scrape-reduce
npm install
npm run scrape
- Scrapes HTML from the target website
- The HTML returned is saved to the
scraped
directory - Runs the
scrape.ts
in thesrc
directory- Provide your own fetch method etc...
- This caches based on the ID provided for each page
- Requests are limited to
3
concurrent requests, by default
npm run reduce
- Transforms the local HTML into what ever you need
- The text returned is saved to
reduced
- Runs the
reduce.ts
in thesrc
directory- You can read the DOM here and find elements etc...
- Avoids downloading too often, only scrape when you need to
- Caching means the scrape can be interrupted, and resumed
- You can iterate quickly with reduce, without network calls to the target site