/c4lbc-crawler

A Web Crawler Example in Node.js and Puppeteer for Code4Lib BC 2021

Primary LanguageJavaScript

Puppeteer Web Crawler Example - Code4Lib BC 2021

Example scripts for crawling web pages using Node.js, puppeteer, and puppeteer-cluster. Built for Code4Lib BC, 2021.

If a page mentions the word 'repository,' you have to drink more coffee!

Note: This is a simplified example and still has some issues that should probably be ironed out before you try and do anything serious with it. For example, it does nothing to handle query strings or anchors differently, and does a poor job of checking whether it has crawled a particular url already. In this way, drink counts are probably maximized.

Usage:

  1. Install node dependencies:

npm install

  1. Run script(s) with node:

node ./crawler.js

Documentation:

Puppeteer

Documentation : https://pptr.dev/

GitHub (with examples): https://github.com/puppeteer/puppeteer

Puppeteer-cluster

Github: https://github.com/thomasdondorf/puppeteer-cluster