/basil

Puppeteer web scraper with bespoke scripts and mods

Primary LanguageJavaScriptMIT LicenseMIT

Basil

Puppeteer web scraper bundled with a selection of scripts for a variety of use cases.

Built on Puppeteer

Installation

Clone the repo by your preferred method. From the root of the repo run:

npm install

Usage

Here's how to manage configurations, run scripts, and handle output.

Configuration file

Basil runs Puppeteer scripts to scrape web data from a configuration file. The configuration file tells Basil:

  • how many browser instances to launch
  • how it should acquire a list of URLs to scrape
  • which script to run
  • what elements specific to that script it should look for

The configuration file comprises an array of JavaScript objects, each one is a configuration which can be selected to run with a given parameter.

It is stored in: ./config.json

The format of a single configuration is:

{
    "configName": "gtm-data-layer",
    "parallel": 8,
    "output": "output/file.csv",
    "input": "input/file.csv",
    "urlSitemap": "https://example.com/sitemap",
    "pageList": {
        "startUrl": "https://www.example.com/courses",
        "linkSelector": "::-p-xpath(//a[@class='m-snippet__link'])",
        "moreItems": "(//button[contains(@aria-label, 'Go to page') and .//span[contains(@class, 'chevron--right')]])[1]"
    },
    "scrollList": {
        "startUrl": "https://blog.justinmallone.com/tag/microblog/",
        "linkSelector": "a.post-card-content-link",
        "maxScrolls": 10
    },
    "script": {
        "name": "gtmDataLayer",
        "params": 
        [
            {"key": "containerID", "value": "GTM-YOURID"},
            {"key": "gtmAttributeName", "value": "pageID"}
        ]
    }
}

This comprises:

Attribute name Description Optional / mandatory
configName String: name for the config used to select the config to run M
parallel Integer: the number of Chrome instances to launch at once. An upper limit of 8 is recommended. M
output String: file path to save results. Default is output/webscrape.csv O
input String: file path of a single column CSV of URLs to scrape O
urlSitemap String: URL for a sitemap to use for a list of URLs to scrape O
pageList Object: details of a page with links to acquire for scraping.

startUrl: page to begin.

linkSelector: DOM elements containing link value to scrape.

moreItems: element to click to move to next page
O
scrollList Object: details of page that uses lazy load scrolling, with links to acquire for scraping.

startUrl: page to begin.

linkSelector: DOM elements containing link value to scrape.

maxScrolls: maximum attempts at scrolling for more items.
O
script Object: name of the script to run and parmaters specific to the scrpt. M

Only one of input, urlSitemap, pageList, or scrollList needs to be present, but all can be included, and the resulting list of unique URLs will be scraped.

How to run a script

Execute:

npm run basil <configName>

This will instruct Basil to select the configuration with that name from ./config.json and run a web scrape using all the input sources the configuration specifies. The output will be logged to STDOUT and, by default, output/webscrape.csv.

Available scripts

All scripts are located in scripts/

Name Purpose
checkForElement Count instances of an element per page
cookiesAll List all cookies downloaded by page
findTextAnywhere Report instances of text anywhere in page
getElement Report attribute of a chosen element by page
gtmDataLayer Report instances of a Google Tag Manager data layer attribute by page
matchLinkArray Report instances of links that match an array of links, by page
multiElement Report an attribute for all instances of an element by page
redirects Record last redirect and status code for supplied list of URLs

Script parameters

Script name Parameter name Type Description Optional / Mandatory
checkForElement element string Element selector to check for M
cookiesAll N/A N/A N/A N/A
findTextAnywhere regexPattern string Regular expression to match page content M
getElement element string Element selector to check for M
attribute string Attribute value to report M
gtmDataLayer containerID string Google Tag Manager container ID M
gtmAttributeName string value to retrieve from GTM data layer M
matchLinkArray links array Array of links to match M
multiElement element string Element selector to check for M
attribute string Attribute value to report O
redirects N/A N/A N/A N/A

Example configurations

These examples are taken from ./sample-config.json

Count headings

{
    "configName": "count-headings",
    "parallel": 8,
    "input": "input/sitemap.csv",
    "urlSitemap": "https://www.example.com/sitemap",
    "script": {
        "name": "checkForElement",
        "params": [
            {
                "key": "element",
                "value": "//li/a[contains(@class, 'pill')]"
            }
        ]
    }
}

Description: Combine the URLs from a file called input/sitemap.csv and a sitemap at https://www.example.com/sitemap and report the number of instances of //li/a[contains(@class, 'pill')] per page.

All media URLs

{
    "configName": "All media URLs",
    "parallel": 8,
    "urlSitemap": "https://www.example.com/sitemap",
    "script": {
        "name": "multiElement",
        "params": 
        [
            {"key": "element", "value": "a[href*=\"https://www.example.com/-/media/\"]"}
        ]
    }
}

Description: Using the URLs in a sitemap at https://www.example.com/sitemap report all elements per page matching the selector a[href*=\"https://www.example.com/-/media/\"].

Redirects from list

{
    "configName": "redirectsFromList",
    "parallel": 8,
    "input": "input/short.csv",
    "listCrawl": {
        "startUrl": "https://www.example.com/list-of-links",
        "linkSelector": "::-p-xpath(//a[@class='m-snippet__link'])",
        "moreItems": "(//button[contains(@aria-label, 'Go to page') and .//span[contains(@class, 'chevron--right')]])[1]"
    },
    "script": {
        "name": "redirects",
        "params": []
    }
}

Description: Using a a list of URLs which combines an input file with the path input/short.csv and all the links at https://www.example.com/list-of-links with element selector ::-p-xpath(//a[@class='m-snippet__link']), report the last rediect and HTTP status code of each URL. Paginate through additional pages of URLs while ever the 'more items' selector (//button[contains(@aria-label, 'Go to page') and .//span[contains(@class, 'chevron--right')]])[1] is found.

Match array of links

{
    "configName": "linkArray",
    "parallel": 8,
    "input": "input/file.csv",
    "script": {
        "name": "matchLinkArray",
        "params": [
        {
            "key": "links",
            "value": [
                "https://www.example.com/about-this-website",
                "https://www.example.com/study-here/apply",
                "https://www.example.com/about-us/who-we-are",
                "https://www.example.com/about-us/our-values/sustainability"
        ]}
        ]
    }
}

Description: Using a list of URLs from the file input/file.csv report all instances per page of links matching the given array.

Get headings from lazy loading link list

{
    "configName": "eg-lazyload",
    "parallel": 8,
    "input": "",
    "scrollList": {
        "startUrl": "https://blog.justinmallone.com/tag/microblog/",
        "linkSelector": "a.post-card-content-link",
        "maxScrolls": 10
    },
    "script": {
        "name": "getElement",
        "params": 
        [
            {"key": "element", "value": "//h1"},
            {"key": "attribute", "value": "innerText"}
        ]
    }
}

Description: Using a list of URLs extracted from the URL https://blog.justinmallone.com/tag/microblog/ matching element selector a.post-card-content-link, report the h1 inner text value for each URL. Scroll down the list of links for a maximum of 10 scroll actions.