/spool-scraper

Spool: Webscraper

Primary LanguageTypeScriptMIT LicenseMIT

spool-scraper

Gitter NPM version Build Status Test Coverage Dependency Status Follow @FabrixApp on Twitter

📦 Scraper Spool

A Spool to make Scraping the web super easy by implementing Crawler.

Install

$ npm install --save @fabrix/spool-scraper

Configure

// config/main.ts
import { ScraperSpool } from '@fabrix/spool-scraper'
export const main = {
  spools: [
    // ... other spools
    ScraperSpool
  ]
}

Configuration

// config/scraper.ts
export const scraper = {
  max_connections: 10,
    rate_limit: 1000,
    encoding: null,
    jQuery: true,
    force_UTF8: true,
    retries: 3,
    retry_timeout: 10000,
    incoming_encoding: null,
    skip_duplicates: false,
    // Boolean If true, userAgent should be an array and rotate it (Default false)
    rotate_UA: false,
    // String|Array, If rotateUA is false, but userAgent is an array, crawler will use the first one.
    user_agent: [],
    // String If truthy sets the HTTP referer header
    referer: null,
    // Object Raw key-value of http headers
    headers: null,
    pre_request: (opts, done) => {
      // 'options' here is not the 'options' you pass to 'c.queue',
      // instead, it's the options that is going to be passed to 'request' module
      console.log(opts)
      // when done is called, the request will start
      done()
    }
}

For more information about store (type and configuration) please see the scraper documentation.

Usage

For the best results, create a Scrape Class and override the default process method.

  import { Scrape } from '@fabrix/spool-scraper'
  
  export class AmazonScrape extends Scrape {
    process(res): Promise<any> {
      const $ = res.$
      const amazon = $('.nav-logo-base').text()
      return Promise.resolve(amazon)
    }
  }

Then you can either queue your scrape or scrape directly

// Return a result immediately <see config for options>
const direct = this.app.scrapes.AmazonScrape.direct('https://amazon.com', options, preRequest)

// Add this to the queue <see config for options>
this.app.scrapes.AmazonScrape.queue('https://amazon.com', options, preRequest)