/tiny-scraper

a simple scraper, friendly usage.

Primary LanguageJavaScript

tiny-scraper

a simple web scraper, friendly usage.

Feature

  • request will be queued, configurable request frequency and delay.
  • page parse logic can be customed base on url route.

Dependencies

  • flyd
  • transducers.js
  • path-to-regexp
  • co

Install

npm install tiny-scraper

API

createRouter

return a router to parse specified page.

const { createRouter } = require('tiny-router');
const router = createRouter();

router.match

match a site base uri, return a function to filter urls in this site. please refer to path-to-regexp document for route expression format.

Parameters

  • baseUri
const matchGithub = router.match('https://github.com')

matchGithub(
  '/zhangmq/tiny-scraper',                //route expression
  function* (req, res, params, query) {
    yield storage(res.data)               //storage just for demo, you can implement it by yourself.
    return [/* parsed urls */];
  }
);

createScraper

create a scraper.

Parameters

  • options a object contains config fields.
    • maxRequest max requests count paralleled.
    • requestDuration min request duration, if request completed early, will wait until specified duration.
    • router you implemented router.
    • downloader method to request page, config => responsePromise. example: axios.request
const { createScraper } = require('tiny-scraper');
const scraper = createScraper({
  maxRequest: 1,
  requestDuration: 2000,
  router,
  downloader: axios.request
});

scraper.tasks$([/* seed tasks */])

scraper.task$

task input stream. you can send seed url or resend failed request into this steam.

Parameters

  • input a array of request config. please refer to axios document.

scraper.running$

current running tasks.

scraper.requestError$

failed request stream.

scraper.routeError$

route execute error. you can debug you route code by this scream.

Demo

example