a simple web scraper, friendly usage.
- request will be queued, configurable request frequency and delay.
- page parse logic can be customed base on url route.
- flyd
- transducers.js
- path-to-regexp
- co
npm install tiny-scraper
return a router to parse specified page.
const { createRouter } = require('tiny-router');
const router = createRouter();
match a site base uri, return a function to filter urls in this site. please refer to path-to-regexp document for route expression format.
- baseUri
const matchGithub = router.match('https://github.com')
matchGithub(
'/zhangmq/tiny-scraper', //route expression
function* (req, res, params, query) {
yield storage(res.data) //storage just for demo, you can implement it by yourself.
return [/* parsed urls */];
}
);
create a scraper.
- options a object contains config fields.
- maxRequest max requests count paralleled.
- requestDuration min request duration, if request completed early, will wait until specified duration.
- router you implemented router.
- downloader method to request page, config => responsePromise. example: axios.request
const { createScraper } = require('tiny-scraper');
const scraper = createScraper({
maxRequest: 1,
requestDuration: 2000,
router,
downloader: axios.request
});
scraper.tasks$([/* seed tasks */])
task input stream. you can send seed url or resend failed request into this steam.
- input a array of request config. please refer to axios document.
current running tasks.
failed request stream.
route execute error. you can debug you route code by this scream.