Tool for easy scraping data from websites
- Clean and simple API
- Persistent error-proof crawling
- State saving for continuous crawling
- jQuery-like server-side DOM parsing with Cheerio
- Parallel requests
- Proxy list and user-agent list support
- HTTP headers and cookies setup
- Automatic charset detection and conversion
- Console progress indicator
- node from 0.10 to 6.0 support
npm install icrawler
icrawler(startData, opts, parse, done);
startData
- task for start crawling (or array of tasks). Single icrawler task can be url (of page or API resource) or object withurl
field for url. Optionaly you can usedata
field with object forPOST
request (default method isGET
). You can use any other fields for custom data. For example, you can mark different types of tasks for parsing different ways, or you can store in task partial data, when one result record needs more than one requests.opts
(optional) - options:concurrency
- positive number of parallel requests or negative number of milisecs of delay between requests with no parallelism. Defaults to 1.delay
- time in milisecs to wait on error before try to crawle again. Defaults to 10000 (10 secs).errorsFirst
- iftrue
failed requests will repeated before all others. iffalse
- it will pushed in tail of queue. Defaults tofalse
.allowedStatuses
- number or array of numbers of HTTP response codes that are not errors. Defaults to [200].skipDuplicates
- iftrue
parse every URL only once. Defaults totrue
.objectTaskParse
- iftrue
task object will sent toparse
instead url string. Defaults tofalse
.decode_response
- (ordecode
) Whether to decode the text responses to UTF-8, if Content-Type header shows a different charset. Defaults to true.noJquery
- iftrue
send response body string toparse
function (as$
parameter) as is, without jQuery-like parsing. Defaults tofalse
.noResults
- iftrue
don't save parsed items to results array (nosave
field in_
parameter ofparse
function). Defaults tofalse
.quiet
- iftrue
don't write anything to console. Nolog
andstep
fields in_
parameter ofparse
function. Defaults tofalse
.open_timeout
(ortimeout
) - Returns error if connection takes longer than X milisecs to establish. Defaults to 10000 (10 secs). 0 means no timeout.read_timeout
- Returns error if data transfer takes longer than X milisecs, after connection is established. Defaults to 10000 milisecs (not like in needle).proxy
- Forwards request through HTTP(s) proxy. Eg.proxy: 'http://user:pass@proxy.server.com:3128'
. If array of strings - use proxies from list.proxyRandom
- iftrue
use random proxy from list for every request; iffalse
after each error use new proxy from list. Defaults totrue
. Ifproxy
is not array -proxyRandom
option will be ignored.reverseProxy
- Replace part of url before request fot using reverse proxy. IfreverseProxy
is string, it'l be used instead protocol and domain of original url. IfreverseProxy
is object, in original url substringreverseProxy.to
will be replaced byreverseProxy.from
. If array of strings or objects - use reverse proxies from list.reverseProxyRandom
- iftrue
use random reverse proxy from list for every request; iffalse
after each error use new reverse proxy from list. Defaults totrue
. IfreverseProxy
is not array -reverseProxyRandom
option will be ignored.headers
- Object containing custom HTTP headers for request. Overrides defaults described below.cookies
- Sets a{key: 'val'}
object as a 'Cookie' header.connection
- Sets 'Connection' HTTP header. Defaults to close.compressed
- if true sets 'Accept-Encoding' HTTP header to 'gzip, deflate'. Defaults tofalse
.agent
- Sets custom http.Agent.user_agent
- Sets the 'User-Agent' HTTP header. If array of strings - use 'User-Agent' header from list. Defaults to Needle/{version} (Node.js {nodeVersion}).agentRandom
- iftrue
use random 'User-Agent' from list for every request; iffalse
after each error use new 'User-Agent' from list. Defaults totrue
. Ifuser_agent
is not array -agentRandom
option will be ignored.onError
-function (err, task)
for doing something on first error before pause.init
-function (needle, log, callback)
for preparing cookies and headers for crawling. Must runcallback(err)
if errors orcallback(null, cookies, headers)
if success.initOnError
- iftrue
runinit
on every resume after errors. Iffalse
runinit
only on start. Ifinit
is not set -initOnError
option will be ignored. Defaults totrue
.cleanCookiesOnInit
- iftrue
clean old cookies oninit
run. Defaults tofalse
.cleanHeadersOnInit
- iftrue
clean old headers oninit
run. Defaults tofalse
.save
-function (tasks, results)
for saving crawler state.tasks
is object containing arrayswaiting
,finished
andfailed
with tasks from queue.results
is array of already fetched data. Ignored iffile
set.results
- results saved bysave
for continue crawling after crash or manual break. Ignored iffile
set.tasks
- tasks saved bysave
for continue crawling after crash or manual break.tasks
is object containing arrayswaiting
,finished
andfailed
. Ignored iffile
set.file
- name of file for saving crawler state for continue crawling after crash or manual break. Use it insteadsave
,tasks
andresults
for auto saving.saveOnError
- iftrue
runssave
every time when paused on error. Defaults totrue
.saveOnFinish
- iftrue
runssave
when crawling finished. Defaults totrue
.saveOnExit
- iftrue
runssave
when user abort script byCtrl+C
. Defaults totrue
.saveOnCount
- if number runssave
everysaveOnCount
requests.asyncParse
- iftrue
- runsparse
in asynchronous mode. Defaults tofalse
.
parse
- page-parsingfunction(task, $, _, res)
, that runs for every crawled page and gets this params:task
- url of parsed page. If setobjectTaskParse
thentask
is object withurl
property.$
- jQuery-like (cheerio
powered) object for html page or parsed object forjson
or raw response body ifnoJquery
istrue
._
- object with four functions:_.push(task)
- adds new task (or array of tasks) to crawler queue (will be parsed later). Every task can be url string or object withurl
property._.save(item)
- adds parsed item to results array_.step()
- increment indicator_.log(message /*, ... */)
- safe logging (use it insteadconsole.log
)_.cb
- callback function for asynchronous mode. Is undefined ifasyncParse
isfalse
.
res
(optional) - full response object (needle
powered).
done
(optional) -function(result)
, that runs once with result of crawling/parsing
var icrawler = require('icrawler');
var opts = {
concurrency: 10,
errorsFirst: true
};
icrawler('http://example.com/', opts, function(url, $, _){
if($('#next').length > 0){
_.push($('.next').attr('href'));
_.log('PAGE');
}
$('.news>a').each(function() {
_.step();
_.save({
title: $(this).text(),
href: $(this).attr('href')
})
});
}, function(result){
console.log(result);
});
MIT