/CrawlerMQ

HTTP Crawler written in Perl with ActiveMQ and Redis

Primary LanguagePerlOtherNOASSERTION

MQCrawler is a distributed perl plateform for fast crawling web sites



APACHE ACTIVEMQ SERVER
-----------------------

You need an activemq server with "stomp" protocol activated
you also need to specify your activemq server address
into YSpider.conf


REDIS SERVER
-----------------------

A Redis server is needed in order to store already crawled url
(into the "url" set)
with Ubuntu, you can "apt-get install redis-server"
and specify your redis server address into YSpider.conf

PERL PROGRAMS
-----------------------

please, specify as many patterns as you need
in YSpider.conf


then, you have to start in this order :

> perl master.pl &
> perl analyzer.pl &
> perl crawler.pl &


master.pl   :
------------------
 retrieves the links (from the "links" queue) to crawl and send the autorized urls to be crawled to the "crawl" queue   

analyzer.pl : 
------------------
 retrieves the html source (from the "source" topic) and extracts the links present into this source and send them to the "links" queue 

 you can write as many analyzer.pl  "clones" for your own purposes, as source code are published into the "source" topic, with persistent message ;)

crawler.pl  : 
------------------

 retrieves the urls to crawl from the "crawl" queue and gets the source content and send them to the "source" topic.
 this script forks n times, in order to parallelize http gets.

 you can launch crawler.pl on as many servers as you want/need.

 be carefull, it may DDOS the web sites you wanna crawl.


when everything is done, you can send the first url to crawl to the "crawl" queue with a starting url in the body message.