crawler: A Python repository from petterw

This is a quick and dirty /simple/ web crawler, and a work in progress. 
If you want it to do anything besides discover new links, you will need to 
write some code. It tries to respect robots.txt, and can be scaled fairly well 
across several machines thanks to the magic of celery and rabbitmq.

Please acquire more beer first if you're planning to use this for something 
serious, it will make it a more tolerable experience..

Libraries: 
	- urllib2 for document fetching
	- celery for task distribution
	- rabbitmq as the default message broker
	- beautifulsoup to extract links from html, even if the html is badly 
	  written
	- urlparse to make urls absolute, and to extract domain names

To use the crawler:

1) Start rabbitmq

2) If you're planning to run workers outside the local computer:
	- edit celeryconfig.py to point to the ip/hostname where you are 
	  running the rabbitmq server.
	- copy the program to the worker nodes

3) Start celeryd in the crawler folder on the worker nodes.

4) run example.py --seed url1 url2 ur3 ...


Examples
----------------------------------

Output of a test-run with a (too) low crawl-delay:

	Suspended crawl to 1298964969.suspended_crawl
	Downloaded bytes: 2603580
	Discovered links: 1449
	Discovered domains: 102
	Runtime: 40.9467639923 seconds
	Most prolific referrer was www. --- .com with an average of
	50.0 outgoing links per page.

To explore the results, fire up an interactive python shell and run:

	>>> from crawler import resume
	>>> crawler = resume("1298964969.suspended_crawl")

For example, these might be interesting:

	>>> print crawler.domains.keys()
	>>> print crawler.urls.values()

To resume the crawling process:

	>>> crawler.crawl()


To resume a killed crawler outside of the python shell, call the script 
with --resume filename

Assumptions, limitations, improvements
--------------------------------------

Assumptions:
- Web consists only of html (!)
- We don't need to get every single document, there's always others..
- Only http results of interest (optional https without cert checking available
  by passing schemes = ["http","https"] to crawler
- We're crawling a small subset of the internet so keeping some stuff in memory is ok.
- The goal was not to reimplement Googlebot in a few days :)

Good stuff:
- The crawler can add and remove workers while running
- RabbitMQ can be clustered, live (thanks Erlang..)
- Respects robots.txt (optionally..) including craw-delay, fetches this asynchronosly
  and delays further crawling of domain until robots.txt is parsed or found missing
- Can be restricted to crawling specific domains
- Accepts custom stopping criteria in the form of a function which can inspect the 
  crawl state
- In case of a crash, kill or completion, crawl state is saved so it can either be 
  resumed or inspected to find any problems.
- Crawls pretty fast after ramp-up

Limitations:
- The document ranking function is probably quite bad (it's no PageRank)
- The crawl oversight process is not distributed (but has been found capable of
  keeping up with hundreds of workers)
- Too much info is kept in memory for this to be usable on a grand scale,
  but it lets us get statistics easily
- Doesn't reuse connections

Suggested improvements:
- Crawl oversight should be made distributed
- Decouple link structure data storage from crawling
- Change to urllib3 to reuse connections (then again, maybe not, this would make us have to
  track which worker is talking to which server)
- Make your own crawler.rank(document) and termination criteria
- WARNING: uses pickle, which is unsafe on data from untrusted sources. Write some other method of saving
  state!
petterw/crawler