## Copyright (C) 2010, Paco Nathan. This work is licensed under ## the BSD License. To view a copy of this license, visit: ## http://creativecommons.org/licenses/BSD/ ## or send a letter to: ## Creative Commons, 171 Second Street, Suite 300 ## San Francisco, California, 94105, USA ## ## @author Paco Nathan <ceteri@gmail.com> Slinky provides an open source, high-performance Web Crawler, plus common Text Analytics, implemented in Python. * uses Redis key/value store for both CrawlQueue and PageStore * uses SQLite to persist crawled URI content * uses Neo4j to persist and analyze URI metadata * uses Hadoop, R, Gephi for Text Analytics and Link Analytics This leverages a "Particle Cluster" design pattern. In contrast to MapReduce, a Particle Cluster is particularly well-suited for combinging highly reliable servers plus low-cost/unreliable VMs. In other words, you can take advantage of CPU + memory + I/O on availably but relatively ephemeral resources -- which might get taken away without notice. For example in AWS, the key/value store could run on a large EC2 node, while the distributed tasks run on Spot Instances -- based on pricing and availability. This pattern helps maximize throughput and reliability while minimizing the cost of scale-out for long-running jobs. Required installs for worker nodes: http://github.com/andymccurdy/redis-py http://www.crummy.com/software/BeautifulSoup/ http://components.neo4j.org/neo4j.py/ http://jpype.sourceforge.net/ http://henry.precheur.org/python/rfc3339 (already included) Additional required installs for server nodes: http://code.google.com/p/redis/downloads http://www.sqlite.org/download.html http://neo4j.org/download/ Usage: # initialize Redis; run on server node... cd PATH_TO_REDIS nohup ./redis-server & # you probably want to config so it does "BGSAVE" # edit "config.tsv" for your settings... # e.g., Slinky handles ~100 crawler threads/node, but not in default config # initialize CrawlQueue and PageStore; run from any node... ./src/slinky.py redis_host:port:db flush ./src/slinky.py redis_host:port:db config < config.tsv ./src/slinky.py redis_host:port:db whitelist < whitelist.tsv ./src/slinky.py redis_host:port:db seed < urls.tsv # perform a crawl; run this on each worker node... nohup ./src/slinky.py redis_host:port:db perform & # will poll/sleep indefinitely; use "kill -9 PID" to terminate # persist the crawled URI content; run from any reliable node with attached storage... nohup ./src/slinky.py redis_host:port:db persist & # will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss # analyze the crawled URI metadata; run from any reliable node with attached storage... nohup ./src/slinky.py redis_host:port:db analyze & # will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss