A simple web crawler to create sitemap of a given website.
python main.py -u http://flask.pocoo.org/docs/0.12/index.html
# Generate sitemap of http://flask.pocoo.org/docs/0.12/ directory
python main.py -u http://flask.pocoo.org/docs/0.12/index.html -b http://flask.pocoo.org/docs/
# Generate sitemap of http://flask.pocoo.org/docs/ directory starting from /docs/0.12/index.html
python main.py -u ... -vvv
# set logging to very verbose
python main.py -u ... -o sitemap.xml
# Write generated sitemap to sitemap.xml file
crawler = WebCrawler(is_master=True)
crawler.crawl(url)
result = crawler.dump()
You can use your own backends:
class SuperFastCsvWebCrawler(WebCrawler):
# Custom Backend Classes
storage_class = SuperFastUrlStorage
http_client_class = SuperFastHttpClient
encoder_class = CSVEncoder
def get_to_visit_queue():
# Custom initilize
return RedisQueue(self.opts, host="127.0.0.1", port=6379, db=2)
-
All links are stored and visited with absolute urls in order to prevent duplicates
-
Helper classes are pluggable, for instance, you can put your own csv encoder.
-
Default UrlStorage is a dict so that registering, finding, unregistered are all in O(1).
-
I preferred BFS over DFS because 1. page order is more natural, 2. recursive graph uses a lot of memory, 3. Supports multiple workers.
-
I joined xml tag strings to create final xml instead of using a real encoder to keep it simple. (as mentinoed above it is very simple to use a more broad encoder)
-
This project first crawls everything then writes into file, if we want to crawl very big pages we may think of possible optimizations:
- Write to file as it crawls to prevent memory leak.
- Create multiple sub sitemaps for different sub directories to run several workers.
- An external queue like redis or RabbitMQ to coordinate multiple workers.
There is a unit test coverege:
pytest test.py