/browsercrawler

Crawl websites from your browser and save them in S3

Primary LanguageJavaScript

WHAT

The BrowserCrawler plugin for Safari will pull all the linked pages under a particular root on a
website and upload them directly to your S3 bucket. There is no limit to the depth it will crawl.

WHY? 

Sometimes you want a permanent copy of something you find on the web. For example, I used it to grab a copy of the
Mozilla Javascript documentation to have offline. Use it to backup your own data on sites that require authentication
but don't have a great data portability solution.

HOW

1. Install the plugin either by downloading it or building it in Safari with the built-in Extension Builder
2. Configure your AWS S3 settings in the preferences
3. When you are on a page where you want to start the crawl either click the spider button or
   right click and start crawl.

It will continue the crawl as long as the page is open, updating a progress box with the current URL
it is crawling. It will only crawl pages at the same level or below as the seed page. You can cancel at any time
by clicking "Cancel".

WARNING

BrowserCrawler will crawl the site as YOU and so will capture any data that you would normally have access to using
your cookies. It will not run javascript but it does download the images (though doesn't upload them), this can use
a tremendous amount of bandwidth if you happen to crawl a big website. Also, be careful with small sites, they might
not have the capacity to endure a full-speed crawl, even from a single machine.

CREDITS

Thanks to l.m.orchard at pobox.com for the S3 library and Paul Johnston, Greg Holt, Andrew Kepert, Ydnar, and Lostinet
for the SHA1 implementation that it uses. JQuery 1.5.2 isn't included but it is used by the plugin to do the actual crawl.