A single-threaded web crawler that extracts static asset's urls (linked to the domain) from pages it visits.
Output displayed in the terminal in the following json format:
[
{
"url": "http://www.example.org",
"assets": [
"http://www.example.org/image.jpg",
"http://www.example.org/script.js"
]
},
{
"url": "http://www.example.org/about",
"assets": [
"http://www.example.org/company_photo.jpg",
"http://www.example.org/script.js"
]
}
]
Installation instructions
- Install bundler (skip step if already done)
gem install bundler
- Run Bundle install in root folder (web-crawler) to install all ruby dependencies
bundle install
- Run the program crawl.rb (Require version of ruby >= 2)
ruby crawl.rb [url]
Run tests
bundle exec rspec