Apify Kit is a tool for centralized html parser (scraping/crawling).
- Hash-based syntax to parse HTML DOM elements.
- Scraping web-sites with or without javascript enabled.
- Configure number of processes of scraping.
- Configure delay between requests.
- Handfull filters to further processing of HTML nodes.
- Web-server to crawl web-sites from separate nodes.
- Built-in local web-server.
- Handful web-admin interface with queues, scheduling and histories.
- Redis
- Foreman
- Clone this repo $ git clone git@github.com:victorvsk/apify-kit.git
- Bundle gems $ cd apify-kit $ bundle install
- Run Foreman $ gem install foreman $ foreman start
- Open http://localhost:5000 and use demo credentials: login: apify@gmail.com password: password apify_secret: secret
While web admin is protected with devise, embedded web-server, which is mounted to /apify in config/routes.rb, has to be public accessible. That is why it is protected with apify_secret environment variable. It can be configured, for example, on the top of config/application.rb:
ENV["APIFY_SECRET"] ||= 'secret'
Note, it should be set before entire rails app is loaded, due to server in fact is a Sinatra Application which is loaded once on startup.
- Deployment recipies.
- Cover Core part with rspec.
- More configurable server node.
- Distribute every part as a gem for easier scaling.
- Cover Server part with rspec.
- Cover scheduler part with rspec.