An spider class built to go through pages on urls based on some rules. And then go through those pages on those urls.
This is heavily inspired by the ruby gem spider
: https://github.com/johnnagro/spider
-
Add the dependency to your
shard.yml
:dependencies: spider: github: confact/spider.cr
-
Run
shards install
require "spider"
And then set up the spider config like this:
Spider.start("https://google.com") do |s|
s.amount_workers = 30
s.every_page_urls = ->(url : URI) {
if /^https:\/\/news.ycombinator.com\/.*/ =~ url.to_s
s.add_link_to_visit(url)
end
if /^https:\/\/indiehackers.com\/.*/ =~ url.to_s
s.add_link_to_visit(url)
end
}
s.every_page = ->(data : Lexbor::Parser, url : URI) {
# run either the whole data process here or move it to another class and call it here,
# we give you the Lexbor::Parser instance directly so you can use it freely,
# and the url to route to correct processing depending on url.
}
end
This will run the spider and it will block any code below it.
If you have a proxy api you use, you can set it here.
it usually is a url and then set the url you want to go to as a query parameter.
As example:
s.prefix_url = "https://app.scrapingbee.com/api/v1/?api_key={api_key}&render_js=true&url="
We plan to expand to different ways to store the visited urls and queue urls. Right now it is hardcoded to use the text files only.
Ideas of future storage:
- Redis
- Memcached
- Database
- some custom API
This is working and is doing pretty good on some production systems. But it could do some more things better:
- failure handling, have a custom way to handle them in the start block.
- More storage possibility, and a way to set it in start block.
- It is keeping up and running even if it is done. As the while check seems to not work fully.
- robots.txt support to respect websites wishes.
Would love some contributions. As example the concurrency support, as I am new to that.
- Fork it (https://github.com/confact/spider.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
- Håkan Nylén - creator and maintainer