postmodern/spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

RubyMIT

Issues

Is there a way to set Accept-Encoding headers?
#43 opened 3 years ago by robfuller
9
Switch to `Addressable::URI` for URI parsing
#86 opened 2 years ago by postmodern
0
Thank you
#71 opened a year ago by thegreyfellow
1
Switch to using `require_relative` to improve load-times
#88 opened a year ago by postmodern
0
Add `# frozen_string_literal: true` comments to all files
#89 opened a year ago by postmodern
0
Add Logging
#72 opened 5 years ago by postmodern
0
Add ignore_paths and ignore_paths_like
#61 opened 8 years ago by postmodern
0
How to control the depth of crawling?
#82 opened 3 years ago by masterbo98
1
Support passing a URI as a proxy setting
#81 opened 3 years ago by postmodern
0
Write specs for Agent.domain
#80 opened 3 years ago by postmodern
1
Add spec for `Spidr::Agent.host`
#79 opened 3 years ago by postmodern
0
Add spec for `Spidr::Agent.site`
#78 opened 3 years ago by postmodern
0
Add spec for `Spidr::Agent.start_at`
#77 opened 3 years ago by postmodern
0
Switch to using async-http
#76 opened 3 years ago by postmodern
1
#<NoMethodError: undefined method `closed?' for nil:NilClass>
#27 opened 3 years ago by ethicalhack3r
1
Rewrite spec/helpers/wsoc.rb as a shared_example with 100% less eval
#33 opened 3 years ago by postmodern
1
Infinite path loop
#31 opened 13 years ago by ethicalhack3r
2
Use webmock and to_rack in specs
#44 opened 3 years ago by postmodern
1
fetch_titles not following 301
#37 opened 3 years ago by audy
2
Following redirects
#56 opened 8 years ago by ZackMattor
4
Switch to Ruby 2.x keyword arguments
#75 opened 3 years ago by postmodern
0
Figure out why specs are failing only on JRuby?
#74 opened 4 years ago by postmodern
1
Page#to_absolut raises URI::InvalidURIError: path conflicts with opaque
#57 opened 7 years ago by buren
7
path conflicts with opaque (URI::InvalidURIError)
#66 opened 7 years ago by mustiikhalil
3
Automatically detect and parse sitemap.xml
#19 opened 14 years ago by postmodern
7
`ignore_links` not working.
#64 opened 7 years ago by vwochnik
4
Multithreading
#26 opened 14 years ago by ethicalhack3r
13
unable to ignore links
#60 opened 8 years ago by vanegomez
4
Add low-level HTTP request methods
#62 opened 8 years ago by postmodern
0
Limit crawl to links matching pattern
#59 opened 8 years ago by bricemaurin
3
Skip processing of pages
#49 opened 8 years ago by darkcode85
1
Session handling
#53 opened 8 years ago by heavysixer
1
Crawling a specific page
#46 opened 9 years ago by justaj
2
/../foo expands to just "foo"
#45 opened 9 years ago by postmodern
0
How can I 'ignore everything except' a set of links
#42 opened 9 years ago by DHarls17
4
how to login via submit a form
#38 opened 9 years ago by loyalpartner
1
Is it possible to display only part of a spidered URL?
#41 opened 9 years ago by DHarls17
3
Anyway to limit the total number of pages crawled or shutdown the crawler after some criteria?
#40 opened 9 years ago by samur-vonq
2
SSL session reuse may fail
#30 opened 13 years ago by nirvdrum
1
HTTP Basic auth problem
#34 opened 12 years ago by tit
4
Spidering pages with no content-type header
#32 opened 13 years ago by bcobb
4
Store history queue in Hash of host:port and paths.
#18 opened 14 years ago by postmodern
0
catching SSLErrors
#28 opened 13 years ago by lucasluitjes
2
uninitialized constant Spidr::Headers::Set
#25 opened 14 years ago by scottsampson
3
Link depth?
#23 opened 14 years ago by ethicalhack3r
6
Improve network connection to HTTPS server via HTTPSProxy.
#20 opened 14 years ago by falaise
2
expected absolute path component: sites/ftp.apache.org/
#14 opened 14 years ago by ethicalhack3r
4
99% of cpu usage while crawling bigger websites...
#21 opened 14 years ago by nu7hatch
1
every_html_page shouldn't process javascripts
#17 opened 14 years ago by nu7hatch
7
trailing slashes
#16 opened 14 years ago by nu7hatch
1