postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
RubyMIT
Issues
- 9
Is there a way to set Accept-Encoding headers?
#43 opened by robfuller - 0
Switch to `Addressable::URI` for URI parsing
#86 opened by postmodern - 1
Thank you
#71 opened by thegreyfellow - 0
- 0
- 0
Add Logging
#72 opened by postmodern - 0
Add ignore_paths and ignore_paths_like
#61 opened by postmodern - 1
How to control the depth of crawling?
#82 opened by masterbo98 - 0
Support passing a URI as a proxy setting
#81 opened by postmodern - 1
Write specs for Agent.domain
#80 opened by postmodern - 0
Add spec for `Spidr::Agent.host`
#79 opened by postmodern - 0
Add spec for `Spidr::Agent.site`
#78 opened by postmodern - 0
Add spec for `Spidr::Agent.start_at`
#77 opened by postmodern - 1
Switch to using async-http
#76 opened by postmodern - 1
- 1
- 2
Infinite path loop
#31 opened by ethicalhack3r - 1
Use webmock and to_rack in specs
#44 opened by postmodern - 2
fetch_titles not following 301
#37 opened by audy - 4
Following redirects
#56 opened by ZackMattor - 0
Switch to Ruby 2.x keyword arguments
#75 opened by postmodern - 1
- 7
- 3
- 7
Automatically detect and parse sitemap.xml
#19 opened by postmodern - 4
`ignore_links` not working.
#64 opened by vwochnik - 13
Multithreading
#26 opened by ethicalhack3r - 4
unable to ignore links
#60 opened by vanegomez - 0
Add low-level HTTP request methods
#62 opened by postmodern - 3
Limit crawl to links matching pattern
#59 opened by bricemaurin - 1
Skip processing of pages
#49 opened by darkcode85 - 1
Session handling
#53 opened by heavysixer - 2
Crawling a specific page
#46 opened by justaj - 0
/../foo expands to just "foo"
#45 opened by postmodern - 4
- 1
how to login via submit a form
#38 opened by loyalpartner - 3
- 2
Anyway to limit the total number of pages crawled or shutdown the crawler after some criteria?
#40 opened by samur-vonq - 1
SSL session reuse may fail
#30 opened by nirvdrum - 4
HTTP Basic auth problem
#34 opened by tit - 4
Spidering pages with no content-type header
#32 opened by bcobb - 0
- 2
catching SSLErrors
#28 opened by lucasluitjes - 3
uninitialized constant Spidr::Headers::Set
#25 opened by scottsampson - 6
Link depth?
#23 opened by ethicalhack3r - 2
- 4
- 1
- 7
every_html_page shouldn't process javascripts
#17 opened by nu7hatch - 1
trailing slashes
#16 opened by nu7hatch