Recluse
Recluse is a web crawler meant to ease quality assurance. Currently, it has three crawling tests:
- Status—checks the HTTP status codes of links on the site. Good for detecting broken links.
- Find—finds pages with links matching the pattern. Good for ensuring that references to a page are removed or renamed.
- Assert—checks pages for the existence of HTML elements. Good for asserting that things are consistent across pages.
Installation
Add this line to your application's Gemfile:
gem 'recluse'
And then execute:
$ bundle
Or install it yourself as:
$ gem install recluse
Profiles
Recluse depends on creating profiles for your sites. This way, the configuration can be reusable for frequent quality assurance checks. Profiles are saved as YAML files (.yaml) in ~/.recluse/
and have the following format:
---
name: profile_name
roots:
- http://example.com/
- http://anotherroot.biz/subdir
email: email@domain.com
blacklist:
- http://example.com/dontgohere/*
whitelist:
- http://example.com/dontgohere/unlessitshere/*
internal_only: false
scheme_squash: false
redirect: false
Profile options
Name | Required | Type | Default | Description |
---|---|---|---|---|
name | Yes | String | The name of your profile for identification. Should also match the filename (i.e., site has filename site.yaml ). |
|
roots | Yes | Array of URLs | The roots to start from for spidering. Will spider all subdirectories and files. | |
Yes | String | Your email. This is for identification of who is crawling a web page in case a system administrator has issues with it. | ||
blacklist | No | Array of globs | Empty array | Glob patterns of sites not to spider. Useful to keep Recluse focused only on the important stuff. |
whitelist | No | Array of globs | Empty array | Glob patterns of sites to spider, even if they are blacklisted. |
internal_only | No | Boolean | false |
If true, Recluse will not follow external links. If false, it will follow for the status mode. |
scheme_squash | No | Boolean | false |
Treats "http" URLs the same as "https". This way, Recluse will not redundantly spider secure and nonsecure duplicates of the same page. |
redirect | No | Boolean | false |
Follow the redirect to the resulting page if true. |
Use
After installation, the recluse
executable should be available for your command line.
Tests
Status
Spiders through the profile and reports the HTTP status codes of the links. If the profile is not internal only, external links will also have their statuses checked.
$ recluse status csv_path profile1 [profile2] ... [options]
Argument | Alias | Required | Type | Default | Description |
---|---|---|---|---|---|
csv_path | Yes | String | The path of where to save results. Results are saved as CSV (comma-separated values). | ||
profiles | Yes | Array of profile names | List of profiles to check. More than one profile can be checked in one run. | ||
group_by | --group-by -g |
No | One of none or url |
none |
What to group by in the result output. If none , there will be a row for each pair of checked URL and the page it was found on. If url , there will be one row for each URL, and the page cell will have a list of every page the URL was found on. |
include | --include -i |
No | Array of status codes | Include all | Include these status codes in the results. Can be a specific number (ex: 200 ) or a wildcard (ex: 2xx ). You can also include idk for pages that result in errors that prevent status code detection. |
exclude | --exclude -x |
No | Array of status codes | Exclude none | Exclude these status codes from the results. Same format as including. |
Output format
Status code,URL,On page,With error
Find
Spiders through the profiles and checks if a link matching one of the provided patterns is found. Will only go over internal pages.
$ recluse find csv_path profile1 [profile2] ... --globs pattern1 [pattern2] ... [options]
Argument | Alias | Required | Type | Default | Description |
---|---|---|---|---|---|
csv_path | Yes | String | The path of where to save results. Results are saved as CSV (comma-separated values). | ||
profiles | Yes | Array of profile names | List of profiles to check. More than one profile can be checked in one run. | ||
globs | --globs -G |
Yes | Array of globs | Glob patterns to find as URLs of links on the page. | |
group_by | --group-by -g |
No | One of none , url , or page |
none |
What to group by in the result output. If none , there will be a row for each pair of checked URL and the page it was found on. If url , there will be one row for each URL, and the page cell will have a list of every page the URL was found on. If page , there will be one row for each page, and the URL cell will list every matching URL found on the page. |
Output format
none
or url
Group by Matching URLs,Pages
page
Group by Page,Matching URLs
Assert
Asserts the existence of an HTML element using CSS-style selectors. Will only check internal pages.
$ recluse assert csv_path profile1 [profile2] ... --exists selector1 [selector2] ...
Argument | Alias | Required | Type | Default | Description |
---|---|---|---|---|---|
csv_path | Yes | String | The path of where to save results. Results are saved as CSV (comma-separated values). | ||
profiles | Yes | Array of profile names | List of profiles to check. More than one profile can be checked in one run. | ||
true | --true --report-true-only |
No | Boolean | false |
Report only true assertions. Reports both true and false assertions by default. |
false | --false --report-false-only |
No | Boolean | false |
Report only false assertions. Reports both true and false assertions by default. |
exists | --exists -e |
Yes | Array of CSS selectors | CSS selectors to assert the existence of on each spidered page. |
Output format
Selector,Exists,On page
Profile management
Where
Path where the profiles are stored for manual edits.
$ recluse where
Creation
Create a profile.
$ recluse profile create [options] name email root1 [root2] ...
For further description of the arguments, check the Profile options section.
Argument | Alias | Required | Type | Default |
---|---|---|---|---|
name | Yes | String | ||
Yes | String | |||
roots | Yes | Array of strings | ||
blacklist | --blacklist |
No | Array of globs | Empty array |
whitelist | --whitelist |
No | Array of globs | Empty array |
internal_only | --internal_only --no-internal-only |
No | Boolean | false |
scheme_squash | --scheme-squash --no-scheme-squash |
No | Boolean | false |
redirect | --redirect --no-redirect |
No | Boolean | false |
Edit
Edit profile options. Any option not provided will stay as it was.
$ recluse profile edit name [options]
Argument | Alias | Required | Type |
---|---|---|---|
name | Yes | String | |
--email |
No | String | |
roots | --roots |
No | Array of strings |
blacklist | --blacklist |
No | Array of globs |
whitelist | --whitelist |
No | Array of globs |
internal_only | --internal_only --no-internal-only |
No | Boolean |
scheme_squash | --scheme-squash --no-scheme-squash |
No | Boolean |
redirect | --redirect --no-redirect |
No | Boolean |
Blacklist, whitelist, and roots
More powerful blacklist and whitelist editing. All examples are interchangeable between the three list types. However, if the profile has no roots, it will not run.
Add
Add patterns/roots to the profile's list.
$ recluse profile blacklist add name new_thing1 [new_thing2] ...
Remove
Remove patterns/roots from the profile's list.
$ recluse profile blacklist remove name thing1 [thing2] ...
Clear
Remove all patterns/roots from the profile's list.
$ recluse profile blacklist clear name
List
List the patterns/roots in the profile's list.
$ recluse profile blacklist list name
Remove
Delete a profile.
$ recluse profile remove name
Rename
Rename a profile.
$ recluse profile rename old_name new_name
List
List all profiles.
$ recluse profile list
Info
List the YAML info of the profile.
$ recluse profile info name
Contributing
Bug reports and pull requests are welcome on GitHub.
Extending
Recluse is modular so you can add tasks if you want. Below is an example of adding your own task to Recluse.
require 'recluse'
module MyModule
##
# Create a task object
class MyTask < Recluse::Tasks::Task
##
# First argument must be the profile. The rest are hash arguments specific for the task.
def initialize(profile, option1: false, option2: true, results: nil)
# Sets up everything based on the profile, queue-specific options, and can also prepopulate results.
super(profile, queue_options, results: results)
@queue.run_if do |link|
# Run a link if this function returns true.
# Link is a Recluse::Link object.
end
@queue.on_complete do |link, response|
# Run this function after the page has either successfully been retrieved, or failed to be retrieved.
# Link is a Recluse::Link object.
# Response is a Recluse::Response object.
end
end
end
end
# Add your task to the task list under the key 'my_task'.
Recluse::Tasks.add_task(:my_task, MyModule::MyTask)
# You can now access 'my_task' like you would the default Recluse tasks.
my_profile = Recluse::Profile.load('my_profile')
my_profile.test(:my_task, option1: true, option2: true)
results = my_profile.results[:my_task]
License
The gem is available as open source under the terms of the MIT License.