This project uses Node.js to implement a spider test suite, and outputs a list of warnings, errors, and 404s.
$ npm install spidersuite --save-dev
-
Start the app that you want to crawl.
-
Open another Terminal window in the same directory as your project. Run node for your localhost:
./node_modules/.bin/spidersuite https://localhost:8443/ [--config <PATH_TO_CONFIG_FILE>]
The spidersuite results appear in this Terminal window.
After running, spidersuite outputs a report on broken links, broken hashes, and other checks over the crawled site. To limit output, it reports only the first five broken links and redirects for a specific link.
If spidersuite finds broken links, the not found count
value in the response is greater than 0. The response also includes not found
information:
not found count: 5
not found: [
{
"url": "https://localhost:8443/brokenurl/",
"linkedFrom": [
"https://localhost:8443/"
]
},
...
]
The linkedFrom
information shows where spidersuite found this file.
If spidersuite finds a link that redirects to a missing page, the report contains the redirectFrom
property, which lists the URLs for a chain of redirects. Any URL in that chain might be present in the content and contribute to the overall error. For example:
not found count: 1
not found: [
{
"url": "https://localhost:8443/brokenurlredirect2",
"redirectFrom": [
"https://localhost:8443/brokenurlredirect1",
"https://localhost:8443/brokenurlredirect0"
]
}
]
In this example, the original URL in the content is https://localhost:8443/brokenurlredirect0
. This URL redirected several times and ended on https://localhost:8443/brokenurlredirect2
, which returns the HTTP 404 Not Found
status code.
Important: The
redirectFrom
property can have multiple heads to the redirect chain. For example, the list might show URLs A, B, C, and D, but A might redirect to B, C might redirect to D, and both B and D redirect to the offending URL. When spidersuite detects an error that results from a chain of redirects, more than one chain can redirect to the same faulty page so you must check the content for all parts of all chains specified by theredirectFrom
list.
To meet your crawling and reporting needs, set one or more options in the spidersuite configuration file.
The spidersuite configuration file resembles the eslint configuration file.
Use the extends
property to find either a referenced file or the default configuration if you specify spidersuite:default
.
Supports only .json
or .js
files.
The following table describes the full set of configuration options.
For any pattern, spidersuite replaces #{ROOT_URL}
with the extracted root URL, such as https://domain.example.com:5522
, which lets you treat URLs that intentionally, or unintentionally, link to other hosts differently.
Option | Description |
---|---|
additionalPaths |
A list of paths to crawl, in addition to the initial URL provided. Use this option for hidden pages, which are pages that you cannot navigate from the main URL. |
excludePatterns |
A list of patterns that specify which URLs to not attempt. Default is spidersuite includes all URLs it finds. |
includePatterns |
A list of patterns that specify which URLs to attempt. Default is spidersuite includes all URLs it finds. If specified, spidersuite fetches pages that match at least one of the patterns. |
titlePattern |
A regular expression pattern that indicates what the HTML title of the crawled pages on the same domain should contain. |
<ERROR>WarnOnlyPatterns |
Reports the specified <ERROR> as a warning rather than a failure. Value is either hashNotFound or http<XXX> . http<XXX>WarnOnlyPatterns reports the specified <XXX> errors as warnings. The <XXX> value is a number from 400 to 510 . |
reportSpoolInterval |
A number that is greater than zero. Indicates the interval with which to report the current spool. The spool comprises the pages that are currently being fetched. Useful for debugging. |
strictCiphers |
If false , the cipher list is relaxed. If true , a more strict version of ciphers is used over TLS. |
simplecrawlerConfig |
Spidersuite is based on simplecrawler . This module has many configuration options. Use the simplecrawlerConfig option to set simplecrawler options. |
Note: For more details about these options, see the configuration file examples in the
examples
directory.
See License.
See Contributing.
This project was inspired by - and heavily influenced by - eslint
. The configuration parsing code was modified for usage in this project.