Find broken links, missing images, etc in your HTML.
Features:
- Parses local and remote HTML documents
- Supports various HTML tags/attributes, not just
<a href>
- Supports redirects, absolute URLs, relative URLs and
<base>
- Provides detailed information about each link (HTTP and HTML)
- URL keyword filtering with wildcards
- Pause/Resume at any time
Node.js >= 0.10
is required. There're two ways to use it:
To install, type this at the command line:
npm install broken-link-checker -g
After that, check out the help for available options:
blc -?
To install, type this at the command line:
npm install broken-link-checker --save-dev
The rest of this document will assist you with how to use the API.
Scans an HTML string to find broken links.
-
handlers.complete
is fired after the last result or zero results. -
handlers.junk
is fired with data on each skipped link, as configured in options. -
handlers.link
is fired with the result of each discovered link (broken or not). -
.clearCache()
will remove any cached URL responses. This is only relevant if thecacheResponses
option is enabled. -
.numActive()
returns the number of active requests. -
.pause()
will pause the internal link queue, but will not pause any active requests. -
.resume()
will resume the internal link queue. -
.scan(htmlString, baseUrl)
parses & scans a single string. Returnsfalse
when there is a previously incomplete scan (andtrue
otherwise).baseUrl
is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.
var htmlChecker = new blc.HtmlChecker(options, {
junk: function(result){},
link: function(result){},
complete: function(){}
});
htmlChecker.scan(htmlString, baseUrl);
Scans the HTML content at each queued URL to find broken links.
-
handlers.end
is fired when the end of the queue has been reached. -
handlers.junk
is fired with data on each skipped link, as configured in options. -
handlers.link
is fired with the result of each discovered link (broken or not) within the current page. -
handlers.page
is fired after a page's last result, on zero results, or if the HTML could not be retrieved. -
.clearCache()
will remove any cached URL responses. This is only relevant if thecacheResponses
option is enabled. -
.dequeue(id)
removes a page from the queue. Returnstrue
on success or anError
on failure. -
.enqueue(pageUrl, customData)
adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or anError
on failure.customData
is optional data that is stored in the queue item for the page.
-
.length()
returns the number of pages in the queue. -
.numActive()
returns the number of active link requests. -
.pause()
will pause the queue, but will not pause any active requests. -
.resume()
will resume the queue.
var htmlUrlChecker = new blc.HtmlUrlChecker(options, {
junk: function(result, customData){},
link: function(result, customData){},
page: function(error, pageUrl, customData){},
end: function(){}
});
htmlUrlChecker.enqueue(pageUrl, customData);
Recursively scans (crawls) the HTML content at each queued URL to find broken links, while honoring robots.txt exclusions.
-
handlers.end
is fired when the end of the queue has been reached. -
handlers.junk
is fired with data on each skipped link, as configured in options. -
handlers.link
is fired with the result of each discovered link (broken or not) within the current page. -
handlers.page
is fired after a page's last result, on zero results, or if the HTML could not be retrieved. -
handlers.site
is fired after a site's last result, on zero results, or if the initial HTML could not be retrieved. -
.clearCache()
will remove any cached URL responses. This is only relevant if thecacheResponses
option is enabled. -
.dequeue(id)
removes a site from the queue. Returnstrue
on success or anError
on failure. -
.enqueue(siteUrl, customData)
adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or anError
on failure.customData
is optional data that is stored in the queue item for the site.
-
.length()
returns the number of sites in the queue. -
.numActive()
returns the number of active link requests. -
.pause()
will pause the queue, but will not pause any active requests. -
.resume()
will resume the queue.
var siteChecker = new blc.SiteChecker(options, {
junk: function(result, customData){},
link: function(result, customData){},
page: function(error, pageUrl, customData){},
site: function(error, siteUrl, customData){},
end: function(){}
});
siteChecker.enqueue(siteUrl, customData);
Requests each queued URL to determine if they are broken.
-
handlers.end
is fired when the end of the queue has been reached. -
handlers.link
is fired for each result (broken or not). -
.clearCache()
will remove any cached URL responses. This is only relevant if thecacheResponses
option is enabled. -
.dequeue(id)
removes a URL from the queue. Returnstrue
on success or anError
on failure. -
.enqueue(url, baseUrl, customData)
adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success or anError
on failure.baseUrl
is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.customData
is optional data that is stored in the queue item for the URL.
-
.length()
returns the number of URLs in the queue. -
.numActive()
returns the number of active requests. -
.pause()
will pause the queue, but will not pause any active requests. -
.resume()
will resume the queue.
var urlChecker = new blc.UrlChecker(options, {
link: function(result, customData){},
end: function(){}
});
urlChecker.enqueue(url, baseUrl, customData);
Type: Array
Default value: ["http","https"]
Will only check links with schemes/protocols mentioned in this list. Any others (except those in excludedSchemes
) will output an "Invalid URL" error.
Type: Number
Default Value: 3600000
(1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses
option is enabled.
Type: Boolean
Default Value: true
URL request results will be cached when true
. This will ensure that each unique URL will only be checked once.
Type: Array
Default value: []
Will not check or output links that match the keywords and glob patterns in this list. The only wildcard supported is "*"
.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Array
Default value: ["data","geo","javascript","mailto","sms","tel"]
Will not check or output links with schemes/protocols mentioned in this list. This avoids the output of "Invalid URL" errors with links that cannot be checked.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Boolean
Default value: false
Will not check or output external links when true
; relative links with a remote <base>
included.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Boolean
Default value: false
Will not check or output internal links when true
.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Boolean
Default value: true
Will not check or output links to the same page; relative and absolute fragments/hashes included.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Boolean
Default value: true
Each link's lengthy response data will not be outputted when true
.
Type: Number
Default value: 1
The tags and attributes that are considered links for checking, split into the following levels:
0
: clickable links1
: clickable links, media2
: clickable links, media, stylesheets, scripts, forms3
: clickable links, media, stylesheets, scripts, forms, meta
To see the exact breakdown, check out the tag map. <base>
is not listed because it is not a link, though it is always parsed.
This option only applies to HtmlChecker
and HtmlUrlChecker
.
Type: Number
Default value: Infinity
The maximum number of links to check at any given time.
Type: Number
Default value: 1
The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.
Type: Number
Default value: 0
The number of milliseconds to wait before each request.
Type: Boolean
Default value: false
Will enqueue discovered links of the same domain for HTML scanning; also known as performing a site "crawl".
This option only applies to HtmlUrlChecker
.
Type: String
Default value: "head"
The HTTP request method used in checking links. Some sites do not respond correctly to "head"
, while "get"
can provide more consistent and accurate results, albeit slower.
Each result will have its own error
key for which you can compare against:
if (result.error !== null) {
// Server denied access
if (result.error.code === "ECONNREFUSED"){}
// Server could not be reached
if (result.error.code === "ENOTFOUND"){}
// Connection timed out
if (result.error.code === "ETIMEDOUT"){}
// Duh.
if (result.error.message === "Invalid URL"){}
}
- start/end string locations for URL attribute values (parse5#43)
- remove
options.excludedSchemes
and handle schemes not inoptions.acceptedSchemes
as junk? - change order of checking to: tcp error, 4xx code (broken), 5xx code (undetermined), 200
- option to retry broken links a number of times (default=0)
- option to scrape
response.body
for erroneous sounding text (since an error page could be presented but still have code 200) - option to check broken link on archive.org for archived version (using this lib)
- option to include iframe HTML source in checking?
- option to run
HtmlUrlChecker
checks on page load (using jsdom) to include links added with JavaScript - option to check if hashes exist in target URL document?
- option to parse Markdown in
HtmlChecker
for links - add throttle profiles (0–9, -1 for "custom") for easy configuring
- check ftp:, sftp: (for downloadable files)
- check
mailto:, news:, nntp:, telnet:? - check local files if URL is relative and has no base URL?
- full-site checker/crawler (honoring robots.txt, same-domain and optional same-subdomain)
- cli json mode -- streamed or not?
- cli non-tty mode -- change nesting ASCII artwork to time stamps?
- use ineed?
- stream HTML files (parse5#26)
MarkdownChecker
,MarkdownUrlChecker
,HtmlMarkdownChecker
,HtmlMarkdownUrlChecker