A lazy, fluent web crawler with an async/await API.
$ yarn add itemize
Itemize lists all of the linked files and pages underneath the specified root URL.
const urls = itemize('https://news.ycombinator.com', { depth: 2 })
// Get a quick Hacker News sitemap
while (!urls.done()) {
console.log(await urls.next())
}
This is useful for writing mirrors, monitoring a page for new content, etc.
It starts at the root URL provided and automatically spiders through to find connecting pages.
Itemize takes a lazy approach to I/O and only makes requests when you're asking it for more content
with next()
.
Returns an Itemize instance.
- url: String, the root URL from which to crawl
- options: Object
- depth: Number, crawl this many layers deep (0)
const items = itemize('https://nodejs.org/download/release/', { depth: 1 })
Returns a Promise for a String, the next linked URL.
If no urls remain, returns a Promise for undefined
.
const url = await items.next()
Returns a Boolean representing whether or not all spidering routes have been exhausted.
if (items.done()) console.log('crawl complete')
Returns a Promise for an Array of Strings, all of the previously traversed items.
const all = await items.all()
Itemize uses a keepalive HTTP/HTTPS agent.
Use close()
to destroy the existing underlying socket and create a new Agent with no existing connections.
You should use this to clean up after Itemize instances that haven't completed their crawls.
items.close()
$ yarn test
$ node --harmony examples/hackernews.js
$ node --harmony examples/nodes.js