/sitemapper

parses sitemaps for Node.JS

Primary LanguageTypeScriptMIT LicenseMIT

Quality Gate Status Test NPM Version NPM Downloads license

Sitemap-parser

NOTE: This is a fork of the original sitemapper package with full migration to ESM and ts. The original package can be found here

Parse through a sitemaps xml to get all the urls for your crawler.

Installation

npm install @yeskiy/sitemapper --save

Simple Example

import Sitemapper from '@yeskiy/sitemapper';

const sitemap = new Sitemapper();

sitemap.fetch('https://www.google.com/work/sitemap.xml').then((sites) => {
    console.log(sites);
});

Options

You can add options on the initial Sitemapper object when instantiating it.

  • requestHeaders: (Object) - Additional Request Headers (e.g. User-Agent)
  • timeout: (Number) - Maximum timeout in ms for a single URL. Default: 15000 (15 seconds)
  • url: (String) - Sitemap URL to crawl
  • debug: (Boolean) - Enables/Disables debug console logging. Default: False
  • concurrency: (Number) - Sets the maximum number of concurrent sitemap crawling threads. Default: 10
  • retries: (Number) - Sets the maximum number of retries to attempt in case of an error response (e.g. 404 or Timeout). Default: 0
  • rejectUnauthorized: (Boolean) - If true, it will throw on invalid certificates, such as expired or self-signed ones. Default: True
  • lastmod: (Number) - Timestamp of the minimum lastmod value allowed for returned urls
  • gotParams: (GotOptions) - Additional options to pass to the got library. See Got Options

License

MIT