/express-xml-scraper

Express (as in Node.js Express) XML scrapper that will save remote XML's from other sitemaps into its own Node server and become HTTP accessible.

Primary LanguageJavaScript

Express Node XML Scraper

A XML Scraper with a front-end facing Express server that scrapes for remote XML files while at the same time offering the saved sitemaps on the Node Express server.

Set-up

  • Run npm install to install all the dependancies including the simple-xml-scrapper app that will crawl the site.
node server.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml

Express Server

  • After everything is successfully created, a local directory /rendered_sitemaps will be created.
  • Visit http://localhost:3001/rendered_sitemaps for additional information on how generated sitemap directories can be accessed.
  • Sitemaps will be have a directory created per domain. http://www.nytimes-se.com/nytse/sitemap.xml will create a nytimes-se.com/ directory and contain all related sitemaps from within that domain.

Cron

  • This node application features a running internal cron job that can be set using a regular cron expression and using the cron_schedule= flag within the CLI command for this job. The following command will scrap on the 30 minute marker the specified sitemap.
node server.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml --cron_schedule="30 * * * *"

Saving

  • Files by default are saved into an XML format. Plans for CSV formatting will be made available.