/simple-node-scraper

DO NOT USE - Proof of concept and playground to test threading and max concurrency with scraping and indexing sites

Primary LanguageJavaScriptGNU General Public License v3.0GPL-3.0

Build Status

Link Checker

Simple proof of concept to scan a url, and extract it's links to a standard output

Notes & Usage

  • Requires node 10+ for URL usage
  • This will fail if you hit a production site like reddit.com, their firewalls will prevent too many requests, to test, use a smaller site
  • this should not be an on-demand service, but instead a scheduled service that continually updates a sitemap with better controlls

Module Usage

yarn install
npm test

const visit = require('./visit');
async () => {
    try {
        const links = await visit(url);
        // loop/reduce your links here
    } catch( e ) {
        // Handle any **errors**
    }
 }

CLI Usage

nvm use 10
yarn install
node cli.js -u https://partnercomm.net/trending/

TODOS

  • Convert to typescript
  • Offload child primises to queue server
  • Add throttling and concurrency
  • Handle files other than pages
  • come up with error handling for modified date

Examples