/scrape

Distributed Scraper

Primary LanguageJavaScript

Distributed Scraper

stdlib service

This is a scraper function that automatically pulls in metadata from the page, as well as supports simple HTML querying using cheerio.

It's built on top of stdlib which makes it highly distributed and scalable.

Usage

You can either use the ready service that's deployed on stdlib here, or fork this repository and launch your own version on stdlib.

Example

For example, a simple scrape to pick up my own email address from Github (and a bunch of extra metadata):

lib nemo.scrape --url https://github.com/nemo --query "li[itemprop='email'] a"
{ metadata:
   { general:
      { description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        title: 'nemo (Nima Gardideh) · GitHub',
        lang: 'en' },
     openGraph:
      { app_id: '1401488693436528',
        image: [Object],
        site_name: 'GitHub',
        type: 'profile',
        title: 'nemo (Nima Gardideh)',
        url: 'https://github.com/nemo',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        username: 'nemo' },
     schemaOrg: { items: [Object] },
     twitter:
      { image: [Object],
        site: '@github',
        card: 'summary',
        title: 'nemo (Nima Gardideh)',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.' } },
  url: 'https://github.com/nemo',
  query: 'li[itemprop=\'email\'] a',
  query_value: 'nima@halfmoon.ws'
}

You can view the function specification here.

Notes

Note that this scraper does not support sites that are single page Javascript applications. You should also follow robot.txt rules when you're scraping websites. Use responsibly.

License

MIT