/pres-node-web-scraping

A presentation on web scraping techniques for node.jSTL

Primary LanguageJavaScriptMIT LicenseMIT

pres-node-web-scraping

A presentation about web scraping techniques using Node.js for the Node.jSTL meetup.

Overview

The presentation is built with Reveal.js and is hosted on GitHub Pages: http://paulzerkel.github.io/pres-node-web-scraping/

Edit the root index.html file to change the presentation. All of the files for Reveal.js are in /reveal. The master branch contains the live site.

Examples

All examples are in /examples. Run npm install in each directory to install the required dependencies.

  1. trivial - A simple example of finding keywords in HTTP responses with regex
  2. dom - Demonstrates the advantage of having a DOM
  3. automate - Full browser automation

The examples target a site that was made for this presentation. The target site is also available. See the target site's repo for more information.

Notes

Notes for the presentation follows.

Title

  • Web Scraping in Node.js
  • 8/11/2016
  • This presentation will cover what web scraping is and three approaches using Node.js.

Who Am I?

  • My name is Paul Zerkel and I run the tech side of the local startup Manufacturers’ Inventory. We use node.js for parts of our backend systems
  • I also work as a software consultant where my focus is development and system architecture.

A Quick Background on Scraping

  • Scraping is the act of programmatically consuming a User Interface and extracting unstructured information for the purpose of creating structure data.
  • Build a read-only API on top of a User Interface as opposed to a database, existing API, etc.
  • Before web scraping there was screen scraping.
    • Intercept terminal communication and extract relevant data.
    • Listen for window messages in a GUI application and extract data from fields or controls.
    • Still used as a legacy application interface strategy.
  • Document Scraping involves extracting structured data from PDFs, Word documents, etc. Tabula is a PDF extraction tool created by Pro Publica.
    • Investigative journalists use this technology to extract data from disclosure reports in order to make interactive datasets.
    • Financial documents, such as reports, can contain data that is more valuable if it can be stored and manipulated independent from the document.

Not Covering

  • Web crawling - outside the scope of pure scraping and requires way more infrastructure
  • Text analysis - NLP, mining, etc, is not always the domain you want to use JavaScript for

Web Scraping

  • Process the responses from HTTP requests for the purposes of processing the data in some way instead of just displaying to users.
  • Possibly involves handling the responses in a fashion similar to a browser (or, exactly like a browser)
  • At a minimum requires parsing data out of raw web server responses.
  • Can be up to programmatically driving a web browser in order to have a dynamic DOM and JavaScript support.

Common Examples

  • If a website doesn't have a publicly available API, scraping may be an answer.
  • Build applications that require 3rd party web data
    • Alerts - BYO Google Alerts, Price changes
    • Display a page in another format - Pocket, Instapaper, etc
    • Data aggregation - Padmapper, crime stats
    • Financial data - Yodlee
    • Niche search engine - Scrape social media and blog posts and analyze it

Pros of Web Scraping

  • In the best case scenarios web scraping is significantly more straightforward than traditional screen scraping.
  • You don't need to deal with binary protocols or operating system level messaging.
  • HTML provides some structure to the data.
  • It is generally easy to reason about using tools we're already familiar with.
    • Chrome developer tools
    • As simple as basic text processing
    • Selectors for specifying what data to read
    • Curl for reading raw responses on the commandline

Potential Issues

  • Still not foolproof and it comes with its own set of challenges
    • If you have the option to just use an API - DO IT
  • Performance isn't going to be great
    • Extra layers of complexity
    • More logic required
    • More likely to be throttled
  • UI changes can render existing web scraping scripts useless.
    • Ideally, UI changes would be done via CSS only and the HTML structure wouldn't change, but that rarely happens.
    • Structured schemas exist but adoption is infrequent and inconsistent.
    • The data may simply no longer be available.
  • Client heavy web applications will require a full DOM and JavaScript interpreter.
  • A site's terms of service and robots.txt may prevent you from scraping a site.
  • Rate limiting, session tracking, IP bans, etc can all disrupt your ability to access a web site in a timely manner.

Trivial Example

  • Simplest use case is checking for the presence of text in the response body.
  • You don't need anything other that the standard node http library.
  • Easy to implement and can be used for alerts or the foundation for more complicated processing.
  • It's also easy to outgrow this. You'll find yourself spending more and more time implementing logic around the request, response, and body processing. Still - start easy!
  • Consider helper libs such as got
    • Got extents the existing HTTP functionality and lets you use a more concise promise syntax.

Build a DOM

  • Using a regex is fine if you know exactly what you're looking for.
  • What if you don't know the value of what you're looking for? Do you want to try to parse HTML etc with a regex? Probably not!
  • cheerio provides a DOM, selectors, and jQuery style API for manipulating the dom.

More on Cheerio

  • JSDOM replacement
  • jQuery style API
  • .load(body), find('selector'), forEach(index, element) =>, etc
  • Test in browser with jQuery or just test your selector with document.querySelectorAll
  • Minor gotcha - the DOM reflects the actual HTML and won't automatically insert elements such as tbody.

Browser Automation

  • Great for dynamic content, logging in, navigation, etc
    • Any example where you may need to have session state
    • Some sites build a profile around if they think you're a "real" user (google)
  • Includes or drives a fully functional browser
    • This means that we can now include more sophisticated logic into our scraping
  • Browser plugins (macros), Selenium, PhantomJS/Casper, Nightmare, and more
    • This can be a huge pain in the butt to integrate WITH your application

Nightmare

  • Nightmare is a high-level browser automation library
  • Based on the Electron project (which lets you run Chromium via Node)
    • This means that you can include full browser automation with an npm install command

Driving Nightmare

  • Provides functions such as goto, wait, screenshot, click, refresh, etc that drive the browser
  • Inject your own .js file in order to run your own JavaScript in the context of the browser
  • The evaluate function takes JavaScript and will also evaluate it in the context of the page. It is used as a bridge BACK to your Node process.

High Level architecture

  • Probably overkill!
  • Designed to be fairly decomposed which means there is management overhead
  • Split up the process of gathering raw data and scraping the data you really want
  • Storage, transport, and what sections can be combined depend on the situation

Gotchas

  • Set a user-agent on your client
  • Be nice to your target sites
    • You can and will get cut off, especially from sites like Google
  • Schedule scraping for off hours
  • Throttle requests
  • Obey robots.txt