/woven

Semantic data extraction for the web, primarily targeting the browser, with no external runtime dependencies

Primary LanguageHTMLISC LicenseISC

Woven

Build Status

A very very immature tool for extracting semating data from web pages, primarily targeting the browser, with no external runtime dependencies.

Include it on your page:

<script src="woven.min.js"></script>

require it in Node/io.js

var woven = require("woven")

All methods take an HTMLDocument or HTMLElement as the first argument. You can get ahold of one in the browser:

// as the main document global variable
var doc = document

// with any `getElement-*` method
var elem = document.getElementById("some-element")

// by parsing some HTML
var parser = new DOMParser()
var html = "<meta name='a' content='b'>"
var docFragment = parser.parseFromString(html, "text/html")

// with some help
var elem = $("#some-element")[0]

In Node/io.js, you can use jsdom (or similar):

var jsdom = require("jsdom")
var html = "<meta name='a' content='b'>"
var docFragment = jsdom.jsdom(html)

Data Sources

Given a docFragment:

<div itemscope itemtype="http://data-vocabulary.org/Person">
   My name is <span itemprop="name">Bob Smith</span>,
   but people call me <span itemprop="nickname">Smithy</span>.
   Here is my homepage:
   <a href="http://www.example.com" itemprop="url">www.example.com</a>.
   I live in
   <span itemprop="address" itemscope
      itemtype="http://data-vocabulary.org/Address">
      <span itemprop="locality">Albuquerque</span>,
      <span itemprop="region">NM</span>
   </span>
   and work as an <span itemprop="title">engineer</span>
   at <span itemprop="affiliation">ACME Corp</span>.
</div>
woven.extractSchemaItems(docFragment) // =>
[ { itemtype: 'http://data-vocabulary.org/Person',
    name: 'Bob Smith',
    nickname: 'Smithy',
    url: 'www.example.com',
    address:
     { itemtype: 'http://data-vocabulary.org/Address',
       locality: 'Albuquerque',
       region: 'NM' },
    title: 'engineer',
    affiliation: 'ACME Corp' } ]

Page <meta> Data

Given a page:

<html>
  <head>
    <meta name="title" content="I Am a Teapot">
    <meta name="keywords" content="self being vessel">
    <meta property="og:title" content="I Am a Teapot">
    <meta property="not-real-property" content="418">
  </head>
  <body>
    <h1>I Am a Teapot</h1>
  </body>
</html>
woven.extractDocumentMeta(document) // =>
{ title: 'I Am a Teapot',
  keywords: 'self being vessel',
  'og:title': 'I Am a Teapot',
  'not-real-property': '418' }

Work in progress.

(Look, it's the one I needed.)

Given a docFragment:

<div class="haudio">
   <span class="fn">Start Wearing Purple</span> by
   <span class="contributor">
        <span class="vcard">
            <span class="fn org">Gogol Bordello</span>
        </span>
    </span>
   found on
   <span class="album">Underdog World Strike</span>
</div>
woven.extractHAudio(docFragment) // =>
[ { fn: 'Start Wearing Purple',
    contributor: 'Gogol Bordello',
    album: 'Underdog World Strike' } ]

Development

Tests

They're in Mocha.

$ mocha test/*

Or, automatically

$ mocha watch test/*

Building

$ gulp

Builds the Browserified version, a minified version of that and corresponding source map.

TODO

  • extractAll method
  • extractMicroformats method
  • More individual microformats
  • Meaningful breakdown of common meta tags
  • Interface for page meta with fallthrough for values
  • More real-world example tests
  • Browser-based tests?
  • Visualizer?
  • Commandline tool?