Woven

A very very immature tool for extracting semating data from web pages, primarily targeting the browser, with no external runtime dependencies.

Include it on your page:

<script src="woven.min.js"></script>

require it in Node/io.js

var woven = require("woven")

All methods take an HTMLDocument or HTMLElement as the first argument. You can get ahold of one in the browser:

// as the main document global variable
var doc = document

// with any `getElement-*` method
var elem = document.getElementById("some-element")

// by parsing some HTML
var parser = new DOMParser()
var html = "<meta name='a' content='b'>"
var docFragment = parser.parseFromString(html, "text/html")

// with some help
var elem = $("#some-element")[0]

In Node/io.js, you can use jsdom (or similar):

var jsdom = require("jsdom")
var html = "<meta name='a' content='b'>"
var docFragment = jsdom.jsdom(html)

Data Sources

Schema.org Data

Given a docFragment:

<div itemscope itemtype="http://data-vocabulary.org/Person">
   My name is <span itemprop="name">Bob Smith</span>,
   but people call me <span itemprop="nickname">Smithy</span>.
   Here is my homepage:
   <a href="http://www.example.com" itemprop="url">www.example.com</a>.
   I live in
   <span itemprop="address" itemscope
      itemtype="http://data-vocabulary.org/Address">
      <span itemprop="locality">Albuquerque</span>,
      <span itemprop="region">NM</span>
   </span>
   and work as an <span itemprop="title">engineer</span>
   at <span itemprop="affiliation">ACME Corp</span>.
</div>

woven.extractSchemaItems(docFragment) // =>
[ { itemtype: 'http://data-vocabulary.org/Person',
    name: 'Bob Smith',
    nickname: 'Smithy',
    url: 'www.example.com',
    address:
     { itemtype: 'http://data-vocabulary.org/Address',
       locality: 'Albuquerque',
       region: 'NM' },
    title: 'engineer',
    affiliation: 'ACME Corp' } ]

Page `<meta>` Data

Given a page:

<html>
  <head>
    <meta name="title" content="I Am a Teapot">
    <meta name="keywords" content="self being vessel">
    <meta property="og:title" content="I Am a Teapot">
    <meta property="not-real-property" content="418">
  </head>
  <body>
    <h1>I Am a Teapot</h1>
  </body>
</html>

woven.extractDocumentMeta(document) // =>
{ title: 'I Am a Teapot',
  keywords: 'self being vessel',
  'og:title': 'I Am a Teapot',
  'not-real-property': '418' }

Microformats

Work in progress.

hAudio

(Look, it's the one I needed.)

Given a docFragment:

<div class="haudio">
   <span class="fn">Start Wearing Purple</span> by
   <span class="contributor">
        <span class="vcard">
            <span class="fn org">Gogol Bordello</span>
        </span>
    </span>
   found on
   <span class="album">Underdog World Strike</span>
</div>

woven.extractHAudio(docFragment) // =>
[ { fn: 'Start Wearing Purple',
    contributor: 'Gogol Bordello',
    album: 'Underdog World Strike' } ]

Development

Tests

They're in Mocha.

$ mocha test/*

Or, automatically

$ mocha watch test/*

Building

$ gulp

Builds the Browserified version, a minified version of that and corresponding source map.

TODO

extractAll method
extractMicroformats method
More individual microformats
Meaningful breakdown of common meta tags
Interface for page meta with fallthrough for values
More real-world example tests
Browser-based tests?
Visualizer?
Commandline tool?

dluxemburg/woven