/sawtooth

Provides an interface on top of nokogiri to parse XML files like Apache Digester.

Primary LanguageRuby

                                                               __
                                                   _____....--' .'
                                      _..___...---'._ o      -`(
                       _             | |  _         \   .--.  `\
    ___  __ ___      _| |_ ___   ___ | |_| |__      |   \   \ `|
   / __|/ _` \ \ /\ / / __/ _ \ / _ \| __| '_ \     |o o |  |  |
   \__ \ (_| |\ V  V /| || (_) | (_) | |_| | | |     \___'.-`.  '.
   |___/\__,_| \_/\_/  \__\___/ \___/ \__|_| |_|          |   `---'
  '^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^'

A companion for nokogori to parse XML files by rules, similar to Apache Commons Digester.

Converting XML structures into Ruby is most often an unsatisfying task, having to choose between implementing a SAX parser (for speed) or using nokogiri features like CSS selectors for ease of use. At it's base sawtooth is parsing documents using SAX, but provides an interface to specify rules for the handling the document.

require 'open-uri'
require 'sawtooth'

rules = Sawtooth.rules do
  before { |doc| doc << [] }                  # 1. create an array for all news items

  on 'rss/channel/item' do
    on_start  { |doc| doc << Hash.new }       # 2. on an item create hash
    on_finish { |doc| doc.parent << doc.pop } # 3. when closing an item, pop from stack and
  end                                         #    append to parent array (from step 1.)

  on_text 'rss/channel/item/*'                # 4. add contents to hash
end

result = rules.parse(open('http://rss.cnn.com/rss/edition.rss')).root
p result #=> [{ 'title' => 'Some CNN News...', 'guid' =>, ...}, ...]

This sample shows the DSL exposed to create the XML parsing rules for an RSS feed.