/ariel

Fork of A Ruby Information Extraction Library

Primary LanguageRubyMIT LicenseMIT

= Ariel release 0.1.0

== About - Ariel: A Ruby Information Extraction Library
Ariel is a library that allows you to extract information from semi-structured
documents (such as websites). It is different to existing tools because rather
than expecting the developer to write rules to extract the desired information,
Ariel will use a small number of labeled examples to generate and learn
effective extraction rules. It is developed by Alex Bradbury and released under
the MIT license. Ariel was started as a Google Summer of Code project mentored
by Austin Ziegler in 2006.

== Install
gem install ariel

== Announcement

I'm happy to announce the release of Ariel 0.1.0, the result of my Summer of
Code work. This release should be easy to use, very functional, and hopefully
useful - so it's worth trying out. I've put a lot of effort in to writing clear
and straightforward documentation to get your started, so take a look at the
docs available at http://ariel.rubyforge.org. In particular, flick through the
tutorial and quick start guide. If you're interested, you may also want to take
a look at the theory page where I've made a good start on describing the method
Ariel uses to learn extraction rules. If you have any problems or find any bugs,
just send me an email or add it to the issue tracker (see link below). Enjoy.
See the FAQ for a vim snippet to make labeling examples a little easier.

== Quickstart/Basic usage

* @require 'ariel'@
* Define a structure for the information you wish to extract: 
    structure = Ariel::Node::Structure.new do |r|
      r.item :title
      r.item :body
      r.list :comments do |c|
        c.list_item :comment do |d|
          d.item :author
          d.item :body
        end
      end
     end
* Collect a few examples of the sort of document you wish to extract information
  from (pages from the same website for instance).
* Label each example with tags such as <l:title>, <l:comment> and so on in the
  relevant places.
*  Ariel.learn structure, labeled_file1, labeled_file2, labeled_file3
* Find the documents you want to extract information from.
*  extractions = Ariel.extract structure, unlabeled_file1,
  unlabeled_file2
*  extractions[0].search('comments/*/body').each {|e| puts e.extracted_text} =>
  "Great stuff, loving it", "I love life", .....
*  extractions[0].at('comments/34') => nil</tt> (there is no 34th comment, #at
  returns the first result rather than an array of matches).


== Credits
Ariel is developed by Alex Bradbury as a Google Summer of Code project under the
mentoring of Austin Ziegler.

== Links
SVN Repository: http://rubyforge.org/projects/ariel
Issue tracker: http://code.google.com/p/ariel/issues/
Documentation/homepage: http://ariel.rubyforge.org
RDoc: http://ariel.rubyforge.org/rdoc/