/big_file_parser

Parser for HUGE XML or CSV files

Primary LanguageRuby

== DESCRIPTION:

Used to parse large XML or CSV files.


== REQUIREMENTS:

libxml-ruby
fastercsv


== INSTALL:

$ gem build big_file_parser.gemspec
$ sudo gem install big_file_parser-x.x.x.gem


== USAGE:

Say you have an xml document like below.  We are interested in extracting
the Monkey elements out of it.

XML::PARSER

<xml>
  <animals>
    <elephant id="3">
      <name>elephant1</name>
      <trunk>large</trunk>
    </elephant>
    <monkey id="1">
      <name>monkey1</name>
      <yrsold>23</yrsold>
      <personality hilarious="true">quiet</personality>
      <street>300 Monkey St</street>
      <city>Cincinnati</city>
      <state>Oh</state>
      <zip>45219</zip>
    </monkey>
    <monkey id="2">
      <name>monkey2</name>
      <yrsold>33</yrsold>
      <personality hilarious="false">loud</personality>
      <street>301 Monkey St</street>
      <city>Cincinnati</city>
      <state>Oh</state>
      <zip>45219</zip>
    </monkey>
  </animals>
</xml>

class Monkey
  include BigFileParser::XML::Properties
  element :monkey
  property :id, :attribute, :element => :monkey

  property :name, :text

  property :age, :text, :element => :yrsold
        
  property :funny, :attribute, :element => :personality, 
          :attribute => :hilarious
        
  property :loud, :text, :element => :personality
        
  property :street, :text

  property :city, :text

  property :state, :text

  property :zip, :text

  def address
    [street, city, state].compact.join(", ") + " #{zip}"
  end
  
  def age
    @age.to_i
  end

  def funny=(value)
    @funny = (value == "true" ? true : false)
  end
end

# Load the parser, with the Monkey class and an XML File
m = BigFileParser::XML::Parser.new(Monkey,'examples/monkeys.xml')

# Run the parser
m.run
=> true

# Inspect the first element
m.elements.first.class
=> Monkey
m.elements.first.id
=> "1"
m.elements.first.name
=> "monkey1"
m.elements.first.age
=> 23
m.elements.first.funny
=> true
m.elements.first.loud
=> "quiet"
m.elements.first.address
=> "300 Monkey St, Cincinnati, Oh 45219"

# Inspect the last element
m.elements.last.class
=> Monkey
m.elements.last.id
=> "2"
m.elements.last.name
=> "monkey2"
m.elements.last.age
=> 33
m.elements.last.funny
=> false
m.elements.last.loud
=> "loud"
m.elements.last.address
=> "301 Monkey St, Cincinnati, Oh 45219"


CSV::PARSER

CSV Files are handled similarly.

id,name,age,funny,loud,street,city,state,zip
1,monkey1,23,true,quiet,300 Monkey St,Cincinnati,Oh,45219
2,monkey2,33,false,loud,301 Monkey st,Cincinnati,Oh,45219

m = BigFileParser::CSV::Parser.new(Monkey,'examples/monkeys.csv')

# Run the parser
m.run
=> true

# Inspect the first element
m.elements.first.class
=> Monkey
m.elements.first.address
=> "300 Monkey St, Cincinnati, Oh 45219"

pp m.elements.first
#<Monkey:0x141a54
 @age="23",
 @city="Cincinnati",
 @funny=true,
 @id="1",
 @loud="quiet",
 @name="monkey1",
 @state="Oh",
 @street="300 Monkey St",
 @zip="45219">
=> nil

pp m.elements.last
#<Monkey:0x112538
 @age="33",
 @city="Cincinnati",
 @funny=false,
 @id="2",
 @loud="loud",
 @name="monkey2",
 @state="Oh",
 @street="301 Monkey st",
 @zip="45219">



XML::INSPECTOR

In addition, there is an Inspector class which will inspect an xml document
and tell you the types of properties that are within the document.  This
is especially helpful, if you do not have an XML schema and have a large
document that you are trying to understand.

The Inspector will run through the entire document to find all elements
or attributes that are ever declared inside the document.

# Load the inspector, passing the element name you are interested in
# and the xml document
m = BigFileParser::XML::Inspector.new(:monkey,'examples/monkeys.xml')

# Run the inspector and take a look at the output
m.run

element :monkey
property :hilarious, :attribute, :element => :personality
property :id, :attribute, :element => :monkey
property :city, :text
property :name, :text
property :personality, :text
property :state, :text
property :street, :text
property :yrsold, :text
property :zip, :text


CSV::INSPECTOR

The CSV inspector requires that the CSV file has a header row.  If it
does it will simply read in the first line of the file and output the
possible properties, similar to the XML::Inspector

m = BigFileParser::CSV::Inspector.new('examples/monkeys.csv')
m.run

property :id, 0
property :name, 1
property :age, 2
property :funny, 3
property :loud, 4
property :street, 5
property :city, 6
property :state, 7
property :zip, 8



== FEATURES/PROBLEMS:

Only one element per class right now.

Only supports UTF-8.  If you are having problems with an XML
document, try removing the invalid characters.
$ iconv -f UTF-8 -t UTF-8 -c infile.xml > outfile.xml

require 'shell_command'
begin
  puts "Parsing XML file"
  m = BigFileParser::XML::Parser.new(Listing,'/tmp/20090303_Listings.xml')
  m.run
rescue LibXML::XML::Error
  puts "Exception Caught Fixing Encoding"
  command = ShellCommand.new('iconv -f UTF-8 -t UTF-8 -c')
  file = "/tmp/#{Time.now.to_i}_listings.xml"
  command.run("20090303_Listings.xml > #{file}")
  m = BigFileParser::XML::Parser.new(Listing,file)
  puts "Re-Parsing XML File"
  m.run
end


== LICENSE:

(The MIT License)

Copyright (c) 2008 FIX

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
'Software'), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.