/HTML_Harvester

Simple HTML data harvester with XML scheme

Primary LanguageJava

HTML Harvester

Simple HTML data harvester with XML scheme

#Dependences

This libraries need HtmlUnit to work.
I actually use the version 2.21.

#XML scheme file

XML document is used to describe what type of elements you need in a web page.
The xml document should follow : this DTD

You can find XML documents sample here

Few things:

  1. "page" elements represent a web page scheme.
  2. "data" elements represents a single elements to extract from the web page
  3. "form" elements represents a HTML form to fill
  4. On XPath(s) constructions xargs id in the xpath is replaced by one of the arg value (all combinaisons of XPaths is build)
  5. All names and ids are unique and totaly arbitrary (user defined)

#Harvesting data

Once you have the XML scheme file it's very easy to get data from a web page You can find sample here