Simple HTML data harvester with XML scheme
#Dependences
This libraries need HtmlUnit to work.
I actually use the version 2.21.
#XML scheme file
XML document is used to describe what type of elements you need in a web page.
The xml document should follow : this DTD
You can find XML documents sample here
Few things:
- "page" elements represent a web page scheme.
- "data" elements represents a single elements to extract from the web page
- "form" elements represents a HTML form to fill
- On XPath(s) constructions xargs id in the xpath is replaced by one of the arg value (all combinaisons of XPaths is build)
- All names and ids are unique and totaly arbitrary (user defined)
#Harvesting data
Once you have the XML scheme file it's very easy to get data from a web page You can find sample here