A Wikipedia Parser generating a Darwin Core Archive for species pages using the taxobox or speciesbox template and their derivates. The parser focuses on the English, German, Spanish and French wikipedias currently and works on the article xml dumps
Multimedia, vernacular names and textual descriptions are extracted. Every section of a wiki page will become a distinct description record with the section title becoming the description "type".
java -jar wikipedia.jar Downloading and processing the entire english wikipedia takes a long time. Depending on your network and CPU expect the program to run for several days.
- http://en.wikipedia.org/wiki/Template:Taxobox
- http://de.wikipedia.org/wiki/Wikipedia:Taxoboxen
- http://de.wikipedia.org/wiki/Wikipedia:Pal%C3%A4oboxen
- http://en.wikipedia.org/wiki/Template:Automatic_taxobox/doc
- http://en.wikipedia.org/wiki/Template:Speciesbox/doc
- http://en.wikipedia.org/wiki/Template:Subspeciesbox/doc
- http://en.wikipedia.org/wiki/Template:Infraspeciesbox/doc
For automatic taxonboxes the classification from the Taxonomy templates are scraped.
- http://simple.wikipedia.org/wiki/Template:Fossil_range/doc
- http://en.wikipedia.org/wiki/Template:Long_fossil_range
- http://en.wikipedia.org/wiki/Template:Geological_range
- http://en.wikipedia.org/wiki/Template:Species_list/doc
- http://en.wikipedia.org/wiki/Template:Taxon_list
- http://en.wikipedia.org/wiki/Template:Plainlist
- http://en.wikipedia.org/wiki/Template:Flatlist
- http://en.wikipedia.org/wiki/Template:Collapsible_list
- http://en.wikipedia.org/wiki/Template:Listen