/docstruct

A tool to create Document Structure trees from XHTML websites.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DocStruct - A Document Structure Parser

A tool to create Document Structure[1] (DS) trees from XHTML websites.  This was created as a term project for 
CSI 5386 (Fall 2009) at the University of Ottawa, Fall 2009.  More detailed information on the project can be
found in the paper located at http://cloud.github.com/downloads/cfournie/docstruct/paper.pdf


Directories
  \module\  - Contains the python parser tool
  \spec\     - Contains example DS trees, and the DS XML Schema
  

References

[1] R. Power, D. Scott, and N. Bouayad-Agha, "Document structure," Comput. Linguist., vol. 29, no. 2,
pp. 211-260, 2003. Accessible at http://www.mitpressjournals.org/doi/abs/10.1162/089120103322145315