- tipitaka from offline worldtipitaka.org (http://archive.is/Ugwv)
- categorized raw html pages
- all in one file, marshalled by ruby (2.0.0-preview2) with use of zlib
- [WIP] scripts for maximum data extracttion
- to remove all garbage
- to get pure xml or even db
- [#] just categorized html files
- [~] remove garbage?
- [~] extract schema?
- [ ] be sure that no data is lost? (generate same pages from db?)
- [ ] output data file with ruby interface to it
git clone git://github.com/sowcow/canon.git; cd canon
- read/run canon.rb