sowcow/canon

processing pali canon from worldtipitaka

Ruby

What

tipitaka from offline worldtipitaka.org (http://archive.is/Ugwv)
categorized raw html pages
all in one file, marshalled by ruby (2.0.0-preview2) with use of zlib
[WIP] scripts for maximum data extracttion

Why

to remove all garbage
to get pure xml or even db

Progress

- [#] just categorized html files
- [~] remove garbage?
- [~] extract schema?
- [ ] be sure that no data is lost? (generate same pages from db?)
- [ ] output data file with ruby interface to it

Usage

git clone git://github.com/sowcow/canon.git; cd canon
read/run canon.rb