/canon

processing pali canon from worldtipitaka

Primary LanguageRuby

What

  • tipitaka from offline worldtipitaka.org (http://archive.is/Ugwv)
  • categorized raw html pages
  • all in one file, marshalled by ruby (2.0.0-preview2) with use of zlib
  • [WIP] scripts for maximum data extracttion

Why

  • to remove all garbage
  • to get pure xml or even db

Progress

- [#] just categorized html files
- [~] remove garbage?
- [~] extract schema?
- [ ] be sure that no data is lost? (generate same pages from db?)
- [ ] output data file with ruby interface to it

Usage

  • git clone git://github.com/sowcow/canon.git; cd canon
  • read/run canon.rb