/dpd-db

Primary LanguagePython

Digital Pāḷi Database

Building the DB

  1. Download this repo
  2. Get tipitaka-xml with git submodule init && git submodule update commands
  3. Install nodejs
  4. Install poetry
  5. poetry install
  6. poetry run bash bash/initial_setup_run_once.sh
  7. poetry run bash bash/build_db.sh
  8. To be able to run database tests you may need to install some of these packages.

That should create an SQLite database ./dpd.db which can be accessed by DB Browser, DBeaver, through SQLAlechmy or your preferred method.

For a quick tutorial on how to access any information in the db with SQLAlchemy, see scripts/db_search_example.py.

Build a complete database locally and extract all dictionaries

⚠️ WARNING: When sandhi/sandhi_splitter.py runs with the config option deconstructor.all_texts = yes, it will take several hours to complete.

Starting with a fresh clone of the tip:

git clone --depth=1 https://github.com/digitalpalidictionary/dpd-db.git
cd dpd-db
git submodule init && git submodule update
poetry install
poetry run bash bash/build_and_make_all.sh

This creates the dpd.db SQLite database. Also it extract all dictionaries see folder exporter/share

Code Structure

There are four parts to the code:

  1. Create the database and build up the tables of derived data.
  2. Add new words, edit and update the db with a GUI.
  3. Run data integrity tests on the db.
  4. Compile all the parts and export into various dictionary formats.

About the database

  • DpdHeadwords and DpdRoots tables are the heart of the db, everything else gets derived from those.
  • They have a relationship DpdHeadwords.rt. to access any root information. For example, DpdHeadwords.rt.root_meaning
  • There are also lots of @properties in db/models.py to access useful derived information.
  • DpdHeadwords table also contains lists of inflections of every word in multiple scripts, as well as html inflection tables.
  • FamilyCompound table is html of all the compound words which contain a specific word.
  • FamilyRoot table is html of all the words with the same prefix and root.
  • FamilySet table is html of all the words which belong to the same set, e.g. names of monks.
  • FamilyWord table is html of all the words which are derived from a common word without a root.
  • InflectionTemplates table are the templates from which all the inflection tables are derived.