/pramana-nlp

data, metadata, tools, and LDA experiments on a corpus of Sanskrit philosophy texts

Primary LanguageHTML

pramana-nlp

A corpus of Sanskrit pramāṇa texts ready for use in NLP applications, along with data and results for an experiment in LDA topic modeling. See also the corresponding paper presented at the 6th ISCLS, for which the citable repo snapshot is at: DOI

Update (Oct 2021): The repo is now current to reflect latest text input and modeling output (and now in Python 3). See the above Zenodo link for the archived version corresponding to the 2019 paper.

More importantly, now you can see this data hard at work in the intertextuality search interface vatayana.info.

Repo Overview:

  1. text_original: downloaded source files (.htm GRETIL, .xml SARIT, .doc etc. private collections)
  2. data_prep: metadata, xls transforms, validation scripts, cleaned texts, segmentation scripts
  3. text_doc_and_word_segmented: topic modeling-ready input data, spreadsheet overview thereof
  4. lda_topic_modeling: topic modeling inputs and outputs, analysis scripts, results

Overivew of which data could be shared freely here:

Data Source 1_text_original 2.1_text_metadata 2.4_text_cleaned 3_text_doc_and_word_segmented
GRETIL y y y y
SARIT y y y y
private collections (some) y (some) y

Tools Used:

Micro-Tools Created:

  • transform.py - daisy-chains XSL transforms, visualize progress
  • validate_text.py - checks textual structure (use of brackets) and character content for troublesome patterns, warns about issues
  • explore_topic_top_words.py - adjusts topic modeling phi values for lambda relevance L, filters out unwanted words, sets limits on how many words to consider and on how many words to show for each topic
  • explore_topic_domination_by_text.py - shows which topics are dominated by small number of individual texts as determined from identifiers
  • format_doc_similarity_table.py - formats document similarity results as table with one column per text as determined from identifiers, optinally prioritizes set of preferred texts

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.