pramana-nlp

A corpus of Sanskrit pramāṇa texts ready for use in NLP applications, along with data and results for an experiment in LDA topic modeling. See also the corresponding paper presented at the 6th ISCLS, for which the citable repo snapshot is at:

Update (Oct 2021): The repo is now current to reflect latest text input and modeling output (and now in Python 3). See the above Zenodo link for the archived version corresponding to the 2019 paper.

More importantly, now you can see this data hard at work in the intertextuality search interface vatayana.info.

Repo Overview:

text_original: downloaded source files (.htm GRETIL, .xml SARIT, .doc etc. private collections)
data_prep: metadata, xls transforms, validation scripts, cleaned texts, segmentation scripts
text_doc_and_word_segmented: topic modeling-ready input data, spreadsheet overview thereof
lda_topic_modeling: topic modeling inputs and outputs, analysis scripts, results

Overivew of which data could be shared freely here:

Data Source	1_text_original	2.1_text_metadata	2.4_text_cleaned	3_text_doc_and_word_segmented
GRETIL	y	y	y	y
SARIT	y	y	y	y
private collections	(some)	y	(some)	y

Tools Used:

Python 3.8
XSL Transforms: lxml library
Word Segmentation: Sanskrit Sandhi and Compound Splitter, based on DCS
Transliteration: skrutable
Topic Modeling: ToPān, based on R packages lda and LDAvis
Topic Model Exploration: fork of Metallō

Micro-Tools Created:

transform.py - daisy-chains XSL transforms, visualize progress
validate_text.py - checks textual structure (use of brackets) and character content for troublesome patterns, warns about issues
explore_topic_top_words.py - adjusts topic modeling phi values for lambda relevance L, filters out unwanted words, sets limits on how many words to consider and on how many words to show for each topic
explore_topic_domination_by_text.py - shows which topics are dominated by small number of individual texts as determined from identifiers
format_doc_similarity_table.py - formats document similarity results as table with one column per text as determined from identifiers, optinally prioritizes set of preferred texts

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

tylergneill/pramana-nlp

pramana-nlp