A corpus of Sanskrit pramāṇa texts ready for use in NLP applications, along with data and results for an experiment in LDA topic modeling. See also the corresponding paper presented at the 6th ISCLS, for which the citable repo snapshot is at:
Update (Oct 2021): The repo is now current to reflect latest text input and modeling output (and now in Python 3). See the above Zenodo link for the archived version corresponding to the 2019 paper.
More importantly, now you can see this data hard at work in the intertextuality search interface vatayana.info.
Repo Overview:
- text_original: downloaded source files (
.htm
GRETIL,.xml
SARIT,.doc
etc. private collections) - data_prep: metadata, xls transforms, validation scripts, cleaned texts, segmentation scripts
- text_doc_and_word_segmented: topic modeling-ready input data, spreadsheet overview thereof
- lda_topic_modeling: topic modeling inputs and outputs, analysis scripts, results
Overivew of which data could be shared freely here:
Data Source | 1_text_original | 2.1_text_metadata | 2.4_text_cleaned | 3_text_doc_and_word_segmented |
---|---|---|---|---|
GRETIL | y | y | y | y |
SARIT | y | y | y | y |
private collections | (some) | y | (some) | y |
Tools Used:
- Python 3.8
- XSL Transforms: lxml library
- Word Segmentation: Sanskrit Sandhi and Compound Splitter, based on DCS
- Transliteration: skrutable
- Topic Modeling: ToPān, based on R packages lda and LDAvis
- Topic Model Exploration: fork of Metallō
Micro-Tools Created:
- transform.py - daisy-chains XSL transforms, visualize progress
- validate_text.py - checks textual structure (use of brackets) and character content for troublesome patterns, warns about issues
- explore_topic_top_words.py - adjusts topic modeling phi values for lambda relevance L, filters out unwanted words, sets limits on how many words to consider and on how many words to show for each topic
- explore_topic_domination_by_text.py - shows which topics are dominated by small number of individual texts as determined from identifiers
- format_doc_similarity_table.py - formats document similarity results as table with one column per text as determined from identifiers, optinally prioritizes set of preferred texts
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.