Official code for the paper "SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes". If you use this code please cite our paper.
You can interact with our SanskritShala's web-based platform: Link
. We encourage you to check our demo video to get familiar with our platform.
You may find more details of codebases in Neural Modules
folder for word segementaion, morphological tagging, depedency parsing and compound type identification task.
First you need to install the individual modules on your machine as instructed in the above section. You need not have a GPU in oder to make these pretrained systems work on your local machine. You may find more details on how to deploy toolkit on your local machine in SanShala-Web
folder.
SanEval is a toolkit for evaluating the quality of Sanskrit embeddings. We assess their generalization power by using them as features on a broad and diverse set of tasks. We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in word embeddings. Our goal is to ease the study and the development of general-purpose fixed-size word representations for Sanskrit. You may find more details of codebases in EvalSan
folder.
- SanEval includes a series of Intrinsic tasks to evaluate what linguistic properties are encoded in your word embeddings.
- We use
SLP1
transliteration scheme for our data. You can change it to another scheme using this code.
Task | Metric | #dev | #test |
---|---|---|---|
Relatedness | F-score | 4.5k | 9k |
Similarity | Accuracy | na | 3k |
Categorization Syntactic | Purity | na | 1.1k |
Categorization Semantic | Purity | na | 150 |
Analogy Syntactic | Accuracy | na | 10k |
Analogy Semantic | Accuracy | na | 6.4k |
- You can download the pretrained models from this link.
README.md
is given for each model. - Place the
models
folder in the parent directory path. - Pretrained vectors can be downloaded from this link. Place this folder in
EvalSan/evaluations/Intrinsic/
path. This vectors are being used in evaluation script. - Our proposed LCM pretraining is available at
EvalSan/LCM
folder. For more details please visit this link.
If you use our tool, we'd appreciate if you cite our paper:
@misc{Sandhan_SanskritShala,
doi = {10.48550/ARXIV.2302.09527},
url = {https://arxiv.org/abs/2302.09527},
author = {Sandhan, Jivnesh and Agarwal, Anshul and Behera, Laxmidhar and Sandhan, Tushar and Goyal, Pawan},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes},
publisher = {arXiv},
year = {2023},
copyright = {Creative Commons Attribution 4.0 International}
}
This project is licensed under the terms of the Apache license 2.0
.
We'd like to say thanks to everyone who helped us make the different neural models for SanskritShala.