Official code for the paper DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit
- Python 3.7
- cuda 11.7
- torch 1.13.0
- torchaudio 0.13.0
- torchvision 0.14.0
- And the rest of the dependencies can be installed by simply creating a new environment using the
environment.yml
file.
We assume that you have installed conda beforehand.
conda env create -f DepNeCTI-LSTM_environment.yml
conda env create -f DepNeCTI-XLMR_environment.yml
And then activate this environment and you are good to go now !!
- Datasets are given in the
Datasets
folder. - Datasets include - with context (fine grain + coarse) and without context (fine grain + coarse)
- transfer the
.csv
files from the respective Datasets folder toDatasets/data_format
- Use the
.ipynb
file in theDatasets/data_format
folder and follow the instructions mentioned there to generate the required data format.
- Pretrained FastText embeddings for DepNeCTI can be obtained from here.
- Make sure that
cc.NeCTIS.300.txt
file is placed atdata/
. And place the rest of the files inword_vectors
folder. - The main results are reported on the systems trained by combining train and dev splits.
- First place all the files in the
word_vectors
folder as mentioned above. - Use the
.ipynb
file inword_vectors
folder and generate your own fasttext embeddings.
-
To run proposed system: simply run bash script
run_DepNeCTI_LSTM.sh
orrun_DepNeCTI_XLMR.sh
and place the respective dataset similar to those files in the data. With these scripts you will be able to reproduce our results for proposed model reported in Table 2. -
To run the system do this
bash run_DepNeCTI_LSTM.sh
- Use the script (
eval_f1.py
) provided inEvaluation
folder to get the scores.
- Use the script (
eval_USS_LSS.py
) provided inEvaluation
folder to get the scores.
- Download the dataset from this link which are in the required format for each baseline.
- Go to the respective folders in the
Baselines
folder and follow the readme files given there. - If you face problem in using this dataset from this link you can generate your data format using the
data_format.ipynb
inDatasets/data_format
- Note: for using any baseline the data will have names like "genia", "GENIA" etc but that data is DepNeCTI data only, the names are left unchanged to avoid creating trouble when running the model.
If you use DepNeCTI in your research, please consider citing our work:
@misc{sandhan2023depnecti,
title={DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit},
author={Jivnesh Sandhan and Yaswanth Narsupalli and Sreevatsa Muppirala and Sriram Krishnan and Pavankumar Satuluri and Amba Kulkarni and Pawan Goyal},
year={2023},
eprint={2310.09501},
archivePrefix={arXiv},
primaryClass={cs.CL}
}