Part-of-speech (POS) tagging, within computational linguistics, aims to use computational methods todiscern what part of speech (i.e. noun, verb, adjective) each word in a document corresponds to. This process, while nontrivial, has been extensively researched using statistical and rulebased tagging systems within the context of English and other European languages. However, less research exists aimed at highly morphological languages such as Sanskrit. Unique difficulties present themselves in the application of English-based POS-tagging methods to the Sanskrit language.
This research project aims to address a foremost issue with POS tagging for Sanskrit: a small amount of labeled data, combined with a need for a larger tagset (as compared to English) results in a relatively low word- and sentence-level tagging accuracy. Building off of tagged data and a tagset developed by researchers R. Chandrashekar and Girish Nath Jha at JNU[1], this project shall train a Hidden Markov Model in a semi-supervised fashion by making use of both the JNU tagged corpus as well as the large quantities of digitised Sanskrit text available from GRETIL.
What follows in this README is a cursory User's Guide of sorts for this project. For detailed documentation on the research (background, model, notation, implementation, methods, results) please see doc/
. Much of this documentation is still forthcoming.
The master
branch of this repository represents the most stable iteration of the model. As of the latest update to this README, this is a (relatively) stable implementation of an HMM trained solely from visible data.
The devel
branch is less stable and contains possibly breaking changes that will eventually be merged into master once functioning.
.
├── data: corpora, tagged output files
│ ├── en: English corpora
│ └── sans: Sanskrit corpora
├── debug: impermanent output logs
├── doc: documentation
├── perfstats: performance statistics of the model under various parameters
├── pos: python modules
│ └── hmm: modules specific to HMM implementation
└── tools: various utility modules and/or scripts
The required dependencies to build and use the pos module are:
- Python >= 2.6. Tested on 2.7.11. Not currently compatible with Python3.
- NumPy >= 1.10. Tested on 1.11.0.
- Cython >= 0.24.
The process to install the above requirements varies based on your system. Most should be available through your package manager's repositories.
This project currently expects a Unix-like system. It has been tested to run, with the required dependencies correctly configured, on OS X 10.10, Arch Linux, Debian and Fedora, and should be able to run on other Linuxes without much hassle (perhaps a tweak in the makefile at most). If for some reason the hmm
module cannot be built, or the tagger/related tools do not run, consider that a bug. No support is expected for Windows, although there may be some success running within the new WSL (not tested).
As distributed by this repository, the hmm.HiddenDataHMM
module must first be built before use. From within the pos/hmm/
directory, run make
(make -f makefile-osx
on OS X).
It might be necessary, in order for certain bash scripts to work out-of-the-box, to allow execute permissions on tagger.py
, score.py
, and tools/corpus.py
. Naturally, the bash scripts themselves should be executable as well.
Labeled and unlabeled data in English and Sanskrit are found under data/en/
and data/sans/
, respectively. A large (~39k line) corpus of tagged English sentences exists in the form of wsj2-21.txt
and wsj22.txt
from the Penn Treebank; its unlabeled equivalent is wsj2-21-notags.txt
. For training an HMM from labeled data (for English), wsj2-21.txt
can be used for training and wsj22.txt
for testing. A (substantially) smaller corpus of 174 labeled English sentences exists as TaggedCorpus.txt
(under data/en/
, along with all other English corpora). The reason for the small corpus is to provide a point of comparison to the performance of the tagger on Sanskrit.
Sanskrit labeled data exists as TaggedCorpus.txt
(under data/sans/
). This is a corpus of 174 lines of labeled Sanskrit, from Dr. Chandrashekar's JNU corpus (with modifications to the tags, see doc/tagset_guidelines.md
). The file GRETILNoTagsTrain.txt
is a corpus of ~3.1k lines of unlabeled padapāṭha (sandhi-separated) Sanskrit sentences, concatenated from the following texts:
- Śivamahimnastava (
sivamahinastava.txt
), text sourced from GRETIL - Stavacintamāṇi of Bhaṭṭa Nārāyaṇa (
bhatta_narayana.txt
), text sourced from GRETIL - Kirātārjunīya (
kiratarjuniya.txt
), text sourced from GRETIL - Rāmacaritamahākavya (
ramacaritamahakavya.txt
), text sourced from GRETIL
The corpus.py
utility (under tools/
) is used to split a labeled corpus into both training and testing files. The user can specify to select either a single line for use as a test corpus, and the remainder for training (useful for cross-validation); or they can specify a ratio of test/train sizes; or lastly a specific number of lines to use as a test corpus, with the remainder for training. If a ratio or line count is specified, the actual lines will be selected at random.
tools/corpus.py --help
should provide a pretty good sense of how this utility works. For example, tools/corpus.py --ratio 0.2 data/sans/TaggedCorpus.txt data/sans
will generate data/sans/train.txt
and data/sans/test.txt
, with test.txt
containing approx. 20% the lines of TaggedCorpus.txt
and train.txt
comprising the remaining ~80%.
Use tagger.py
to train a Hidden Markov model, and decode a document using the Viterbi method. Make sure tagger.py
is set as executable.
Tagging can be done three ways: on labeled data, on unlabeled data, or on both. These are specified via the --model
option. Pass to the tagger a training corpus, which can be a list of files; it should be a labeled corpus for the supervised and semi-supervised models, and unlabeled for the unsupervised model. Pass also a test corpus, which can be a list of files, all of which must be labeled. If using the semi-supervised model, an extra unlabeled corpus must be specified the --extra
flag. Specify a location for tagged output with --output
. Lastly, specify what language the tagger should run on (currently either English or Sanskrit).
For more information on how to run the tagger, including additional flag options, run ./tagger.py --help
.
Tagging in English, on labeled data:
$ ./tagger.py --lang EN --model super --train data/en/wsj2-21.txt --test data/en/wsj22.txt --output data/output.txt
Tagging in Sanskrit, on unlabeled data, for 184 tag classes and 10 iterations of Baum-Welch:
$ ./tagger.py --lang SANS --model unsuper --iter 10 -n 184 --train data/sans/GRETILNoTagsTrain.txt \
--test data/sans/test.txt --output data/output.txt
Tagging in Sanskrit on both labeled and unlabeled data:
$ ./tagger.py --lang SANS --model semisuper --iter 5 --train data/sans/train.txt --test data/sans/test.txt \
--extra data/sans/GRETILNoTagsTrain.txt --output data/output.txt
For generating the files data/sans/train.txt
and data/sans/test.txt
, see tools/corpus.py --help
and/or the previous section.
Upon the tagger completing, there should be a file (specified by --output
, e.g. data/output.txt
) containing its tagged output sentences. Use score.py
to compare this against the "gold" file, which is usually what was passed as --test
to the tagger.
Run ./score.py --help
for a detailed overview of the options. Note that, in lieu of a tagset file, an empty file (e.g. /dev/null
) can be passed if balanced accuracy or a printout of the confusion matrix is not desired. Also pass an empty file if the classes of the confusion matrix should be only the classes present in the gold and tagged files, rather than all valid classes.
When evaluating performance of the model on Sanskrit (or other small labeled corpora, c.f. data/en/TaggedCorpus.txt
), cross-validation is necessary to properly assess how well it might generalise to unknown data without compromising model accuracy by reducing available training data even further. The crossvalidate
script automates the process.
Usage: ./crossvalidate tagger_args [...]
tagger_args: the set of arguments passed to ./tagger.py, excluding --test --train --output and --lang,
which are specified by this script
The script will default to working on Sanskrit, but this behaviour can be changed by setting MODEL_LANG=EN
as an environment variable, or by editing it in the script.
In order to assess the performance of the semi-supervised model under increasing amounts of unlabeled data, the eval
script automates the process of creating an unlabeled corpus of increasing sizes and training the tagger with that.
usage: eval STEP_SIZE EM_ITER OUTPUT (EN|SANS)
Evaluate accuracy of semi-supervised HMM as a function of unlabeled corpus size.
Arguments:
STEP_SIZE: Num. of lines data is increased by on each iteration.
EM_ITER: Num. of iterations of EM
OUTPUT: Path to output table of corpus size vs. accuracy
(EN|SANS): Either EN (english) or SANS (sanskrit). Defaults to English.
Run ./eval
without any parameters to print the above help.