Error-repair Dependency Pasring for Ungrammatical Texts

Instructions

N.B. For license restriction, we don't provide the original PTB in this repository.

Prerequisites

The code depends on Python 2.7 (compiled with unicode=ucs2).

Check if your python is compatible with the code.

$ python --version
Python 2.7.17
$ python -c "import sys; print(sys.maxunicode)"
65535 (If this is 1114111, then your python uses unicode=ucs4)

If your python is not compatible, you might want to build python from source.

(for example)
cd $HOME
mkdir local
mkdir temp
cd ./temp
wget https://www.python.org/ftp/python/2.7.17/Python-2.7.17.tgz
tar zxvf Python-2.7.17.tgz
cd Python-2.7.17
./configure --prefix=$HOME/local --enable-unicode=ucs2 --enable-loadable-sqlite-extensions
make && make install
export PATH=$HOME/local/bin:$PATH
cd $HOME/temp
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Once you have a compatible python, install pre-requisite modules.
```
pip install -r requirements.txt
```
(You need to install libmysqlclient-dev and libsqlite3-dev (e.g., sudo apt-get install libmysqlclient-dev libsqlite3-dev)

Download kenlm model to ./data

cd ./data
wget http://cs.jhu.edu/~keisuke/shared/gigaword.kenlm

If you use a parser with pre-trained models, download the model weights and put them at ./easyfirst/models/ so that it will look like as follows:

easyfirst/models/
├── E05.model
├── E05.weights.FINAL
├── E10.model
├── E10.weights.FINAL
├── E15.model
├── E15.weights.FINAL
├── E20.model
└── E20.weights.FINAL

Get Penn Treebank under data directory. If you just use a parser with pre-trained models, go to step 6.
```
 cd ./data
 ln -s PATH_TO_YOUR_PTB treebank_3
```

Download and Install CRFsuite for preprocessing.

 [example for linux]
 cd ./data
 wget https://github.com/downloads/chokkan/crfsuite/crfsuite-0.12-x86_64.tar.gz
 wget https://github.com/downloads/chokkan/crfsuite/crfsuite-0.12.tar.gz
 tar zxvf crfsuite-0.12-x86_64.tar.gz
 tar zxvf crfsuite-0.12-.tar.gz

Set CRFSUITE_UTIL and crfsuite paths in preproc.sh and run the script.
```
 sh ./preproc.sh
```
This creates ./data/[train|dev|test].E00 (i.e., Error rate = 0%)

Add noise by running errgent. See the readme file in the directory.

 cd ./errgent
 sh ./generate_train_dev_test.sh (for generating all the files needed)

We assume that we have named the files as ./data/[train|dev|test].[E00|E05|E10|E15|E20]. The file should look like the following.

     1       Ms.     B-NP    NNP     _       _       2       TITLE   _       _
     2       Haag    I-NP    NNP     _       _       3       SBJ     _       _
     3       plays   B-VP    VBZ     _       _       0       ROOT    _       _
     4       Elianti B-NP    NNP     _       _       3       OBJ     _       _
     5       .       O       .       _       _       3       P       _       _
     
     1       The     B-NP    DT      _       _       4       NMOD    _       _
     2       luxury  I-NP    NN      _       _       4       NMOD    _       _
     3       auto    I-NP    NN      _       _       4       NMOD    _       _
     4       maker   I-NP    NN      _       _       7       SBJ     _       _
     5       last    B-NP    JJ      _       _       6       NMOD    _       _
     6       year    I-NP    NN      _       _       7       TMP     _       _
     7       sold    B-VP    VBD     _       _       0       ROOT    _       _
     8       1,214   B-NP    CD      _       _       9       NMOD    _       _
     9       cars    I-NP    NNS     _       _       7       OBJ     _       _
     10      in      B-PP    IN      _       _       7       LOC     _       _
     11      the     B-NP    DT      _       _       12      NMOD    _       _
     12      U.S.    I-NP    NNP     _       _       10      PMOD    _       _
     
     ...

Training a error-repair parser

 cd easyfirst
 (e.g.,) sh sample_train.sh E05 (training a model with 5% error-injected corpus)

Parsing sentences with the trained model

 (e.g.,) sh sample_parse.sh dev E05 E10 (parse 10% error-injected dev set with a model trained on 5% error corpus)

Evaluation on parsing performance

 cd ./eval
 wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/srleval/source-archive.zip -O srleval.zip
 unzip srleval.zip
 cd ./eval/srleval/trunk/align
 make
 
 modify line 231 in ./eval/srleval/trunk/eval.py
 (from) for item in alignment.align(ref_words, hyp_words, command=os.path.dirname(__file__) + "/align/align"):
 (to)   for item in alignment.align(ref_words, hyp_words):
 
 run evaluation script
 cd  ./eval
 (e.g.,) sh evaluate.sh dev E05 E10 (evaluate 10% error-injected dev set with a model trained on 5% error corpus)

Evaluation on grammaticality improvement

See Predicting Grammaticality on an Ordinal Scale

Questions

Please e-mail to Keisuke Sakaguchi (keisuke[at]cs.jhu.edu).

keisks/error-repair-parsing

Error-repair Dependency Pasring for Ungrammatical Texts

Instructions

Questions