iesl/dilated-cnn-ner

preprocessing before triggering 'preprocess.sh' for ontonotes

marc88 opened this issue · 2 comments

Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does not write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html

Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0

*structure for $DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0 ( this directory has all the _gold_conll files. Please take a direcotry below as an example:
/home/ss06886910/Strubel_IDCNN/data/conll-formatted-ontonotes-5.0/data/train/data/english/annotations/wb/c2e/00/c2e_0028.v4_gold_conll)

conll-formatted-ontonotes-5.0
├── data
│   ├── development
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   ├── test
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   └── train
│       └── data
│           ├── arabic
│           │   └── annotations
│           ├── chinese
│           │   └── annotations
│           └── english
│               └── annotations
└── scripts

Tried running with the following parameter in ontonotes.conf ;
export raw_data_dir="$DATA_DIR/conll-formatted-ontonotes-5.0/data"
($DATA_DIR = $DILATED_CNN_NER_ROOT/data)

And, I get the following error:

Processing file: data/conll-formatted-ontonotes-5.0/data/development
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/development --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/development --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Embeddings coverage: 98.67%
Processing file: data/conll-formatted-ontonotes-5.0/data/test
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/test --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/test --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Traceback (most recent call last):
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 498, in <module>
    tf.app.run()
  File "/home/ss06886910/IDCNN/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 494, in main
    tsv_to_examples()
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 487, in tsv_to_examples
    print("Embeddings coverage: %2.2f%%" % ((1-(num_oov/num_tokens)) * 100))
ZeroDivisionError: division by zero

Regards

@marc88 I have same issue here. Did you fix the successfully ?

I figure it out.

In the test set, the conll file name ends with gold_parse_conll instead of gold_conll. So you need to change the line

[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))\
                         if "/"+data_type+"/" in y and "/english/" in y]

with

[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_parse_conll'))\
                         if "/"+data_type+"/" in y and "/english/" in y]

for test set.