preprocessing before triggering 'preprocess.sh' for ontonotes
marc88 opened this issue · 2 comments
Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?
./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does not write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt
I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html
Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:
$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0
*structure for $DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0 ( this directory has all the _gold_conll files. Please take a direcotry below as an example:
/home/ss06886910/Strubel_IDCNN/data/conll-formatted-ontonotes-5.0/data/train/data/english/annotations/wb/c2e/00/c2e_0028.v4_gold_conll)
conll-formatted-ontonotes-5.0
├── data
│ ├── development
│ │ └── data
│ │ ├── arabic
│ │ │ └── annotations
│ │ ├── chinese
│ │ │ └── annotations
│ │ └── english
│ │ └── annotations
│ ├── test
│ │ └── data
│ │ ├── arabic
│ │ │ └── annotations
│ │ ├── chinese
│ │ │ └── annotations
│ │ └── english
│ │ └── annotations
│ └── train
│ └── data
│ ├── arabic
│ │ └── annotations
│ ├── chinese
│ │ └── annotations
│ └── english
│ └── annotations
└── scripts
Tried running with the following parameter in ontonotes.conf ;
export raw_data_dir="$DATA_DIR/conll-formatted-ontonotes-5.0/data"
($DATA_DIR = $DILATED_CNN_NER_ROOT/data)
And, I get the following error:
Processing file: data/conll-formatted-ontonotes-5.0/data/development
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/development --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/development --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Embeddings coverage: 98.67%
Processing file: data/conll-formatted-ontonotes-5.0/data/test
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/test --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/test --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Traceback (most recent call last):
File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 498, in <module>
tf.app.run()
File "/home/ss06886910/IDCNN/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 494, in main
tsv_to_examples()
File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 487, in tsv_to_examples
print("Embeddings coverage: %2.2f%%" % ((1-(num_oov/num_tokens)) * 100))
ZeroDivisionError: division by zero
Regards
@marc88 I have same issue here. Did you fix the successfully ?
I figure it out.
In the test set, the conll file name ends with gold_parse_conll
instead of gold_conll
. So you need to change the line
[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))\
if "/"+data_type+"/" in y and "/english/" in y]
with
[y for x in os.walk(FLAGS.in_file) for y in glob(os.path.join(x[0], '*_gold_parse_conll'))\
if "/"+data_type+"/" in y and "/english/" in y]
for test set.