iesl/dilated-cnn-ner

Issue in preprocessing ontonotes

ghaddarAbs opened this issue ยท 7 comments

Hi,
I am facing some issue with preprocessing ontonotes.

First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example:
conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll
In conll-2012/v4/data directory i have 3 sub directories: train/dev/test

My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py? Should i concatenate all *_gold_conll in one file for train/dev/test?

Regards

Okay, I've pushed a fix. You should be able to simply run e.g. ./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf as in the readme. Let me know if it works for you.

Hi,

Thank you for taking the time to consider my request.
the new code have solved part of the problem but the code still need some modifs.

Here some suggestion to make the code compatible with the directory structure produced by skeleton2conll.sh script :

  1. conf/ontonotes/ontonotes.conf line 5 -> change "dev" to "development" (Minor).

  2. change

    cat "$raw_data_dir/${data_files[0]}"/*/*_gold_conll \

    to
    cat `find $raw_data_dir/${data_files[0]} -type f -name \*_gold_conll | grep -v "/pt/nt" | grep "english"` \

  • The old command failed to gather all *_gold_conll files.
  • grep -v "/pt/nt" -> Skip New Testaments portion (Optional).
  • grep "english -> Avoid chinese and arabic files in data_file directory (e.g. data/train/data/chinese/annotations/wb/e2c/00).
  1. change

for this_data_file in `find $raw_data_dir/$filename -type d | tail -n +2`; do

to
for this_data_file in `find $raw_data_dir/$filename -type d -links 2 | tail -n +2 | grep -v "/pt/nt" | grep "english"`; do

  • -links 2" -> Get leaf directories only (otherwise it will raise an error in src/tsv_to_tfrecords.py)
  1. change

    if fname.endswith("_conll"):

    to if fname.endswith("_gold_conll"): otherwise it will read _auto_conll files.

  2. I was woundering why tail -n +2 ?

Even with these modifs there still some issues (mainly in tsv_to_tfrecords.py):

  1. With the current directory structure and because line
    data_type = FLAGS.in_file.strip().split("/")[-1]
    each time tsv_to_tfrecords.py is called it will write an output with the name of the last directory. Here samples of the directory names:
conll-2012/data/train/data/english/annotations/nw/wsj/19
conll-2012/data/train/data/english/annotations/nw/wsj/05
conll-2012/data/train/data/english/annotations/nw/wsj/13 

Consequently tsv_to_tfrecords.py will write 19_sizes.txt 05_sizes.txt... rather than nw_sizes.txt

I suggest that in_file of src/tsv_to_tfrecords.py be the document genre path (e.g. conll-2012/data/train/data/english/annotations/nw) and replacing

for fname in os.listdir(FLAGS.in_file):

by

file_list = [y for x in os.walk(in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))]
for fname in os.listdir(file_list):

What is folder structure of conll-2012 dataset

@ghaddarAbs thanks for sharing these changes! I forgot that the folder I'm using removed some of the directory structure. Could you submit these as a PR?

@strubell done.... Now the code compatible with the directory structure produced by skeleton2conll.sh .

each of train|dev|test /protos directories now contain 6 files bc|bn|mz|nw|tc|wb_examples.proto

Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html

Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0
structure for conll-formatted-ontonotes-5.0:

conll-formatted-ontonotes-5.0
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ development
โ”‚   โ”‚   โ””โ”€โ”€ data
โ”‚   โ”‚       โ”œโ”€โ”€ arabic
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ annotations
โ”‚   โ”‚       โ”œโ”€โ”€ chinese
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ annotations
โ”‚   โ”‚       โ””โ”€โ”€ english
โ”‚   โ”‚           โ””โ”€โ”€ annotations
โ”‚   โ”œโ”€โ”€ test
โ”‚   โ”‚   โ””โ”€โ”€ data
โ”‚   โ”‚       โ”œโ”€โ”€ arabic
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ annotations
โ”‚   โ”‚       โ”œโ”€โ”€ chinese
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ annotations
โ”‚   โ”‚       โ””โ”€โ”€ english
โ”‚   โ”‚           โ””โ”€โ”€ annotations
โ”‚   โ””โ”€โ”€ train
โ”‚       โ””โ”€โ”€ data
โ”‚           โ”œโ”€โ”€ arabic
โ”‚           โ”‚   โ””โ”€โ”€ annotations
โ”‚           โ”œโ”€โ”€ chinese
โ”‚           โ”‚   โ””โ”€โ”€ annotations
โ”‚           โ””โ”€โ”€ english
โ”‚               โ””โ”€โ”€ annotations
โ””โ”€โ”€ scripts

Regards