Issue in preprocessing ontonotes

Hi,
I am facing some issue with preprocessing ontonotes.

First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example:
conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll
In conll-2012/v4/data directory i have 3 sub directories: train/dev/test

My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py? Should i concatenate all *_gold_conll in one file for train/dev/test?

Regards

Thanks for pointing this out! It looks like not all of the ontonotes pre-processing got properly ported to this repo. I'll try to push I fix later today.

…

On Thu, Mar 8, 2018 at 4:56 PM ghaddarAbs ***@***.***> wrote: Hi, I am facing some issue with preprocessing ontonotes. First i used the script of conll-2012 shared task to generate the *_gold_conll files which contains the annotations. For example: conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll In conll-2012/v4/data directory i have 3 sub directories: train/dev/test My question is: How should i group *_gold_conll files in order to fit the required format of preprocess.py Regards — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZt0NAZPPZL3XWfGCYRh9vybrAN8Oqks5tcakQgaJpZM4SjaWU> .

Okay, I've pushed a fix. You should be able to simply run e.g. ./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf as in the readme. Let me know if it works for you.

Hi,

Thank you for taking the time to consider my request.
the new code have solved part of the problem but the code still need some modifs.

Here some suggestion to make the code compatible with the directory structure produced by skeleton2conll.sh script :

conf/ontonotes/ontonotes.conf line 5 -> change "dev" to "development" (Minor).
change

dilated-cnn-ner/bin/preprocess.sh

Line 45 in 0b4955a

cat "$raw_data_dir/${data_files[0]}"/*/*_gold_conll \

to
cat `find $raw_data_dir/${data_files[0]} -type f -name \*_gold_conll | grep -v "/pt/nt" | grep "english"` \

The old command failed to gather all *_gold_conll files.
grep -v "/pt/nt" -> Skip New Testaments portion (Optional).
grep "english -> Avoid chinese and arabic files in data_file directory (e.g. data/train/data/chinese/annotations/wb/e2c/00).

change

dilated-cnn-ner/bin/preprocess.sh

Line 72 in 0b4955a

for this_data_file in `find $raw_data_dir/$filename -type d | tail -n +2`; do

to
for this_data_file in `find $raw_data_dir/$filename -type d -links 2 | tail -n +2 | grep -v "/pt/nt" | grep "english"`; do

-links 2" -> Get leaf directories only (otherwise it will raise an error in src/tsv_to_tfrecords.py)

change

dilated-cnn-ner/src/tsv_to_tfrecords.py

Line 421 in 0b4955a

if fname.endswith("_conll"):

to if fname.endswith("_gold_conll"): otherwise it will read _auto_conll files.
I was woundering why tail -n +2 ?

Even with these modifs there still some issues (mainly in tsv_to_tfrecords.py):

With the current directory structure and because line

dilated-cnn-ner/src/tsv_to_tfrecords.py

Line 418 in 0b4955a

data_type = FLAGS.in_file.strip().split("/")[-1]

each time tsv_to_tfrecords.py is called it will write an output with the name of the last directory. Here samples of the directory names:

conll-2012/data/train/data/english/annotations/nw/wsj/19
conll-2012/data/train/data/english/annotations/nw/wsj/05
conll-2012/data/train/data/english/annotations/nw/wsj/13

Consequently tsv_to_tfrecords.py will write 19_sizes.txt 05_sizes.txt... rather than nw_sizes.txt

I suggest that in_file of src/tsv_to_tfrecords.py be the document genre path (e.g. conll-2012/data/train/data/english/annotations/nw) and replacing

dilated-cnn-ner/src/tsv_to_tfrecords.py

Line 420 in 0b4955a

for fname in os.listdir(FLAGS.in_file):

by

file_list = [y for x in os.walk(in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))]
for fname in os.listdir(file_list):

What is folder structure of conll-2012 dataset

@ghaddarAbs thanks for sharing these changes! I forgot that the folder I'm using removed some of the directory structure. Could you submit these as a PR?

@strubell done.... Now the code compatible with the directory structure produced by skeleton2conll.sh .

each of train|dev|test /protos directories now contain 6 files bc|bn|mz|nw|tc|wb_examples.proto

Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html

Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0
structure for conll-formatted-ontonotes-5.0:

conll-formatted-ontonotes-5.0
├── data
│   ├── development
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   ├── test
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   └── train
│       └── data
│           ├── arabic
│           │   └── annotations
│           ├── chinese
│           │   └── annotations
│           └── english
│               └── annotations
└── scripts

Regards