Issue in preprocessing ontonotes
ghaddarAbs opened this issue ยท 7 comments
Hi,
I am facing some issue with preprocessing ontonotes.
First i used the script of conll-2012 shared task to generate the *_gold_conll
files which contains the annotations. For example:
conll-2012/v4/data/train/data/english/annotations/bn/cnn/03/cnn_0301.v4_gold_conll
In conll-2012/v4/data
directory i have 3 sub directories: train/dev/test
My question is: How should i group *_gold_conll
files in order to fit the required format of preprocess.py? Should i concatenate all *_gold_conll
in one file for train/dev/test?
Regards
Okay, I've pushed a fix. You should be able to simply run e.g. ./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
as in the readme. Let me know if it works for you.
Hi,
Thank you for taking the time to consider my request.
the new code have solved part of the problem but the code still need some modifs.
Here some suggestion to make the code compatible with the directory structure produced by skeleton2conll.sh script :
-
conf/ontonotes/ontonotes.conf line 5 -> change
"dev" to "development"
(Minor). -
change
dilated-cnn-ner/bin/preprocess.sh
Line 45 in 0b4955a
to
cat `find $raw_data_dir/${data_files[0]} -type f -name \*_gold_conll | grep -v "/pt/nt" | grep "english"` \
- The old command failed to gather all
*_gold_conll
files. grep -v "/pt/nt"
-> Skip New Testaments portion (Optional).grep "english
-> Avoid chinese and arabic files in data_file directory (e.g.data/train/data/chinese/annotations/wb/e2c/00
).
- change
dilated-cnn-ner/bin/preprocess.sh
Line 72 in 0b4955a
to
for this_data_file in `find $raw_data_dir/$filename -type d -links 2 | tail -n +2 | grep -v "/pt/nt" | grep "english"`; do
-links 2"
-> Get leaf directories only (otherwise it will raise an error insrc/tsv_to_tfrecords.py
)
-
change
dilated-cnn-ner/src/tsv_to_tfrecords.py
Line 421 in 0b4955a
toif fname.endswith("_gold_conll"):
otherwise it will read_auto_conll
files. -
I was woundering why
tail -n +2
?
Even with these modifs there still some issues (mainly in tsv_to_tfrecords.py
):
- With the current directory structure and because line
dilated-cnn-ner/src/tsv_to_tfrecords.py
Line 418 in 0b4955a
tsv_to_tfrecords.py
is called it will write an output with the name of the last directory. Here samples of the directory names:
conll-2012/data/train/data/english/annotations/nw/wsj/19
conll-2012/data/train/data/english/annotations/nw/wsj/05
conll-2012/data/train/data/english/annotations/nw/wsj/13
Consequently tsv_to_tfrecords.py
will write 19_sizes.txt 05_sizes.txt... rather than nw_sizes.txt
I suggest that in_file of src/tsv_to_tfrecords.py
be the document genre path (e.g. conll-2012/data/train/data/english/annotations/nw
) and replacing
dilated-cnn-ner/src/tsv_to_tfrecords.py
Line 420 in 0b4955a
by
file_list = [y for x in os.walk(in_file) for y in glob(os.path.join(x[0], '*_gold_conll'))]
for fname in os.listdir(file_list):
What is folder structure of conll-2012 dataset
@ghaddarAbs thanks for sharing these changes! I forgot that the folder I'm using removed some of the directory structure. Could you submit these as a PR?
@strubell done.... Now the code compatible with the directory structure produced by skeleton2conll.sh .
each of train|dev|test /protos directories now contain 6 files bc|bn|mz|nw|tc|wb_examples.proto
Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?
./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt
I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html
Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:
$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0
structure for conll-formatted-ontonotes-5.0:
conll-formatted-ontonotes-5.0
โโโ data
โ โโโ development
โ โ โโโ data
โ โ โโโ arabic
โ โ โ โโโ annotations
โ โ โโโ chinese
โ โ โ โโโ annotations
โ โ โโโ english
โ โ โโโ annotations
โ โโโ test
โ โ โโโ data
โ โ โโโ arabic
โ โ โ โโโ annotations
โ โ โโโ chinese
โ โ โ โโโ annotations
โ โ โโโ english
โ โ โโโ annotations
โ โโโ train
โ โโโ data
โ โโโ arabic
โ โ โโโ annotations
โ โโโ chinese
โ โ โโโ annotations
โ โโโ english
โ โโโ annotations
โโโ scripts
Regards