Can't train CoNLL formatted file
lechatpito opened this issue · 6 comments
I have been struggling to find freely available training CoNLL data for Redshift. I have finally found that using http://www.anc.org:8080/ANC2Go/ you can export Treebank in CoLNN. However the trainer fails with the following error:
Traceback (most recent call last):
File "./scripts/train.py", line 54, in
plac.call(main)
File "/home/3TOP/fscharf/virt_env/3top_dev/lib/python2.6/site-packages/plac_core.py", line 309, in call
cmd, result = parser_from(obj).consume(arglist)
File "/home/3TOP/fscharf/virt_env/3top_dev/lib/python2.6/site-packages/plac_core.py", line 195, in consume
return cmd, self.func(_(args + varargs + extraopts), *_kwargs)
File "./scripts/train.py", line 48, in main
train_data = redshift.io_parse.read_conll(train_str, unlabelled=unlabelled)
File "io_parse.pyx", line 129, in redshift.io_parse.read_conll (redshift/io_parse.cpp:2860)
ValueError: too many values to unpack (expected 4)
It looks like a format problem...
Also, is there a way to pass a folder as an argument to the trainer so all the files are used ?
Hi,
The develop branch uses CoNLL formatted files, on the master branch, I was using a reduced format which lists only the word, tag, head and label. You can get the reduced format easily from the CoNLL files using the cut tool.
I don't have the facility to read in a directory at the moment, but it'll be easy enough for you to write the Python function to load the data in that way. See scripts/train.py , or just use the "cat" utility to concatenate your data before you give it to the parser.
Thanks for the answer !
Using the dev version I get the following error:
File "sentence.pyx", line 139, in redshift.sentence.Input.from_conll (redshift/sentence.cpp:3329)
ValueError: invalid literal for int() with base 10: 'B-NP'
Here is the first sentence of the CoNLL file:
1 Electronic _ JJ _ _ B-NP _ _ _ _ _ _ _ _ _ _ _ _ _ _
2 theft _ NN _ _ I-NP _ _ _ _ _ _ _ _ _ _ _ _ _ _
3 by _ IN _ _ I-NP B-PP _ _ _ _ _ _ _ _ _ _ _ _ _
4 foreign _ JJ _ _ B-NP I-PP B-ADJP _ _ _ _ _ _ _ _ _ _ _ _
5 and _ CC _ _ I-NP I-PP I-ADJP _ _ _ _ _ _ _ _ _ _ _ _
6 industrial _ JJ _ _ I-NP I-PP I-ADJP _ _ _ _ _ _ _ _ _ _ _ _
7 spies _ NNS _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
8 and _ CC _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
9 disgruntled _ JJ _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
10 employees _ NNS _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
11 is _ VBZ _ _ _ _ _ B-VP _ _ _ _ _ _ _ _ _ _ _
12 costing _ VBG _ _ _ _ _ B-VP _ _ _ _ _ _ _ _ _ _ _
13 U. _ NNP _ _ B-NP _ _ I-VP B-NML _ _ _ _ _ _ _ _ _ _
14 S. _ NNP _ _ I-NP _ _ I-VP I-NML _ _ _ _ _ _ _ _ _ _
15 companies _ NNS _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
16 billions _ NNS _ _ B-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
17 and _ CC _ _ _ _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
18 eroding _ VBG _ _ _ _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
19 their _ PRP$ _ _ B-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
20 international _ JJ _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
21 competitive _ JJ _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
22 advantage _ NN _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
23 . _ . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
That file is chunked, not parsed.
The best way to obtain training data for the parser is from a dependency converter, given PTB-formatted trees. You'll need to ensure that the converter produces projective trees. I usually use the Stanford converter set to "basic" mode, but I've also used the output of the Penn2Malt converter.
Here's an example of a CoNLL parsed sentence:
1 Ms. _ NNP NNP _ 2 nn _ _
2 Haag _ NNP NNP _ 3 nsubj _ _
3 plays _ VBZ VBZ _ 0 ROOT _ _
4 Elianti _ NNP NNP _ 3 dobj _ _
5 . _ . . _ 3 P _ _
Dear Matthew,
just for clarification: would the example sentence you just gave look like this in the reduced format you use in your master branch?
Ms. NNP 2 nn
Haag NNP 3 nsubj
plays VBZ 0 ROOT
Elianti NNP 3 dobj
. . 3 P
Dear Matthew,
I can't get scripts/train.py
(from your develop
branch) to run on the example sentence you gave.
~/bin/redshift $ scripts/train.py -k 16 input_conll.txt output.txt
Loading vocab from /home/arne/bin/redshift/index/bllip-clusters
Traceback (most recent call last):
File "scripts/train.py", line 43, in <module>
plac.call(main)
File "/usr/local/lib/python2.7/dist-packages/plac_core.py", line 309, in call
cmd, result = parser_from(obj).consume(arglist)
File "/usr/local/lib/python2.7/dist-packages/plac_core.py", line 195, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "scripts/train.py", line 31, in main
sents = [Input.from_conll(s) for s in train_str.split('\n\n') if s.strip()]
File "sentence.pyx", line 128, in redshift.sentence.Input.from_conll (redshift/sentence.cpp:3133)
IndexError: list index out of range
$ cat input_conll.txt
1 Ms. _ NNP NNP _ 2 nn _ _
2 Haag _ NNP NNP _ 3 nsubj _ _
3 plays _ VBZ VBZ _ 0 ROOT _ _
4 Elianti _ NNP NNP _ 3 dobj _ _
5 . _ . . _ 3 P _ _
Hi,
Sorry for the delay replying. I'm travelling, and have reduced internet access.
Ms. NNP 1 nn
Haag NNP 2 nsubj
plays VBZ -1 ROOT
Elianti NNP 2 dobj
. . 2 P
So: index from 0 for the first word, and denote ROOT with the index -1. The CoNLL format takes the first word as having index 1, with the ROOT symbol at position 0.