syllog1sm/redshift

Can't train CoNLL formatted file

lechatpito opened this issue · 6 comments

I have been struggling to find freely available training CoNLL data for Redshift. I have finally found that using http://www.anc.org:8080/ANC2Go/ you can export Treebank in CoLNN. However the trainer fails with the following error:
Traceback (most recent call last):
File "./scripts/train.py", line 54, in
plac.call(main)
File "/home/3TOP/fscharf/virt_env/3top_dev/lib/python2.6/site-packages/plac_core.py", line 309, in call
cmd, result = parser_from(obj).consume(arglist)
File "/home/3TOP/fscharf/virt_env/3top_dev/lib/python2.6/site-packages/plac_core.py", line 195, in consume
return cmd, self.func(_(args + varargs + extraopts), *_kwargs)
File "./scripts/train.py", line 48, in main
train_data = redshift.io_parse.read_conll(train_str, unlabelled=unlabelled)
File "io_parse.pyx", line 129, in redshift.io_parse.read_conll (redshift/io_parse.cpp:2860)
ValueError: too many values to unpack (expected 4)

It looks like a format problem...
Also, is there a way to pass a folder as an argument to the trainer so all the files are used ?

Hi,

The develop branch uses CoNLL formatted files, on the master branch, I was using a reduced format which lists only the word, tag, head and label. You can get the reduced format easily from the CoNLL files using the cut tool.

I don't have the facility to read in a directory at the moment, but it'll be easy enough for you to write the Python function to load the data in that way. See scripts/train.py , or just use the "cat" utility to concatenate your data before you give it to the parser.

Thanks for the answer !
Using the dev version I get the following error:
File "sentence.pyx", line 139, in redshift.sentence.Input.from_conll (redshift/sentence.cpp:3329)
ValueError: invalid literal for int() with base 10: 'B-NP'

Here is the first sentence of the CoNLL file:

1 Electronic _ JJ _ _ B-NP _ _ _ _ _ _ _ _ _ _ _ _ _ _
2 theft _ NN _ _ I-NP _ _ _ _ _ _ _ _ _ _ _ _ _ _
3 by _ IN _ _ I-NP B-PP _ _ _ _ _ _ _ _ _ _ _ _ _
4 foreign _ JJ _ _ B-NP I-PP B-ADJP _ _ _ _ _ _ _ _ _ _ _ _
5 and _ CC _ _ I-NP I-PP I-ADJP _ _ _ _ _ _ _ _ _ _ _ _
6 industrial _ JJ _ _ I-NP I-PP I-ADJP _ _ _ _ _ _ _ _ _ _ _ _
7 spies _ NNS _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
8 and _ CC _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
9 disgruntled _ JJ _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
10 employees _ NNS _ _ I-NP I-PP _ _ _ _ _ _ _ _ _ _ _ _ _
11 is _ VBZ _ _ _ _ _ B-VP _ _ _ _ _ _ _ _ _ _ _
12 costing _ VBG _ _ _ _ _ B-VP _ _ _ _ _ _ _ _ _ _ _
13 U. _ NNP _ _ B-NP _ _ I-VP B-NML _ _ _ _ _ _ _ _ _ _
14 S. _ NNP _ _ I-NP _ _ I-VP I-NML _ _ _ _ _ _ _ _ _ _
15 companies _ NNS _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
16 billions _ NNS _ _ B-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
17 and _ CC _ _ _ _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
18 eroding _ VBG _ _ _ _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
19 their _ PRP$ _ _ B-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
20 international _ JJ _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
21 competitive _ JJ _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
22 advantage _ NN _ _ I-NP _ _ I-VP _ _ _ _ _ _ _ _ _ _ _
23 . _ . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

That file is chunked, not parsed.

The best way to obtain training data for the parser is from a dependency converter, given PTB-formatted trees. You'll need to ensure that the converter produces projective trees. I usually use the Stanford converter set to "basic" mode, but I've also used the output of the Penn2Malt converter.

Here's an example of a CoNLL parsed sentence:

1 Ms. _ NNP NNP _ 2 nn _ _
2 Haag _ NNP NNP _ 3 nsubj _ _
3 plays _ VBZ VBZ _ 0 ROOT _ _
4 Elianti _ NNP NNP _ 3 dobj _ _
5 . _ . . _ 3 P _ _

Dear Matthew,

just for clarification: would the example sentence you just gave look like this in the reduced format you use in your master branch?

Ms. NNP 2 nn
Haag NNP 3 nsubj
plays VBZ 0 ROOT
Elianti NNP 3 dobj
. . 3 P

Dear Matthew,

I can't get scripts/train.py (from your develop branch) to run on the example sentence you gave.

~/bin/redshift $ scripts/train.py -k 16 input_conll.txt output.txt
Loading vocab from  /home/arne/bin/redshift/index/bllip-clusters
Traceback (most recent call last):
  File "scripts/train.py", line 43, in <module>
    plac.call(main)
  File "/usr/local/lib/python2.7/dist-packages/plac_core.py", line 309, in call
    cmd, result = parser_from(obj).consume(arglist)
  File "/usr/local/lib/python2.7/dist-packages/plac_core.py", line 195, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "scripts/train.py", line 31, in main
    sents = [Input.from_conll(s) for s in train_str.split('\n\n') if s.strip()]
  File "sentence.pyx", line 128, in redshift.sentence.Input.from_conll (redshift/sentence.cpp:3133)
IndexError: list index out of range
$ cat input_conll.txt

1   Ms. _   NNP NNP _   2   nn  _   _
2   Haag    _   NNP NNP _   3   nsubj   _   _
3   plays   _   VBZ VBZ _   0   ROOT    _   _
4   Elianti _   NNP NNP _   3   dobj    _   _
5   .   _   .   .   _   3   P   _   _

Hi,

Sorry for the delay replying. I'm travelling, and have reduced internet access.

Ms.     NNP     1       nn
Haag    NNP     2       nsubj
plays   VBZ     -1      ROOT
Elianti NNP     2       dobj
.       .       2       P

So: index from 0 for the first word, and denote ROOT with the index -1. The CoNLL format takes the first word as having index 1, with the ROOT symbol at position 0.