Error using POS tagger
Closed this issue · 6 comments
Hi there. I'm trying to use your POS tagger and I'm getting the following error when I attempt to train on a very small sample (10 sentences) from the Penn Treebank WSJ dataset. Any thoughts as to what I'm doing wrong?
In [2]: from redshift.tagger import train
In [3]: train(open('wsj.10.txt', 'r').read(), 'redshift_model')
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-4-16d6fd520844> in <module>()
----> 1 train(open('wsj.10.txt', 'r').read(), 'redshift_model')
/Library/Python/2.7/site-packages/redshift/tagger.so in redshift.tagger.train (redshift/tagger.cpp:2391)()
/Library/Python/2.7/site-packages/redshift/tagger.so in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)()
/Library/Python/2.7/site-packages/thinc/learner.so in thinc.learner.LinearModel.update (thinc/learner.cpp:2395)()
AssertionError:
I think I've tracked that assertion down to here:
https://github.com/honnibal/thinc/blob/master/thinc/learner.pyx#L99
But I'm unclear as to why my class label is negative.
Hi,
Thanks for your patience and persistence! Sorry I haven't had much time to help yet.
How is the data in wsj.10.txt formatted? Are the tests passing for you?
This test shows passing a single training example to the train function: https://github.com/syllog1sm/redshift/blob/develop/tests/test_tagger.py
wsj.10.txt is PTB-formatted:
Why/WRB is/VBZ the/DT stock/NN market/NN suddenly/RB so/RB volatile/JJ ?/.
This seems to be the expected format for the Input.from_pos constructor.
I tried running the tests and two of them fail. As you can see from the snippet below, these failures are resulting from the same AssertionError that I mentioned above:
➜ redshift git:(develop) ✗ py.test
========================================================= test session starts ==========================================================
platform darwin -- Python 2.7.6 -- py-1.4.26 -- pytest-2.6.4
collected 24 items
tests/test_ae.py .............
tests/test_edit_ae.py .....
tests/test_lexicon.py .
tests/test_parser.py E
tests/test_tagger.py ...E
================================================================ ERRORS ================================================================
_____________________________________________________ ERROR at setup of test_parse _____________________________________________________
@pytest.fixture
def train_dir():
import redshift.parser
> redshift.parser.train(train_str, model_dir)
tests/test_parser.py:20:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
redshift/parser.pyx:111: in redshift.parser.train (redshift/parser.cpp:3039)
parser.tagger.train_sent(py_sent)
redshift/tagger.pyx:122: in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)
self.guide.update(counts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E AssertionError
thinc/learner.pyx:81: AssertionError
______________________________________________________ ERROR at setup of test_tag ______________________________________________________
@pytest.fixture
def train_dir():
import redshift.tagger
sent_strs = []
for sent_str in train_str.strip().split('\n\n'):
sent = []
for tok_str in sent_str.strip().split('\n'):
fields = tok_str.split()
sent.append('%s/%s' % (fields[1], fields[3]))
sent_strs.append(' '.join(sent))
train_pos = '\n'.join(sent_strs)
> redshift.tagger.train(train_pos, model_dir)
tests/test_tagger.py:27:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
redshift/tagger.pyx:43: in redshift.tagger.train (redshift/tagger.cpp:2391)
tagger.train_sent(sent)
redshift/tagger.pyx:122: in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)
self.guide.update(counts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E AssertionError
thinc/learner.pyx:81: AssertionError
================================================== 22 passed, 2 error in 1.71 seconds ==================================================
Are you able to reproduce this? I'm running on OS X 10.10, using Python 2.7.6.
Okay, I think I've fixed this.
The underlying problem is that I've broken the perceptron code out into its own module, thinc, and I'd been redshift against my local version of that library instead of the one on pip.
Try pulling the new version, and running "pip install -r requirements.txt", to get thinc1.50. Then run "fab clean make test".
Yay. Tests pass and I've trained a tagger. Thanks!
Great! Thanks for the bug reports. Let me know if you have any other problems.