erickrf/nlpnet

Training new SRL model: Unexpected role data

GraphGrailAi opened this issue · 1 comments

I am trying to train new SRL model:
root@engine:/var/www/engine/nlpnet-master/bin# nlpnet-train.py srl pred --gold train_google_ru.txt --data srl-model/
with txt file with 2 string, each on new line:

Его уверенная поступь – предмет зависти топ-менеджеров и разработчиков по всему миру.
Для техно-евангелистов Google – это самая крупная жемчужина сокровищницы.

Result of launching is error:


Reading training data...
Traceback (most recent call last):
  File "/usr/local/bin/nlpnet-train.py", line 248, in <module>
    text_reader = create_reader(args, md)
  File "/usr/local/bin/nlpnet-train.py", line 61, in create_reader
    only_predicates=args.predicates)
  File "/usr/local/lib/python3.4/dist-packages/nlpnet/srl/srl_reader.py", line 70, in __init__
    self._read_conll(filename)
  File "/usr/local/lib/python3.4/dist-packages/nlpnet/srl/srl_reader.py", line 130, in _read_conll
    tag, expected_role = self._read_role(tag, 'O', True)
  File "/usr/local/lib/python3.4/dist-packages/nlpnet/srl/srl_reader.py", line 185, in _read_role
    raise ValueError('Unexpected role data: %s' % role)
ValueError: Unexpected role data: по

This error is about strange thing: it cannot understand some words, for example "по" which is in english preposition 'over' (,,, all over the world). If i remove conflicting words it works the following way:

root@engine:/var/www/engine/nlpnet-master/bin# nlpnet-train.py srl pred --gold train_google_ru.txt --data srl-model/

Reading training data...
Loading vocabulary
Creating new network...
Generating word type features...
Created new network with the following layer sizes: 250, 50, 2

Training for up to 1 epochs
1 epochs   Error: 0.000000   Accuracy: 1.000000   0 corrections skipped   learning rate: 0.010000
Finished training

So, what's you advice to solve? How many times i need to train model to achieve good results?

Sorry for the very late response.

This error message says that your training data is not formatted correctly. A description of the format can be found at the nlpnet documentation for SRL; it is basically the CoNLL format.