erickrf/nlpnet

English models

tek opened this issue · 21 comments

tek commented

Hey there, any chance there are models for the English language available, or is this only usable in Portuguese yet?

Hi,
currently, it only works with Portuguese. For English, you should see SENNA
(http://ronan.collobert.com/senna/), which has a very similar architecture.

2014/1/17 tek notifications@github.com

Hey there, any chance there are models for the English language available,
or is this only usable in Portuguese yet?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1
.

tek commented

I've been using Senna, just wanted to try your implementation for comparison, and native python provides a little more comfort. Also, Senna seems to have problems with sentences whose verb is a form of "to be"…

Sorry for the delayed answer, but only recently I could look carefully at the SRL modules. You can actually train an nlpnet model for SRL in any language. I just don't have any trained models so far.

The only (probably inocuous) catch is that the class SRLReader will try to make word contractions wherever they exist in Portuguese. In English, none would ever trigger, but you might want to comment it off. I intent to make it optional in later versions.

@erickrf do you have a quick example of how I can train the SRL for a different language?

Use the "simplified-data" branch, I have started making it easier to use. You'll have to do some things manually, but I plan to simplify it further when I have the time.

You should use the nlpnet-train script, as described in http://nilc.icmc.usp.br/nlpnet/scripts.html#nlpnet-train

It will search for the following files in the data directory (the argument for --data):

  1. If you provide pre-trained embeddings, they must be a numpy 2d array (num_words x num_features) saved to a file. If you train a network for argument boundary identification, it must be named types-features-id.npy. For argument classification, types-features-class.npy. For the 2-in-1 pass, it must be types-features-1step.npy. For predicate detection, types-features-preds.npy.

  2. You must provide a vocabulary.txt file in the data directory with one token per line (including punctuation tokens). If you provided the embeddings file, the number of the line corresponds to the number of the row in the matrix. If you generate random features with nlpnet, all words not in this file are mapped to the rare (or unknown) word.

  3. You must provide an srl-tags.txt file with one tag per line. Tags shouldn't have IOBES prefixes; nlpnet takes care of it (e.g., A0, A1, AM-LOC, etc.)

General observations:

  • If you train separate networks for argument identification and classification, note that both are trained independently.
  • I had good results with all learning rates having the same value. Starting with a couple of iterations at 0.01 and then a few tens with 0.001.

I am having issues finding a ConLL 2009 compatible english dataset to test training nlpnet-train with ... any ideas where I can download such data?

Getting English SRL data involves licensing stuff. The Propbank annotation layer is freely available at http://www.lsi.upc.edu/~srlconll/soft.html, but that annotation doesn't include the actual tokens because the Penn Treebank is available through the LDC.

IIRC, nltk includes a small sample of the Penn Treebank with semantic roles. Not sure if you'd need to do some preprocessing.

Hi,
I tried to run your training script on some sample data i got, starting with pos jsut for the beginning. I tried to run it with some parameters but I didn't have much success. Could you show a sample input parameters to train on your data? I tried pos with ~800 tagged sentences, for vocabulary.txt I pulled all the words from this set, and for pos-tags.txt - all the tags from the training set. I guessed some of the parameters, but were not able to run the model. Could you give me some advice how to create a model using your trainer? Here is the output I am getting:

nlpnet-train.py -l 0.01 --caps -n 200 --task pos --data model -v --gold pos.txt
Reading text...
Loading vocabulary
Done. Dictionary size is 3379 types
Generating word type features...
Generated 3382 feature vectors with 50 features each.
Generating capitalization features...
Generated 4 feature vectors with 5 features each.
Creating new network...
Created new network with the following layer sizes: 275, 200, 49
Starting training with 787 sentences
Training for up to 100 epochs
Traceback (most recent call last):
File "C:\projects\nlpnet\bin\nlpnet-train.py", line 209, in
train(text_reader, args)
File "C:\projects\nlpnet\bin\nlpnet-train.py", line 167, in train
args.iterations, intervals, args.accuracy)
File "network.pyx", line 421, in nlpnet.network.Network.train (nlpnet/network.c:7657)
File "network.pyx", line 479, in nlpnet.network.Network._train_epoch (nlpnet/network.c:8488)
File "network.pyx", line 203, in nlpnet.network.Network._tag_sentence (nlpnet/network.c:4550)
File "network.pyx", line 127, in nlpnet.network.Network.run (nlpnet/network.c:3543)
IndexError: index 4019 is out of bounds for axis 0 with size 3382

@lazymachinist apparently, nlpnet tried to access word number 4019 in your feature table, but it only had 3382. This is strange, I never saw this error. Can I see your input data (vocabulary.txt, pos-tags.txt and training data)?

For anyone reading this issue: I merged the "simplified-data" branch into the master branch and deleted it.

By the way: you can add to the vocabulary.txt file the lines rare, left and right to represent, respectively, the rare/unkown tokens, left padding and right padding.

Hi Erick,
I don't know if you have got my email, I send you, with the training data, vocabulary and pos-tags files, that caused the error. It would be great, if you could look at them and point me to the source of the problem.

No, I didn't get it. Are you sure you sent to the right gmail address?

Hi Erick,
I used the email you have displayed on your github profile, i.e. gmail's erickrfonseca . I have re-sent it again

Hi Erick,

I'm facing following issue:
Reading text...
Loading vocabulary
Done. Dictionary size is 129998 types
Creating new network...
Loading word vectors...
Generating capitalization features...
Generated 5 feature vectors with 5 features each.
Generating suffix features...
Generated 457 feature vectors with 5 features each.
Generating gazetteer features...
Generated 2 feature vectors with 5 features each.
Generated 2 feature vectors with 5 features each.
Generated 2 feature vectors with 5 features each.
Generated 2 feature vectors with 5 features each.
Created new network with the following layer sizes: 400, 300, 17
Starting training with 22138 sentences
Network weights learning rate: 0.000100
Feature vectors learning rate: 0.010000
Tag transition matrix learning rate: 0.010000
Training for up to 40 epochs
Traceback (most recent call last):
File "bin/nlpnet-train.py", line 267, in
train(text_reader, args)
File "bin/nlpnet-train.py", line 210, in train
args.iterations, report_intervals, args.accuracy)
File "network.pyx", line 574, in nlpnet.network.Network.train (nlpnet/network.c:8601)
File "network.pyx", line 636, in nlpnet.network.Network._train_epoch (nlpnet/network.c:9546)
File "network.pyx", line 310, in nlpnet.network.Network._tag_sentence (nlpnet/network.c:5898)
ValueError: all the input arrays must have same number of dimensions

Hi @kiran-surya ,

your log output has "gazetteer features", which is not part of my original nlpnet. I suppose you are using some fork with NER functionalities?

Hi @erickrf,

Thanks for your prompt reply. Yes, i'm using NER functionality, but that should not effect vector dimensions, correct?

No, it shouldn't, but I can't help much if the code is from a fork. The problem you describe (at least with line numbers in the stack trace) doesn't apply to the current nlpnet version.

Thank you.

Hi Erick
I have installed nlpnet successfully. But when I try to set the path (folder containing models), it always gives error ": No such file or directory"
I used both method like
$ nlpnet-tag.py srl --data /media/backup/NLP_tool/nlpnet/data --lang pt

and also in python command line

nlpnet.set_data_dir('/media/backup/NLP_tool/nlpnet/data')

Both commands gives the same error.
/media/backup/NLP_tool/nlpnet/data folder contains all the .zip model file and also extracted file. I also give the path of extracted file like: '/media/backup/NLP_tool/nlpnet/data/pos-pt'

Could you please suggest me where I went wrong?

Thanks in advance.
Shrestha

Hi Shrestha,

the data parameter value should be the path to the uncompressed folder, i.e. the contents of that folder should be vocabulary.txt, .npz files etc.

If you still get the error, please copy the full error message from nlpnet. Also, check if you are using the most recent version of code and data (they haven't been updated in some months, though).

@nshresthan I use Linux and have had this problem. It solved by converting the DOS file format to Linux (by sudo fromdos /usr/local/bin/nlpnet-* command).
I converted all nlpnet-*.py files.

Thanks to http://stackoverflow.com/questions/19764710/python-script-gives-no-such-file-or-directory