kafkasl/contextualLSTM

Error in running ./run_pipeline.sh ./data/enwiki 500

ankiosa opened this issue · 9 comments

Traceback (most recent call last):
File "preprocess.py", line 4, in
from preprocess.cleaner import clean_data
File "../src/preprocess/cleaner.py", line 2, in
from pattern.en import tokenize
ImportError: No module named pattern.en
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'

@ankiosa I've run the script and it does generate the idWordVec_500.pklz file. Can you check if the file has been generated? Otherwise, probably there should be another error before during the block that outputs:

[BLOCK] Created embeddings of size 500

@kafkasl I have run only these 2 commands

  1. ./wiki_extractor_launch.sh path_to_wikipedia_dump
  2. ./run_pipeline.sh ../data/enwiki 500

Yes as I can see no idWordVec_500.pklz file has been generated by above commands. Can you please let me know how to generate that file if that is not been generated.

@ankiosa did you actually download a wikipedia dump and replaced path_to_wikipedia_dump with the actual path to the dump?
The run_pipeline.sh should generate it, can you copy paste here the full output of the command?

@kafkasl
Sorry its my mistake, I have found that I don't have gensim.

Thanks for your response. It it running now.
And currently the output is :
Starting Preprocess pipeline
* Data path: ./data/enwiki
* Embedding size: 500
* Min count: 1
[BLOCK] Transforming sentences to 4-dimensional lists
Starting 16 processes to clean 0 files
Starting 16 processes to clean 100 files

Can you please let me know how much time it will take? As I have seen in your reports, it takes more than a day or so.
Thanks,

@kafkasl I am getting this error on running ./run_pipeline.sh ./data/enwiki 500 from bin folder

Can you please see what changes I should do to resolve it.

Starting 16 processes to clean 100 files
[BLOCK] Done transforming data
Time cleaning data: 49615.1150281
Creating embeddings from cleaned data...
[BLOCK] Initializing MySentences from ./data/enwiki
Got 12176 files to turn into sentences
[BLOCK] Creating embeddings model
Traceback (most recent call last):
File "preprocess.py", line 50, in
model = create_embeddings(data_path, emb_size, min_count)
File "../src/preprocess/embeddings.py", line 62, in create_embeddings
workers=mp.cpu_count())
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 527, in init
fast_version=FAST_VERSION)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 335, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 486, in build_vocab
self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1402, in prepare_weights
self.reset_weights(hs, negative, wv)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1415, in reset_weights
wv.vectors = empty((len(wv.vocab), wv.vector_size), dtype=REAL)
MemoryError
/home/ankit/anaconda2/lib/python2.7/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'

@ankiosa you are running out of memory. I am not sure how much ram is required but no less than 10Gb I think. If you can't train the embeddings I can send you mine.

@kafkasl

Thanks Yes, I am running it on 8GB linux machine, Please send me train embeddings if possible and instructions for using it.
I am running another process on mac laptop which is 16GB it is running from last 2 days, lets see if I can get output there.

Please send me train embeddings.

@ankiosa here you have the pretrained embeddings (you should put them inside the models/ directory). Then run the new script run_short_pipeline.sh. It's parameters are on the Readme.

https://www.dropbox.com/s/ws6d8l6h6jp3ldc/embeddings.tar.gz?dl=0