Error in running ./run_pipeline.sh ./data/enwiki 500
ankiosa opened this issue · 9 comments
Traceback (most recent call last):
File "preprocess.py", line 4, in
from preprocess.cleaner import clean_data
File "../src/preprocess/cleaner.py", line 2, in
from pattern.en import tokenize
ImportError: No module named pattern.en
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'
@ankiosa I've run the script and it does generate the idWordVec_500.pklz file. Can you check if the file has been generated? Otherwise, probably there should be another error before during the block that outputs:
[BLOCK] Created embeddings of size 500
@kafkasl I have run only these 2 commands
- ./wiki_extractor_launch.sh path_to_wikipedia_dump
- ./run_pipeline.sh ../data/enwiki 500
Yes as I can see no idWordVec_500.pklz file has been generated by above commands. Can you please let me know how to generate that file if that is not been generated.
@ankiosa did you actually download a wikipedia dump and replaced path_to_wikipedia_dump
with the actual path to the dump?
The run_pipeline.sh should generate it, can you copy paste here the full output of the command?
@kafkasl
Sorry its my mistake, I have found that I don't have gensim.
Thanks for your response. It it running now.
And currently the output is :
Starting Preprocess pipeline
* Data path: ./data/enwiki
* Embedding size: 500
* Min count: 1
[BLOCK] Transforming sentences to 4-dimensional lists
Starting 16 processes to clean 0 files
Starting 16 processes to clean 100 files
Can you please let me know how much time it will take? As I have seen in your reports, it takes more than a day or so.
Thanks,
@kafkasl I am getting this error on running ./run_pipeline.sh ./data/enwiki 500 from bin folder
Can you please see what changes I should do to resolve it.
Starting 16 processes to clean 100 files
[BLOCK] Done transforming data
Time cleaning data: 49615.1150281
Creating embeddings from cleaned data...
[BLOCK] Initializing MySentences from ./data/enwiki
Got 12176 files to turn into sentences
[BLOCK] Creating embeddings model
Traceback (most recent call last):
File "preprocess.py", line 50, in
model = create_embeddings(data_path, emb_size, min_count)
File "../src/preprocess/embeddings.py", line 62, in create_embeddings
workers=mp.cpu_count())
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 527, in init
fast_version=FAST_VERSION)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 335, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 486, in build_vocab
self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1402, in prepare_weights
self.reset_weights(hs, negative, wv)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1415, in reset_weights
wv.vectors = empty((len(wv.vocab), wv.vector_size), dtype=REAL)
MemoryError
/home/ankit/anaconda2/lib/python2.7/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'
@ankiosa you are running out of memory. I am not sure how much ram is required but no less than 10Gb I think. If you can't train the embeddings I can send you mine.
Thanks Yes, I am running it on 8GB linux machine, Please send me train embeddings if possible and instructions for using it.
I am running another process on mac laptop which is 16GB it is running from last 2 days, lets see if I can get output there.
Please send me train embeddings.
@ankiosa here you have the pretrained embeddings (you should put them inside the models/ directory). Then run the new script run_short_pipeline.sh. It's parameters are on the Readme.
https://www.dropbox.com/s/ws6d8l6h6jp3ldc/embeddings.tar.gz?dl=0