Error in running ./run_pipeline.sh ./data/enwiki 500

Question

Error in running ./run_pipeline.sh ./data/enwiki 500

ankiosa opened this issue 6 years ago · 9 comments

Traceback (most recent call last):
File "preprocess.py", line 4, in
from preprocess.cleaner import clean_data
File "../src/preprocess/cleaner.py", line 2, in
from pattern.en import tokenize
ImportError: No module named pattern.en
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'

Answer 1 · 2018-03-27T09:16:36.000Z

@ankiosa I've run the script and it does generate the idWordVec_500.pklz file. Can you check if the file has been generated? Otherwise, probably there should be another error before during the block that outputs:

[BLOCK] Created embeddings of size 500

Answer 2 · 2018-03-27T09:24:48.000Z

@kafkasl I have run only these 2 commands

./wiki_extractor_launch.sh path_to_wikipedia_dump
./run_pipeline.sh ../data/enwiki 500

Yes as I can see no idWordVec_500.pklz file has been generated by above commands. Can you please let me know how to generate that file if that is not been generated.

Answer 3 · 2018-03-27T09:33:24.000Z

@ankiosa did you actually download a wikipedia dump and replaced path_to_wikipedia_dump with the actual path to the dump?
The run_pipeline.sh should generate it, can you copy paste here the full output of the command?

Answer 4 · 2018-03-27T11:23:22.000Z

@kafkasl
Sorry its my mistake, I have found that I don't have gensim.

Thanks for your response. It it running now.
And currently the output is :
Starting Preprocess pipeline
* Data path: ./data/enwiki
* Embedding size: 500
* Min count: 1
[BLOCK] Transforming sentences to 4-dimensional lists
Starting 16 processes to clean 0 files
Starting 16 processes to clean 100 files

Can you please let me know how much time it will take? As I have seen in your reports, it takes more than a day or so.
Thanks,

Answer 5 · 2018-03-28T18:42:22.000Z

@kafkasl I am getting this error on running ./run_pipeline.sh ./data/enwiki 500 from bin folder

Can you please see what changes I should do to resolve it.

Starting 16 processes to clean 100 files
[BLOCK] Done transforming data
Time cleaning data: 49615.1150281
Creating embeddings from cleaned data...
[BLOCK] Initializing MySentences from ./data/enwiki
Got 12176 files to turn into sentences
[BLOCK] Creating embeddings model
Traceback (most recent call last):
File "preprocess.py", line 50, in
model = create_embeddings(data_path, emb_size, min_count)
File "../src/preprocess/embeddings.py", line 62, in create_embeddings
workers=mp.cpu_count())
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 527, in init
fast_version=FAST_VERSION)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 335, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 486, in build_vocab
self.trainables.prepare_weights(self.hs, self.negative, self.wv, update=update, vocabulary=self.vocabulary)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1402, in prepare_weights
self.reset_weights(hs, negative, wv)
File "/home/ankit/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1415, in reset_weights
wv.vectors = empty((len(wv.vocab), wv.vector_size), dtype=REAL)
MemoryError
/home/ankit/anaconda2/lib/python2.7/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Embeddings path: ../models/idWordVec_500.pklz
Traceback (most recent call last):
File "../src/lstm/lstm.py", line 423, in
tf.app.run()
File "/home/ankit/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "../src/lstm/lstm.py", line 360, in main
embeddings = VectorManager.read_vector(FLAGS.embeddings)
File "../src/utils/vector_manager.py", line 74, in read_vector
with open(filename, 'rb') as f:
IOError: [Errno 2] No such file or directory: '../models/idWordVec_500.pklz'

Answer 6 · 2018-03-28T19:05:15.000Z

@ankiosa you are running out of memory. I am not sure how much ram is required but no less than 10Gb I think. If you can't train the embeddings I can send you mine.

Answer 7 · 2018-03-28T21:20:58.000Z

@kafkasl

Thanks Yes, I am running it on 8GB linux machine, Please send me train embeddings if possible and instructions for using it.
I am running another process on mac laptop which is 16GB it is running from last 2 days, lets see if I can get output there.

Please send me train embeddings.

Answer 8 · 2018-03-31T14:46:42.000Z

@ankiosa here you have the pretrained embeddings (you should put them inside the models/ directory). Then run the new script run_short_pipeline.sh. It's parameters are on the Readme.

https://www.dropbox.com/s/ws6d8l6h6jp3ldc/embeddings.tar.gz?dl=0

Answer 9 · 2018-04-15T09:48:30.000Z

Thanks Pol, Sorry for late reply. I haven't got time to run your trained embeddings as it was taking a lot of time, so I moved to other projects. Thanks a lot Pol for your contribution throughout. Warm Regards, Ankit Kumar Saw Data Scientist +91-9403112586 | +91 8293672357 Gmail <ankit.kgpian@gmail.com> | IIT-Kgp <ankitsaw@mech.iitkgp.ernet.in> | Linkedin <http://linkedin.com/in/ankitsaw> | Github <http://github.com/ankiosa> On Fri, Apr 6, 2018 at 4:16 PM, Pol Alvarez <notifications@github.com> wrote:

…

Hey Ankit, have you tried the trained embeddings I sent you? Best, Pol On Wed, 28 Mar 2018 at 23:21 ankiosa ***@***.***> wrote: > @kafkasl <https://github.com/kafkasl> > > Thanks Yes, I am running it on 8GB linux machine, Please send me train > embeddings if possible and instructions for using it. > I am running another process on mac laptop which is 16GB it is running > from last 2 days, lets see if I can get output there. > > Please send me train embeddings. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1# issuecomment-377041000>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AC0ux1Um7uEfbX2z6Fz47RKjuhi71XACks5ti_66gaJpZM4S8ibF> > . > -- *This message contains confidential information and it's intended for the recipient named above. If you would receive it by error, please notify the sender immediately. Do not copy, use or circulate this communitacion. Before printing, think about the environment. **Thank you.* — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMoAtzj7FgmseB06oB0tg0wWEgGMAzO1ks5tl0d6gaJpZM4S8ibF> .