ray1007/pytorch-word2vec

Skipgram mode has bugs

Closed this issue · 1 comments

Hi Tzu-Ray,

I tested your code with --cbow 0, and it gives this error:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/yvx5085/Github/pytorch-word2vec/main.py", line 212, in train_process_worker
    data_queue.put(data)
NameError: name 'data' is not defined

So I went to line 212 of main.py, and noticed that 'data' might need to be replaced by 'chunk'.
Then I made the change and ran the code again, and got another error:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/yvx5085/Github/pytorch-word2vec/main.py", line 212, in train_process_worker
    data_queue.put(chunk)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/queues.py", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "stringsource", line 2, in View.MemoryView._memoryviewslice.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

So I guess it there should be a bug in the sg_producer function.

BTW, I have tried --cbow 1 with a small English data, and it worked out fine.
I like this project a lot, because it uses the flexibility of pytorch and cython, so that I can get a very good training speed without needing to deal with, for example, the custom ops in tensorflow.

I will make more tests on Chinese data, and will let you know the results.

Best luck!
Yang

Hi Yang,

You're right, this is a bug. I fixed it up, made some tests, and did a new commit. Please check out the new changes and let me know if there is still some problems. 😃

Thanks for your appreciation to this repo! FYI, I am using Chinese Gigaword corpus, and plan to use text8 for benchmarking (in the README).

Tzu-Ray