anhaidgroup/deepmatcher

Error in running dm.data.process

Opened this issue · 7 comments

I am getting this error while running code->

Code-> train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv')

Error->

Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00

ValueError Traceback (most recent call last)
in ()
3 train='train.csv',
4 validation='validation.csv',
----> 5 test='test.csv')

7 frames
/usr/local/lib/python3.6/dist-packages/fastText/FastText.py in init(self, model)
35 self.f = fasttext.fasttext()
36 if model is not None:
---> 37 self.f.loadModel(model)
38
39 def is_quantized(self):

ValueError: /root/.vector_cache/wiki.en.bin has wrong file format!

I am getting the same issue -- was just about to post this!

It appears that fastText file format may have recently changed. For now could you try using an earlier version of fastText (https://pypi.org/project/fasttext/#history) perhaps 0.9.1?

I tried using fasttext 0.9.1 but I am getting the same error with it. Also on earlier version 0.8.4, there was an error in installing deepmatcher.

Maybe just try to reinstall deepmatcher with:
pip install git+https://github.com/anhaidgroup/deepmatcher.git
Because the PyPi has not yet been updated with the recent modifications.

I'm having the same error, and reinstalling deepmatcher didn't help. Was anyone able to solve this?

Apparently this is because of failing to download the word embedding model correctly from Google drive. Let me look into this. For now you can get it working by adding these lines before dm.data.process as in this Colab: https://colab.research.google.com/drive/1Qqx4FCj3JKt1oGHslsO3M8BXgyhXLWfp#scrollTo=os5kG_92eMwT

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip --directory-prefix=/root/.vector_cache
!unzip /root/.vector_cache/wiki.en.zip -d /root/.vector_cache/
!rm /root/.vector_cache/wiki.en.vec

This will fetch the model zip directly from Facebook AI but is slower and takes more space since it has additional data.

Excellent, thanks @sidharthms ! In the meantime my friend got this working by increasing the colab RAM from 12 GB to 25 GB. But this workaround is good to know as well.
Thanks again!