facebookresearch/SentAugment

Facing multiple issues while running src/flat_retrieve.py

karthickpgunasekaran opened this issue · 1 comments

Hello All,
I am trying to run SentAugment as a part my project for clustering purposes but facing multiple issues trying to run it. I am using a part of the CommonCrawl data for this purpose.

Issue 1:
File "src/flat_retrieve.py", line 37, in
_, indices = torch.topk(scores, params.k, dim=0) # K x Q
NameError: name 'params' is not defined

File "SentAugment/src/flat_retrieve.py", line 42, in
for k in range(K):
NameError: name 'K' is not defined

Proposed Solution:

Is it a bug? Should it be args.K instead of params.k and just K?

Issue 2:
File "src/flat_retrieve.py", line 43, in
print(IndexTextQuery(txt_mmap, ref_mmap, indices[k][qeury_idx]))
File "/home/username/FAIRCluster/SentAugment/src/indexing.py", line 95, in IndexTextQuery
return b[0:i].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 8: invalid continuation byte

I followed all the steps mentioned but getting the following error when run the step 3. Can somebody help with that?

Issue 3:
After removing the decode and trying to run it,

File "src/flat_retrieve.py", line 43, in
print(IndexTextQuery(txt_mmap, ref_mmap, indices[k][qeury_idx]))
File "/home/username/FAIRCluster/SentAugment/src/indexing.py", line 92, in IndexTextQuery
while txt_mmap[p+i] != 10 and i < dim:
File "/home/username/anaconda3/envs/envConda6/lib/python3.7/site-packages/numpy/core/memmap.py", line 331, in getitem
res = super(memmap, self).getitem(index)
IndexError: index 25580 is out of bounds for axis 0 with size 8000

Any guess on whats wrong here?

Thanks in advance. Any help appreciated!

dahrs commented

I have the exact same problems. At least for issues 1 and 3.
I solved Issue 1 by replacing params.k with args.K and range(K) with range(args.K).
As for Issue 3, the argument in flag --bank should be the path to the .txt file instead of the path to the .ref.bin64 file.
Issue 2 seems to be a problem with your sentences.txt file. You should make sure that the encoding you are using when writing/saving your data is UTF-8 and that there is no character written using a different encoding.
If you change all that, it should work. It did for me.