facebookresearch/LASER

mine_bitexts.py: Zero division RuntimeWarning with margin "ratio"

OrianeN opened this issue · 2 comments

This is a bug report.

When using the margin strategy "ratio" on a small corpus, I get the following warning/error:

LASER: tool to search, score or mine bitexts
 - knn will run on all available GPUs (recommended)

[project]/LASER/source/mine_bitexts.py:243: RuntimeWarning: divide by zero encountered in true_divide
  margin = lambda a, b: a / b

This causes the script to stop early.

command to reproduce:

cmd = f'python {os.environ["LASER"]}/source/mine_bitexts.py ' \
          f'{src_file} ' \
          f'{tgt_file} ' \
          f'--src-lang en ' \
          f'--trg-lang fr ' \
          f'--output out.tsv ' \
          f'--threshold 0 ' \
          f'--mode mine ' \
          f'--neighborhood 4 ' \
          f'--src-embeddings {emb_out_src} ' \
          f'--trg-embeddings {emb_out_tgt} ' \
          f'--retrieval intersect ' \
          f'--margin ratio ' \
          f"--gpu " \
        #   f'--verbose'

subprocess.run(cmd, shell=True, check=True)

Here are the input files (TXT src/tgt + numpy embeddings (.bin) created with embed.py):
LASER_inputs.zip

I can replace it with the margin strategy "distance", but the paper https://arxiv.org/pdf/1811.01136.pdf shows that "ratio" is a better strategy.

Hi @OrianeN! Thank you for kindly providing the script along with your data! I just ran it myself using your arguments and data but didn't run into any division by zero errors (I got an output file (out.tsv) with 81 English-French alignments).

Have you tried running this command on cpu only? (i.e. dropping --gpu). I wonder if it's a GPU-related issue

Closing due to inactivity. Please re-open if needed!