Lyonk71/pandas-dedupe

KeyError

Opened this issue · 3 comments

df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name'])

gives me an error:

deduper = _active_learning(data, sample_size, deduper, t saraining_file, settings_file)  
  File "C:\...\site-packages\pandas_dedupe\dedupe_dataframe.py", linev\l
 41, in _active_learning
    deduper.prepare_training(data, sample_size=sample_num)
  File "C:\...\site-packages\dedupe\api.py", line 1274, in prepare_training
    self._sample(data, sample_size, blocked_proportion)
  File "C:\...\site-packages\dedupe\api.py", line 1304, in _sample
    index_include=examples)
  File "C:\...\site-packages\dedupe\labeler.py", line 421, in __init__
    self.candidates = self._sample(data, blocked_proportion, sample_size)
  File "C:\...\site-packages\dedupe\labeler.py", line 61, in _sample
    in blocked_sample_keys | random_sample_keys]
  File "C:\...\site-packages\dedupe\labeler.py", line 60, in <listcomp>
    for k1, k2
KeyError: 12885043641

I can not figure out why. The df has over 500.000 rows.

Using Python 3.7.7 and a Windows machine.

On a Mac everything works fine.

Obviously fixed by: dedupeio/dedupe#945

Hi @FakieKickflip, many thanks for this. Very helpful.
Did dedupe#945 required you to do anything (e.g. update requirements?). Grateful if you can post it here in case anyone else encounter the same issues.

Thank you again for spotting this and finding the solution.

@ieriii In #945 you can see that it should be addressed by dedupeio/dedupe@7317798