KeyError
Opened this issue · 3 comments
FakieKickflip commented
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name'])
gives me an error:
deduper = _active_learning(data, sample_size, deduper, t saraining_file, settings_file)
File "C:\...\site-packages\pandas_dedupe\dedupe_dataframe.py", linev\l
41, in _active_learning
deduper.prepare_training(data, sample_size=sample_num)
File "C:\...\site-packages\dedupe\api.py", line 1274, in prepare_training
self._sample(data, sample_size, blocked_proportion)
File "C:\...\site-packages\dedupe\api.py", line 1304, in _sample
index_include=examples)
File "C:\...\site-packages\dedupe\labeler.py", line 421, in __init__
self.candidates = self._sample(data, blocked_proportion, sample_size)
File "C:\...\site-packages\dedupe\labeler.py", line 61, in _sample
in blocked_sample_keys | random_sample_keys]
File "C:\...\site-packages\dedupe\labeler.py", line 60, in <listcomp>
for k1, k2
KeyError: 12885043641
I can not figure out why. The df has over 500.000 rows.
Using Python 3.7.7 and a Windows machine.
On a Mac everything works fine.
FakieKickflip commented
Obviously fixed by: dedupeio/dedupe#945
ieriii commented
Hi @FakieKickflip, many thanks for this. Very helpful.
Did dedupe#945 required you to do anything (e.g. update requirements?). Grateful if you can post it here in case anyone else encounter the same issues.
Thank you again for spotting this and finding the solution.
FakieKickflip commented
@ieriii In #945 you can see that it should be addressed by dedupeio/dedupe@7317798