seomoz/simhash-py

Is simhash-cpp 100x faster than simhash-py?

Closed this issue ยท 3 comments

Hello:

Thanks very much for sharing the great codes! Your works are wonderful.
I have a question about the efficiency of simhash-cpp and simhash-py.

I installed simhash-cpp (https://github.com/seomoz/simhash-cpp) and simhash-py (https://github.com/seomoz/simhash-py) and run the benchmark. I got the following results:
(1) simhash-cpp:
../simhash-cpp/src$ ./bench 1000000
blocks=6, bits=3
Inserting 1000000 hashes...
Running 4000000 queries...
Queries complete with 0 errors
Running time: total=0.705171s, avg=0.17629275us
There are 9999999 items in the table

(2) simhash-py:
../simhash-py/bench.py --random 1000000 --blocks 6 --bits 3
Generating 1000000 hashes
Generating 1000000 queries
Starting Bulk Insertion
Ran Bulk Insertion in 7.518402s, avg: 7.518402us
Starting Bulk Find First
Ran Bulk Find First in 13.021438s, avg: 13.021438us
Starting Bulk Find All
Ran Bulk Find All in 14.687295s, avg: 14.687295us
Starting Bulk Removal
Ran Bulk Removal in 8.982185s, avg: 8.982185us

Based on the above results, I found that the average times over 1000000 hashes of each query are:
simhash-cpp is 0.17629275us and simhash-py is 13.021438us.
So simhash-cpp is about 100x faster than simhash-py. However, I checked the codes of simhash-py. I found that simhash-py is actually built on simhash-cpp. In my view, simhash-py is just a python wrapper of simhash-cpp. So I think simhash-py should be slower than simhash-cpp, but their difference should not up to almost 100x.
My question is why simhash-cpp is about 100x faster than simhash-py.
I don't know if my understanding is right, or if I missed something. If I made something wrong, please correct me!

Thanks!

You're right -- this seems odd. I would not have expected the difference to be nearly that substantial. I could understand a difference of 2 or 3x, but not 100x. I'll see if I can reproduce it.

Thx @dlecocq ! Looking forward your feedback!