seomoz/simhash-py

Infinite Loop When Querying (Fix Detailed)

Closed this issue · 2 comments

We've found that for certain inputs and certain queries, certain installations can have an issue where querying a corpus never terminates. There's a gist containing some JSON files with data that can evince the bug.

Ultimately, it's been found to be caused by libJudy. It relies on undefined behavior and when built with newer versions of gcc (4.8 has been confirmed to not be safe) the J1N API call does not work as defined. In particular this call does not increment the scanned index at all in certain cases. We've not tracked down what exactly this case is, nor do we have any plans to. We do, however, have a fix.

Installing libjudy-dev from apt

Some of libjudy-dev from apt (for Ubuntu 12.04, for instance) are known to work well, but others do not. Notably, Ubuntu 14.04's copy does not and there libJudy must be built from source using gcc-4.6. The process is relatively straightforward:

# With libJudy-1.0.5 unpacked
apt-get install -y gcc-4.6
# These are the flags where the 12.04 build that works
export CFLAGS='-Wall -O2'
export CC=`which gcc-4.6`
# These are the configure flags used in the 12.04 build that works
./configure --prefix=/usr --mandir=/usr/share/man
make
make install

I'm not sure if the previous comment has a typo, but at least in my system (Ubuntu 14.04) I must use
export CC=which gcc-4.6``

Yes, it appears to be a typo. I just edited the original comment to fix the typo.