dblalock/bolt

Python unit test is broken?

XiaoConstantine opened this issue · 5 comments

OS: MacOS Monterey 12.5 (Intel chip)
Python: 3.10.5

❯ pytest tests
============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/xiao/development/github.com/XiaoConstantine/bolt-1
collected 4 items

tests/test_encoder.py ..F.                                                                                                                                                 [100%]

==================================================================================== FAILURES ====================================================================================
________________________________________________________________________________ test_unquantize _________________________________________________________________________________

    def test_unquantize():
        X, Q = _load_digits_X_Q(nqueries=20)
>       enc = bolt.Encoder('dot', accuracy='high').fit(X)

tests/test_encoder.py:151:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:466: in fit
    centroids = _learn_centroids(X, ncentroids=ncentroids,
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:142: in _learn_centroids
    centroids, labels = kmeans(X_in, ncentroids)
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:106: in kmeans
    seeds = kmc2.kmc2(X, k).astype(np.float32)
kmc2.pyx:97: in kmc2.kmc2
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: probabilities contain NaN

mtrand.pyx:935: ValueError

I've made a PR to kmc2 code and added a hack to the bolt api the with a PR here: #38

The issue is that each row is padded with 0's: Since there are 16 rows, but we only get 15 values per codebook from python, I have zeroed out the last row at all columns here: #29 (comment).
When we pass in columns 1 at a time to get centroids for each column. The first column is all 0's. The kmc2 code errors when it has only 1 unique row: it updates points with the normalized the distances of every row from each other. This is nan if all the rows are the same, since the sum is 0.

This is mentioned in the thread where the external KMC2 package is included: #4 (comment).

Make sense to me 👍 will wait for @dblalock to take a look when he gets time

I'm using Python 3.10.0 on my intel mac. I couldn't pip install kmc2 because the cython interface has changed. I did clone the kmc2 repository and rand cython kmc2 which then built. However, I still got the Nan error reported above.

Did you run python setup.py install inside the bolt repo after checking out the branch with the updated python/bolt/bolt_api.py?

Following the steps here #4 (comment) .

I just tested this on python 3.7 and ran python setup.py install in both repos; I've not tried with cython.

Here's commands that pass the pytests on macOS 12.5:

git clone https://github.com/dblalock/bolt.git
pip install -r requirements.txt
python setup.py  install
cd ..
git clone  git@github.com:clark-hive/kmc2.git
cd kmc2/
git checkout clark/allow_duplicated_inputs
python setup.py install
cd ../bolt/
pytest tests/