danielhomola/mifs

Fitting time too long

andersonspy opened this issue · 6 comments

I am trying to fit a data with dimension around 600*11 with continuous Y (regression problem). However, it seems like the training is taking on forever. I don't know how it is going to work since I think I have done the right thing in setting up the code.

Hi,

Sorry for the late response. It should work just fine with that amount of data.. have you set the categorical flag to False?

Yes, and I finally found it is the problem of Windows.... Now it's working
fine. But what does it mean if JMI is nan....

On Wednesday, May 18, 2016, Daniel Homola notifications@github.com wrote:

Hi,

Sorry for the late response. It should work just fine with that amount of
data.. have you set the categorical flag to False?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#2 (comment)

not sure.. could you paste in the output you get when you run the algorithm with verbose=2?

It keeps on reporting the same warning as follows:

Warning (from warnings module):
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\hashing.py", line 197
obj_bytes_view = obj.view(self.np.uint8)
DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead

And finally ends up with the following error:

Traceback (most recent call last):
File "E:\PythonProject_3\IO_for_EP\Retrofit_Tool.py", line 139, in
main()
File "E:\PythonProject_3\IO_for_EP\Retrofit_Tool.py", line 82, in main
print Helper.JMISelector(data_prepro)
File "E:\PythonProject_3\IO_for_EP\Helper.py", line 351, in JMISelector
MIFS.fit(X,total)
File "E:\PythonProject_3\IO_for_EP\mifs.py", line 137, in fit
return self._fit(X, y)
File "E:\PythonProject_3\IO_for_EP\mifs.py", line 211, in _fit
S, F = self._add_remove(S, F, bn.nanargmax(xy_MI))
File "reduce.pyx", line 2907, in reduce.nanargmax (bottleneck/src/auto_pyx/reduce.c:25633)
File "reduce.pyx", line 3552, in reduce.reducer (bottleneck/src/auto_pyx/reduce.c:31009)
File "reduce.pyx", line 2943, in reduce.nanargmax_all_float64 (bottleneck/src/auto_pyx/reduce.c:25949)
ValueError: All-NaN slice encountered

and this is the dimension of the X and y:
(23328L, 11L) (23328L, 1L)

Hey, massively sorry for the late reply..
So first of all, 24k data points is a lot, because MI is calculated with nearest neighbours under the hood. I'm not saying it needs 24k^2 calculations because some clever tricks are used internally by scikit learn to speed up the search, but it will always be slow on datasets that size.
Regarding the errors you get, make sure your input arrays are all numpy arrays:

np.array(X) 
np.array(y)