jlsuarezdiaz/pyDML

DMLMJ error related to neighbors number

pamikem opened this issue · 4 comments

Hi!

I got an error when choosing a neighbors number (2 and 4) different from the default (3) in DMLMJ constructor. The traceback of the error is the following :

dml/dmlmj.pyx in dml.dmlmj.DMLMJ.fit()

dml/dmlmj.pyx in dml.dmlmj.DMLMJ._compute_matrices()

IndexError: index 139900304611984 is out of bounds for axis 0 with size 11678

Here, 11678 is the size of my training set. I executed _compute_neighborhoods outside the fit, checked the output and no index was out of the bounds. This is really surprising. Please tell me if there is something i didn't do correctly.

Hi, I wasn't able to reproduce this error. Do you have a minimal working example that I could test?

Yes, sure. Please find below the data i'm working with. It comes from flow cytometry and contains cells features.

A script for the importation and the preprocessing :

data = pd.read_csv("flowcyto.csv", sep=" ", na_values="NA", index_col=0)
labels = data["Label"].to_numpy()
data.drop(columns=['Label'], inplace=True)
X = data.values

mm_scaler = MinMaxScaler()
mm_scaler.fit(X)
X = mm_scaler.transform(X)

le = LabelEncoder()
le.fit(labels)
y = le.transform(labels)


Hi, thank you for pointing this issue out and for providing the example. It seems that the current implementation can't handle large datasets because the pairwise distance calculation requires too much memory. I'm, trying to reimplement this part to make memory-efficient enough, but I may need some time.

Hi! I'll use smaller datasets for the moment then. Thank you for your answer.