Custom Dataset All-NaN Slice encountered

Question

Custom Dataset All-NaN Slice encountered

Closed this issue 4 years ago · 3 comments

Hi,

I am trying to use your algorithm using another dataset. After few epochs, when I tried to update the pseudo labels, I got this message :

Traceback (most recent call last):
  File "PseudoLabelling.py", line 391, in <module>
    L, cfg.PS = update_pseudoLabels(model, cfg, inferloader, L)
  File "PseudoLabelling.py", line 314, in update_pseudoLabels
    L, PS = cpu_sk(model, pseudo_loader, cfg.hc, cfg.outs, L, cfg.presize, cfg.K, cfg.dtype, cfg.device, cfg.lamb)
  File "PseudoLabelling.py", line 251, in cpu_sk
    L, PS = optimize_L_sk(L, PS, outs, lamb, "cpu", dtype, nh=0)
  File "PseudoLabelling.py", line 306, in optimize_L_sk
    argmaxes = np.nanargmax(PS, 0) # size N
/.conda/envs/torch/lib/python3.8/site-packages/numpy/lib/nanfunctions.py", line 551, in nanargmax
aise ValueError("All-NaN slice encountered")
ValueError: All-NaN slice encounteredes = np.nanargmax(PS, 0) # size N

Why did I get this error ? Is it because every pseudo labels are the same than the previous epochs ?

Answer 1 · 2020-07-22T19:24:25.000Z

Hi there,

It's a bit hard to know exactly why this is happening, as I don't know the details about how you're using the algorithm. What's your number of clusters?

Few things to try:
a) instead of having a while loop in the SK-optimization use just a couple of matrix multiplies, e.g. 10.
b) use more clusters.
c) make sure you're using double precision for the matrix multiplies (should be the default).
d) play around with lambda. maybe a lower one works better.

Let me know if and what works! Cheers

Answer 2 · 2020-07-28T15:06:54.000Z

Hi,

I thank you for your answer.
I have change the while loop by a number of iteration instead of an "error" value. It seems to help to do more epochs. But I still have the issue. I think I may have found why but I am not sure. When I look the previous checkpoint before the bug, it seems that most of the item are predicted into one class (there are 20 class, and 99% of the training data are predicted into one class). Moreover the probability is equal to one (so whatever the lambda value is, it will not change the probability). I suspect the issue comes from that. Because that probably mean that, the epoch which is bugging is probably an epoch where all items are predicted as one class with a probablity of 1.0

Answer 3 · 2020-07-28T16:22:14.000Z

Hi, ok that's interesting. I'm glad it has helped a bit already. Maybe a systematic analysis of which parameter made it train longer would be good. A lower lambda during training should help because it will force the equipartitioning to be more strict.

Worst case, you can always just train with K=200 instead of K=20 and then merge clusters back to 20 at the end.