IdanAchituve/pFedGP

Error of Torch.Cholesky

Closed this issue · 6 comments

Firstly, thanks for your awesome work and code. When I run the pFedGP_IP_compute in the cifar100 in the 'noisy input', there raises an error: 'RuntimeError: cholesky_cuda: For batch 1: U(993,993) is zero, singular U.' The same problem also occurs in other settings. How can I solve the issue?

Hi,
Can you provide the full trace of the error?
Where exactly does it fall?

Cheers,
Idan

The input line:

│(per) [quyuxun@Moon pFedGP-main]$ python ./trainer_ip.py --data-name cifar100 --data-path experiments/datafolder/noisy_cifar100/data_dictionary.pkl --method pFedGP-comp
 │ute --lr 0.01 --save-path experiments/noisy_input/output/pFedGP/pFedGP-IP

The Traceback


 ~│Step: 25, client: 90, Inner Step: 0, Loss: 6.2183685302734375:   2%|#5                                                              | 24/1000 [03:07<2:07:12,  7.82s/it]
]$│Traceback (most recent call last):
 ~│  File "./trainer_ip.py", line 317, in <module>
  │    val_results, labels_vs_preds_val = eval_model(net, GPs, X_bar, clients, split="val")
  │  File "/home/quyuxun/anaconda3/envs/per/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
  │    return func(*args, **kwargs)
  │  File "./trainer_ip.py", line 135, in eval_model
  │    loss, pred = GPs[client_id].forward_eval(X_train, Y_train, X_test, Y_test, client_X_bar, is_first_iter)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/Learner.py", line 84, in forward_eval
  │    preds = self.tree.eval_tree_full_path(X_train, Y_train, X_test, X_bar, self.n_output, is_first_iter)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/tree.py", line 446, in eval_tree_full_path
  │    is_first_iter=is_first_iter)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/pFedGP_compute.py", line 126, in predictive_posterior
  │    gibbs_state = self.gibbs_sample(model_state)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/pFedGP_compute.py", line 182, in gibbs_sample
  │    gibbs_state = self.next_gibbs_state(model_state, gibbs_state)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/pFedGP_compute.py", line 215, in next_gibbs_state
  │    f_new = self.sample_f(gibbs_state.omega, model_state)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/pFedGP_compute.py", line 239, in sample_f
  │    dist = self.gaussian_posterior(omega, model_state)
  │  File "/home/quyuxun/pFedGP-main/pFedGP/pFedGP_compute.py", line 280, in gaussian_posterior
  │    L_Q = psd_safe_cholesky(Q)  # ND x M x M
  │  File "/home/quyuxun/pFedGP-main/utils.py", line 241, in psd_safe_cholesky
  │    raise e
  │  File "/home/quyuxun/pFedGP-main/utils.py", line 215, in psd_safe_cholesky
  │    L = torch.cholesky(A, upper=upper, out=out)
  │RuntimeError: cholesky_cuda: For batch 1: U(993,993) is zero, singular U.

It seems that the matrix A in line 215 in 'utils.py' is irreversible in the step25/client90/InnerStep0. Is the issue reproducible in your service?

Indeed this is the issue. I wasn't able to reproduce this error in my machine, although the optimization process seems to be similar to yours (e.g., client 90 is sampled at step 25).
Perhaps you can check the following:

  1. Pytorch changed the function that performs cholesky decomposition in recent versions. So, if you are working with a recent one, perhaps you should change it also in the code (link).
  2. Maybe you can try working with double precision and see if the issue is resolved.
  3. Add a larger jitter to the diagonal when the code fails.

Idan

Thanks for your response, I would try as what you suggest. The pytorch version is the same as that in the 'requirement' and I utilize the CUDA11.0 in RTX3090. Perhaps it is closely relative to the machine. Lastly, sorry to disturb you and thanks for your detailed response.

No problem. Feel free to approach me on anything.
BTW, I am working on RTX 2080 Ti with CUDA 11.4.

Good luck,
Idan