JAYATEJAK/GVAlign

run error

Closed this issue · 4 comments

Hi, thanks for your great work. I run your code on other dataset, but it appears some errors as follows, can you give me some help?

Traceback (most recent call last):
File "/media/project/yanan/eccv2024/gvalign/src/main_incremental.py", line 338, in
main()
File "/media/project/yanan/eccv2024/gvalign/src/main_incremental.py", line 286, in main
appr.train(t, trn_loader[t], tst_loader[t], args, tst_loader)
File "/media/project/yanan/eccv2024/gvalign/src/approach/incremental_learning.py", line 57, in train
self.train_loop(t, trn_loader, val_loader, args, tst_loader)
File "/media/project/yanan/eccv2024/gvalign/src/approach/lucir_gvalign_2stage.py", line 288, in train_loop
distrib = MultivariateNormal(loc=mean, covariance_matrix=cov_cls_ms)
File "/media/project/yanan/anaconda3/envs/ahead/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py", line 150, in init
super(MultivariateNormal, self).init(batch_shape, event_shape, validate_args=validate_args)
File "/media/project/yanan/anaconda3/envs/ahead/lib/python3.10/site-packages/torch/distributions/distribution.py", line 56, in init
raise ValueError(
ValueError: Expected parameter covariance_matrix (Tensor of shape (50, 512, 512)) of distribution MultivariateNormal(loc: torch.Size([50, 512]), covariance_matrix: torch.Size([50, 512, 512])) to satisfy the constraint PositiveDefinite(), but found invalid values:
tensor([[[ 2.7521e-03, 1.8102e-03, -9.6236e-04, ..., -1.2610e-03,
5.4316e-04, -6.7986e-04],
[ 1.8102e-03, 1.5385e-03, -9.5112e-04, ..., -7.8666e-04,
5.5435e-04, -5.4887e-04],
[-9.6236e-04, -9.5112e-04, 1.2840e-03, ..., 2.5209e-04,
-5.3971e-04, 6.2342e-04],
...,
[-1.2610e-03, -7.8666e-04, 2.5209e-04, ..., 1.1643e-03,
3.7204e-05, 9.5319e-05],
[ 5.4316e-04, 5.5435e-04, -5.3971e-04, ..., 3.7204e-05,
4.4342e-04, -3.3423e-04],
[-6.7986e-04, -5.4887e-04, 6.2342e-04, ..., 9.5319e-05,
-3.3423e-04, 4.5726e-04]],

    [[ 2.7521e-03,  1.8102e-03, -9.6236e-04,  ..., -1.2610e-03,
       5.4316e-04, -6.7986e-04],
     [ 1.8102e-03,  1.5385e-03, -9.5112e-04,  ..., -7.8666e-04,
       5.5435e-04, -5.4887e-04],
     [-9.6236e-04, -9.5112e-04,  1.2840e-03,  ...,  2.5209e-04,
      -5.3971e-04,  6.2342e-04],
     ...,
     [-1.2610e-03, -7.8666e-04,  2.5209e-04,  ...,  1.1643e-03,
       3.7204e-05,  9.5319e-05],
     [ 5.4316e-04,  5.5435e-04, -5.3971e-04,  ...,  3.7204e-05,
       4.4342e-04, -3.3423e-04],
     [-6.7986e-04, -5.4887e-04,  6.2342e-04,  ...,  9.5319e-05,
      -3.3423e-04,  4.5726e-04]],

    [[ 2.7521e-03,  1.8102e-03, -9.6236e-04,  ..., -1.2610e-03,
       5.4316e-04, -6.7986e-04],
     [ 1.8102e-03,  1.5385e-03, -9.5112e-04,  ..., -7.8666e-04,
       5.5435e-04, -5.4887e-04],
     [-9.6236e-04, -9.5112e-04,  1.2840e-03,  ...,  2.5209e-04,
      -5.3971e-04,  6.2342e-04],
     ...,
     [-1.2610e-03, -7.8666e-04,  2.5209e-04,  ...,  1.1643e-03,
       3.7204e-05,  9.5319e-05],
     [ 5.4316e-04,  5.5435e-04, -5.3971e-04,  ...,  3.7204e-05,
       4.4342e-04, -3.3423e-04],
     [-6.7986e-04, -5.4887e-04,  6.2342e-04,  ...,  9.5319e-05,
      -3.3423e-04,  4.5726e-04]],

    ...,

    [[ 2.7521e-03,  1.8102e-03, -9.6236e-04,  ..., -1.2610e-03,
       5.4316e-04, -6.7986e-04],
     [ 1.8102e-03,  1.5385e-03, -9.5112e-04,  ..., -7.8666e-04,
       5.5435e-04, -5.4887e-04],
     [-9.6236e-04, -9.5112e-04,  1.2840e-03,  ...,  2.5209e-04,
      -5.3971e-04,  6.2342e-04],
     ...,
     [-1.2610e-03, -7.8666e-04,  2.5209e-04,  ...,  1.1643e-03,
       3.7204e-05,  9.5319e-05],
     [ 5.4316e-04,  5.5435e-04, -5.3971e-04,  ...,  3.7204e-05,
       4.4342e-04, -3.3423e-04],
     [-6.7986e-04, -5.4887e-04,  6.2342e-04,  ...,  9.5319e-05,
      -3.3423e-04,  4.5726e-04]],

    [[ 2.7521e-03,  1.8102e-03, -9.6236e-04,  ..., -1.2610e-03,
       5.4316e-04, -6.7986e-04],
     [ 1.8102e-03,  1.5385e-03, -9.5112e-04,  ..., -7.8666e-04,
       5.5435e-04, -5.4887e-04],
     [-9.6236e-04, -9.5112e-04,  1.2840e-03,  ...,  2.5209e-04,
      -5.3971e-04,  6.2342e-04],
     ...,
     [-1.2610e-03, -7.8666e-04,  2.5209e-04,  ...,  1.1643e-03,
       3.7204e-05,  9.5319e-05],
     [ 5.4316e-04,  5.5435e-04, -5.3971e-04,  ...,  3.7204e-05,
       4.4342e-04, -3.3423e-04],
     [-6.7986e-04, -5.4887e-04,  6.2342e-04,  ...,  9.5319e-05,
      -3.3423e-04,  4.5726e-04]],

    [[ 2.7521e-03,  1.8102e-03, -9.6236e-04,  ..., -1.2610e-03,
       5.4316e-04, -6.7986e-04],
     [ 1.8102e-03,  1.5385e-03, -9.5112e-04,  ..., -7.8666e-04,
       5.5435e-04, -5.4887e-04],
     [-9.6236e-04, -9.5112e-04,  1.2840e-03,  ...,  2.5209e-04,
      -5.3971e-04,  6.2342e-04],
     ...,
     [-1.2610e-03, -7.8666e-04,  2.5209e-04,  ...,  1.1643e-03,
       3.7204e-05,  9.5319e-05],
     [ 5.4316e-04,  5.5435e-04, -5.3971e-04,  ...,  3.7204e-05,
       4.4342e-04, -3.3423e-04],
     [-6.7986e-04, -5.4887e-04,  6.2342e-04,  ...,  9.5319e-05,
      -3.3423e-04,  4.5726e-04]]])

Hi @YananGu, how many epochs have you trained?

@JAYATEJAK I trained 90 epochs on domainnet dataset. This is a dataset spanning multiple domains

@YananGu, I think this issue will arise if proper variance isn't learned from the data. In CIFAR or ImageNet, you won't encounter these issues.

Maybe you can try adding a delta to the covariance matrix, some thing like:
class_cov = torch.cov(torch.tensor(vectors, dtype=torch.float64).T) + torch.eye(class_mean.shape[-1]) * 1e-5

In common practice, people prefer this because the covariance matrix should be positive. In our approach, we calculated it from the class that has the most number of samples, so we didn't encounter this issue.

Maybe my suggestion is if you have fewer samples for all classes; try to get a variance from all classes and use it to tune the classifier. this is an interesting experiment to check what happens.

Thanks, I add a delta to the covariance matrix and the problem fixed. The reason for this problem should be that in the domainnet dataset, even same classes data belongs to different domains, which increases the difficulty of learning variance.