xiaxin1998/DHCN

Abnormal code result and the warning about divide by zero

Closed this issue · 8 comments

DHCN/util.py

Line 40 in b798a7b

DH = H.T.multiply(1.0/H.sum(axis=1).reshape(1, -1))

A warning will appear on this line of code:

RuntimeWarning: divide by zero encountered in true_divide
  DH = H.T.multiply(1.0/H.sum(axis=1).reshape(1, -1))

There is an element with a value of 0 in H.sum(axis=1), this will eventually lead to the generation of inf. Doesn’t it affect the model data?

I only tested the model in the diginetica dataset, using the original code and the preprocessed data provided by Dropbox. I guess the preprocessed code should be the same as SR-GNN, but the results I ran are abnormal. I hope you can provide some suggestions.

Thanks for your interest.
This is just a warning and it won't affect the result. When it is zero, the result will be zero.
Besides, we use the same preprocess method with SR-GNN on Diginetica. Could you present the details about your abnormal results?

This is the result of the first epoch on my computer:

Namespace(batchSize=100, beta=0.01, dataset='diginetica', embSize=100, epoch=30, filter=False, l2=1e-05, layer=3, lr=0.001)
/home/nishikata/Downloads/DHCN-main/util.py:40: RuntimeWarning: divide by zero encountered in true_divide
  DH = H.T.multiply(1.0/H.sum(axis=1).reshape(1, -1))
-------------------------------------------------------
epoch:  0
start training:  2021-10-06 12:26:53.144526
	Loss:	72368.703
start predicting:  2021-10-06 15:43:56.240143
{'hit5': 13.520525451559934, 'mrr5': 10.005665024630542, 'hit10': 15.440065681444992, 'mrr10': 10.266280918497667, 'hit20': 17.185550082101805, 'mrr20': 10.388375823707687}
train_loss:	72368.7031	Recall@5: 13.5205	MRR5: 10.0057	Epoch: 0,  0
train_loss:	72368.7031	Recall@10: 15.4401	MRR10: 10.2663	Epoch: 0,  0
train_loss:	72368.7031	Recall@20: 17.1856	MRR20: 10.3884	Epoch: 0,  0

In the following epoches, the best results will no longer be updated.
And the training time of each epoch of the model is very long, about three hours. This should be a problem of the data loading method. Is there a multi-threaded version of Dataloader?
My numpy version is 1.19.2, pytorch version is 1.7.1.

Hi I checked our codes and found out the reason.
A few days ago, I made some changes in line 51 and line 75 because some people said that they encountered cudaError when running our codes. The error is caused by different numpy and pytorch version with ours.
In our model, we use numpy=1.18.1. Then after I changed the codes slightly to resolve the version problem, I didn't realize that these changes will affect the results and training time significantly on Diginetica. sorry about that.

For resolving your problem, I suggest you try again the current codes (I have recovered the changes) and the training time and results on Diginetica will be normal. In our environment, training one epoch of Diginetica is about 30 minutes and the first epoch result is about as follows:
epoch: 0
start training: 2021-10-07 00:19:21.094170
Loss: 44228.219
start predicting: 2021-10-07 00:51:05.679698
{'hit5': 25.328407224958948, 'mrr5': 14.150602079912423, 'hit10': 36.45812807881774, 'mrr10': 15.619508953006486, 'hit20': 49.64203612479475, 'mrr20': 16.533229130333517}
train_loss: 44228.2188 Recall@5: 25.3284 MRR5: 14.1506 Epoch: 0, 0
train_loss: 44228.2188 Recall@10: 36.4581 MRR10: 15.6195 Epoch: 0, 0
train_loss: 44228.2188 Recall@20: 49.6420 MRR20: 16.5332 Epoch: 0, 0
Using the current codes, if you encounter cudaError in line 51 or line 75 in model.py file like the previous closed issues, I suggest you change your numpy version to 1.18.1 and pytorch version to 1.6.0. to run our codes successfully.
And I am now trying to find how to resolve the problem of abnormal training after resolving the version problem.
Thank you!

Thank you for your detailed answers, according to the environment version you provided (pytorch-1.6.0, numpy-1.18.1), I got normal results.

Hi, I set my environment version you provided (pytorch-1.6.0, numpy-1.18.1), there is still the warning:
RuntimeWarning: divide by zero encountered in true_divide
DH = H.T.multiply(1.0/H.sum(axis=1).reshape(1, -1))

Hi, I got the very same results as Nishikata97's. And the results got very small improvment each epoch, I wonder is this because the numpy or pytorch version issue? Why the version casue this much diffience?
Namespace(batchSize=100, beta=0.01, dataset='diginetica', embSize=100, epoch=30, filter=False, gpu=2, l2=1e-05, layer=3, lr=0.001)
/util.py:43: RuntimeWarning: divide by zero encountered in true_divide
DH = H.T.multiply(1.0/H.sum(axis=1).reshape(1, -1))
epoch: 0
start training: 2023-05-31 15:58:37.380634
100%|███████████████████████████████████████████████████████████| 7195/7195 [26:33<00:00, 4.51it/s]
train time: 1593.9640152454376
Loss: 73193.414
start predicting: 2023-05-31 16:25:11.344766
100%|█████████████████████████████████████████████████████████████| 609/609 [02:31<00:00, 4.03it/s]
test time: 151.2500503063202
{'hit1': 7.916256157635468, 'mrr1': 7.916256157635468, 'hit5': 13.285714285714286, 'mrr5': 9.901149425287356, 'hit10': 15.259441707717569, 'mrr10': 10.166651028227381, 'hit20': 17.03448275862069, 'mrr20': 10.289920811295623}
train_loss: 73193.4141 Recall@1: 7.9163 MRR1: 7.9163 Epoch: 0, 0
train_loss: 73193.4141 Recall@5: 13.2857 MRR5: 9.9011 Epoch: 0, 0
train_loss: 73193.4141 Recall@10: 15.2594 MRR10: 10.1667 Epoch: 0, 0
train_loss: 73193.4141 Recall@20: 17.0345 MRR20: 10.2899 Epoch: 0, 0
epoch: 1
start training: 2023-05-31 16:27:42.806487
100%|███████████████████████████████████████████████████████████| 7195/7195 [28:36<00:00, 4.19it/s]
train time: 1716.6106851100922
Loss: 72292.641
start predicting: 2023-05-31 16:56:19.417264
100%|█████████████████████████████████████████████████████████████| 609/609 [02:40<00:00, 3.81it/s]
test time: 160.02362656593323
{'hit1': 8.019704433497537, 'mrr1': 8.019704433497537, 'hit5': 13.602627257799671, 'mrr5': 10.120799124247398, 'hit10': 15.49096880131363, 'mrr10': 10.377290379753433, 'hit20': 17.136288998357962, 'mrr20': 10.491275465254022}
train_loss: 72292.6406 Recall@1: 8.0197 MRR1: 8.0197 Epoch: 1, 1
train_loss: 72292.6406 Recall@5: 13.6026 MRR5: 10.1208 Epoch: 1, 1
train_loss: 72292.6406 Recall@10: 15.4910 MRR10: 10.3773 Epoch: 1, 1
train_loss: 72292.6406 Recall@20: 17.1363 MRR20: 10.4913 Epoch: 1, 1

And it's not only appear on Diginetica dataset, it got the same situation on Tmall dataset too. The difference is a good result is achieved on the first epoch. While Recall@20 is smaller than reported in the paper, and MRR@20 is greater, which is quite strange.
epoch: 0
start training: 2023-05-31 15:49:33.102401
100%|███████████████████████████████████████████████████████████| 3513/3513 [06:25<00:00, 9.12it/s]
train time: 385.20891666412354
Loss: 33937.809
start predicting: 2023-05-31 15:55:58.311425
100%|█████████████████████████████████████████████████████████████| 259/259 [00:37<00:00, 6.87it/s]
test time: 37.72077751159668
{'hit1': 12.737451737451739, 'mrr1': 12.737451737451739, 'hit5': 20.027027027027028, 'mrr5': 15.6011583011583, 'hit10': 22.27027027027027, 'mrr10': 15.904483054483054, 'hit20': 24.328185328185327, 'mrr20': 16.046886213112437}
train_loss: 33937.8086 Recall@1: 12.7375 MRR1: 12.7375 Epoch: 0, 0
train_loss: 33937.8086 Recall@5: 20.0270 MRR5: 15.6012 Epoch: 0, 0
train_loss: 33937.8086 Recall@10: 22.2703 MRR10: 15.9045 Epoch: 0, 0
train_loss: 33937.8086 Recall@20: 24.3282 MRR20: 16.0469 Epoch: 0, 0
epoch: 1
start training: 2023-05-31 15:56:36.095035
100%|███████████████████████████████████████████████████████████| 3513/3513 [06:52<00:00, 8.52it/s]
train time: 412.4000954627991
Loss: 33636.328
start predicting: 2023-05-31 16:03:28.495259
100%|█████████████████████████████████████████████████████████████| 259/259 [00:37<00:00, 6.97it/s]
test time: 37.182899713516235
{'hit1': 12.741312741312742, 'mrr1': 12.741312741312742, 'hit5': 19.996138996138995, 'mrr5': 15.604054054054053, 'hit10': 22.235521235521237, 'mrr10': 15.907994729423303, 'hit20': 24.362934362934364, 'mrr20': 16.05535687916272}
train_loss: 33636.3281 Recall@1: 12.7413 MRR1: 12.7413 Epoch: 1, 1
train_loss: 33636.3281 Recall@5: 20.0270 MRR5: 15.6041 Epoch: 0, 1
train_loss: 33636.3281 Recall@10: 22.2703 MRR10: 15.9080 Epoch: 0, 1
train_loss: 33636.3281 Recall@20: 24.3629 MRR20: 16.0554 Epoch: 1, 1

By the way, my numpy version is 1.22.1, and pytorch version is 1.10.1