aigc-apps/AMFormer

训练的时候loss和acc几乎不变的问题

Closed this issue · 10 comments

作者你好,这是截取的部分结果,从开始到结束几乎不变,请问是哪里出了问题呢
Training - Step: 8720 - Acc: 0.9219 - GPU: 7
Training - Step: 8740 - Acc: 0.9297 - GPU: 7
Training - Step: 8760 - Acc: 0.9375 - GPU: 7
Training - Step: 8780 - Acc: 0.9141 - GPU: 7
Training - Step: 8800 - Acc: 0.9062 - GPU: 7
============Begin Validation============:step:8800
Valid - Step: 8800
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9189, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5053, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9189, device='cuda:0')}
Training - Step: 8820 - Acc: 0.8984 - GPU: 7
Training - Step: 8840 - Acc: 0.9531 - GPU: 7
Training - Step: 8860 - Acc: 0.9219 - GPU: 7
Training - Step: 8880 - Acc: 0.9141 - GPU: 7
Training - Step: 8900 - Acc: 0.9375 - GPU: 7
============Begin Validation============:step:8900
Valid - Step: 8900
Loss: 0.0025
{'loss': {Tensor:()} tensor(0.0025, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9176, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5050, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9176, device='cuda:0')}
Training - Step: 8920 - Acc: 0.9141 - GPU: 7
Training - Step: 8940 - Acc: 0.9297 - GPU: 7
Training - Step: 8960 - Acc: 0.9141 - GPU: 7
Training - Step: 8980 - Acc: 0.8828 - GPU: 7
Training - Step: 9000 - Acc: 0.9141 - GPU: 7
============Begin Validation============:step:9000
Valid - Step: 9000
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9189, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5052, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9189, device='cuda:0')}
Training - Step: 9020 - Acc: 0.8906 - GPU: 7

不变具体是指哪个指标?
看情况像是已经优化到local minima了。

就是我按照你们这个code,去Kaggle里下载了Credit对应application_train.csv来训练的,在训练的过程中loss就没变过,auc指标也没有上涨的趋势,就是似乎从开始就已经收敛了,然后文章中HC这个数据集对应的AUC指标有0.75,这里只跑到了0.5,不知道问题是出在哪里

Same problems when I apply this model to my own dataset for regression

我在本地运行了一遍,显示在1000个step左右就AUC在0.74左右了,是否是训练步骤太长导致的问题,是否方便看看前几轮的结果呢?

$CUDA_VISIBLE_DEVICES="6" python main.py --config config/run/ours_fttrans-hcdr.yaml
================ Loading Config ================
================ ver2 ================
================ cost 1.576 seconds ================
================ train set loaded ================
================ cost 1.361 seconds ================
================ valid set loaded ================
LR = 0.001
Epoch: 0 ----- step:0 - train_epoch size:783
Training - Step: 0 - Acc: 0.0703 - GPU: 7
Training - Step: 20 - Acc: 0.9102 - GPU: 7
Training - Step: 40 - Acc: 0.9180 - GPU: 7
Training - Step: 60 - Acc: 0.9570 - GPU: 7
Training - Step: 80 - Acc: 0.9258 - GPU: 7
Training - Step: 100 - Acc: 0.9453 - GPU: 7
============Begin Validation============:step:100
Valid - Step: 100
Loss: 0.0011
{'loss': {Tensor:()} tensor(0.0011, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5585, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 120 - Acc: 0.9102 - GPU: 7
Training - Step: 140 - Acc: 0.9180 - GPU: 7
Training - Step: 160 - Acc: 0.9375 - GPU: 7
Training - Step: 180 - Acc: 0.8711 - GPU: 7
Training - Step: 200 - Acc: 0.9141 - GPU: 7
============Begin Validation============:step:200
Valid - Step: 200
Loss: 0.0011
{'loss': {Tensor:()} tensor(0.0011, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.6331, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 220 - Acc: 0.8984 - GPU: 7
Training - Step: 240 - Acc: 0.9297 - GPU: 7
Training - Step: 260 - Acc: 0.9141 - GPU: 7
Training - Step: 280 - Acc: 0.9180 - GPU: 7
Training - Step: 300 - Acc: 0.9102 - GPU: 7
============Begin Validation============:step:300
Valid - Step: 300
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7270, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 320 - Acc: 0.9219 - GPU: 7
Training - Step: 340 - Acc: 0.8945 - GPU: 7
Training - Step: 360 - Acc: 0.9023 - GPU: 7
Training - Step: 380 - Acc: 0.9180 - GPU: 7
Training - Step: 400 - Acc: 0.9219 - GPU: 7
============Begin Validation============:step:400
Valid - Step: 400
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7336, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 420 - Acc: 0.8906 - GPU: 7
Training - Step: 440 - Acc: 0.9297 - GPU: 7
Training - Step: 460 - Acc: 0.9336 - GPU: 7
Training - Step: 480 - Acc: 0.9180 - GPU: 7
Training - Step: 500 - Acc: 0.9219 - GPU: 7
============Begin Validation============:step:500
Valid - Step: 500
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7335, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 520 - Acc: 0.9219 - GPU: 7
Training - Step: 540 - Acc: 0.9219 - GPU: 7
Training - Step: 560 - Acc: 0.9180 - GPU: 7
Training - Step: 580 - Acc: 0.8750 - GPU: 7
Training - Step: 600 - Acc: 0.9219 - GPU: 7
============Begin Validation============:step:600
Valid - Step: 600
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7372, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 620 - Acc: 0.8984 - GPU: 7
Training - Step: 640 - Acc: 0.9180 - GPU: 7
Training - Step: 660 - Acc: 0.9219 - GPU: 7
Training - Step: 680 - Acc: 0.9258 - GPU: 7
Training - Step: 700 - Acc: 0.9141 - GPU: 7
============Begin Validation============:step:700
Valid - Step: 700
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7433, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 720 - Acc: 0.9023 - GPU: 7
Training - Step: 740 - Acc: 0.9492 - GPU: 7
Training - Step: 760 - Acc: 0.9219 - GPU: 7
Training - Step: 780 - Acc: 0.9180 - GPU: 7
Epoch: 1 ----- step:783 - train_epoch size:783
Training - Step: 800 - Acc: 0.9141 - GPU: 7
============Begin Validation============:step:800
Valid - Step: 800
Loss: 0.0010
{'loss': {Tensor:()} tensor(0.0010, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.7388, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 820 - Acc: 0.9453 - GPU: 7
Training - Step: 840 - Acc: 0.9336 - GPU: 7
Training - Step: 860 - Acc: 0.9180 - GPU: 7
Training - Step: 880 - Acc: 0.9219 - GPU: 7
Training - Step: 900 - Acc: 0.8789 - GPU: 7

这是前1000步的运行结果,AUC上不去
================ Loading Config ================
True
================ ver6 ================
================ cost 6.450 seconds ================
================ train set loaded ================
================ cost 6.358 seconds ================
================ valid set loaded ================
LR = 0.001
Epoch: 0 ----- step:0 - train_epoch size:1566
Training - Step: 0 - Acc: 0.0703 - GPU: 7
Training - Step: 20 - Acc: 0.4062 - GPU: 7
Training - Step: 40 - Acc: 0.9062 - GPU: 7
Training - Step: 60 - Acc: 0.9219 - GPU: 7
Training - Step: 80 - Acc: 0.9766 - GPU: 7
Training - Step: 100 - Acc: 0.9453 - GPU: 7
============Begin Validation============:step:100
Valid - Step: 100
Loss: 0.0022
{'loss': {Tensor:()} tensor(0.0022, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.4955, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
best_loss Model Saved
Training - Step: 120 - Acc: 0.9375 - GPU: 7
Training - Step: 140 - Acc: 0.9453 - GPU: 7
Training - Step: 160 - Acc: 0.9453 - GPU: 7
Training - Step: 180 - Acc: 0.9531 - GPU: 7
Training - Step: 200 - Acc: 0.9375 - GPU: 7
============Begin Validation============:step:200
Valid - Step: 200
Loss: 0.0023
{'loss': {Tensor:()} tensor(0.0023, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5009, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 220 - Acc: 0.9453 - GPU: 7
Training - Step: 240 - Acc: 0.9141 - GPU: 7
Training - Step: 260 - Acc: 0.8984 - GPU: 7
Training - Step: 280 - Acc: 0.9141 - GPU: 7
Training - Step: 300 - Acc: 0.8984 - GPU: 7
============Begin Validation============:step:300
Valid - Step: 300
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5063, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 320 - Acc: 0.9531 - GPU: 7
Training - Step: 340 - Acc: 0.9062 - GPU: 7
Training - Step: 360 - Acc: 0.8984 - GPU: 7
Training - Step: 380 - Acc: 0.9141 - GPU: 7
Training - Step: 400 - Acc: 0.9609 - GPU: 7
============Begin Validation============:step:400
Valid - Step: 400
Loss: 0.0023
{'loss': {Tensor:()} tensor(0.0023, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5065, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 420 - Acc: 0.8984 - GPU: 7
Training - Step: 440 - Acc: 0.8828 - GPU: 7
Training - Step: 460 - Acc: 0.9219 - GPU: 7
Training - Step: 480 - Acc: 0.9531 - GPU: 7
Training - Step: 500 - Acc: 0.9219 - GPU: 7
============Begin Validation============:step:500
Valid - Step: 500
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5058, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 520 - Acc: 0.9141 - GPU: 7
Training - Step: 540 - Acc: 0.9141 - GPU: 7
Training - Step: 560 - Acc: 0.8750 - GPU: 7
Training - Step: 580 - Acc: 0.9453 - GPU: 7
Training - Step: 600 - Acc: 0.9531 - GPU: 7
============Begin Validation============:step:600
Valid - Step: 600
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5055, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 620 - Acc: 0.9219 - GPU: 7
Training - Step: 640 - Acc: 0.8906 - GPU: 7
Training - Step: 660 - Acc: 0.9375 - GPU: 7
Training - Step: 680 - Acc: 0.8750 - GPU: 7
Training - Step: 700 - Acc: 0.9531 - GPU: 7
============Begin Validation============:step:700
Valid - Step: 700
Loss: 0.0026
{'loss': {Tensor:()} tensor(0.0026, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9150, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5077, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9150, device='cuda:0')}
Training - Step: 720 - Acc: 0.9453 - GPU: 7
Training - Step: 740 - Acc: 0.9219 - GPU: 7
Training - Step: 760 - Acc: 0.9375 - GPU: 7
Training - Step: 780 - Acc: 0.9141 - GPU: 7
Training - Step: 800 - Acc: 0.8984 - GPU: 7
============Begin Validation============:step:800
Valid - Step: 800
Loss: 0.0023
{'loss': {Tensor:()} tensor(0.0023, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5058, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 820 - Acc: 0.8672 - GPU: 7
Training - Step: 840 - Acc: 0.9141 - GPU: 7
Training - Step: 860 - Acc: 0.9219 - GPU: 7
Training - Step: 880 - Acc: 0.9062 - GPU: 7
Training - Step: 900 - Acc: 0.9453 - GPU: 7
============Begin Validation============:step:900
Valid - Step: 900
Loss: 0.0023
{'loss': {Tensor:()} tensor(0.0023, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5061, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 920 - Acc: 0.9297 - GPU: 7
Training - Step: 940 - Acc: 0.9141 - GPU: 7
Training - Step: 960 - Acc: 0.8984 - GPU: 7
Training - Step: 980 - Acc: 0.9297 - GPU: 7
Training - Step: 1000 - Acc: 0.9531 - GPU: 7
============Begin Validation============:step:1000
Valid - Step: 1000
Loss: 0.0024
{'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9195, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5061, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9195, device='cuda:0')}
Training - Step: 1020 - Acc: 0.9141 - GPU: 7
Training - Step: 1040 - Acc: 0.8984 - GPU: 7
Training - Step: 1060 - Acc: 0.9062 - GPU: 7
Training - Step: 1080 - Acc: 0.9141 - GPU: 7
Training - Step: 1100 - Acc: 0.8906 - GPU: 7

实在抱歉,之前整理的代码存在bug,已经修复,你可以拉取新的代码进行测试

你好,可以提供一下环境的配置文件吗?

作者你好,这是截取的部分结果,从开始到结束几乎不变,请问是哪里出了问题呢 Training - Step: 8720 - Acc: 0.9219 - GPU: 7 Training - Step: 8740 - Acc: 0.9297 - GPU: 7 Training - Step: 8760 - Acc: 0.9375 - GPU: 7 Training - Step: 8780 - Acc: 0.9141 - GPU: 7 Training - Step: 8800 - Acc: 0.9062 - GPU: 7 ============Begin Validation============:step:8800 Valid - Step: 8800 Loss: 0.0024 {'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9189, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5053, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9189, device='cuda:0')} Training - Step: 8820 - Acc: 0.8984 - GPU: 7 Training - Step: 8840 - Acc: 0.9531 - GPU: 7 Training - Step: 8860 - Acc: 0.9219 - GPU: 7 Training - Step: 8880 - Acc: 0.9141 - GPU: 7 Training - Step: 8900 - Acc: 0.9375 - GPU: 7 ============Begin Validation============:step:8900 Valid - Step: 8900 Loss: 0.0025 {'loss': {Tensor:()} tensor(0.0025, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9176, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5050, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9176, device='cuda:0')} Training - Step: 8920 - Acc: 0.9141 - GPU: 7 Training - Step: 8940 - Acc: 0.9297 - GPU: 7 Training - Step: 8960 - Acc: 0.9141 - GPU: 7 Training - Step: 8980 - Acc: 0.8828 - GPU: 7 Training - Step: 9000 - Acc: 0.9141 - GPU: 7 ============Begin Validation============:step:9000 Valid - Step: 9000 Loss: 0.0024 {'loss': {Tensor:()} tensor(0.0024, device='cuda:0'), 'acc': {Tensor:()} tensor(0.9189, device='cuda:0'), 'auc': {Tensor:()} tensor(0.5052, device='cuda:0'), 'mse': {Tensor:()} tensor(0.9189, device='cuda:0')} Training - Step: 9020 - Acc: 0.8906 - GPU: 7

您好,请问您是怎么复现这份代码的,我直接运行main函数,报错如下:
Traceback (most recent call last):
File "E:\PYTHON_PROGRAMME\tabular learn\AMFormer-main\main.py", line 77, in
main(args)
File "E:\PYTHON_PROGRAMME\tabular learn\AMFormer-main\main.py", line 44, in main
dataset = getattr(data_load, args.data_name.lower())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'utils.data_load' has no attribute 'pretrain'
好像是data_load的原因,但是我点进去,好像啥也没有,之前是报错没有Namespace,我在cfg.py里面加了
image

按作者说的运行python main.py --config config/run/ours_fttrans-hcdr.yaml,还是报错:
Traceback (most recent call last):
File "E:\PYTHON_PROGRAMME\tabular learn\AMFormer-main\main.py", line 71, in
args = args.initialize()
^^^^^^^^^^^^^^^^^
File "E:\PYTHON_PROGRAMME\tabular learn\AMFormer-main\config\cfg.py", line 66, in initialize
config = self.load_base(derived_config, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\PYTHON_PROGRAMME\tabular learn\AMFormer-main\config\cfg.py", line 36, in load_base
with open(derived_config, 'r', encoding='utf-8') as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not dict

请问您在复现遇到了哪些问题呢?并且如何换成自己的数据集呢?

我之前修复bug重新push了一份代码,你有使用新的代码吗?我之前运行的config就是提供的这份config,如果这里getattr有问题,可能是你的工作目录设置不对?图方便的话可以直接修改main文件中的dataset = getattr(data_load, args.data_name.lower()),不适用getattr

我之前修复bug重新push了一份代码,你有使用新的代码吗?我之前运行的config就是提供的这份config,如果这里getattr有问题,可能是你的工作目录设置不对?图方便的话可以直接修改main文件中的dataset = getattr(data_load, args.data_name.lower()),不适用getattr

好的,谢谢作者,我再看一下哈