Not saving ckpt.tar.gz checkpoint
IsauraMaria96 opened this issue · 8 comments
Hi,
Thanks for the great tool. Recently I've installed CellBender in an Ubuntu server, and I've been having a problem in which the ckpt checkpoint is not saved, and thus the tool is uncapable of completing the process. Has anyone else had this problem? Thanks a lot.
Full log is attached: Error.log
System description:
- Model: Dell Inc. Precision 5860.
- RAM: 128,0 GiB.
- CPU: Intel Xeon w5-2465X x32
- OS: Ubuntu 22.04.4 LTS
Log:
cellbender:remove-background: Command:
cellbender remove-background --cuda --input /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5 --output /home/neurofisiologia/DatosRefinados.h5
cellbender:remove-background: CellBender 0.3.2
cellbender:remove-background: (Workflow hash 346ca8efb8)
cellbender:remove-background: 2024-07-03 09:51:30
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from /home/neurofisiologia/SRR19792156/outs/raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Features in dataset: 38606 Gene Expression
cellbender:remove-background: Trimming features for inference.
cellbender:remove-background: 33572 features have nonzero counts.
cellbender:remove-background: Prior on counts for cells is 3741
cellbender:remove-background: Prior on counts for empty droplets is 295
cellbender:remove-background: Excluding 8942 features that are estimated to have <= 0.1 background counts in cells.
cellbender:remove-background: Including 24630 features in the analysis.
cellbender:remove-background: Trimming barcodes for inference.
cellbender:remove-background: Excluding barcodes with counts below 147
cellbender:remove-background: Using 2575 probable cell barcodes, plus an additional 10272 barcodes, and 71346 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts.
cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmprc18nole
cellbender:remove-background: No saved checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
cellbender:remove-background: [epoch 001] average training loss: 6661.9639
cellbender:remove-background: [epoch 002] average training loss: 6034.6147 (3.7 seconds per epoch)
cellbender:remove-background: Will checkpoint every 114 epochs
cellbender:remove-background: [epoch 003] average training loss: 5427.5748
cellbender:remove-background: [epoch 004] average training loss: 5108.2541
cellbender:remove-background: [epoch 005] average training loss: 4923.2049
cellbender:remove-background: [epoch 005] average test loss: 4961.6450
cellbender:remove-background: [epoch 006] average training loss: 4714.6100
cellbender:remove-background: [epoch 007] average training loss: 4658.0975
cellbender:remove-background: [epoch 008] average training loss: 4686.5494
cellbender:remove-background: [epoch 009] average training loss: 4636.7682
cellbender:remove-background: [epoch 010] average training loss: 4599.6784
cellbender:remove-background: [epoch 010] average test loss: 4674.5524
cellbender:remove-background: [epoch 011] average training loss: 4629.7057
cellbender:remove-background: [epoch 012] average training loss: 4552.1350
cellbender:remove-background: [epoch 013] average training loss: 4496.8647
cellbender:remove-background: [epoch 014] average training loss: 4308.9377
cellbender:remove-background: [epoch 015] average training loss: 4275.1747
cellbender:remove-background: [epoch 015] average test loss: 4324.3197
cellbender:remove-background: [epoch 016] average training loss: 4261.2428
cellbender:remove-background: [epoch 017] average training loss: 4251.0613
cellbender:remove-background: [epoch 018] average training loss: 4228.1749
cellbender:remove-background: [epoch 019] average training loss: 4206.0814
cellbender:remove-background: [epoch 020] average training loss: 4197.3849
cellbender:remove-background: [epoch 020] average test loss: 4191.5520
cellbender:remove-background: [epoch 021] average training loss: 4190.3577
cellbender:remove-background: [epoch 022] average training loss: 4154.5904
cellbender:remove-background: [epoch 023] average training loss: 4119.1000
cellbender:remove-background: [epoch 024] average training loss: 4101.0069
cellbender:remove-background: [epoch 025] average training loss: 4077.4471
cellbender:remove-background: [epoch 025] average test loss: 4076.8579
cellbender:remove-background: [epoch 026] average training loss: 4079.1548
cellbender:remove-background: [epoch 027] average training loss: 4060.0420
cellbender:remove-background: [epoch 028] average training loss: 4041.2950
cellbender:remove-background: [epoch 029] average training loss: 4023.0368
cellbender:remove-background: [epoch 030] average training loss: 4001.7430
cellbender:remove-background: [epoch 030] average test loss: 3975.9369
cellbender:remove-background: [epoch 031] average training loss: 3994.5689
cellbender:remove-background: [epoch 032] average training loss: 3992.0950
cellbender:remove-background: [epoch 033] average training loss: 3986.7607
cellbender:remove-background: [epoch 034] average training loss: 3997.4167
cellbender:remove-background: [epoch 035] average training loss: 3991.3141
cellbender:remove-background: [epoch 035] average test loss: 3993.9262
cellbender:remove-background: [epoch 036] average training loss: 3998.2393
cellbender:remove-background: [epoch 037] average training loss: 3989.8854
cellbender:remove-background: [epoch 038] average training loss: 3982.2416
cellbender:remove-background: [epoch 039] average training loss: 3980.3234
cellbender:remove-background: [epoch 040] average training loss: 3984.4739
cellbender:remove-background: [epoch 040] average test loss: 3973.4658
cellbender:remove-background: [epoch 041] average training loss: 3974.9065
cellbender:remove-background: [epoch 042] average training loss: 3984.2641
cellbender:remove-background: [epoch 043] average training loss: 3975.1879
cellbender:remove-background: [epoch 044] average training loss: 3971.4374
cellbender:remove-background: [epoch 045] average training loss: 3974.2532
cellbender:remove-background: [epoch 045] average test loss: 3950.0547
cellbender:remove-background: [epoch 046] average training loss: 3970.9828
cellbender:remove-background: [epoch 047] average training loss: 3964.1729
cellbender:remove-background: [epoch 048] average training loss: 3962.0764
cellbender:remove-background: [epoch 049] average training loss: 3971.4048
cellbender:remove-background: [epoch 050] average training loss: 3970.0651
cellbender:remove-background: [epoch 050] average test loss: 3958.7704
cellbender:remove-background: [epoch 051] average training loss: 3973.9497
cellbender:remove-background: [epoch 052] average training loss: 3970.4156
cellbender:remove-background: [epoch 053] average training loss: 3965.1261
cellbender:remove-background: [epoch 054] average training loss: 3975.3828
cellbender:remove-background: [epoch 055] average training loss: 3969.5423
cellbender:remove-background: [epoch 055] average test loss: 3932.6834
cellbender:remove-background: [epoch 056] average training loss: 3964.7342
cellbender:remove-background: [epoch 057] average training loss: 3967.4058
cellbender:remove-background: [epoch 058] average training loss: 3971.9959
cellbender:remove-background: [epoch 059] average training loss: 3960.5551
cellbender:remove-background: [epoch 060] average training loss: 3964.4331
cellbender:remove-background: [epoch 060] average test loss: 3967.9076
cellbender:remove-background: [epoch 061] average training loss: 3965.4153
cellbender:remove-background: [epoch 062] average training loss: 3962.5914
cellbender:remove-background: [epoch 063] average training loss: 3965.0319
cellbender:remove-background: [epoch 064] average training loss: 3965.6907
cellbender:remove-background: [epoch 065] average training loss: 3960.0795
cellbender:remove-background: [epoch 065] average test loss: 3945.7927
cellbender:remove-background: [epoch 066] average training loss: 3964.4541
cellbender:remove-background: [epoch 067] average training loss: 3968.9065
cellbender:remove-background: [epoch 068] average training loss: 3958.4191
cellbender:remove-background: [epoch 069] average training loss: 3963.3575
cellbender:remove-background: [epoch 070] average training loss: 3954.3709
cellbender:remove-background: [epoch 070] average test loss: 4007.2453
cellbender:remove-background: [epoch 071] average training loss: 3958.2268
cellbender:remove-background: [epoch 072] average training loss: 3961.9567
cellbender:remove-background: [epoch 073] average training loss: 3968.9788
cellbender:remove-background: [epoch 074] average training loss: 3962.2250
cellbender:remove-background: [epoch 075] average training loss: 3967.0552
cellbender:remove-background: [epoch 075] average test loss: 3997.2249
cellbender:remove-background: [epoch 076] average training loss: 3955.0682
cellbender:remove-background: [epoch 077] average training loss: 3960.1321
cellbender:remove-background: [epoch 078] average training loss: 3966.0317
cellbender:remove-background: [epoch 079] average training loss: 3953.0031
cellbender:remove-background: [epoch 080] average training loss: 3957.0243
cellbender:remove-background: [epoch 080] average test loss: 4002.7144
cellbender:remove-background: [epoch 081] average training loss: 3963.4742
cellbender:remove-background: [epoch 082] average training loss: 3964.5696
cellbender:remove-background: [epoch 083] average training loss: 3967.0997
cellbender:remove-background: [epoch 084] average training loss: 3967.0555
cellbender:remove-background: [epoch 085] average training loss: 3969.6566
cellbender:remove-background: [epoch 085] average test loss: 4005.2764
cellbender:remove-background: [epoch 086] average training loss: 3979.3970
cellbender:remove-background: [epoch 087] average training loss: 3971.2706
cellbender:remove-background: [epoch 088] average training loss: 3979.9692
cellbender:remove-background: [epoch 089] average training loss: 3991.1880
cellbender:remove-background: [epoch 090] average training loss: 3984.4977
cellbender:remove-background: [epoch 090] average test loss: 4012.9531
cellbender:remove-background: [epoch 091] average training loss: 3979.9608
cellbender:remove-background: [epoch 092] average training loss: 3980.1110
cellbender:remove-background: [epoch 093] average training loss: 3990.6269
cellbender:remove-background: [epoch 094] average training loss: 3987.8105
cellbender:remove-background: [epoch 095] average training loss: 4003.3267
cellbender:remove-background: [epoch 095] average test loss: 4026.3201
cellbender:remove-background: [epoch 096] average training loss: 4011.9168
cellbender:remove-background: [epoch 097] average training loss: 4001.7220
cellbender:remove-background: [epoch 098] average training loss: 4002.4815
cellbender:remove-background: [epoch 099] average training loss: 4014.8439
cellbender:remove-background: [epoch 100] average training loss: 4009.8107
cellbender:remove-background: [epoch 100] average test loss: 4034.6981
cellbender:remove-background: [epoch 101] average training loss: 4001.8132
cellbender:remove-background: [epoch 102] average training loss: 4000.4273
cellbender:remove-background: [epoch 103] average training loss: 4000.4040
cellbender:remove-background: [epoch 104] average training loss: 3996.6345
cellbender:remove-background: [epoch 105] average training loss: 4007.3502
cellbender:remove-background: [epoch 105] average test loss: 4046.1299
cellbender:remove-background: [epoch 106] average training loss: 3994.2900
cellbender:remove-background: [epoch 107] average training loss: 4018.2631
cellbender:remove-background: [epoch 108] average training loss: 3995.7133
cellbender:remove-background: [epoch 109] average training loss: 3984.8872
cellbender:remove-background: [epoch 110] average training loss: 4008.2703
cellbender:remove-background: [epoch 110] average test loss: 4043.1757
cellbender:remove-background: [epoch 111] average training loss: 4017.7784
cellbender:remove-background: [epoch 112] average training loss: 4017.0501
cellbender:remove-background: [epoch 113] average training loss: 4021.3158
cellbender:remove-background: [epoch 114] average training loss: 3994.4110
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
cellbender:remove-background: [epoch 115] average training loss: 4016.8244
cellbender:remove-background: [epoch 115] average test loss: 4036.3020
cellbender:remove-background: [epoch 116] average training loss: 4017.2557
cellbender:remove-background: [epoch 117] average training loss: 3996.7196
cellbender:remove-background: [epoch 118] average training loss: 4004.9664
cellbender:remove-background: [epoch 119] average training loss: 4022.4710
cellbender:remove-background: [epoch 120] average training loss: 4019.5331
cellbender:remove-background: [epoch 120] average test loss: 4067.2432
cellbender:remove-background: [epoch 121] average training loss: 4008.7457
cellbender:remove-background: [epoch 122] average training loss: 4001.0307
cellbender:remove-background: [epoch 123] average training loss: 3998.2867
cellbender:remove-background: [epoch 124] average training loss: 4001.8232
cellbender:remove-background: [epoch 125] average training loss: 4055.3543
cellbender:remove-background: [epoch 125] average test loss: 4058.0449
cellbender:remove-background: [epoch 126] average training loss: 4003.1687
cellbender:remove-background: [epoch 127] average training loss: 4017.3536
cellbender:remove-background: [epoch 128] average training loss: 4019.2687
cellbender:remove-background: [epoch 129] average training loss: 4028.9802
cellbender:remove-background: [epoch 130] average training loss: 4018.2229
cellbender:remove-background: [epoch 130] average test loss: 4026.8101
cellbender:remove-background: [epoch 131] average training loss: 4018.8546
cellbender:remove-background: [epoch 132] average training loss: 4002.1382
cellbender:remove-background: [epoch 133] average training loss: 4011.3291
cellbender:remove-background: [epoch 134] average training loss: 4009.5174
cellbender:remove-background: [epoch 135] average training loss: 3999.1352
cellbender:remove-background: [epoch 135] average test loss: 4015.5564
cellbender:remove-background: [epoch 136] average training loss: 3996.2076
cellbender:remove-background: [epoch 137] average training loss: 3995.8721
cellbender:remove-background: [epoch 138] average training loss: 4017.0538
cellbender:remove-background: [epoch 139] average training loss: 4017.7493
cellbender:remove-background: [epoch 140] average training loss: 3998.2958
cellbender:remove-background: [epoch 140] average test loss: 4049.0232
cellbender:remove-background: [epoch 141] average training loss: 3991.3952
cellbender:remove-background: [epoch 142] average training loss: 4022.6591
cellbender:remove-background: [epoch 143] average training loss: 3992.5597
cellbender:remove-background: [epoch 144] average training loss: 4008.8651
cellbender:remove-background: [epoch 145] average training loss: 3992.5097
cellbender:remove-background: [epoch 145] average test loss: 4121.4365
cellbender:remove-background: [epoch 146] average training loss: 4005.6093
cellbender:remove-background: [epoch 147] average training loss: 4021.3828
cellbender:remove-background: [epoch 148] average training loss: 3995.0772
cellbender:remove-background: [epoch 149] average training loss: 3985.9057
cellbender:remove-background: [epoch 150] average training loss: 4004.1677
cellbender:remove-background: [epoch 150] average test loss: 4030.2060
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
File "/home/neurofisiologia/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home/neurofisiologia/anaconda3/envs/cellbender/lib/python3.11/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
cellbender:remove-background: 2024-07-03 10:01:02
cellbender:remove-background: Inference procedure complete.
same problem, not able to save check points:
Traceback (most recent call last):
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
torch.save(model_obj, filebase + '_model.torch')
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 628, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/home2/s225139/.conda/envs/CellBender/lib/python3.8/site-packages/torch/serialization.py", line 840, in _save
pickler.dump(obj)
TypeError: cannot pickle 'weakref' object
No solution, but I am encountering the same error. I have tested on v0.3.2, v0.3.0 and v0.2.2. Version 0.2.2 produces expected outputs, while the more recent versions produce the errors seen above.
I am also experiencing the same issue
Same error.
I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1
I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1
Thank you, I will try it.
I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1
This combination works for me. Thanks!
I got it working- it's a version error. I am using python 3.7.12, cellbender version 0.3.0, torch 1.13.1
Does anyone who got it working mind sharing what scipy version they are using? After using these three at the versions listed I get an error 'ValueError: row index exceeds matrix dimensions' which I'm hoping with be a quick fix after I switch to the correct scipy version. Thanks!