clinicalml/sc-foundation-eval

out of memory

Closed this issue · 5 comments

Hello,sorry to bother you,but when i run dist_pretrain.py i found there is something wrong about out of memory after first print epoch,but i set batch_size=1,and i beg your help.thanks for your time
== Epoch: 1 | Training Loss: 0.013374 | Accuracy: 73.3304% ==
RuntimeError: CUDA out of memory. Tried to allocate 8.55 GiB (GPU 3; 31.75 GiB total capacity; 21.48 GiB already allocated; 5.57 GiB free; 24.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hi, I'm surprised to here you are running out of memory with batch size 1 on your GPU, as I was able to fit batch size 8 on an 80GB GPU for pretraining. Could you please copy and paste the exact function call you are using? Maybe I will notice something not right there.

Did you preprocess your data to 16906 genes, as the scBERT authors did?

Wow, you are right, thank you. I used the scbert panglao_human dataset, but I assumed it was pre-processed. I will now go and process it for testing. Thanks again.

Hello! I'm sorry to bother you again. I have verified that the data has been pre-processed, and an error still occurs when the first epoch ends and calculations are made. The function I ran is your dist_pretrain.py. My GPU is a 40G V100, and I used 6 GPUs with a word_size of 6 and a batch_size of 1.The complete error message is as follows. Thanks for your time!
64889it [15:35:35, 1.27it/s]
64889it [15:35:35, 1.16it/s]
64889it [15:35:35, 1.23it/s]
64889it [15:35:35, 1.16it/s]
64889it [15:35:36, 1.17it/s]
64889it [15:35:36, 1.16it/s]
== Epoch: 1 | Training Loss: 0.017446 | Accuracy: 58.1475% ==
Traceback (most recent call last):
File "/mnt/zzh/scbertgpt/scBERT/dist_pretrain_early.py", line 464, in
main()
File "/mnt/zzh/scbertgpt/scBERT/dist_pretrain_early.py", line 89, in main
mp.spawn(
File "/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/mnt/zzh/scbertgpt/scBERT/dist_pretrain_early.py", line 394, in distributed_pretrain
truths = distributed_concat(torch.cat(truths, dim=0), len(val_sampler.dataset), world_size)
File "/mnt/zzh/scbertgpt/scBERT/utils.py", line 211, in distributed_concat
torch.distributed.all_gather(output_tensors, tensor)
File "/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: CUDA out of memory. Tried to allocate 7.75 GiB (GPU 0; 31.75 GiB total capacity; 18.17 GiB already allocated; 5.51 GiB free; 19.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

/mnt/zzh/anaconda3/envs/rna-pretrain/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

I'm not exactly sure what's going wrong here, but just as a possible tip to point you in the right direction, the memory is breaking when calling "distributed_concat," which brings together the tensors from multiple GPUs. Maybe try using fewer GPUs (smaller world size) as a debugging step, and check if it works? And then you can look into the distributed_concat function, which was written by the original scBERT authors, and see if you can find a solution. Best of luck! Lmk if I can help any more