loading checkpoints onto cpu machine
bardia01 opened this issue · 3 comments
Question:
When loading a trained model onto a cpu machine, an error occurs in mase/machop/chop/tools/checkpoint_load.py due to not having a CPU
Commit hash: e98079f
Command to reproduce:
./ch search --accelerator cpu --config configs/examples/jsc_bardia_by_type.toml --load /content/mase/mase_output/jsc_bardia_e_50_b_128_l_001/software/training_ckpts/best.ckpt
Error log:
Traceback (most recent call last):
File "/content/mase/machop/./ch", line 6, in <module>
ChopCLI().run()
File "/content/mase/machop/chop/cli.py", line 270, in run
run_action_fn()
File "/content/mase/machop/chop/cli.py", line 395, in _run_search
search(**search_params)
File "/content/mase/machop/chop/actions/search/search.py", line 58, in search
model = load_model(load_name=load_name, load_type=load_type, model=model)
File "/content/mase/machop/chop/tools/checkpoint_load.py", line 84, in load_model
model = load_lightning_ckpt_to_unwrapped_model(
File "/content/mase/machop/chop/tools/checkpoint_load.py", line 15, in load_lightning_ckpt_to_unwrapped_model
src_state_dict = torch.load(checkpoint)["state_dict"]
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1392, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1366, in load_tensor
wrap_storage=restore_location(storage, location),
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 381, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 274, in _cuda_deserialize
device = validate_cuda_device(location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 258, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Comments:
Please would you consider changing "src_state_dict = torch.load(checkpoint)["state_dict"]" to something like:
if(torch.cuda.is_available()) : src_state_dict = torch.load(checkpoint)["state_dict"]
else: src_state_dict= torch.load(checkpoint, map_location=torch.device('cpu'))["state_dict"]
so that this doesn't break when using CPU
The accelerator flag doesn't seem to help - the print below shows that the accelerator is correctly overridden to cpu
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name | Default | Config. File | Manual Override | Effective |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task | classification | cls | | cls |
| load_name | None | | /content/mase/mase_outpu | /content/mase/mase_outpu |
| | | | t/jsc_bardia_e_50_b_128_ | t/jsc_bardia_e_50_b_128_ |
| | | | l_001/software/training_ | l_001/software/training_ |
| | | | ckpts/best.ckpt | ckpts/best.ckpt |
| load_type | mz | pl | | pl |
| batch_size | 128 | 512 | | 512 |
| to_debug | False | | | False |
| log_level | info | | | info |
| report_to | tensorboard | | | tensorboard |
| seed | 0 | 42 | | 42 |
| quant_config | None | | | None |
| training_optimizer | adam | | | adam |
| trainer_precision | 16-mixed | | | 16-mixed |
| learning_rate | 1e-05 | 0.01 | | 0.01 |
| weight_decay | 0 | | | 0 |
| max_epochs | 20 | 5 | | 5 |
| max_steps | -1 | | | -1 |
| accumulate_grad_batches | 1 | | | 1 |
| log_every_n_steps | 50 | 5 | | 5 |
| num_workers | 2 | | | 2 |
| num_devices | 1 | | | 1 |
| num_nodes | 1 | | | 1 |
| accelerator | auto | cpu | cpu | cpu |
| strategy | auto | | | auto |
| is_to_auto_requeue | False | | | False |
| github_ci | False | | | False |
| disable_dataset_cache | False | | | False |
| target | xcu250-figd2104-2L-e | | | xcu250-figd2104-2L-e |
| num_targets | 100 | | | 100 |
| is_pretrained | False | | | False |
| max_token_len | 512 | | | 512 |
| project_dir | /content/mase/mase_outpu | | | /content/mase/mase_outpu |
| | t | | | t |
| project | None | jsc-tiny | | jsc-tiny |
| model | None | jsc-bardia | | jsc-bardia |
| dataset | None | jsc | | jsc |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
Can you please follow the template for filing an issue?
Additionally, the proposal presented here doesn't seem feasible to me at the moment. Perhaps exploring the option --accelerator cpu
might yield something different for you? However, troubleshooting would be more straightforward if you included your command, as exemplified in the provided template.
Hi, I changed the format - using the accelerator flag doesn't seem to help. Additionally, I made the change I suggested locally and it does fix the issue
Hi bardia,
It seems it not follows the mase coding consistency.
You can try the modification in commit: 6947cc3 to check whether it can solve your problem or not.