pytorch/opacus

Error for capture_activations_hook in grad_sample_module.py

conjurer-Fan-Wu opened this issue ยท 13 comments

๐Ÿ› Bug

Error seems to be in the opacus library
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook
p._forward_counter += 1
AttributeError: 'Parameter' object has no attribute '_forward_counter'

Please reproduce using our template Colab and post here the link

I use Google Drive to store the files. federated_main.py is the main file when I run with spyders. All py files are in src v3 filefolder

https://drive.google.com/drive/folders/1inWFXO0fPoKygi8rJSzUcJLr-jFVoLxb?usp=sharing

To Reproduce

โš ๏ธ We cannot help you without you sharing reproducible code. Do not ignore this part :)
Steps to reproduce the behavior:

  1. Run federated_main directly

Traceback (most recent call last):

File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:237
model0, optimizer0, train_loader = privacy_engine.make_private(

TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'

runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3')
Reloaded modules: options, update, models, sampling, utils

Experimental details:
Model : cnn
Optimizer : sgd
Learning : 0.01
Global Rounds : 2

Federated parameters:
IID
Fraction of users  : 0.9
Local Batch size   : 64
Local Epochs       : 5

global model: CNNMnist(
(conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3))
(conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2))
(fc1): Linear(in_features=512, out_features=32, bias=True)
(fc2): Linear(in_features=32, out_features=10, bias=True)
)
global model: CNNMnist(
(conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3))
(conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2))
(fc1): Linear(in_features=512, out_features=32, bias=True)
(fc2): Linear(in_features=32, out_features=10, bias=True)
)
0%| | 0/2 [00:00<?, ?it/s]
| Global Training Round : 1 |

/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(image), torch.tensor(label)
0%| | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):

File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:245
w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights
log_probs = model(images)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl
result = forward_call(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward
return self._module(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl
return forward_call(*args, **kwargs)

File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward
x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl
hook_result = hook(self, args, result)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook
p._forward_counter += 1

AttributeError: 'Parameter' object has no attribute '_forward_counter'

Expected behavior

At least the program should normally run.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
By the way, I use ubuntu 22.04 with python3 3.10.12 and opacus 1.4.0

[pip3] flake8==6.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] numpydoc==1.5.0
[pip3] torch==2.1.0
[pip3] torchinfo==1.8.0
[pip3] torchvision==0.15.2
[pip3] triton==2.1.0
[conda] Could not collect


## Additional context

<!-- Add any other context about the problem here. -->

It seems your model has not been successfully instantiated by "make_private". Thus, the "_forward_counter" has not been defined (

p._forward_counter = 0
). Furthermore, the error message shows "TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'", which might be the reason for the failed instantiation. Could you please fix that part first? thanks!

I have tested the code again, and eliminated the dataloader problem. But the above problem still exists.

##################################

runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3')

Experimental details:
Model : cnn
Optimizer : sgd
Learning : 0.01
Global Rounds : 2

Federated parameters:
IID
Fraction of users  : 0.9
Local Batch size   : 64
Local Epochs       : 5

global model: CNNMnist(
(conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3))
(conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2))
(fc1): Linear(in_features=512, out_features=32, bias=True)
(fc2): Linear(in_features=32, out_features=10, bias=True)
)
global model: CNNMnist(
(conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3))
(conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2))
(fc1): Linear(in_features=512, out_features=32, bias=True)
(fc2): Linear(in_features=32, out_features=10, bias=True)
)
0%| | 0/2 [00:00<?, ?it/s]
| Global Training Round : 2 |

/home/fanwu/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:142: UserWarning: Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with secure_mode turned on.
warnings.warn(
/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(image), torch.tensor(label)
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:244
w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights
log_probs = model(images)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl
result = forward_call(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward
return self._module(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl
return forward_call(*args, **kwargs)

File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward
x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl
hook_result = hook(self, args, result)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook
p._forward_counter += 1

AttributeError: 'Parameter' object has no attribute '_forward_counter'

This thread (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2) should solve this issue. Please lmk whether it works :)

No, that does not work.

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:246
w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:76 in update_weights
model = GradSampleModule(model)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:141 in init
self.add_hooks(

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:191 in add_hooks
raise ValueError("Trying to add hooks twice to the same model")

ValueError: Trying to add hooks twice to the same model

Could you link the latest code (did not find it in your drive)? From the error, it seems somehow you try to privatize a model that has already been privatized before. Possibly you forget to unprivatize the model at the end of client training (self.model = model.to_standard_module()).

Sorry for the late file update. Now the files are in the google drive:
https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1

I change the update.py, with adding model = GradSampleModule(model) in update_weights function, before the training begins. And the error happens. At least such change should have a little variation.
To tell the truth, since I used the code based on FedAvg(https://github.com/AshwinRJ/Federated-Learning-PyTorch), and I have felt that the constructure is very different from the example of opacus. I tried several days and all of changes were failed.

Any reason not to have "model.to_standard_module()", as suggested by (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2)? Note that this code reverts the privatized model to non-private model, avoiding privatizing the same model twice.

As I mentioned, the reason you see this hook error is that you are privatizing the same model for two times, thus adding the same hook twice.

My suggestion is as follows:

  1. Remove privacy_engine.make_private in federated_main.py and move it to update.py .
  2. Remove GradSampleModule in update.py.
  3. In update.py, instead of "return model.state_dict()", have "return model.to_standard_module().state_dict()"

Generally speaking, what you need to do is

  1. On the server side, you only keep non-private models. Therefore, you have the freedom to change model parameters by aggregation.
  2. On the client side, the client firstly receives the non-private model, then call the privacy engine to privatize the model and run DP-SGD. Finally, the client returns the model parameters (of the non-private model).

Thanks for your kind response. I think I understand the architecture a little. I modified the code according to your help.
(https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1)
However, a new problem happens. I have not found the difference from the example in github. Module and optimizer construction is same as the example, but the error exists.

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235
w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:67 in update_weights
model, optimizer, train_loader = privacy_engine.make_private(

File ~/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:393 in make_private
raise ValueError(

ValueError: Module parameters are different than optimizer Parameters

Maybe you can define a new optimizer in "update.py", instead of re-using the existing one.
One example is "optimizer = torch.optim.SGD(model.parameters(),lr=0.01,momentum=0,weight_decay=0)" in "FederatedLearningClient.py" in https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049

Thanks for your response. I have changed the code as you said. However, the problem is still from the opacus library:

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235
w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:80 in update_weights
epsilon = privacy_engine.accountant.get_epsilon(delta=args.delta)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:97 in get_epsilon
dprv = self._get_dprv(eps_error=eps_error, delta_error=delta_error)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:114 in _get_dprv
domain = self._get_domain(

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:150 in _get_domain
return Domain.create_aligned(-L, L, mesh_size)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/analysis/prv/domain.py:31 in create_aligned
size = int(np.round((t_max - t_min) / dt)) + 1

ValueError: cannot convert float NaN to integer

What is the delta value you are using? It is possible the delta value is too small. For PRV, we only support the case when delta > 1e-6.

Another potential fix is that you can move "privacy_engine.accountant.get_epsilon" to the end of loop. This can avoid the case where in the first iteration, the accountant fetches epsilon before the model gets updated.

Thanks for your patient help. I modified the code according to your suggestion: move "privacy_engine.accountant.get_epsilon" to the end of loop.
(https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1)

All the parameter values are same as the example of mnist for opacus. But when the program is running, the loss in each epoch turns to be minus number quickly, without any convergence. I check the whole process again. But I do not know why the problem happens. I tried to change lr = 0.05 or 0.01, however, they are no use.

There are many possibilities for a loss to be negative. For example, the input of NLLLOSS should be a probability (0 to 1)(https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html). However, it might not be the case from reading your model setup.

Since the original error was not a "bug", and we are pivoting away the topic from Opacus, I just close the issue.