MCZhi/DIPP

RuntimeError when training, variable control_variables is inconsistent with objective's expected

Closed this issue · 5 comments

Hi, MCZhi,
Many thanks for you nice work,

When I trained to 6 epochs I encounter this error: RuntimeError: Attempted to update variable control_variables with a (cuda:0,torch.float32) tensor, which is inconsistent with objective's expected (cuda,torch.float32).

Could you kindly help me to find the reason?
The full log is shown below:

Epoch 5/20
Train Progress: [ 72977/ 72977] Loss: 6.0672 0.0077s/sample
plannerADE: 1.1334, plannerFDE: 2.6881, predictorADE: 0.8775, predictorFDE: 1.8742
Valid Progress: [ 18189/ 18189] Loss: 5.7455 0.0019s/sample
val-plannerADE: 1.0462, val-plannerFDE: 2.4850, val-predictorADE: 0.7919, val-predictorFDE: 1.7697
Model saved in training_log/DIPP

Epoch 6/20
Traceback (most recent call last):
File "train.py", line 254, in
model_training()
File "train.py", line 210, in model_training
train_loss, train_metrics = train_epoch(train_loader, predictor, planner, optimizer, args.use_planning)
File "train.py", line 54, in train_epoch
final_values, info = planner.layer.forward(planner_inputs)
File "/home/hu/anaconda3/envs/DIPP/lib/python3.8/site-packages/theseus/theseus_layer.py", line 93, in forward
vars, info = _forward(
File "/home/hu/anaconda3/envs/DIPP/lib/python3.8/site-packages/theseus/theseus_layer.py", line 169, in _forward
objective.update(input_tensors)
File "/home/hu/anaconda3/envs/DIPP/lib/python3.8/site-packages/theseus/core/objective.py", line 776, in update
raise ValueError(
ValueError: Attempted to update variable control_variables with a (cuda:0,torch.float32) tensor, which is inconsistent with objective's expected (cuda,torch.float32).

Hi, @Donghu1876, thank you for reaching out about the issue. It appears that there's a device mismatch between the "control_variables" tensor and the expected device of the Theseus layer. Please make sure the device of the Theseus layer is the same as the network output, or assign the network output tensor to the device on which the Theseus layer resides.

Many thanks for your reply, I have found this question disappeared when cpu using.

But I found another issue. when I launch Open-loop test, in some scenarios, the following errors may occur:

Traceback (most recent call last):
File "open_loop_test.py", line 227, in
open_loop_test()
File "open_loop_test.py", line 79, in open_loop_test
plans, predictions, scores, cost_function_weights = predictor(ego, neighbors, lanes, crosswalks)
File "/home/hu/anaconda3/envs/DIPP/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/hu/dong/03dataset/DIPP/model/predictor.py", line 242, in forward
output = self.agent_map(agent_agent[:, i], lane_feature[:, i], crosswalk_feature[:, i], map_mask[:, i])
IndexError: index 10 is out of bounds for dimension 1 with size 10

Thank you very much for your help in your busy schedule.

Hi, @MCZhi ,

I found that this error only occurs for a specific scenario, because in the “class predictor” in “predictor. py”, the actors. shape [1] has always been 11, while the agent_ agent [:, i]. shape may sometimes be torch. Size ([1, 7, 256]), but in reality, its ideal shape should be torch. Size ([1, 11, 256]).

May I ask what is the reason for this? Thank you very much for your help and your great work.

Hi, @hudong24. Which version of PyTorch are you using? If you using a higher version of PyTorch (>=2.0), you may need to set enable_nested_tensor=False in self.interaction_net = nn.TransformerEncoder(encoder_layer, num_layers=2) (inside class Agent2Agent in predictor.py).

Thank you, @MCZhi . Your answer has been very helpful to me, and I would like to express my gratitude to you again. I hope your future is getting better and better!