MCZhi/DIPP

RuntimeError when training to 13 epochs

Closed this issue · 2 comments

First, thank you for your amazing work!

When I trained to 13 epochs I encounter this error: RuntimeError: There was an error while running the linear optimizer. Original error message: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).. Backward pass will not work. To obtain the best solution seen before the error, run with torch.no_grad()

Could you kindly help me to find the reason?
The full log is shown below:

Epoch 13/20
Train Progress: [ 29984/ 36111] Loss: 5.0243 0.2171s/sampleTraceback (most recent call last):
File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 274, in _optimize_loop
delta = self.compute_delta(**kwargs)
File "/home/moovita/theseus/theseus/optimizer/nonlinear/gauss_newton.py", line 47, in compute_delta
return self.linear_solver.solve()
File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 113, in solve
return self._apply_damping_and_solve(
File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 75, in _apply_damping_and_solve
return self._solve_sytem(Atb, AtA)
File "/home/moovita/theseus/theseus/optimizer/linear/dense_solver.py", line 157, in _solve_sytem
lower = torch.linalg.cholesky(AtA)
torch._C._LinAlgError: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 251, in
model_training()
File "train.py", line 207, in model_training
train_loss, train_metrics = train_epoch(train_loader, predictor, planner, optimizer, args.use_planning)
File "train.py", line 53, in train_epoch
final_values, info = planner.layer.forward(planner_inputs)
File "/home/moovita/theseus/theseus/theseus_layer.py", line 88, in forward
vars, info = _forward(
File "/home/moovita/theseus/theseus/theseus_layer.py", line 148, in _forward
info = optimizer.optimize(**optimizer_kwargs)
File "/home/moovita/theseus/theseus/optimizer/optimizer.py", line 43, in optimize
return self._optimize_impl(**kwargs)
File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 357, in _optimize_impl
self._optimize_loop(
File "/home/moovita/theseus/theseus/optimizer/nonlinear/nonlinear_optimizer.py", line 281, in _optimize_loop
raise RuntimeError(
RuntimeError: There was an error while running the linear optimizer. Original error message: torch.linalg_cholesky: (Batch element 8): The factorization could not be completed because the input is not positive-definite (the leading minor of order 18 is not positive-definite).. Backward pass will not work. To obtain the best solution seen before the error, run with torch.no_grad()

MCZhi commented

Thank you for your interest in our work. This problem is caused by Theseus solver that torch.linalg_cholesky cannot handle this case. I have no solution to this problem but a workaround is to change the linear_solver_cls from th.CholeskyDenseSolver to th.CholmodSparseSolver in the MotionPlanner class in planner.py, which is more stable.

It really works for me!
Thank you for your kind help and your great work.