Inaccurate gradient estimation for log penalization

Question

Inaccurate gradient estimation for log penalization

AxelBreuer opened this issue 2 years ago · 1 comments

Hi,

I have reported an equivalent issue on cvxpylayers cvxgrp/cvxpylayers#135 but since cvxpylayers uses diffcp, I wanted to report this issue here as well.

I have built an even more simple example with inaccurate gradient estimation:

Let's suppose we want to maximize $J(x) = x + \lambda (\log(1-x) + \log(1+x))$ where $\lambda \in R^+ $.
The solution is $x^* = -\lambda + \sqrt{\lambda²+1}$
And $\frac{dx^*}{d\lambda} = -1 + \frac{\lambda}{\sqrt{\lambda²+1}}$
Now, let's calculate the sensitivity of $x^{*}$ with respect to $\lambda$, at $\lambda=1$ and $\delta\lambda=10^{-6}$, with diffcp:

import numpy as np
import cvxpy as cp
import diffcp.cone_program as cone_prog
import diffcp.utils as utils

# variables
x_ = cp.Variable()

# parameters
_lam = cp.Parameter(1, nonneg=True)
_lam.value = np.ones(1)

# objective
objective = cp.Maximize(x_ + _lam*cp.log(1+x_) + _lam*cp.log(1-x_))

problem = cp.Problem(objective)

A, b, c, cone_dims = utils.scs_data_from_cvxpy_problem(problem)

x, y, s, D, DT = cone_prog.solve_and_derivative(A,
                                                b,
                                                c,
                                                cone_dims,
                                                solve_method="ECOS",
                                                mode="dense",
                                                feastol=1e-10,
                                                abstol=1e-10,
                                                reltol=1e-10)

dlam = 1e-6
dA = utils.get_random_like(A, lambda n: np.zeros(n))
db = np.zeros(b.size)
dc = np.zeros(c.size)
dc[1] = dc[1] - dlam  # the minus sign stems from the fact that  c =  [-1., -1., -1.]
dc[2] = dc[2] - dlam  # the minus sign stems from the fact that  c =  [-1., -1., -1.]

dx, dy, ds = D(dA, db, dc)

x_pert, y_pert, s_pert, _, _ = cone_prog.solve_and_derivative(A + dA,
                                                              b + db,
                                                              c + dc,
                                                              cone_dims,
                                                              solve_method="ECOS",
                                                              mode="dense",
                                                              feastol=1e-10,
                                                              abstol=1e-10,
                                                              reltol=1e-10)

print(f"x_pert-x   = {x_pert-x}")
print(f"      dx   = {dx}")
analytical = -1+_lam.value/np.sqrt(_lam.value**2+1)
print(f"analytical = {analytical*dlam}")

and you get

x_pert-x   = [-2.93091918e-07 -2.07319449e-07  5.00378592e-07]
      dx   = [-2.61203874e-07 -1.84699034e-07  4.45902905e-07]
analytical = [-2.92893219e-07]

Where we see that $\frac{dx^*}{d\lambda} \delta\lambda$ has a finite difference approximation (-2.930e-7) which is much closer to the analytical gradient (-2.923e-7) than the Automatic Difference gradient (-2.612e-7), which is very surprising, especially for such a simple (unconstrained) problem.
Do you have an idea why AD performs so poorly on this example ?

Answer 1 · 2023-05-09T09:05:15.000Z

closed by #59