tianyic/only_train_once

Confused about Fig 3a in OTO v1 paper

Closed this issue · 3 comments

Hello,

image

From the definition of $\mathcal{S}(\mathbf{x})$, and assuming $\epsilon = 0 ,\ \mathcal{I}^{0}= \{ 1 \}$, the half space should be the whole region on the right. The projection should be straight onto $[x]_1=0$ or $[x]_2=0$

What am i missing?

@iamanigeeit

Thanks for reaching out. HSPG is a sparse optimizer to yield group sparsity. In the figure, we consider the union of $[x]_1$ and $[x]_2$ as a single variable group. Therefore, they need to be project onto zero at the same time. The Half-Space is then formed which boundary is orthogonal to the projection direction $-[x]$.

image

You might consider the case to produce unstructured sparsity. For this case, please look into our OBProxSG paper. In fact, the Half-Space projector under $\epsilon=0$ is equivalent to the orthant projector in OBProxSG if employing onto singleton groups.

@tianyic Thanks for the reply!

My understanding of the algorithm is this:

  1. Do subgradient descent update. Call the new point $\boldsymbol{x'_{k+1}}$
  2. Zero out the param groups that were previously zeroed. Call the new point $\boldsymbol{\tilde{x}_{k+1}}$.
  3. Now, we need to figure out which new param groups to zero.
  4. We try to set each group of variables to zero. Let $\boldsymbol{[\tilde{x}_{k+1}]_g}$ be the trial point where param group $g$ is zeroed out .
  5. If $\boldsymbol{[\tilde{x}_{k+1}]_g \cdot [x_k]_g} < 0$, the trial point is on the opposite side of the origin from the previous point $\boldsymbol{x_k}$. Then we can let the params in group $g$ to 0 because setting to 0 updates in the correct direction.
  6. If $\boldsymbol{[\tilde{x}_{k+1}]_g \cdot [x_k]_g} < \epsilon\ \boldsymbol{[x_k]_g \cdot [x_k]_g} $ and $\epsilon < 1$, the trial point is on the same side as $[x_k]_g$ but is still closer to the origin. As we increase $\epsilon$ from 0 to 1, we increase the range of points where we can set params in $g$ to 0.

If this understanding is correct then i know why i was confused. Maybe this could help other people with similar problems.

  1. I interpreted $\mathcal{G} = \{\{1,2\}\}$ as $g=\{1\}$ and $g=\{2\}$ when it should be $g=\{1,2\}$.
  2. I thought $\mathcal{S}_k$ indicates the region where $\boldsymbol{x_{k+1}}$ would be set to zero. Now i realize it is the important region, i.e. we retain $\boldsymbol{x_{k+1}}$ if it's inside $\mathcal{S}_k$.

But i still have questions:

  1. In Figure 3b, it seems like the $\epsilon=0$ line should pass through $\mathcal{O}$ and $\epsilon=1$ should pass through $\boldsymbol{x_k}$?
  2. According to Algorithm 2 param groups that are zeroed can never be activated again, is that correct?
  3. All the below look like the same thing. Should they be written as $\boldsymbol{x'_{k+1}}$?

Just before Fig 3a:
image

Fig 3a:
image

Algorithm 2:
image

@iamanigeeit

Happy to hear you figure the previous confusion out. Your understanding is largely correct.

Regarding your remaining question.

  1. Yes, $\epsilon$ is a hyper-parameter that we could control the region Half-Space. The feasible region is $[0,1)$. Equaling to $0$ pass through origin, making it approaching $1$ would pass through $x_k$.

  2. Yes, in the HSPG, DHSPG schema, the identified redundant groups are kept redundant (being zero). In AdaHSPG+, the theoretical paper, we discuss how to make it adaptively zero in/out to achieve a better redundancy identification.

  3. Your suggestion makes sense. But I think it is better to make line 10 as $[x_{k+1}]_{\mathcal{I}^{\neq 0}}$ for clarity.

Since we typically notate the final next iterate as $x_{k+1}$, other notations such as $x_{k+1}'$ or $\hat{x}_{k+1}$ refer to trial iterate that needs further adjustment to get the final next iterate.