2 Linear Models
Opened this issue · 4 comments
Some questions and suggestions came to mind when I read about the gradient descent method:
-
In section Gradient Descent, I find the formulation of the exponential decay of the learning rate a little bit odd. I would suggest expressing \eta_s in terms of \eta_0 instead.
-
In section Stochastic Gradient Descent (SGD), I believe it would be better to start the index at i=0, otherwise it would make more sense to divide by n+1 when averaging the individual losses. Same goes for the other two sums in that part.
-
Furthermore, the "incremental gradient" method does look a lot like the SAG method described here instead of the incremental aggregated gradient (IAG) method from this paper, which I found confusing. I also found the SAGA Algorithm. Maybe adding some of these references will be helpful to other students.
-
Another suggestion would be to change "random i" to "if i = i_s" and adding "with i_s randomly chosen per iteration".
- Good catch! Seems I switched them. Exponential should use eta_0, inverse-time should use eta_s
- You mean starting the index at 1, right? Also updated :)
- I added the links to the papers. There is a whole range of incremental gradient techniques, I'm trying to bring across the main idea behind all of them.
- OK, added.
Thanks!
Some other things that popped up:
- In Solving SVMs with Lagrange Multipliers, I do not understand the expression for the distance of support vectors to the boundary. I don't see how this "y - w_0" term reduces to 1, which would be the correct distance (for example using this derivation for the whole margin).
- I find it not easy to see that the plot in Geometric interpretation correspond to function f, because the contours do not look like concentric circles. Furthermore, it is not clear that there is a pole at the origin. I think a plot of w1^2+w2^2 like in this article is clearer (note that they are minimizing ||w||^2 instead).
- In the video I explain this more intuitively. If you imagine that the hyperplane is defined by 1 positive and 1 negative support vector, and we call the positive support vector x_1, you know that y = w x_1 + w_0. That means that w x_1 = y - w_0 and we know that (by definition) w x_1 should be 1 there, hence y - w_0 = 1. Would you agree with this or am I simplifying it too much? Do you think that the much longer derivation in the link is easier to understand?
(In the video I also put the origin at the intersection point. Since this is just an affine hyperplane I can do that if I subtract the supporting vector w_0, but in hindsight that might be less clear :). The simpler explanation above might be better.)
-
Hmm, that figure was meant to show the more general case of a convex function and linear constraints more generally, but you're probably right that it's easier to follow if I demonstrate this specific case. I'll try to replace that figure with a plot.
-
Yes, using the same standardized data, the coefficients of the overfitted model would be larger than those of a well-fitted model. (That would actually still hold if the data was not standardized). I don't know an absolute definition of 'large' in this case, that completely depends on the data...