mosicr opened this issue 6 years ago · 1 comments
Shouldn't w2.t() below be grad_w2 instead ? Thanks.
grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h_relu.t().mm(grad_y_pred) grad_h_relu = grad_y_pred.mm(w2.t())
No, the current implementation is correct. See my derivation for backprop through linear layers here:
http://cs231n.stanford.edu/handouts/linear-backprop.pdf