Bug: `successor.local_gradient_for_argument(self)` could be corrupted
Banyc opened this issue · 4 comments
Issue
local_gradient_for_argument might read a corrupted float value from the successor.
Say we have the computation graph:
h
|
|
f
-
$h$ has parameters$w$ -
$f$ has parameters$v$ $h : \mathbb{R}^m \to \mathbb{R}$ $f : \mathbb{R}^n \to \mathbb{R}$ -
$h$ is the successor of$f$ -
$E$ is the graph $\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$ $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$ $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$ - when doing backpropagation, the steps will be
-
$h$ computes$\frac{\partial E}{\partial w}$ and caches$\frac{\partial E}{\partial h}$ , and$\frac{\partial h}{\partial w}$ -
$h$ updates$w$ to$w'$ -
$f$ computes$\frac{\partial E}{\partial f}$ and$\frac{\partial h}{\partial f}$ is cached-
$\frac{\partial h}{\partial f}$ is not yet in cache, so$h$ will have to compute it now -
$\frac{\partial h}{\partial f}$ is computed based on the new parameter$w'$ - This is the problem!
-
$\frac{\partial h}{\partial f}$ is corrupted
-
$\frac{\partial h}{\partial f}$ is in cache now -
$\frac{\partial E}{\partial f}$ is computed by looking both$\frac{\partial E}{\partial h}$ and$\frac{\partial h}{\partial f}$ in cache -
$\frac{\partial E}{\partial f}$ is in cache now
-
-
$f$ computes$\frac{\partial f}{\partial v}$ and caches it -
$f$ computes$\frac{\partial E}{\partial v}$ with$\frac{\partial f}{\partial v}$ and the corrupted$\frac{\partial h}{\partial f}$ -
$f$ updates$v$ based on the corrupted$\frac{\partial E}{\partial v}$
-
Solutions
I can come up with two solutions
- compute
local_gradient
($\frac{\partial h}{\partial f}$ ) at the beginning of do_gradient_descent_step before parameters ($w$ ) is modified - the successor
$h$ distributeslocal_gradient
($\frac{\partial h}{\partial f}$ ) andglobal_gradient
($\frac{\partial E}{\partial h}$ ) to$f$ before parameters ($w$ ) is modified- I have also coded a neural network based on yours, the distribution happens at distribute_global_gradient_entries_to_operands
PS
Thank you for your book and the sample code so that I could deeper understand the neural network!
The math notations are going weird. I post the raw markdown here
# Issue
[local_gradient_for_argument](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L125) might read a corrupted float value from the successor.
Say we have the computation graph:
\`\`\`text
h
|
|
f
\`\`\`
- $h$ has parameters $w$
- $f$ has parameters $v$
- $h : \mathbb{R}^m \to \mathbb{R}$
- $f : \mathbb{R}^n \to \mathbb{R}$
- $h$ is the successor of $f$
- $E$ is the graph
- $\frac{\partial E}{\partial w} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w}$
- $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
- $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
- when doing backpropagation, the steps will be
1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial w}$
1. $h$ updates $w$ to $w'$
1. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
1. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
- **This is the problem!**
- $\frac{\partial h}{\partial f}$ is **corrupted**
1. $\frac{\partial h}{\partial f}$ is in cache now
1. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
1. $\frac{\partial E}{\partial f}$ is in cache now
1. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
1. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the **corrupted** $\frac{\partial h}{\partial f}$
1. $f$ updates $v$ based on the **corrupted** $\frac{\partial E}{\partial v}$
# Solutions
I can come up with two solutions
- compute `local_gradient` ($\frac{\partial h}{\partial f}$) at the beginning of [do_gradient_descent_step](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L107) before parameters ($w$) is modified
- the successor $h$ distributes `local_gradient` ($\frac{\partial h}{\partial f}$) and `global_gradient` ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified
- I have also coded a neural network based on yours, the distribution happens at [distribute_global_gradient_entries_to_operands](https://github.com/Banyc/neural_network/blob/c6970d2712c8c01286b0a10434723a073fb99ad7/src/nodes/node.rs#L116)
I have been slow to respond to this comment because it is quite detailed and I want to make sure I give it the time it deserves.
For starters, one cannot compose f
and h
because the output of f
is a single number but the input of h
is a vector of length m
. But I think the rest of your statement does not differ if h
and f
are single-input functions, so that can be ignored.
Next, I believe the structure of the code ensures that all gradients are computed before any update steps occur.
I believe you can see this by applying the following patch (attached as a file as well), and running any of the integration tests in neural_network_integration_test.py
, e.g., with pytest -s -k test_learn_xor_sigmoid neural_network_integration_test.py
corrupted_patch.txt
diff --git a/neural_network/neural_network.py b/neural_network/neural_network.py
index cee6a4a..db639e1 100644
--- a/neural_network/neural_network.py
+++ b/neural_network/neural_network.py
@@ -107,9 +107,11 @@ class Node:
def do_gradient_descent_step(self, step_size):
'''The core gradient step subroutine: compute the gradient for each of this node's
tunable parameters, step away from the gradient.'''
+ print("Starting one gradient descent step")
if self.has_parameters:
for i, gradient_entry in enumerate(self.global_parameter_gradient):
# step away from the gradient
+ print("Doing gradient update for parameter %s" % i)
self.parameters[i] -= step_size * gradient_entry
'''Gradient computations which don't depend on the node's definition.'''
@@ -147,6 +149,7 @@ class Node:
@property
def local_gradient(self):
if self.cache.local_gradient is None:
+ print("Computing local gradient for %s" % (self, ))
self.cache.local_gradient = self.compute_local_gradient()
return self.cache.local_gradient
@@ -159,6 +162,7 @@ class Node:
@property
def local_parameter_gradient(self):
if self.cache.local_parameter_gradient is None:
+ print("Computing local parameter gradient for %s" % (self, ))
self.cache.local_parameter_gradient = self.compute_local_parameter_gradient()
return self.cache.local_parameter_gradient
@@ -391,6 +395,7 @@ class NeuralNetwork:
return self.error_node.compute_error(inputs, label)
def backpropagation_step(self, inputs, label, step_size=None):
+ print("Doing one backpropagation step")
self.compute_error(inputs, label)
self.for_each(lambda node: node.do_gradient_descent_step(step_size))
The test output should show that all gradient computations occur before all parameter updates. E.g., here is what I see in one gradient descent step of the xor example:
Doing one backpropagation step
Starting one gradient descent step
Starting one gradient descent step
Starting one gradient descent step
Computing local gradient for <neural_network.L2ErrorNode object at 0x7fb0b9e35748>
Computing local gradient for <neural_network.SigmoidNode object at 0x7fb0b9e356d8>
Computing local parameter gradient for <neural_network.LinearNode object at 0x7fb0b9e355c0>
Doing gradient update for parameter 0
Doing gradient update for parameter 1
Doing gradient update for parameter 2
Doing gradient update for parameter 3
The mnist example looks similar to me.
It's possible this issue is still occurring, and I'm too dense to see it. Perhaps you could try to formulate your issue as a failing test case? Or else, just an example network in the code with print statements that demonstrate the issue?
@j2kun I have come up with a test case https://github.com/Banyc/programmers-introduction-to-mathematics/blob/banyc/issue_52/neural_network/neural_network_test.py#L257-L316
The bug is reproduced at that test case
Also, thanks for the instruction on the pytest
! This is the first time I use that tool and the guide helps me well! Your print method is also great! But I think a test case and the pytest
is enough to reproduce the bug, so I did't print out the checkpoints