Bug: `successor.local_gradient_for_argument(self)` could be corrupted

Question

Bug: `successor.local_gradient_for_argument(self)` could be corrupted

Banyc opened this issue 3 years ago · 4 comments

Issue

local_gradient_for_argument might read a corrupted float value from the successor.

Say we have the computation graph:

$h$ has parameters $w$
$f$ has parameters $v$
$h : \mathbb{R}^m \to \mathbb{R}$
$f : \mathbb{R}^n \to \mathbb{R}$
$h$ is the successor of $f$
$E$ is the graph
$\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$
$\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
$\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
when doing backpropagation, the steps will be
1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$, and $\frac{\partial h}{\partial w}$
2. $h$ updates $w$ to $w'$
3. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
  1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
  2. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
    - This is the problem!
    - $\frac{\partial h}{\partial f}$ is corrupted
  3. $\frac{\partial h}{\partial f}$ is in cache now
  4. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
  5. $\frac{\partial E}{\partial f}$ is in cache now
4. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
5. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the corrupted $\frac{\partial h}{\partial f}$
6. $f$ updates $v$ based on the corrupted $\frac{\partial E}{\partial v}$

Solutions

I can come up with two solutions

compute local_gradient ($\frac{\partial h}{\partial f}$) at the beginning of do_gradient_descent_step before parameters ($w$) is modified
the successor $h$ distributes local_gradient ($\frac{\partial h}{\partial f}$) and global_gradient ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified
- I have also coded a neural network based on yours, the distribution happens at distribute_global_gradient_entries_to_operands

PS

Thank you for your book and the sample code so that I could deeper understand the neural network!

Answer 1 · 2022-07-03T18:13:07.000Z

The math notations are going weird. I post the raw markdown here

# Issue

[local_gradient_for_argument](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L125) might read a corrupted float value from the successor.

Say we have the computation graph:

\`\`\`text
    h
    |
    |
    f
\`\`\`

- $h$ has parameters $w$
- $f$ has parameters $v$
- $h : \mathbb{R}^m \to \mathbb{R}$
- $f : \mathbb{R}^n \to \mathbb{R}$
- $h$ is the successor of $f$
- $E$ is the graph
- $\frac{\partial E}{\partial w} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w}$
- $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
- $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
- when doing backpropagation, the steps will be
  1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial w}$
  1. $h$ updates $w$ to $w'$
  1. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
     1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
     1. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        - **This is the problem!**
        - $\frac{\partial h}{\partial f}$ is **corrupted**
     1. $\frac{\partial h}{\partial f}$ is in cache now
     1. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
     1. $\frac{\partial E}{\partial f}$ is in cache now
  1. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
  1. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the **corrupted** $\frac{\partial h}{\partial f}$
  1. $f$ updates $v$ based on the **corrupted** $\frac{\partial E}{\partial v}$

# Solutions

I can come up with two solutions

- compute `local_gradient` ($\frac{\partial h}{\partial f}$) at the beginning of [do_gradient_descent_step](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L107) before parameters ($w$) is modified
- the successor $h$ distributes `local_gradient` ($\frac{\partial h}{\partial f}$) and `global_gradient` ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified
  - I have also coded a neural network based on yours, the distribution happens at [distribute_global_gradient_entries_to_operands](https://github.com/Banyc/neural_network/blob/c6970d2712c8c01286b0a10434723a073fb99ad7/src/nodes/node.rs#L116)

Answer 2 · 2022-07-26T04:40:22.000Z

I have been slow to respond to this comment because it is quite detailed and I want to make sure I give it the time it deserves.

For starters, one cannot compose f and h because the output of f is a single number but the input of h is a vector of length m. But I think the rest of your statement does not differ if h and f are single-input functions, so that can be ignored.

Next, I believe the structure of the code ensures that all gradients are computed before any update steps occur.

I believe you can see this by applying the following patch (attached as a file as well), and running any of the integration tests in neural_network_integration_test.py, e.g., with pytest -s -k test_learn_xor_sigmoid neural_network_integration_test.py
corrupted_patch.txt

diff --git a/neural_network/neural_network.py b/neural_network/neural_network.py
index cee6a4a..db639e1 100644
--- a/neural_network/neural_network.py
+++ b/neural_network/neural_network.py
@@ -107,9 +107,11 @@ class Node:
     def do_gradient_descent_step(self, step_size):
         '''The core gradient step subroutine: compute the gradient for each of this node's
         tunable parameters, step away from the gradient.'''
+        print("Starting one gradient descent step")
         if self.has_parameters:
             for i, gradient_entry in enumerate(self.global_parameter_gradient):
                 # step away from the gradient
+                print("Doing gradient update for parameter %s" % i)
                 self.parameters[i] -= step_size * gradient_entry

     '''Gradient computations which don't depend on the node's definition.'''
@@ -147,6 +149,7 @@ class Node:
     @property
     def local_gradient(self):
         if self.cache.local_gradient is None:
+            print("Computing local gradient for %s" % (self, ))
             self.cache.local_gradient = self.compute_local_gradient()
         return self.cache.local_gradient

@@ -159,6 +162,7 @@ class Node:
     @property
     def local_parameter_gradient(self):
         if self.cache.local_parameter_gradient is None:
+            print("Computing local parameter gradient for %s" % (self, ))
             self.cache.local_parameter_gradient = self.compute_local_parameter_gradient()
         return self.cache.local_parameter_gradient

@@ -391,6 +395,7 @@ class NeuralNetwork:
         return self.error_node.compute_error(inputs, label)

     def backpropagation_step(self, inputs, label, step_size=None):
+        print("Doing one backpropagation step")
         self.compute_error(inputs, label)
         self.for_each(lambda node: node.do_gradient_descent_step(step_size))

The test output should show that all gradient computations occur before all parameter updates. E.g., here is what I see in one gradient descent step of the xor example:

Doing one backpropagation step
Starting one gradient descent step
Starting one gradient descent step
Starting one gradient descent step
Computing local gradient for <neural_network.L2ErrorNode object at 0x7fb0b9e35748>
Computing local gradient for <neural_network.SigmoidNode object at 0x7fb0b9e356d8>
Computing local parameter gradient for <neural_network.LinearNode object at 0x7fb0b9e355c0>
Doing gradient update for parameter 0
Doing gradient update for parameter 1
Doing gradient update for parameter 2
Doing gradient update for parameter 3

The mnist example looks similar to me.

It's possible this issue is still occurring, and I'm too dense to see it. Perhaps you could try to formulate your issue as a failing test case? Or else, just an example network in the code with print statements that demonstrate the issue?

Answer 3 · 2022-07-26T09:06:40.000Z

@j2kun I have come up with a test case https://github.com/Banyc/programmers-introduction-to-mathematics/blob/banyc/issue_52/neural_network/neural_network_test.py#L257-L316

The bug is reproduced at that test case

Answer 4 · 2022-07-26T09:12:19.000Z

Also, thanks for the instruction on the pytest! This is the first time I use that tool and the guide helps me well! Your print method is also great! But I think a test case and the pytest is enough to reproduce the bug, so I did't print out the checkpoints