pim-book/programmers-introduction-to-mathematics

Bug: `successor.local_gradient_for_argument(self)` could be corrupted

Banyc opened this issue · 4 comments

Banyc commented

Issue

local_gradient_for_argument might read a corrupted float value from the successor.

Say we have the computation graph:

    h
    |
    |
    f
  • $h$ has parameters $w$
  • $f$ has parameters $v$
  • $h : \mathbb{R}^m \to \mathbb{R}$
  • $f : \mathbb{R}^n \to \mathbb{R}$
  • $h$ is the successor of $f$
  • $E$ is the graph
  • $\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$
  • $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
  • $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
  • when doing backpropagation, the steps will be
    1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$, and $\frac{\partial h}{\partial w}$
    2. $h$ updates $w$ to $w'$
    3. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
      1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
      2. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        • This is the problem!
        • $\frac{\partial h}{\partial f}$ is corrupted
      3. $\frac{\partial h}{\partial f}$ is in cache now
      4. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
      5. $\frac{\partial E}{\partial f}$ is in cache now
    4. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
    5. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the corrupted $\frac{\partial h}{\partial f}$
    6. $f$ updates $v$ based on the corrupted $\frac{\partial E}{\partial v}$

Solutions

I can come up with two solutions

  • compute local_gradient ($\frac{\partial h}{\partial f}$) at the beginning of do_gradient_descent_step before parameters ($w$) is modified
  • the successor $h$ distributes local_gradient ($\frac{\partial h}{\partial f}$) and global_gradient ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified

PS

Thank you for your book and the sample code so that I could deeper understand the neural network!

Banyc commented

The math notations are going weird. I post the raw markdown here

# Issue

[local_gradient_for_argument](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L125) might read a corrupted float value from the successor.

Say we have the computation graph:

\`\`\`text
    h
    |
    |
    f
\`\`\`

- $h$ has parameters $w$
- $f$ has parameters $v$
- $h : \mathbb{R}^m \to \mathbb{R}$
- $f : \mathbb{R}^n \to \mathbb{R}$
- $h$ is the successor of $f$
- $E$ is the graph
- $\frac{\partial E}{\partial w} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w}$
- $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
- $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
- when doing backpropagation, the steps will be
  1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial w}$
  1. $h$ updates $w$ to $w'$
  1. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
     1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
     1. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        - **This is the problem!**
        - $\frac{\partial h}{\partial f}$ is **corrupted**
     1. $\frac{\partial h}{\partial f}$ is in cache now
     1. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
     1. $\frac{\partial E}{\partial f}$ is in cache now
  1. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
  1. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the **corrupted** $\frac{\partial h}{\partial f}$
  1. $f$ updates $v$ based on the **corrupted** $\frac{\partial E}{\partial v}$

# Solutions

I can come up with two solutions

- compute `local_gradient` ($\frac{\partial h}{\partial f}$) at the beginning of [do_gradient_descent_step](https://github.com/pim-book/programmers-introduction-to-mathematics/blob/b8867dd443f2e4350e8b257d22bdc95de2c008d5/neural_network/neural_network.py#L107) before parameters ($w$) is modified
- the successor $h$ distributes `local_gradient` ($\frac{\partial h}{\partial f}$) and `global_gradient` ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified
  - I have also coded a neural network based on yours, the distribution happens at [distribute_global_gradient_entries_to_operands](https://github.com/Banyc/neural_network/blob/c6970d2712c8c01286b0a10434723a073fb99ad7/src/nodes/node.rs#L116)

j2kun commented

I have been slow to respond to this comment because it is quite detailed and I want to make sure I give it the time it deserves.

For starters, one cannot compose f and h because the output of f is a single number but the input of h is a vector of length m. But I think the rest of your statement does not differ if h and f are single-input functions, so that can be ignored.

Next, I believe the structure of the code ensures that all gradients are computed before any update steps occur.

I believe you can see this by applying the following patch (attached as a file as well), and running any of the integration tests in neural_network_integration_test.py, e.g., with pytest -s -k test_learn_xor_sigmoid neural_network_integration_test.py
corrupted_patch.txt

diff --git a/neural_network/neural_network.py b/neural_network/neural_network.py
index cee6a4a..db639e1 100644
--- a/neural_network/neural_network.py
+++ b/neural_network/neural_network.py
@@ -107,9 +107,11 @@ class Node:
     def do_gradient_descent_step(self, step_size):
         '''The core gradient step subroutine: compute the gradient for each of this node's
         tunable parameters, step away from the gradient.'''
+        print("Starting one gradient descent step")
         if self.has_parameters:
             for i, gradient_entry in enumerate(self.global_parameter_gradient):
                 # step away from the gradient
+                print("Doing gradient update for parameter %s" % i)
                 self.parameters[i] -= step_size * gradient_entry

     '''Gradient computations which don't depend on the node's definition.'''
@@ -147,6 +149,7 @@ class Node:
     @property
     def local_gradient(self):
         if self.cache.local_gradient is None:
+            print("Computing local gradient for %s" % (self, ))
             self.cache.local_gradient = self.compute_local_gradient()
         return self.cache.local_gradient

@@ -159,6 +162,7 @@ class Node:
     @property
     def local_parameter_gradient(self):
         if self.cache.local_parameter_gradient is None:
+            print("Computing local parameter gradient for %s" % (self, ))
             self.cache.local_parameter_gradient = self.compute_local_parameter_gradient()
         return self.cache.local_parameter_gradient

@@ -391,6 +395,7 @@ class NeuralNetwork:
         return self.error_node.compute_error(inputs, label)

     def backpropagation_step(self, inputs, label, step_size=None):
+        print("Doing one backpropagation step")
         self.compute_error(inputs, label)
         self.for_each(lambda node: node.do_gradient_descent_step(step_size))

The test output should show that all gradient computations occur before all parameter updates. E.g., here is what I see in one gradient descent step of the xor example:

Doing one backpropagation step
Starting one gradient descent step
Starting one gradient descent step
Starting one gradient descent step
Computing local gradient for <neural_network.L2ErrorNode object at 0x7fb0b9e35748>
Computing local gradient for <neural_network.SigmoidNode object at 0x7fb0b9e356d8>
Computing local parameter gradient for <neural_network.LinearNode object at 0x7fb0b9e355c0>
Doing gradient update for parameter 0
Doing gradient update for parameter 1
Doing gradient update for parameter 2
Doing gradient update for parameter 3

The mnist example looks similar to me.

It's possible this issue is still occurring, and I'm too dense to see it. Perhaps you could try to formulate your issue as a failing test case? Or else, just an example network in the code with print statements that demonstrate the issue?

Banyc commented

Also, thanks for the instruction on the pytest! This is the first time I use that tool and the guide helps me well! Your print method is also great! But I think a test case and the pytest is enough to reproduce the bug, so I did't print out the checkpoints