A Numpy Implementation of Neural Networks from A Theoretical Perspective
Python
Numpy Implementation Of NN From Theoretical Perspective
This repository implements a simple two-layer neural network by using pure numpy. To this end, one has to have mastery of the chain rule of derivatives especially in matrix notation. This repository will first elaborate on the derivation from theoretical perspective and give its numpy implementation.
Prerequisites
Before browsing the derivation, one has to install Github with MathJax into Chrome. After installation, refresh this page.
Derivation
Suppose we have a two-layer neural network illustrated as follow.
Then we can obtain its forward pass:
where $x^T \in R^{B \times n_x}$ is a mini-batch input, $W \in R^{n_x \times n_h}$ and $S \in {R}^{n_h \times n_y}$ are the learning parameters of the network, and $\mathcal{L}$ is the loss function.
Now, the objective is to derive the gradients with respect to $S$ and $W$, respectively.
Let's first examine $\nabla_S \mathcal{L}$.
By re-writting $y^T=h^TS$, we obtain
$$\begin{equation}
\begin{split}
(y_1, y_2, ..., y_{n_y})
& = (h_1, h_2, ..., h_{n_h})S
& = (h_1, h_2, ..., h_{n_h})
\begin{bmatrix}
| & | & & |\
s_1 & s_2 & \cdots & s_{n_y}\
| & | & & |\
\end{bmatrix}
\end{split}
\end{equation}$$
import numpy
batch_size, n_x, n_h, n_y = 64, 1000, 100, 10
x = np.random.randn(batch_size, n_x)
t = np.random.randn(batch_size, n_y)
W = np.random.randn(n_x, n_h)
S = np.random.randn(n_h, n_y)
learning_rate = 1e-6
for i in range(500):
# Forward pass: compute predicted y
u = x.dot(W)
h = np.maximum(u, 0)
y = h.dot(S)
# Compute and print loss
loss = np.square(y-t).sum()
print("Iter: {}\tLoss: {}".format(i, loss))
# Backprop to compute gradients of W and S with respect to loss
grad_y = 2.0 * (y-t)
grad_S = h.T.dot(grad_y)
grad_h = grad_y.dot(S.T)
grad_u = grad_h.copy()
grad_u[u<0] = 0
grad_W = x.T.dot(grad_u)
# Update weights by gradient descent
W -= learning_rate * grad_W
S -= learning_rate * grad_S
Postscript
It's been confusing me. I was thinking that this whole thing involves tensor product. Surprisingly, however, it turns out to be just an application of the fourth definition of matrix product.