Input data is fed into the neural network - from the left - and it gets forward propagated through it to produce the output on the other end - right!...
propagated: added, multiplied, divided, blah blahed.. we'll go over it!
Mathematically a neural network is simply a pure function with many many parameters.
Let's discuss functions for a bit!
It's common knowledge that a function is a relationship between a set of possible inputs and a set of possible outputs.
Theory allows the pattern of this mapping to be totally random and useless, but all practical and useful functions embody a pattern.
For instance, when we look at the plot of
Similarly the infamous equation of a line
- parameters
$m$ and$b$ are accurately known -
$x$ and$y$ have a linear relationship
And that's essentially what linear regression is.
No surprise now when we think about conversational AIs like ChatGPT. Its just a very complex - 175 Billion Parameter - function that has pretty accurately captured the pattern of human language, and not just human language it has gone one step meta, it has captured patterns that are based on top of human language: human knowledge.
It's a heartless, mindless, function that receives a body of text - your question, and the conversation before it -
Also if you've noticed when you're interacting with ChatGPT it feels like its typing one word after another. That's not a UX thing, it's actually the model predicting one word after another - for nerds: WebSocket. It's given the existing conversation and it produces the next word, and then it's given the conversation with the new word added and it generates the next word, and so on until it predicts an "end of text" token, which is weirdly similar to how we decide when to shut up!
It's easy to understand how neural networks produce useful output
once they have been assigned correct parameters -
analogues for
It's a really cool mathematical technique called
The landscape of AI is ruled by gods of Calculus and Linear Algebra. This text is especially for people who love understanding things from a first principles perspective, and are willing to love math because that's the only way this thing will ever make sense. Remember thinking "when will I ever use this?" in Cal and LA classes?? You can use this now.
Here goes another attempt at slaying the BackProp dragon!
Before we can start understanding BackProp we must express forward propagation, mathematically - the process of converting a neural network's input and producing an output.
From here onwards the discussion will get highly tecnhical, the readers are expected to have a general idea about weights, biases, artificial neurons.
For
The output of this neuron is given by
where
- Imagined from left to right, with a number of layers.
- Each layer is a vertical column of neurons.
- Leftmost layer is input layer. Each neuron in this layer connects to all the neurons in the next layer through weights.
- Rightmost layer is the output layer. It is fully connected with the layer on its left. Its activations are the output of the network.
- All the layers in the middle are called hidden layers, these layers are fully connected on both sides and they receive the input from the left, they produce activations which get forwarded to the layer on the right.
Each layer stores what is called a weight matrix. This matrix stores all the weights that connect this layer to the previous layer - layer on the left. A weight is a floating point value between 0 and 1, that decides the strength of a connection between two neurons. We formally define a weight as:
Represents a weight that connects
Example: $w_{27}^4$ represents a weight that connects $2_{nd}$ neuron in layer $4$ to $7_{th}$ neuron in layer $3$
Weight matrix for layer
The number of rows in
In summary, following is the layout of a weight matrix for layer
Example: $b_{4}^5$ represents the bias for the $4_{th}$ neuron in layer $5$
Example: $z_{4}^5$ represents the weighted input for the $4_{th}$ neuron in layer $5$
Example: $a_{4}^5$ represents the activation for the $4_{th}$ neuron in layer $5$
The weighted input for a neuron
The activation for a neuron
When we're implementing neural networks we rely on vector/matrix operations to take advantage of SIMD operations available on modern CPUs and GPUs. Therefore we should express equations for forward/backward propagation in vectorized forms.
We've already defined the vectorized form of the weights as the weight matrix.
For a layer
For a layer
where
For a layer
where:
-
$w^L$ is the weight matrix of layer$L$ -
$a^{L-1}$ is the activation vector of layer$L-1$ -
$b^L$ is the bias vector of layer$L$
For a layer
where:
-
$\sigma$ is the activation function - sigmoid in this case. -
$z^l$ is the weighted input vector for layer$L$
Given
we forward propagate by first computing
and then we calculate
this allows us to compute the activations of the next layer