This repo contains my learnings for pytorch.
The bare minimum hello world for pytorch.
poetry run src/pytorch-hello.py
Train a simple feed forward neural network for fashionMNIST.
poetry run src/fashion-mnist.py
Discovering the Transformer paper
Formula | Description |
---|---|
The Query | |
The Key | |
The Value | |
The self attention function, known as scaled dot-product attention | |
An embedded input | |
Weight queries, used to train the attention. | |
The number of embedded input vectors | |
The length of a single word embedding vector | |
|
Vectors of dimension |
$$ \mathbf{QK}^T =
\begin{bmatrix}
e_{11} & e_{12} & \dots & e_{1n} \
e_{21} & e_{22} & \dots & e_{2n} \
\vdots & \vdots & \ddots & \vdots \
e_{m1} & e_{m2} & \dots & e_{mn} \
\end{bmatrix} $$
$$ \frac{\mathbf{QK}^T}{\sqrt{d_k}} =
\begin{bmatrix}
\tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} \
\tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} \
\vdots & \vdots & \ddots & \vdots \
\tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} \
\end{bmatrix} $$
$$ \text{softmax} \left( \frac{\mathbf{QK}^T}{\sqrt{d_k}} \right) =
\begin{bmatrix}
\text{softmax} ( \tfrac{e_{11}}{\sqrt{d_k}} & \tfrac{e_{12}}{\sqrt{d_k}} & \dots & \tfrac{e_{1n}}{\sqrt{d_k}} ) \
\text{softmax} ( \tfrac{e_{21}}{\sqrt{d_k}} & \tfrac{e_{22}}{\sqrt{d_k}} & \dots & \tfrac{e_{2n}}{\sqrt{d_k}} ) \
\vdots & \vdots & \ddots & \vdots \
\text{softmax} ( \tfrac{e_{m1}}{\sqrt{d_k}} & \tfrac{e_{m2}}{\sqrt{d_k}} & \dots & \tfrac{e_{mn}}{\sqrt{d_k}} ) \
\end{bmatrix} $$