pbloem/former

Simple transformer implementation from scratch in pytorch.

PythonMIT

Issues

No Field in torchtext.data
#35 opened a year ago by dtdo90
1
Question about k × hk weight matrices
#30 opened a year ago by ramesaliyev
1
Why the use of log_softmax ?
#36 opened a year ago by dsantiago
1
How to tokenize a testing phrase
#37 opened a year ago by edgarmg91
0
ModuleNotFoundError: No module named 'past'
#1 opened 4 years ago by koen-dejonghe
7
Calculation of memory required in "Going big"
#34 opened a year ago by oleksandrasaskia
2
Einsum to avoid transpose and reshape
#4 opened 5 years ago by azzever
3
Question about the tool used to make the figures in the blog post
#11 opened 5 years ago by yassouali
2
improve performance in deeper network using your multi head attention code
#29 opened 2 years ago by fatemehniknezhad
3
Hi,question about sliced-up version self-attention
#28 opened 3 years ago by Faith-Uchiha
1
Module import error
#25 opened 3 years ago by Cybernetic1
1
conda installation failed
#27 opened 3 years ago by bsun0802
1
Understanding multiheaded attention.
#23 opened 3 years ago by anorak94
0
Blog is down
#26 opened 3 years ago by JJGO
1
ImportError: attempted relative import with no known parent package
#24 opened 3 years ago by RubenRidderstrom
8
data.Field no longer supported in torchtext
#20 opened 3 years ago by jvdburgt
1
ModuleNotFoundError: No module named 'past'
#9 opened 4 years ago by jjong2ya
5
High compute time, what is a reasonable generator model to get somewhat good results to play with?
#22 opened 4 years ago by jplasser
1
Is narrow attention implemented correctly?
#13 opened 4 years ago by TheGrayFrost
8
slide 50, v and q mixed up
#19 opened 4 years ago by ulf1
1
token_embedding for non-text sequences
#18 opened 4 years ago by StolkArjen
6
Weight matrix dimensions after transofrmation
#17 opened 4 years ago by SrinjaySarkar
1
Using the trained model
#14 opened 4 years ago by ShivanshuPurohit
1
How can we visualize the self-attention map
#15 opened 4 years ago by amiltonwong
1
Why is dividing by e**(1/4) for both keys and queries more memory efficient?
#5 opened 5 years ago by esvhd
4
Accuracies on the examples
#12 opened 5 years ago by sf-wind
1
Issue with masking
#10 opened 5 years ago by mc-robinson
1
Comparing SelfAttention classes
#8 opened 5 years ago by sidneyaraujomelo
2
AttributeError: module 'torch' has no attribute 'triu_indices'
#7 opened 5 years ago by xxxxyan
5
Masking done for the upper or lower triangle?
#6 opened 5 years ago by esvhd
1
Weight scaling should be for keys not values?
#2 opened 5 years ago by esvhd
2