Question

Question

Question

Closed this issue a year ago · 10 comments

YHL04 commented a year ago

Is it supposed to detach?

Inside block_recurrent_transformer_pytorch.py line 815

if exists(layer_next_states):
next_states.append(layer_next_states.detach())

How would the gradients flow through the states?

Answer 1 · 2023-03-31T22:47:43.000Z

yea, i had the same question, but i don't think they are propagating the gradients through the cache. would be interesting to port over some of the ideas from the memformer paper and try a differentiable cache though

Answer 2 · 2023-03-31T23:34:14.000Z

I think it means at the end of the unroll it caches not every step of the unroll

…

On Fri, Mar 31, 2023 at 6:47 PM Phil Wang ***@***.***> wrote: [image: Screen Shot 2023-03-31 at 3 46 29 PM] <https://user-images.githubusercontent.com/108653/229245855-62e2e864-ef3a-49da-8224-1220da616cfa.png> yea, i had the same thought, but i don't think they are. would be interesting to port over some of the ideas from the memformer paper and try a differentiable cache though — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANY2HZODSNOQXNWHJP6MWWTW65NJTANCNFSM6AAAAAAWPDLG7I> . You are receiving this because you authored the thread.Message ID: ***@***.*** .com>

Answer 3 · 2023-03-31T23:36:48.000Z

@YHL04 if they were doing TBPTT, what is their cutoff number of steps? i don't see that in the paper anywhere

Answer 4 · 2023-04-01T00:44:40.000Z

I'm not sure if this is TBPTT A long document, such as a book, consists of a sequence of tokens. Due to memory limitations, it is usually not possible to fit the entire sequence into device memory. Thus, the sequence is divided into segments of length N (N = 4096 in our experiments), which are processed sequentially over a number of training steps. Each training step processes one segment. Note that if N = W, then sliding window attention will behave exactly like Transformer-XL; it will process and cache one segment (i.e. one block) per training step. Setting N >> W does not change the context length of attention, but it allows gradients to backpropagate across multiple blocks during training; we show that the improved differentiability provides a modest benefit to perplexity over Transformer-XL. See Appendix A for more details.

…

On Fri, Mar 31, 2023 at 7:36 PM Phil Wang ***@***.***> wrote: @YHL04 <https://github.com/YHL04> if they were doing TBTTT, what is their cutoff number of steps? i don't see that in the paper anywhere — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANY2HZOKVPJT5JACLCBTIETW65TBXANCNFSM6AAAAAAWPDLG7I> . You are receiving this because you were mentioned.Message ID: ***@***.*** .com>

Answer 5 · 2023-04-01T01:03:02.000Z

@YHL04 the way i interpreted the N and W is as the max_seq_len and block_width in the code here

Answer 6 · 2023-04-01T01:04:07.000Z

@YHL04 maybe it will be faster to just email the first author 😄

Answer 7 · 2023-04-01T15:45:16.000Z

@lucidrains Pls let me know what they say

Answer 8 · 2023-04-04T03:32:28.000Z

@YHL04 hey! i sent the author an email, and verified that the cache is not differentiable, so the detach is supposed to be there.

however, during the exchange, i realized the source of the confusion is that i was updating the state only once, attending to the entire input segment, rather than one block at a time. can you let me know if the new code changes look more reasonable?

Answer 9 · 2023-04-04T15:46:16.000Z

@YHL04 just confirmed with the author that the new changes are correct

thank you again for pressing on this!

Answer 10 · 2023-04-04T21:28:43.000Z

@YHL04 Sounds good, I will look into it and make some changes myself