syncdoth/RetNet

Huggingface compatible implementation of RetNet (Retentive Networks, https://arxiv.org/pdf/2307.08621.pdf) including parallel, recurrent, and chunkwise forward.

Jupyter NotebookMIT

Issues

I don't know if I should input attention_mask in the SFT process
#40 opened 9 months ago by wac81
0
Parallel inference is thought to be faster than recurrent inference, but it turns out that it is not in play file
#39 opened 9 months ago by wac81
0
contains the question when inference
#38 opened 9 months ago by wac81
0
can't train 3B model in 48GB single card
#33 opened 9 months ago by wac81
2
Integration with transformers library
#37 opened 10 months ago by kiucho
0
HuggingFace checkpoint
#36 opened a year ago by xtwigs
2
Can you support streaming when generating?
#32 opened a year ago by wac81
2
how to load model with device_map="auto"
#35 opened a year ago by wac81
1
The number of parameters does not match the setting in paper
#34 opened a year ago by ziHoHe
1
Initialize word embedding layer
#31 opened a year ago by hyunwoongko
7
Added description for torch.compile
#29 opened a year ago by ce-lery
1
Changelog of official implementation
#10 opened a year ago by donglixp
5
Info/Documentation on chunkwise training
#30 opened a year ago by pkpro
5
gradient_checkpointing=True issue in TrainerArgument
#28 opened a year ago by lolshuo
1
Would it be possible to integrate an attention sink https://arxiv.org/pdf/2309.17453.pdf into RetNet?
#27 opened a year ago by pkpro
4
Tokenizer Choice?
#26 opened a year ago by risedangel
1
encountered nan while trying to train
#6 opened a year ago by liujuncn
10
Add Hidden Size for DeepSpeed integration
#23 opened a year ago by infosechoudini
2
How to use multiple GPUs for model parallel training
#16 opened a year ago by zhihui-shao
5
真缺一个全量多卡显存叠加并行训练方案,如果能行也算是一种成功!/ There is really a lack of a full-scale multi-card video memory superposition parallel training scheme. If it can be done, it can be regarded as a success!
#5 opened a year ago by gg22mm
4
parallel and recurrent forward achieves totally different output
#7 opened a year ago by Zhihan1996
2
Question about verifying the Inference Latency
#8 opened a year ago by LiZeng001
3
Comments on the model
#14 opened a year ago by okpatil4u
4
Can't Resume Training from Checkpoint
#17 opened a year ago by infosechoudini
1
passing attention_mask doesn't work for recurrent
#15 opened a year ago by infiniteperplexity
2
How to load my own model
#12 opened a year ago by zhihui-shao
1
ValueError: not enough values to unpack (expected 2, got 1)
#11 opened a year ago by pathoncyp
3
Can you provide a LICENCE file
#13 opened a year ago by Shubhankar-Aidetic
2
Training using HF Transformers
#3 opened a year ago by nebulatgs
1
Errors when running your examples
#4 opened a year ago by houghtonweihu
2
somewhere that needs to be modified
#1 opened a year ago by liujuncn
1