microsoft/torchscale

Foundation Architecture for (M)LLMs

PythonMIT

Issues

torchscale 0.3.0 requires fairscale==0.4.0, but you have fairscale 0.4.13 which is incompatible.
#110 opened a month ago by pandayummy
0
Minecraft
#109 opened 3 months ago by Pelaez99
0
Question about LongNet attention map overlap
#108 opened 4 months ago by RmZeta2718
0
Different batch sizes lead to different evalution results for LongVIT
#107 opened 5 months ago by HHHedo
0
How to test the model
#106 opened 5 months ago by ReloJeffrey
0
pip error
#105 opened 5 months ago by wanghaoran-ucas
0
can't use longvit
#103 opened 5 months ago by abebe9849
0
Where is the offset implemented in Multi-head dilated attention ?
#104 opened 5 months ago by AshStuff
0
Question about learnable segment lengths and dilation rates
#102 opened 6 months ago by benrousePUC
0
How to use retention in RetNet for cross-attention?
#101 opened 6 months ago by yxchng
0
Checkpoint for RetNet
#99 opened 6 months ago by macsz
0
What WSI level was used for pretraining LongVit?
#98 opened 7 months ago by jpfeil
1
about attention mask
#97 opened 8 months ago by hichoe95
0
about the longnet's ppl
#95 opened 9 months ago by robotzheng
2
Question regarding the configuration of decoder_retention_heads
#84 opened 10 months ago by Kratos-Wen
2
Training RetNet on A100 GPUs
#83 opened 10 months ago by Antoine-Bergerault
1
Wrong Rnm Normalization.
#86 opened 10 months ago by pdradx
1
Introducing padding_mask to RetNet
#85 opened 10 months ago by xtwigs
2
Question about the normalization in attention
#81 opened 10 months ago by Cranial-XIX
2
Question about RetNetRelPos
#80 opened 10 months ago by hyunwoongko
2
initialization of qkv
#68 opened a year ago by XintianHan
3
typo in normalization denominator in parallel retention?
#78 opened 10 months ago by XintianHan
1
[Minor issue] Discrepancy inside arxiv paper
#82 opened 10 months ago by radarFudan
0
about gamma/decay in RetNet
#79 opened 10 months ago by rouniuyizu
2
Chunk recurrent representation incorrect results
#77 opened 10 months ago by N0r9st
7
embed_tokens
#59 opened a year ago by CodeMiningCZW
4
Compatibility with torchsummary
#71 opened a year ago by lzqlzzq
1
About training memory
#75 opened a year ago by HoraceXIaoyiBao
2
RuntimeError: The size of tensor a (5) must match the size of tensor b (2) at non-singleton dimension 0
#72 opened a year ago by codinglover0111
3
Query about Retentive Network's Recurrent Representation
#76 opened a year ago by gopi-erabati
1
retnet traning config
#64 opened a year ago by hanlinxuy
6
There're a confusion in torchscale
#65 opened a year ago by lovekang3344
3
pip package does not contain RetNet
#67 opened a year ago by fabienGenhealth
2
AttributeError: 'EncoderDecoderConfig' object has no attribute 'normalize_output'
#73 opened a year ago by Yuki2L0ve
3
BEiT3 Vision-Language Expert question
#74 opened a year ago by andreapdr
4
RetNet : Check consistency of each forward mode
#54 opened a year ago by mmorinag127
9
RetNet: relative position
#49 opened a year ago by fkodom
5
Question on decay factor for attention with xPos
#66 opened a year ago by mvbakulin
1
Can Torchscale be applied in point cloud tasks?
#61 opened a year ago by huiyang0613
2
Could you please explain the reason behind defining TEMPERATURE_FOR_L_UAX in the code without actually using it?
#63 opened a year ago by Ruiyuan-Zhang
1
`get_moe_group` 's return is None, when building `class MOELayer(Base)` , using one gpu
#60 opened a year ago by Ruiyuan-Zhang
4
Question about the recurrent forward of MultiScaleRetention
#62 opened a year ago by LEECHOONGHO
2
Training & Inference examples for RetNet
#52 opened a year ago by jhl-Det
1
Retnet training is slow
#55 opened a year ago by Zth9730
2
Question about is_first_step and Retnet
#58 opened a year ago by tdomhan
2
Retnet parameter dimension
#57 opened a year ago by allanj
2
"sentencepiece.bpe.model" and "dict.txt" in page below seem not available
#56 opened a year ago by HuXinjing
2
Multi-Scale Retention: Why include position embeddings explicitly?
#48 opened a year ago by fkodom
3
Is there some example of the paper? e.g., compare of the inference latency
#53 opened a year ago by LiZeng001
1
scale.sqrt() in the recurrent_forward function of the multiscale_retention module
#47 opened a year ago by wangmengzhi
6