pytorch/torchtitan

A native PyTorch Library for large model training

PythonBSD-3-Clause

Issues

about reference of weight init according to layer depth or layer id
#375 opened 18 days ago by SeunghyunSEO
1
Checkpoint saves failing for eager mode training
#168 opened 19 days ago by chauhang
8
add compiled RMSNorm into the norm config
#374 opened 19 days ago by tianyu-l
0
add config option to only produce tensorboard logs on rank 0
#304 opened 19 days ago by tianyu-l
0
Add torchdata to requirements after release
#351 opened a month ago by gokulavasan
0
reload existing llama checkpoints
#305 opened a month ago by tianyu-l
6
Make dataloader stateful?
#291 opened a month ago by XinDongol
9
RoPE implementation differences
#335 opened a month ago by rlrs
7
Modify FLOPs in MFU calculation for casual mask when using FlashAttention.
#341 opened a month ago by Yuxin-CV
1
Code change that changes the model semantics
#347 opened a month ago by kwen2501
3
Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader
#128 opened 3 months ago by lessw2020
5
checkpoint.model_weights_only Doesn't makes any difference
#336 opened a month ago by TJ-Solergibert
1
Converting to checkpoint.pd is not working
#307 opened a month ago by viai957
5
Question on Model Init
#312 opened a month ago by XinDongol
7
profiler issue when training with 64 or more GPUs
#266 opened a month ago by tianyu-l
6
Make fused RMSNorm a registered op
#199 opened 2 months ago by lessw2020
1
numerical difference for SDPA between non-dtensor vs dtensor, when math attention and fp16 are used
#317 opened a month ago by tianyu-l
0
`freqs_cis` in llama model should be a non-persistent buffer
#316 opened a month ago by tianyu-l
0
add doc for adding custom dataset
#311 opened a month ago by lessw2020
0
Verify that we can do eval / inference
#192 opened 2 months ago by gnadathur
1
[Feature] Add fineweb dataset
#309 opened a month ago by viai957
1
Custom dataset for llama 3 finetuning
#310 opened a month ago by rshah918
2
freezeing some part of the model
#306 opened a month ago by tianyu-l
0
metrics - add L1 gradient norm tracking
#119 opened 3 months ago by lessw2020
0
Grad scaler not in train state
#146 opened a month ago by BadrYoubiIdrissi
3
Starting off with different models across ranks and FSDP doesn't synchronise
#166 opened 3 months ago by BadrYoubiIdrissi
4
Add HSDP + TP/SP support
#176 opened 3 months ago by gnadathur
0
FSDP2 based HSDP support
#177 opened 3 months ago by gnadathur
0
Add support for MoE model architecture
#184 opened 2 months ago by gnadathur
1
add unit test for ongoing numerical verification of fusedRMSNorm
#205 opened 2 months ago by lessw2020
0
Fused RMSNorm incompatible with PP tracing (dynamic stride)
#217 opened 2 months ago by wconstab
2
numerical issue when running SDPA with DTensor
#267 opened 2 months ago by tianyu-l
0
[Feature] Plan to add `model_register`
#282 opened 2 months ago by XinDongol
1
[Feature] Add gradient accumulation
#292 opened 2 months ago by XinDongol
5
Probably shouldn't call `init_weights` in constructor of the model
#290 opened 2 months ago by ad8e
4
Wrong mesh order
#286 opened 2 months ago by ad8e
1
Question: tp able to run a model which not able to fit a single batch on GPU?
#276 opened 2 months ago by lucasjinreal
17
Question; parallelising convolutional layers?
#277 opened 2 months ago by jvwilliams23
4
update metric title as 'tokens per second' (TPS) rather than Words per Second (WPS)
#263 opened 2 months ago by lessw2020
3
simplify meta_init (rope embeddings)
#110 opened 2 months ago by lessw2020
2
E2E training numbers for 13B/70B
#118 opened 2 months ago by wanchaol
4
TorchTrain: Release blocking Issues master tracker
#186 opened 2 months ago by gnadathur
0
Hard release criteria: Run and get convergence data on long running tests
#193 opened 2 months ago by gnadathur
2
Silent failure of loading full converted checkpoint in torchtune
#185 opened 2 months ago by gnadathur
1
FSDP2 incur higher GPU memory usage in 2D compare to FSDP1
#191 opened 2 months ago by wanchaol
5
FSDP2 incur higher CPU memory usage in 2D compare to FSDP1
#208 opened 2 months ago by awgu
3
Validate DCP load and save for 1D and 2D w/ FSDP2
#108 opened 2 months ago by gnadathur
3
Implement fast Layer norm to get decent MFU
#196 opened 2 months ago by gnadathur
2
Validate FSDP2 + SP parity with FSDP1 + SP
#107 opened 3 months ago by gnadathur
1
Integration test for torchtrain
#109 opened 3 months ago by gnadathur
0