lucidrains/x-transformers

Random lack of gradients

Baran-phys opened this issue · 1 comments

@lucidrains While training a model, I monitored my gradients, and randomly I get no gradients:
Can dropout causes this: 0%| | 44/41765 [00:06<1:29:57, 7.73it/s] 0%| | 45/41765 [00:06<1:31:31, 7.60it/s] 0%| | 46/41765 [00:06<1:32:11, 7.54it/s] All modules and their parameters have gradients. All modules and their parameters have gradients. All modules and their parameters have gradients. All modules and their parameters have gradients. Modules with no gradients: Module: trans_1.attn_layers.layers.0.1.to_out Parameter: weight Module: trans_1.attn_layers.rel_pos.mlp.0.0 Parameter: weight Parameter: bias Module: trans_1.attn_layers.layers.0.1.to_q Parameter: weight Module: trans_1.attn_layers.layers.0.1.to_k Parameter: weight Module: trans_1.attn_layers.layers.0.1.to_v_gate Parameter: weight Parameter: bias Module: trans_1.attn_layers.rel_pos.mlp.2 Parameter: weight Parameter: bias Module: trans_1.attn_layers.rel_pos.mlp.1.0 Parameter: weight Parameter: bias Module: trans_1.attn_layers.layers.0.1.to_v Parameter: weight Module: trans_1.attn_layers.layers.0.0.0 Parameter: g All modules and their parameters have gradients. Modules with no gradients: Module: trans_1.attn_layers.layers.0.1.to_out Parameter: weight Module: trans_1.attn_layers.rel_pos.mlp.0.0 Parameter: weight Parameter: bias

For example, this is the x-transformer part of my code:
self.trans_1 = ContinuousTransformerWrapper( dim_in = 64, dim_out = 64, max_seq_len = 1500, emb_dropout = 0.1, use_abs_pos_emb = False, num_memory_tokens = 1, attn_layers = Encoder( dim = 256, depth = 1, heads = 4, rel_pos_bias = True, attn_gate_values = True, use_rmsnorm = True, layer_dropout = 0.1, attn_dropout = 0.1, ff_glu = True, ff_dropout = 0.1, ) )

I monitored the loss at different stages, there are no NaN or inf in it. This is the function that is looking at the gradients:

    `def find_no_grad_modules(m: nn.Module) -> None:
no_grad_params = {n: [] for n, _ in m.named_modules()}
no_grad_modules = set()

for module_name, module in m.named_modules():
    for param_name, param in module.named_parameters(recurse=False):
        full_name = f"{module_name}.{param_name}" if module_name else param_name
        if param.grad is None:
            no_grad_params[module_name].append(param_name)
            no_grad_modules.add(module_name)

if no_grad_modules:
    print("Modules with no gradients:")
    for module_name in no_grad_modules:
        print(f"Module: {module_name}")
        for param_name in no_grad_params[module_name]:
            print(f"  Parameter: {param_name}")
else:
    print("All modules and their parameters have gradients.")`

Is this behaviour healthy? If not, what do you think is causing it?

yup that is caused by the layer dropout from stochastic depth technique