state-spaces/s4

Unable to save S4 decoder with mode=nplr

gregogiudici opened this issue · 3 comments

Hi, I'm implementing a decoder for audio generation (DDSP-style) using standalone S4 (V3).
I'd like to save checkpoints during the training and eventually the final model.
When training the model with S4D configuration (mode=diag) everything works well.

Instead, training the model with standard S4 configuration (mode=nplr) I get the following error:
RuntimeError: Cannot save multiple tensors or storages that view the same data as different types.

Using CUDA extension for Cauchy and/or pykeops doesn't make a different.

I'm searching for a solution. Thanks in advance.

I'm on Ubuntu 18.04.4 LTS and this is my environment:

python = 3.9.16
torch = 2.0.1
torchaudio = 0.13.1
pytorch-cuda = 11.6
pytorch-lightning = 1.9.3
lightning = 2.0.2
hydra-core = 1.3.2

And this is the train.log I obtained:
image

Can you be more specific about the way you're saving the checkpoints? Do you have a custom train loop, or are you running our training script? If the latter, can you provide more details about the config you're using

One thing that stands out is that your torch and torchaudio versions don't seem compatible. torchaudio=0.13.1 should be used with PyTorch 1.13 instead of 2.0: https://pytorch.org/get-started/previous-versions/

I'm not running your training script. I'm new at using pytorch lightning so I'm using this template to learn it (modifyed for my model and the generative task).

I use default lightning callback ModelCheckpoint to save the checkpoints during evaluation with the following config:

 model_checkpoint:
  _target_: lightning.pytorch.callbacks.ModelCheckpoint
  dirpath: ${paths.output_dir}/checkpoints
  filename: "epoch_{epoch:03d}"
  monitor: "val/loss"
  save_last: True
  save_top_k: 1 
  mode: "min" 
  auto_insert_metric_name: False 
  save_on_train_epoch_end: False 

I've also tryed different environment configuration, like the following example:

python = 3.9.16
torch = 1.13.1
torchaudio = 0.13.1
pytorch-cuda = 11.6
pytorch-lightning = 1.5.10
lightning = 2.0.2
hydra-core = 1.3.2

obtaining the same RunTimeError all the times

Unfortunately I haven't seen this problem in a while and it's hard for me to debug without more details. I do think I've seen related things before; IIRC there might be something going on in the DPLR kernel because of several linear algebra conversions involved when constructing it which might cause issues in edge cases (e.g. more advanced usages when needing to convert the model to different forms and do something different at inference time). In vanilla training settings it should be fine though.

This is the best advice I can give for now:

  • Double check that my train loop works; e.g. python -m train wandb=null should be saving checkpoints every epoch
  • See if there are any discrepancies between your ModelCheckpoint and the way it's done in this repo
  • More broadly, I would just recommend using S4D. The default version (mode=diag init=diag-legs) should be very close to the full S4-DPLR model in general, especially if you're using adequate learning rate warmup. The version S4D-Lin (mode=diag init=diag-lin) should also work well usually.