
Model can not converge on the LRA Pathfinder

violet-zct opened this issue · 19 comments


Thanks for the great work! When I ran your code on the LRA pathfinder dataset (using your config), I found it can't converge till the end of the 200th epoch as shown in the following log: loss=0.693, val/accuracy=0.499, val/loss=0.693, test/accuracy=0.495, test/loss=0.693, train/accuracy=0.501, train/loss=0.693. The loss is 0.693 throughout training.

Do you have any thoughts on this? Thanks!

The model changed a bit since the initial release. We will release a branch or tag marking the initial release where those configs should reproduce the original results (you should also be able to find it in the commit history). We have also been working on re-creating those results with the newest version of the model, which should involve only minor changes to the configs.

Thanks! Would you mind pointing me to the commit that I can use to reproduce the results?

Actually it looks like we already tagged it here:

Great, thanks so much!

Hi, I used this commit and ran on two datasets: pathfinder-32 and cifar, with two random seeds respectively, including your default one 1112.
For pathfinder, the best valid accuracies are 77.59 and 78.14.
For cifar, the best valid accuracies are 79.72 and 79.8.

For both datasets, the valid accuracies are much worse than the test results reported in the paper. I used A40 to run these experiments. Do you know why this happens or is there a different commit of code that can be used for reproduction?

I'm not sure why this is happening. Many other people have been able to reproduce the experiments. What versions of pytorch and pytorch-lightning are you running? Which Cauchy kernel do you have installed? Can you paste the command lines you're using?

Here are my specifications:
pytorch 1.11.0
pytorch-lightening 1.6.1
Cauchy kernel: def cauchy_conj_slow(v, z, w): in state-spaces/src/models/functional/ following the issue here #9 (comment).

The command line I used is exactly the same as in your README:

python -m train wandb=null experiment=s4-lra-cifar
python -m train wandb=null experiment=s4-lra-pathfinder


Did you have any issue installing either of the two faster Cauchy kernels? It is conceivable that they might have subtle numerical differences. We tested using the custom CUDA kernel.

I just checked out that commit and ran the CIFAR command and am getting to 80% val in 15 epochs and currently 82% at 30 epochs. So I think it's working properly.

It is also possible (although less likely) that pytorch-lightning changed something. If possible, I would suggest:

pip install pytorch-lightning=1.5.10
git checkout main
cd extensions/cauchy
python install
cd ../..
git checkout v1

and try running the command from there. The pykeops kernel installed with pip install pykeops==1.5 should also work.

Thanks so much for the response and instructions! I will try what you suggested.

The job I launched ended up getting to around 86% val accuracy. Let me know if you figure out the issue; if it ends up being a problem with cauchy_conj_slow or a package version I'll update the README.

Thanks! I was handling something else yesterday and will get back to you asap.

Hi Albert, sorry for the delay. I just recreated a new environment with pytorch 1.11.0, pytorch_lightning 1.5.10 installed, and I also successfully compiled the custom CUDA Cauchy kernel. I ran experiments on CIFAR on both A40 and A100, however, I still could not reproduce the results and I got something similar to my previous run:


I have no clues what could be the reason why this happens since I didn't modify anything from your code.

Hi Albert, to confirm, both my friend and I can not reproduce the results with v1 independently. But I can reproduce your results of v2.

Thanks for reporting back! I'll leave this issue open for longer because some other people are still trying to reproduce V1.

Just to check more variables: Is your friend using the same computing resources (e.g. same cluster or machine types) as you?

I definitely checked these results on an A100 before the V1 release, and as I reported above a fresh version of the repo still gets to high 80's on CIFAR for me on a P100, so I am really confused as well.

We are using the same cluster but different machine types.


Could you downgrade to Pytorch 1.10 and try again when you have time? We just discovered a bug in Pytorch 1.11 (pytorch/pytorch#77081) with Dropout2d which is causing a noticeable difference on small sCIFAR models and will probably cause a difference for large models.

Thanks! Did you use Pytorch 1.10 for your version 1? I can downgrade and see if it can reproduce lately.

Yeah we were on torch 1.10 for a long time. The run I did above was also on 1.10

Closing this issue as the original problems were confirmed to be a PyTorch bug and have since been resolved.