Problem with executing git submodule update --init --recursive and errors during task maze

Question

Problem with executing git submodule update --init --recursive and errors during task maze

timenotwait opened this issue 6 months ago · 9 comments

Thank you for your excellent work on this project.

I encountered an issue while trying to execute the command git submodule update --init --recursive. As a workaround, I manually downloaded the corresponding version from the Picoclvr link provided in the repository.

However, despite following these steps, I still encounter errors when executing task maze. Here is the error message I receive:

How could I fix it ? I would appreciate any guidance or suggestions on how to resolve this issue.

Answer 1 · 2024-07-11T11:34:16.000Z

Hey ! Thanks for your interest in our work.

I've just checked again with a fresh install and it should work.

For the git submodule update --init --recursive did you try running that at the root of the repository (in the same folder than the .gitmodules ?

For the command, I know that François Fleuret updated his picoclvr repository with new stuff. It might have broken a few things. Maybe try to checkout to commit cf94b49 from inside the picoclvr repository. (it is the commit by default in the submodule.

git checkout cf94b49

to run in PicoCLVR repository

Then what command are you running ?
A command that works on my side is

python main.py --task maze --training_strategy="shuffle" --nb_train_samples 100 --nb_test_samples 100

(I decreased the number of samples for faster debug)

Answer 2 · 2024-07-12T03:35:32.000Z

I executed the command you mentioned, but at the end I still get the same error...

Answer 3 · 2024-07-12T17:38:14.000Z

It could still be a problem of submodules or conda version.
Did you create a new env?

What are the results of the following commands :

git submodule status

should be

 cf94b49d085ec05e1053b49b7e796afa3f593a28 non-nlp/picoclvr (heads/master)
 9755682b981a45507f6eb9b11eadef8cb83cebd5 text/nanoGPT (heads/master)

or

cd non-nlp/picoclvr
git log -1

should show the same commit.

Or if it's an env problem what are the results of your
conda list or pip freeze ?

Answer 4 · 2024-07-12T17:39:45.000Z

because it seem to be an error related to the batch size dimension, which is weird. So I would be really interested of being able to replicate this error

Answer 5 · 2024-07-13T03:41:20.000Z

Thank you for your enthusiastic response.
The result of the command looks fine:

My CUDA driver version is 12.2.
By the way,the bug did not occur at the beginning, but after running for about 10 minutes.
I haven't made any changes to the code, and the conda list is a bit long,here I list a part of it(I only used the yml file you provided to create it):

Answer 6 · 2024-07-15T18:06:00.000Z

Thanks a lot for your extensive answer, it turns out that you were right, there was a bug late in the pipeline that I introduced during refactoring. Sorry for that. I've corrected it and am running the whole pipeline again this night, + coming day(s) to be sure that everything is in order. I'll let you know once it's done.

Answer 7 · 2024-07-16T13:48:34.000Z

It seems to work correctly with the fix. Let me know if it now works on your side. Thanks a lot for spotting that mistake

Answer 8 · 2024-07-16T15:13:55.000Z

The program ran smoothly this time 😄. Thanks again for your helpful response!

Answer 9 · 2024-07-16T15:33:01.000Z

Thanks for noticing the bug and following up through !