Problem with executing git submodule update --init --recursive and errors during task maze
timenotwait opened this issue · 9 comments
Thank you for your excellent work on this project.
I encountered an issue while trying to execute the command git submodule update --init --recursive. As a workaround, I manually downloaded the corresponding version from the Picoclvr link provided in the repository.
However, despite following these steps, I still encounter errors when executing task maze. Here is the error message I receive:
How could I fix it ? I would appreciate any guidance or suggestions on how to resolve this issue.
Hey ! Thanks for your interest in our work.
I've just checked again with a fresh install and it should work.
For the git submodule update --init --recursive
did you try running that at the root of the repository (in the same folder than the .gitmodules
?
For the command, I know that François Fleuret updated his picoclvr repository with new stuff. It might have broken a few things. Maybe try to checkout to commit cf94b49
from inside the picoclvr repository. (it is the commit by default in the submodule.
git checkout cf94b49
to run in PicoCLVR repository
Then what command are you running ?
A command that works on my side is
python main.py --task maze --training_strategy="shuffle" --nb_train_samples 100 --nb_test_samples 100
(I decreased the number of samples for faster debug)
I executed the command you mentioned, but at the end I still get the same error...
It could still be a problem of submodules or conda version.
Did you create a new env?
What are the results of the following commands :
git submodule status
should be
cf94b49d085ec05e1053b49b7e796afa3f593a28 non-nlp/picoclvr (heads/master)
9755682b981a45507f6eb9b11eadef8cb83cebd5 text/nanoGPT (heads/master)
or
cd non-nlp/picoclvr
git log -1
should show the same commit.
Or if it's an env problem what are the results of your
conda list
or pip freeze
?
because it seem to be an error related to the batch size dimension, which is weird. So I would be really interested of being able to replicate this error
Thank you for your enthusiastic response.
The result of the command looks fine:
My CUDA driver version is 12.2.
By the way,the bug did not occur at the beginning, but after running for about 10 minutes.
I haven't made any changes to the code, and the conda list is a bit long,here I list a part of it(I only used the yml file you provided to create it):
Thanks a lot for your extensive answer, it turns out that you were right, there was a bug late in the pipeline that I introduced during refactoring. Sorry for that. I've corrected it and am running the whole pipeline again this night, + coming day(s) to be sure that everything is in order. I'll let you know once it's done.
It seems to work correctly with the fix. Let me know if it now works on your side. Thanks a lot for spotting that mistake
The program ran smoothly this time 😄. Thanks again for your helpful response!
Thanks for noticing the bug and following up through !