vc1492a/tidd

CUDA out of memory for Tesla P4

hamlinliu17 opened this issue · 4 comments

@vc1492a Yesterday I tried rerunning the new commits on Wednesday and yesterday that you have added and I seem to be running into the error where CUDA runs out of memory. I was wondering what changed since I was able to run it on the P4 earlier in the week. I will try and request a larger GPU to accommodate for the greater memory usage in the meantime.

Hey @hamlinliu17! I did end up changing the model I am training / using to a model that is slightly larger than 8GB in memory (I think). That may be the issue. If you want, specify a smaller model in the model architectures and see how things go (e.g., a model with fewer layers or numbers of nodes).

Also try commenting out the portion of the code that loads the learner again after model training. This doesn't seem to flush the old learner from the memory and instead just loads a new one, doubling the memory footprint, and causes an out of memory error. I ran into this issue myself earlier this week.

Let me know if it ends up being one of the two things above - I'd try the latter first. If it isn't, I can look into it further.

@vc1492a It seems that I have trouble running the training code. I tried some things out and went with the architecture[4] preset instead of the architecture[5] one since it has less layers. This one does seem to be compatible with my GPU. I will let you know of its results.

Got it, thanks for the clarification - I'd say that's likely down to the model 5 not fitting in memory on the GPU. This isn't so much a bug in the code, but instead a mis-alignment between what is desired in terms of model architecture and what can fit into the memory of the GPU.

I suspect with more training data, we can reduce the complexity / size of the model but we will have to wait until more training data is available to make this assessment more formally.

Feel free to close this issue once you are able to run the training code successfully with one of the model architectures provided!

Side note: I added some notes to the project board after thinking through a few items earlier today but haven't converted them to issues.

I can run it and it seems that results are not that promising? I will play around with the numbers to see if anything changes and close this issue. In the meantime but here are is the snapshot of the predicted day vs the day of the seismic event and the absolute error graph.
test_result

absolute_error