tom-doerr/TecoGAN-Docker

Hello. I'm getting noise results.

Closed this issue · 19 comments

Hello. Thank you for all you've done.

I'm trying to run this on a brand new 3090 with a brand new nvidia setup. The CUDA steps so far appear to run with expected outputs.

TecoGAN under your docker file sure seems to think it's working.

However, here's frame 22 of calendar.

image

Any thoughts what I might be doing wrong, configuration I might need, etc?

Congratulation on getting a 3090. :)
I have never seen anything like this.
Could it be that you are running relatively old drivers with your 3090 that might case issues when running inference?
Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Another thing could be that you trained for a small number of iterations and in the process overwrote the default TecoGAN weights, but didn't train for long enough to deliver usable results.

Could it be that you are running relatively old drivers with your 3090 that might case issues when running inference?

I have, honestly, no idea. How would I go about checking?

.

Another thing could be that you trained for a small number of iterations and in the process overwrote the default TecoGAN weights, but didn't train for long enough to deliver usable results.

I just copy pasta-ed the instructions in your readme

root@hex:/TecoGAN# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  460.32.03  Sun Dec 27 19:00:34 UTC 2020
GCC version:  gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

Not sure if this is what you're checking, but, it appears that there is a slightly newer .39, but that I'm running one that's two weeks old

Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Is it possible to turn that off to test?

A different TecoGAN implementation, in PyTorch, is also giving me nonsense results. Therefore I suspect either the system or the system's configuration.

I have no idea how to go about debugging something like this

This discussion regards a Ubuntu boot, but there's also Windows on the machine, and prepackaged CUDA stuff on Windows works fine

Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Is it possible to turn that off to test?

Don't know that.

Could it be that you are looking at the wrong output image?

I suppose that's a possibility, yes. Where am I meant to be looking?

Just checked, it should beresults/calendar.

Yeah that's where that comes from ☹️

Does this mean the machine I bought won't be able to do the job?

No, that's not an acceptable outcome.
Could you try to run the following inside the container and then check the generated images?

pip uninstall -y tensorflow-gpu
pip install tensorflow==1.8.0
python3 runGan.py 1

sorry, i previously missed the email from this response

i just started a run. we'll find out soon.

thank you for the help, new friend

Are the images still noisy?

That seems to have fixed it 😄

Thank you.

Sorry for the slow response; I torched the docker image in other unrelated ways trying to fix it before you told me what to do, then last week was a lot of stuff at work; I've only just now had the chance to restart from scratch and follow your instructions.

No problem. Just so you know: The commands just removed GPU support so the inference is running on your CPU.

oh. no wonder it's so slow :(

is there a way to get this working on the gpu proper?

Maybe using a newer tensorflow version could solve the issue, but upgrading tensorflow probably isn't that easy. Some users did some work upgrading tensorflow: thunil/TecoGAN#97

lol, brutal

no worries; i just ordered a bunch of ram. there's more than one way to peel an orange.

we're gonna try artillery. that orange is never gonna know what hit it