Hello. I'm getting noise results.

Question

Hello. I'm getting noise results.

Closed this issue 4 years ago · 19 comments

Hello. Thank you for all you've done.

I'm trying to run this on a brand new 3090 with a brand new nvidia setup. The CUDA steps so far appear to run with expected outputs.

TecoGAN under your docker file sure seems to think it's working.

However, here's frame 22 of calendar.

Any thoughts what I might be doing wrong, configuration I might need, etc?

Answer 1 · 2021-02-07T19:40:42.000Z

Congratulation on getting a 3090. :)
I have never seen anything like this.
Could it be that you are running relatively old drivers with your 3090 that might case issues when running inference?
Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Answer 2 · 2021-02-07T19:47:36.000Z

Another thing could be that you trained for a small number of iterations and in the process overwrote the default TecoGAN weights, but didn't train for long enough to deliver usable results.

Answer 3 · 2021-02-07T22:03:52.000Z

Could it be that you are running relatively old drivers with your 3090 that might case issues when running inference?

I have, honestly, no idea. How would I go about checking?

.

Another thing could be that you trained for a small number of iterations and in the process overwrote the default TecoGAN weights, but didn't train for long enough to deliver usable results.

I just copy pasta-ed the instructions in your readme

Answer 4 · 2021-02-07T22:08:32.000Z

root@hex:/TecoGAN# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  460.32.03  Sun Dec 27 19:00:34 UTC 2020
GCC version:  gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

Not sure if this is what you're checking, but, it appears that there is a slightly newer .39, but that I'm running one that's two weeks old

Answer 5 · 2021-02-07T22:10:12.000Z

Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Is it possible to turn that off to test?

Answer 6 · 2021-02-07T22:39:42.000Z

A different TecoGAN implementation, in PyTorch, is also giving me nonsense results. Therefore I suspect either the system or the system's configuration.

I have no idea how to go about debugging something like this

This discussion regards a Ubuntu boot, but there's also Windows on the machine, and prepackaged CUDA stuff on Windows works fine

Answer 7 · 2021-02-08T05:26:21.000Z

Another thing that comes to mind are the new tensorfloat cores that use less precision and might be used by default by the 3090.

Is it possible to turn that off to test?

Don't know that.

Answer 8 · 2021-02-08T05:27:34.000Z

Could it be that you are looking at the wrong output image?

Answer 9 · 2021-02-08T05:49:22.000Z

I suppose that's a possibility, yes. Where am I meant to be looking?

Answer 10 · 2021-02-08T07:54:05.000Z

Just checked, it should beresults/calendar.

Answer 11 · 2021-02-08T13:30:57.000Z

Yeah that's where that comes from ☹️

Does this mean the machine I bought won't be able to do the job?

Answer 12 · 2021-02-08T22:16:13.000Z

No, that's not an acceptable outcome.
Could you try to run the following inside the container and then check the generated images?

pip uninstall -y tensorflow-gpu
pip install tensorflow==1.8.0
python3 runGan.py 1

Answer 13 · 2021-02-11T17:43:34.000Z

sorry, i previously missed the email from this response

i just started a run. we'll find out soon.

thank you for the help, new friend

Answer 14 · 2021-02-17T23:58:55.000Z

Are the images still noisy?

Answer 15 · 2021-02-21T20:27:46.000Z

That seems to have fixed it 😄

Thank you.

Sorry for the slow response; I torched the docker image in other unrelated ways trying to fix it before you told me what to do, then last week was a lot of stuff at work; I've only just now had the chance to restart from scratch and follow your instructions.

Answer 16 · 2021-02-21T23:35:12.000Z

No problem. Just so you know: The commands just removed GPU support so the inference is running on your CPU.

Answer 17 · 2021-02-22T04:18:41.000Z

oh. no wonder it's so slow :(

is there a way to get this working on the gpu proper?

Answer 18 · 2021-02-22T22:22:17.000Z

Maybe using a newer tensorflow version could solve the issue, but upgrading tensorflow probably isn't that easy. Some users did some work upgrading tensorflow: thunil/TecoGAN#97

Answer 19 · 2021-02-23T19:37:12.000Z

lol, brutal

no worries; i just ordered a bunch of ram. there's more than one way to peel an orange.

we're gonna try artillery. that orange is never gonna know what hit it