Kreyren/kreyren

ZOTAC GTX 970 AMP Omega

Opened this issue · 7 comments

image

Was using this in dreamon when i suddenly got kicked from X11 and it stopped working on nvidia and on nouveau and on llvmpipe i get these green dots all over the screen:

image

At the time of failure the GPU's fans were powered from 12VDC external PSU keeping them at 100%, these fans are not connected to the PCB.

VRAM|100x100

GPU Core|100x100

Hypothesis

Sudden death of a VRAM

Known issues

VRAM overheating

This GPU had issues with VRAM overheating before that caused it to fail-safe as the VRAM is just passively cooled using a 1mm alluminimum plate.

Diagnostics

[CONFIRMED] Conductive dust?

The GPU had more dust then i would like so it's possible that these were causing a short.


TODO

  • Handle the VRAM overheating issue

Help-wanted: Suggestions to handle the VRAM overheating appreciated.

DISCLAIMER: Don't do this, not to be used for training.

SOLVED: Conductive dust was causing a signal to hug up

I put it in a sonic cleaner which didn't fix the issue, but seemed to reduce the amount of fragmentation on the screen.. Did two more passes without any major improvement..

So i did this

image

then back to sonic cleaner filled with isopropyl alcohol 99.6% and dried with a hot air and now it works.

I believe that the dust was probably trapped below the VRAM or some component where the sonic cleaner wasn't able to get to easily, but pressured water from a shower head set to jet did.

FWIW The water is reverse osmosis going through a filter and de-gasser.. woudn't do this with a regular tap water due to the risk of minerals sticking to the components.

Also did a VRAM stress test

kreyren@dreamon:~/Downloads/memtestG80$ sudo ./memtestG80 3500 5
     -------------------------------------------------------------
     |                      MemtestG80 v1.00                     |
     |                                                           |
     | Usage: memtestG80 [flags] [MB GPU RAM to test] [# iters]  |
     |                                                           |
     | Defaults: GPU 0, 128MB RAM, 50 test iterations            |
     | Amount of tested RAM will be rounded up to nearest 2MB    |
     -------------------------------------------------------------

      Available flags:
        --gpu N ,-g N : run test on the Nth (from 0) CUDA GPU
        --license ,-l : show license terms for this build

Running 5 iterations of tests over 3500 MB of GPU memory on card 0: GeForce GTX 970

Running memory bandwidth test over 20 iterations of 1750 MB transfers...
	Estimated bandwidth 89171.97 MB/s

Test iteration 1 (GPU 0, 3500 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (161 ms)
	Memtest86 Walking 8-bit: 0 errors (1275 ms)
	True Walking zeros (8-bit): 0 errors (640 ms)
	True Walking ones (8-bit): 0 errors (638 ms)
	Moving Inversions (random): 0 errors (160 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (2550 ms)
	Memtest86 Walking ones (32-bit): 0 errors (2554 ms)
	Random blocks: 0 errors (288 ms)
	Memtest86 Modulo-20: 0 errors (5342 ms)
	Logic (one iteration): 0 errors (82 ms)
	Logic (4 iterations): 0 errors (87 ms)
	Logic (shared memory, one iteration): 0 errors (82 ms)
	Logic (shared-memory, 4 iterations): 0 errors (87 ms)

Test iteration 2 (GPU 0, 3500 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (161 ms)
	Memtest86 Walking 8-bit: 0 errors (1273 ms)
	True Walking zeros (8-bit): 0 errors (622 ms)
	True Walking ones (8-bit): 0 errors (611 ms)
	Moving Inversions (random): 0 errors (156 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (2456 ms)
	Memtest86 Walking ones (32-bit): 0 errors (2450 ms)
	Random blocks: 0 errors (285 ms)
	Memtest86 Modulo-20: 0 errors (5061 ms)
	Logic (one iteration): 0 errors (79 ms)
	Logic (4 iterations): 0 errors (81 ms)
	Logic (shared memory, one iteration): 0 errors (79 ms)
	Logic (shared-memory, 4 iterations): 0 errors (81 ms)

Test iteration 3 (GPU 0, 3500 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (155 ms)
	Memtest86 Walking 8-bit: 0 errors (1228 ms)
	True Walking zeros (8-bit): 0 errors (610 ms)
	True Walking ones (8-bit): 0 errors (611 ms)
	Moving Inversions (random): 0 errors (154 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (2440 ms)
	Memtest86 Walking ones (32-bit): 0 errors (2438 ms)
	Random blocks: 0 errors (285 ms)
	Memtest86 Modulo-20: 0 errors (5030 ms)
	Logic (one iteration): 0 errors (80 ms)
	Logic (4 iterations): 0 errors (81 ms)
	Logic (shared memory, one iteration): 0 errors (78 ms)
	Logic (shared-memory, 4 iterations): 0 errors (82 ms)

Test iteration 4 (GPU 0, 3500 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (154 ms)
	Memtest86 Walking 8-bit: 0 errors (1228 ms)
	True Walking zeros (8-bit): 0 errors (610 ms)
	True Walking ones (8-bit): 0 errors (609 ms)
	Moving Inversions (random): 0 errors (155 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (2436 ms)
	Memtest86 Walking ones (32-bit): 0 errors (2442 ms)
	Random blocks: 0 errors (281 ms)
	Memtest86 Modulo-20: 0 errors (5040 ms)
	Logic (one iteration): 0 errors (79 ms)
	Logic (4 iterations): 0 errors (81 ms)
	Logic (shared memory, one iteration): 0 errors (78 ms)
	Logic (shared-memory, 4 iterations): 0 errors (81 ms)

Test iteration 5 (GPU 0, 3500 MiB): 0 errors so far
	Moving Inversions (ones and zeros): 0 errors (155 ms)
	Memtest86 Walking 8-bit: 0 errors (1231 ms)
	True Walking zeros (8-bit): 0 errors (611 ms)
	True Walking ones (8-bit): 0 errors (612 ms)
	Moving Inversions (random): 0 errors (155 ms)
	Memtest86 Walking zeros (32-bit): 0 errors (2439 ms)
	Memtest86 Walking ones (32-bit): 0 errors (2435 ms)
	Random blocks: 0 errors (282 ms)
	Memtest86 Modulo-20: 0 errors (5045 ms)
	Logic (one iteration): 0 errors (79 ms)
	Logic (4 iterations): 0 errors (81 ms)
	Logic (shared memory, one iteration): 0 errors (78 ms)
	Logic (shared-memory, 4 iterations): 0 errors (81 ms)

Final error count after 5 iterations over 3500 MiB of GPU memory: 0 errors

Aaaaannnddd... its fucked again, this time it seems to be a new issue.

Was playing assassins creed black flag and other games for like 8 hours without any problem when it suddenly bricked the system and since its refusing to give display after driver load.

Running the same MemtestG80 above this time gave me +1 000 000 errors so i assume it being a VRAM failure which is supported by blue stripes on nouveau.

Help-wanted: Howddya figure out which VRAM is faulty?

Blocked by #92