taichi-dev/difftaichi

CUDA extremely slow on example where CPU is fast

juliusbierk opened this issue · 4 comments

First of all: cool library. I am trying to familiarize myself with it.

I tried to make just a simple example. This code makes an image with a black-white gradient,
and using a loss functions to darken the image.
It runs fast on cpu, but cannot even render the first frame on the GPU (using an RTX 2080 Ti). It keeps the GPU at 100 % utilization, but nothing happens. I can run other examples just fine on the GPU.

Are there any glaring misunderstandings that I have gotten?

import taichi as ti

# ti.init(arch=ti.x86_64, debug=False)  # works
ti.init(arch=ti.cuda, debug=False)  # extremely slow

n = 320
pixels = ti.var(dt=ti.f32, shape=(n * 2, n), needs_grad=True)
loss = ti.var(dt=ti.f32, shape=(), needs_grad=True)

@ti.kernel
def paint(t: ti.f32):
    for i, j in pixels:
        loss[None] += ti.sqr(pixels[i, j])

@ti.kernel
def init():
    for i, j in pixels:
        pixels[i, j] = i/500. + j/500.

@ti.kernel
def apply_grad():
    for i, j in pixels:
        pixels[i, j] -= learning_rate * pixels.grad[i, j]

gui = ti.GUI("Tester", (n * 2, n))
init()

learning_rate = 0.01

for i in range(1000000):
    print(i)
    with ti.Tape(loss):
        paint(i * 0.1)
    apply_grad()
    print(pixels.grad[5, 5])

    gui.set_image(pixels)
    gui.show()

Thank you in advance for your help.

Btw. CUDA seems to stuck on the __exit__ part of with Tape, i.e. when calculating the gradients of paint().

Aha, found the problem.

Apparently gradients do not support the "smart indexing" used in the for loops.
Replacing paint with

@ti.kernel
def paint(t: ti.f32):
    for i in range(n * 2):
        for j in range(n):
            loss[None] += pixels[i, j] * pixels[i, j]

allows it to run on the gpu.

This is strange, the example in the documentation suggest that smart indexing is the way to go:
https://taichi.readthedocs.io/en/stable/hello.html

Also, from the documentation, then 2nd version of paint should be slower because only the outermost scope (in your case the loop for i in range(n * 2)) would be parallelized https://taichi.readthedocs.io/en/stable/hello.html#parallel-for-loops

A small observation: You are using ti.var, the example uses ti.field. I cannot find anything about ti.var in the documentation. What is ti.var?

I am trying to get my head around the examples, so I cannot help much more but hope it points you out in the right direction.

Hi @robertour . Thanks for your reply. I opened this issue back in February... I'm sure many things could have changed since then (e.g. ti.var no longer being used). Perhaps it also just works now.