Real-Time Renderering - Frame to Screen Data Transfer
jakobtroidl opened this issue · 5 comments
I am trying to use drjit to build a forward-only renderer that should run in real-time after being inspired by this amazing new paper.
I wonder how to best transfer frames rendered by a kernel to the screen as fast as possible. I am first converting the data into a numpy array and then render it using the PyGame window library - this runs at ~20 FPS on my LLVM-accelerated Intel MacPro (see code below). This seems inefficient because there's a lot of unnecessary data transfer (GPU->CPU->GPU for CUDA backend) happening for each frame. Do you have recommendations on more clever design choices when using drjit for real-time rendering?
pip install drjit numpy
python -m pip install -U pygame==2.5.2 --user
import pygame
import numpy as np
import drjit as dr
from drjit.llvm import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop
def sdf(p: Array3f) -> Float:
return dr.norm(p) - 1
def trace(o: Array3f, d: Array3f) -> Array3f:
for i in range(10):
o = dr.fma(d, sdf(o), o)
return o
def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float:
n = Array3f(
sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]),
sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]),
sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps])
) / (2 * eps)
return dr.maximum(0, dr.dot(n, l))
def render_sphere():
x = dr.linspace(Float, -1, 1, 1000)
x, y = dr.meshgrid(x, x)
p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1)))
sh = shade(p, l=Array3f(0, -1, -1))
sh[sdf(p) > .1] = 0
img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh
img_flat = dr.ravel(img)
return TensorXf(img_flat, shape=(1000, 1000, 3))
# pygame setup
pygame.init()
screen = pygame.display.set_mode((1280, 720))
clock = pygame.time.Clock()
running = True
dt = 0
# Create a Pygame surface from the array
clock = pygame.time.Clock()
font = pygame.font.Font(None, 36)
while running:
# poll for events
# pygame.QUIT event means the user clicked X to close your window
for event in pygame.event.get():
if event.type == pygame.QUIT:
running = False
screen.fill("purple")
array = render_sphere()
data = np.array(array)
data = data * 255
data = data.astype(np.uint8)
surface = pygame.surfarray.make_surface(data)
screen.blit(pygame.transform.scale(surface, (1280, 720)), (0, 0))
# Calculate and display FPS
fps = clock.get_fps()
fps_text = font.render(f"FPS: {fps:.2f}", True, pygame.Color('white'))
screen.blit(fps_text, (50, 50))
pygame.display.flip()
dt = clock.tick(150) / 1000
pygame.quit()
Hi @jakobtroidl
Out-of-the box, I'd say that there are a few things missing in Dr.Jit to really squeeze out every bit of performance of something like this.
As you suggested, in your current approach there is some overhead of the data transfer. Fundamentally, this problem lies on the display tool/framework you want to use: find on which will accept CUDA arrays through the dlpack
interface. (There are some ongoing discussions about the necessity to synchronize when using that interface #198). Alternatively, the Texture2f
/Texture3f
classes use CUDA textures, I'd expect some framework to maybe accept these directly. However their respective handles aren't exposed through Python so you'd need to add that yourself.
Another big overhead is the tracing runtime of executing the Python interpreter through your render()
function. Although Dr.Jit will cache its kernels and re-use them, it still has to "read" through your code entirely to realize that you're executing some piece of code that it has already seen. There isn't much you can do to alleviate this.
I think it should still be possible to access the underlying Dr.Jit evaluated CUDA buffers directly using the data_()
call. E.g.,
a = dr.linspace(dr.cuda.Float32, 0, 1, 1024)
print(a.data_())
If I recall correctly, this will be a simple pointer directly to the CUDA memory. So if you then have a CUDA kernel that draws that to screen/a texture, that might be quite fast. I agree with Nicolas that the tracing is likely of significant overhead.
Thank you so much for your answers.
Another big overhead is the tracing runtime of executing the Python interpreter through your render() function.
I am wondering how I could work around this issue. Would it make sense to compile CUDA kernels once from a Python implementation and then invoke them from a C++ based rendering loop? There must be a way around this issue since the paper mentioned above is so incredibly fast and it seems like they implemented their forward pass in drjit.
This paper uses a custom Dr.Jit version with many project-specific modifications. It's our goal to make something like this possible in mainline Dr.Jit in the future. Right now it is not possible due to the tracing overheads mentioned above. It will take a long time, you may want to pursue other other options if your goal is to do this right now.
ok, thanks for the heads-up.