Kernel-driven progress bar in Numba

What is it?

It's a progress bar whose completion is controlled from a running CUDA kernel compiled with https://numba.pydata.org.

python kernel_progress.py

The latest Numba master branch
A Compute Capability 6.0 or greater CUDA GPU (e.g. GTX 1xxx, RTX 2/3xxx, Pascal / Volta / Turing / Ampere Quadro GPUs).
Linux (Managed Memory in Numba is experimental on Windows, and this use of it seems to crash it).
... Maybe other things I forgot to mention.

A managed array visible to both the host and device holds an integer representing current progress:

progress = cuda.managed_array(1, dtype=np.uint64)

The running kernel increments this in a loop (with some sleeping, so it doesn't run too fast):

for i in range(MAX_VAL):
    cuda.atomic.inc(progress, 0, MAX_VAL)
    cuda.nanosleep(KERNEL_SLEEP)

Simultaneously, the host reads the progress value and updates the progress bar in a loop:

while val < MAX_VAL:
    sleep(0.001)
    val = progress[0]
    bar.goto(val)

Maybe...

I have not thought hard about synchronization issues between the host and device,
I have not tested this extensively,
YMMV!

Maybe not...

If you're going to, you should probably read and understand Section M.2.2 of the CUDA Programming Guide, which discusses coherency and concurrency of managed memory.
You should also examine whether this technique has a performance impact,
and think about whether breaking a long-running kernel launch into multiple shorter kernels would provide a better way of synchronizing progress with the host.