Array.get is very slow

Question

Array.get is very slow

flavianh opened this issue 9 years ago · 5 comments

Hey there, I have noticed that transfering an array from the device to the host is slow in a case I was able to reproduce. Have a look at the following snippet:

from reikna.cluda import any_api
from reikna.linalg import MatrixMul
import numpy as np
import time
api = any_api()
thr = api.Thread.create()

def minimal_example(shape):
    matrix_a = thr.array(shape, dtype=np.float32)
    matrix_b = thr.array(shape, dtype=np.float32)
    matrix_output = thr.array(shape, dtype=np.float32)
    dot = MatrixMul(matrix_b, matrix_a, out_arr=matrix_output)
    dotc = dot.compile(thr)
    dotc(matrix_output, matrix_a, matrix_b)
    matrix_output.get()

import cProfile

cProfile.run('minimal_example((10000, 10000))', sort='time')

This yields:

         94297 function calls (92282 primitive calls) in 13.481 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   13.353   13.353   13.353   13.353 gpuarray.py:248(get)
     6388    0.007    0.000    0.009    0.000 sre_parse.py:182(__next)
   234/38    0.006    0.000    0.022    0.001 sre_parse.py:379(_parse)
      271    0.003    0.000    0.009    0.000 version.py:208(__init__)
      148    0.003    0.000    0.004    0.000 sre_compile.py:207(_optimize_charset)
      313    0.003    0.000    0.030    0.000 __init__.py:2088(find_on_path)
     5893    0.003    0.000    0.011    0.000 sre_parse.py:201(get)
   431/36    0.003    0.000    0.009    0.000 sre_compile.py:32(_compile)
15989/15770    0.003    0.000    0.003    0.000 {len}
        2    0.003    0.001    0.003    0.001 {posix.read}
      965    0.002    0.000    0.004    0.000 posixpath.py:60(join)
        1    0.002    0.002    0.002    0.002 {posix.fork}
      265    0.002    0.000    0.002    0.000 version.py:351(_cmpkey)
        4    0.002    0.000    0.002    0.001 collections.py:237(namedtuple)
        1    0.002    0.002    0.091    0.091 __init__.py:15(<module>)
      372    0.002    0.000    0.002    0.000 {posix.stat}
     6302    0.002    0.000    0.002    0.000 {method 'endswith' of 'str' objects}
  601/210    0.002    0.000    0.002    0.000 sre_parse.py:140(getwidth)
      276    0.002    0.000    0.019    0.000 __init__.py:2460(from_location)
4263/4262    0.001    0.000    0.002    0.000 {isinstance}

I have checked my GPU card bandwidth (I am using Amazon's GPU machine, the small one with a GRID K520) and it gives me


 Device 0: GRID K520
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     9390.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     8814.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     119122.4

My guess is that get takes a little too much time to copy a 380MB array. I suspected this was due to the thread synchronization step, but when I do thr.from_device(matrix_output, async=True) I don't get better results. So is the copy really slow or is there something I don't get?

Answer 1 · 2015-05-06T18:14:30.000Z

I think that dotc is non blocking, so get is waiting for the computation to finish before getting the data. You should be able to see this by first waiting for the computation, then getting the data.

Answer 2 · 2015-05-06T18:37:20.000Z

How can I wait for completion? The doc is not very explicit about that, see here:

__call__(*args, **kwds)
Execute the computation.

Answer 3 · 2015-05-06T18:51:14.000Z

I don't know the reikna api very well. Try to pass async=False to thr = api.Thread.create() ? (of course, this is for profiling only)

Answer 4 · 2015-05-07T00:37:55.000Z

@Nodd is correct, computations are executed asynchronously by default. You can either use async=False in the Thread constructor (or in Thread.create()), or call thr.synchronize() where you need it. There is currently no option for doing that in the computation call itself.

Answer 5 · 2015-05-11T05:54:30.000Z

Seems to be resolved then. Closing.