Fast transfer of small tensors
juanmed opened this issue · 5 comments
Hello again,
After reviewing your example benchmark script, I was doing some measurements on CPU->GPU->CPU transfer times, comparing Pytorch CPU pinned tensor versus Speedtorch's gadgetGPU. My tensors are actually very small compared to your tests cases, at most 20x3 tensors, but I need to transfer them fast enough to allow me make other computations >100Hz.
So far, if I understood how to correctly use SpeedTorch, it seems that Pytorch's cpu pinned tensors have faster transfer times compared to SpeedTorch DataGadget
cpu pinned object. See the graph below, were both histograms corresponds to CPU->GPU->CPU
transfer of a 6x3 matrix. The pink histogram corresponds to pytorch's cpu pinned tensor and the turquoise one to Speedtorch's DataGadget operations.
For my use case, it seems Pytorch's pinned cpu Tensor has better performance. I would like to ask if this makes sense in your experience and what recommendations could you provide for using SpeedTorch to achieve better performance? My use case implies receiving data in the CPU, transfer to GPU, make various linear algebra operations, and finally getting the result back to the CPU. All this must be performed at 100Hz minimum. So far I have only achieved 70Hz and would like to speed up every operation as much as possible.
You can find the code I used to get this graph here, and this was run on a Jetson Nano (armV8, nVidia Tegra TX1 GPU).
Thank you very much!
This makes sense. I initially didn't know why pinned cupy tensors were getting faster performance, but a pytorch engineer pointed it out, I updated the 'How it works?' section but the pinned cupy tensors aren't copying faster, they're using a different indexing kernal, which works better for CPUs with a lower number of cores.
In either case, for smaller tensors, there's probably not that much indexing going on, so it would make sense that Pytorch's pinned CPU tensors are faster.
More details here
In your case, I would imagine the number of CPU cores wouldn't make too much of a difference, but out of curiosity, how many CPU cores are you working with?
@Santosh-Gupta, thanks for your reply.
The Jetson Nano platform has a quad-core ARM A57 64-bit CPU.
but the pinned cupy tensors aren't copying faster, they're using a different indexing kernal, which works better for CPUs with a lower number of cores.
In either case, for smaller tensors, there's probably not that much indexing going on, so it would make sense that Pytorch's pinned CPU tensors are faster.
More details here: https://discuss.pytorch.org/t/introducing-speedtorch-4x-speed-cpu-gpu-transfer-110x-gpu-cpu-transfer/56147/2
So it seems the best use case for SpeedTorch is CPU<->GPU transfer of slices of big tensors? After I transfer the tensor to GPU I am creating one variable per row like row1 = xgpu[1]
and end up with 20 variables (20 rows in tensor).
Does this mean SpeedTorch advantage is dependent both on the number of cores and size of the matrices, for all operations (both transfer and indexing), and the underlying reason is a bug in Pytorch's implementation? I read in the 'How it works?' section that Pytorch's performance gets better with the number of cores, but I understood from the benchmark tables that the speed ups, more importantly for Speedtorch pinned CPU <-> Pytorch cuda (for which a 124x is indicated), were to be expected. Sorry if the question is answered in your readme and links, I would just like to make sure I am getting it right.
In this comment though, talking about a DataGadget (cpu-pinned??) object they mention:
CuPy is only intended to hold CUDA data, but in this case it’s actually holding CPU data (pinned memory).
So I am left wondering if somehow I could still use SpeedTorch to speed-up this particular characteristic. I am not sure if you were able to look at my code but, is the following script a sensible way of getting the best performance? In __init__()
I initialize all tensors that will transfer data, and control_law
is the callback which should be executed at greater than 100Hz.
def __init__(self):
#initialize CPU and GPU tensors
# this will receive data in GPU
self.data_gpu = torch.zeros(18,3).to('cuda')
self.data = np.zeros((18,3))
# this will receive all CPU data and transfer to GPU
self.data_cpu = SpeedTorch.DataGadget(data, CPUPinn = True ) # is this using CPU memory?
self.data_cpu.gadgetInit()
# this will receive GPU data and transfer to CPU
data2 = np.zeros((6,3))
self.return_cpu = SpeedTorch.DataGadget(data2, CPUPinn = True ) # is this using CPU memory?
self.data_cpu.gadgetInit()
# inside a callback @ 100Hz
def control_law(self, new_data): #new_data is an 18x3 np.array
# update data in pinned CPU memory, would this be still using pinned memory?
self.data_cpu.CUPYcorpus = cp.asarray(new_data)
#transfer from CPU to GPU
self.data_gpu[:] = self.data_cpu.getData(indexes = (slice(0,18),slice(0,3)))
#slice all GPU data
a = self.data_gpu[0]
b = self.data_gpu[1]
c = self.data_gpu[2:5]
# etc etc etc
h = self.data_gpu[17]
# do various linear algebra operations in GPU with torch cuda tensors
res = self.do_linear_algebra(a,b,....h)
#transfer back to pinned CPU memory
self.return_cpu.insertData(dataObject = res, indexes = (slice(0,6),slice(0,3)))
# continue processing
return self.return_cpu.CUPYcorpus.asnumpy()
Previous to this I have modified SpeedTorch's CUPYLive.py to accept numpy array as input when initializing:
def gadgetInit(self):
if self.CPUPinn == True:
cupy.cuda.set_allocator(my_pinned_allocator)
if(type(self.fileName) == np.ndarray):
self.CUPYcorpus = cupy.asarray(self.fileName)
else:
self.CUPYcorpus = cupy.load( self.fileName)
if self.CPUPinn == True:
cupy.cuda.set_allocator(None)
Thank you very much for your comments! Getting to know SpeedTorch has allowed me to better understand the interaction between CPU and GPU :)
quad-core ARM A57 64-bit CPU
So this would have 4 cores?
So it seems the best use case for SpeedTorch is CPU<->GPU transfer of slices of big tensors?
Yes, but I am also wondering if even you copy the whole tensor, that would also needed to go though indexing operations, and thus a Speedup. This is something I'll need to test.
In this comment though, talking about a DataGadget (cpu-pinned??)
Yeah, they're talking about cpu-pinned DataGadget
self.data_cpu = SpeedTorch.DataGadget(data, CPUPinn = True ) # is this using CPU memory?
With the new modification, I imagine this would still be using pinned CPU memory. I would be surprised if it wasn't. But I haven't explicitly tested this so I can't 100% sure.
# update data in pinned CPU memory, would this be still using pinned memory? self.data_cpu.CUPYcorpus = cp.asarray(new_data)
I haven't tested this, but I don't believe this would be pinned memory. If cp.asarray(new_data)
was pinned, then yes, it would be pinned. Perhaps a data pinner is a method I should include in the next update. But I'll need to test this to be 100% sure, by creating a large array for cp.asarray(new_data)
and checking if the CPU or GPU memory increases.
Hmmm, since the data dimension is 18,3, and 16,3, I imagine that the indexing will not be too heavy, and there may not be that much of a speedup. Particularly when the slices are only a few rows. But perhaps the facebook engineer in that one link can give better insight.
Either way, it's worth a test. If you do, would love to hear the results.
@Santosh-Gupta Thanks for your reply back. Yes, that would be 4 cores in the cpu. I will try to confirm if the self.data_cpu.CUPYcorpus = cp.asarray(new_data)
would still be using cpu pinned memory and come back with the results. Thanks!
Another approach you might want to consider is using the PyCuda and Numba indexing kernals, using a similar approach, disguising CPU pinned tensors as GPU tensors. I didn't have a chance to try this approach.