dotnet/TorchSharp

No way to copy a tensor from gpu to cpu to pre allocated array.

LukePoga opened this issue · 6 comments

Doesnt appear to be any way to transfer a result tensor in to an existing cpu float array. Below requires new memory allocation.

    var cpuResult = gpuResult.cpu();
    float[] result = cpuResult.data<float>().ToArray();

If this is part of a loop, this is a lot of wasted memory allocation and time! Below is how libraries normally do things. eg. CUDA.

float[] cpuResult ..... (pre allocated further up)
gpuResult.CopyToHost(cpuResult);

Maybe I missed this CopyTo because its kinda essential for any gpu type library (?!)

Is this project maintained?

One way to make a "fast" copy without overloading memory is make contiguous and add this on TensorAccessor.cs in function ToArray()

if (_tensor.is_contiguous()) {
    //This is very fast. And work VERY WELL
    var shps = _tensor.shape;
    long TempCount = 1;
    for (int i = 0; i < shps.Length; i++)
        TempCount *= shps[i]; //Theorically the numel is simple as product of each element shape
    unsafe {
        return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
    }
}

I Added these in one comit of my Pull Request Autocast. I try to figure out how make same idea if the tensor is not contiguous. Because this way for faster copy i always i need make the tensor as contiguous.

torch.Tensor te /*blablabla*/;
te = te.contiguous().data<float>().ToArray()

I noticed that if the tensor is not contiguous call always the method Numel so always computed.

Edit: Oh sorry i misunderstood what you mean, i think with CopyTo will work. You mean like this?

float[] data = new float[h*w*3]; //PreAllocated in top of function for example

//Intense functions and process blablabla

tenGPU.data<float>().CopyTo(data); //`tenGPU` is a variable of torch.Tensor that is allocated in GPU

I will test this. If that not work, soon i investigate how do that.

I recently test this and work well.
Image

Great thanks. I don't know why I didnt see CopyTo before.

tenGPU.data<float>().CopyTo(data); 

But its not faster. This takes 340ms for 12,000,000 floats. This is 150MB/s which is extremely slow for PCIE bandwidth. whys it so slow?

@LukePoga
Because in TensorAccessor.cs L41 Can see that call _tensor.numel() and inside of loop in GetSubsequentIndices for example, call always the Numel. That use so much CPU and may slow too, so many times that call the function and also iterate over Ptr array one by one assign in preallocated array. So my solution was modified that TensorAccessor for fast copy but only will work if the tensor is contiguous so before CopyTo or ToArray() should create a contiguous tensor like this:

torch.Tensor tenGPU;
//Blablabla
tenGPU = tenGPU.contiguous();
//After that you can call tenGPU.data<T>().ToArray() or a CopyTo.
tenGPU.data<T>().ToArray() //Or CopyTo

My Fast TensorAccessor pre-compute the Numel 1 time (that is multiply all element) and then create a complete copy without loop Ptr or Pointer Array

//From my branch of TorchSharp/Utils/TensorAccessor.cs
unsafe {
    return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
}

That this not iterate over array and assign value on index. This create a complete copy.

Soon i will make a PR for a Fast TensorAccessor but reminder that only will work fast if tensor is contiguous.
Now for the not contiguous i need to see how figure out, maybe it can be a bit quick with pre-compute Numel.
Because not contiguous is more complex due a Stride.

Thanks for working on the PR. Do you know who can approve it?

#1396

Thanks for working on the PR. Do you know who can approve it?

#1396

Working on it. Changes remain to be made before it can be approved.