NVIDIA/FasterTransformer

[bug] CustomAllReduceComm swapInternalBuffer is not safe (modifying const pointer).

rkindi opened this issue · 5 comments

Branch/Tag/Commit

main

Docker Image Version

N/A

GPU name

A100

CUDA Driver

N/A

Reproduced Steps

This line is not safe because it is writing to a Tensor structs's data field which is a const void* (modifying constant value is undefined behavior). When I run a standalone script to test the custom all reduce, I can print the tensor's data attribute before and after the call to swapInternalBuffer and see that no change is made. I include my script below:

repro_issue671_fastertransformer.zip

Instructions:

- python3 make_npy_tensors.py
- Run main.cu
- python3 validate_npy_tensors.py

Output from my machine for main.cu:

ar_out_buffer.data (before): 0x7f65d5002400.
ar_out_buffer.data (after): 0x7f65d5002400.
DONE

We can see the data pointer of the Tensor is not changed. This prevents us from being able to use custom all reduce as there is no way to write to the all reduce input buffers.

I don't have a complete high level overview of FasterTransformer's design so I might be missing something, but maybe something like the below could work as an alternative to the current CustomAllReduceComm interface:


Here is an example of the current usage:

https://github.com/NVIDIA/FasterTransformer/blob/c6e8f60ec40da218804a60e6aa986903e7fa8594/src/fastertransformer/layers/TensorParallelSiluFfnLayer.cc#L31C1-L61

template<typename T>
void TensorParallelSiluFfnLayer<T>::forward(TensorMap*          output_tensors,
                                            TensorMap*          input_tensors,
                                            const FfnWeight<T>* ffn_weights)
{
    FT_LOG_DEBUG("%s start", __PRETTY_FUNCTION__);
    Tensor       out_tensor   = output_tensors->at("ffn_output");
    const size_t token_num    = out_tensor.shape[0];
    const size_t hidden_units = out_tensor.shape[1];

    std::vector<Tensor> swap_tensors = {out_tensor};

    bool use_custom_all_reduce_kernel = false;
    if (enable_custom_all_reduce_ && custom_all_reduce_comm_ != nullptr) {
        use_custom_all_reduce_kernel =
            custom_all_reduce_comm_->swapInternalBuffer(&swap_tensors, token_num * hidden_units);
    }

    SiluFfnLayer<T>::forward(output_tensors, input_tensors, ffn_weights);

    T* ffn_out = out_tensor.getPtr<T>();
    if (do_all_reduce_ && tensor_para_.world_size_ > 1) {
        if (!use_custom_all_reduce_kernel) {
            ftNcclAllReduceSum(ffn_out, ffn_out, token_num * hidden_units, tensor_para_, SiluFfnLayer<T>::stream_);
        }
        else {
            custom_all_reduce_comm_->customAllReduce(token_num * hidden_units, SiluFfnLayer<T>::stream_);
        }
        sync_check_cuda_error();
    }
}

Maybe we could instead provide a getter in CustomAllReduceComm which returns a Tensor the same shape, dtype, etc. as ffn_output but pointing to the CustomAllReduceComm's peer_comm_buffer_ptrs. We then swap the existing Tensor mapped to the ffn_output key in the output_tensors TensorMap for the Tensor obtained via the getter for the call to SiluFfnLayer<T>::forward(output_tensors, input_tensors, ffn_weights); so the SiluFfnLayer writes its output to the correct destination. Finally, we swap the ffn_output back.

I don't think TensorMap currently has a way to delete keys / replace entries though right now. Is that intentional or would it be fine to add some functions for that?

The problem here is not the const void*, it simply means "pointer to something that is const", notice the const-ness is not on the pointer but the data it points to (so modifying the buffer content is somehow problematic).

What we actually want to modify is the data member of output_tensors->at("ffn_output"), however, swapInternalBuffer is operating on the copy of a copy of output_tensors->at("ffn_output").

The interface of swapInternalBuffer is not able to directly operate on output_tensors->at("ffn_output") because it needs vector<Tensor>*. You have to make a copy of a Tensor header in order to it into a std::vector<Tensor> anyway.

What we could do besides from changing the interface of CustomAllReduceComm is to sync the data pointer manually after "swapping" on the copy of it

// right after `swapInternalBuffer`
output_tensors->at("ffn_output").data = swap_tensors[0].data;
// ...
// right after `customAllReduce`
output_tensors->at("ffn_output").data = swap_tensors[0].data;

@lzhangzz , I misread the data type part, thanks for the correction! Your explanation makes sense. I'll play around with what you suggested and let you know if it fixes it

Thanks @lzhangzz , it makes sense to me.

@PerkzZheng and @byshiue , can you please take a look at it, please? We have to

1/ validate that we can reproduce the issue (but based on the code, I'd be surprised if it was not the case),
2/ validate that @lzhangzz 's proposal fixes the issue (I think it does),
3/ integrate the PR.

Thanks!
Julien

rkindi commented

Thank you for the fix!