[bug] CustomAllReduceComm swapInternalBuffer is not safe (modifying const pointer).
rkindi opened this issue · 5 comments
Branch/Tag/Commit
main
Docker Image Version
N/A
GPU name
A100
CUDA Driver
N/A
Reproduced Steps
This line is not safe because it is writing to a Tensor structs's data
field which is a const void* (modifying constant value is undefined behavior). When I run a standalone script to test the custom all reduce, I can print the tensor's data attribute before and after the call to swapInternalBuffer and see that no change is made. I include my script below:
repro_issue671_fastertransformer.zip
Instructions:
- python3 make_npy_tensors.py
- Run main.cu
- python3 validate_npy_tensors.py
Output from my machine for main.cu:
ar_out_buffer.data (before): 0x7f65d5002400.
ar_out_buffer.data (after): 0x7f65d5002400.
DONE
We can see the data pointer of the Tensor is not changed. This prevents us from being able to use custom all reduce as there is no way to write to the all reduce input buffers.
I don't have a complete high level overview of FasterTransformer's design so I might be missing something, but maybe something like the below could work as an alternative to the current CustomAllReduceComm interface:
Here is an example of the current usage:
template<typename T>
void TensorParallelSiluFfnLayer<T>::forward(TensorMap* output_tensors,
TensorMap* input_tensors,
const FfnWeight<T>* ffn_weights)
{
FT_LOG_DEBUG("%s start", __PRETTY_FUNCTION__);
Tensor out_tensor = output_tensors->at("ffn_output");
const size_t token_num = out_tensor.shape[0];
const size_t hidden_units = out_tensor.shape[1];
std::vector<Tensor> swap_tensors = {out_tensor};
bool use_custom_all_reduce_kernel = false;
if (enable_custom_all_reduce_ && custom_all_reduce_comm_ != nullptr) {
use_custom_all_reduce_kernel =
custom_all_reduce_comm_->swapInternalBuffer(&swap_tensors, token_num * hidden_units);
}
SiluFfnLayer<T>::forward(output_tensors, input_tensors, ffn_weights);
T* ffn_out = out_tensor.getPtr<T>();
if (do_all_reduce_ && tensor_para_.world_size_ > 1) {
if (!use_custom_all_reduce_kernel) {
ftNcclAllReduceSum(ffn_out, ffn_out, token_num * hidden_units, tensor_para_, SiluFfnLayer<T>::stream_);
}
else {
custom_all_reduce_comm_->customAllReduce(token_num * hidden_units, SiluFfnLayer<T>::stream_);
}
sync_check_cuda_error();
}
}
Maybe we could instead provide a getter
in CustomAllReduceComm
which returns a Tensor
the same shape, dtype, etc. as ffn_output
but pointing to the CustomAllReduceComm's peer_comm_buffer_ptrs
. We then swap the existing Tensor mapped to the ffn_output
key in the output_tensors
TensorMap for the Tensor obtained via the getter for the call to SiluFfnLayer<T>::forward(output_tensors, input_tensors, ffn_weights);
so the SiluFfnLayer writes its output to the correct destination. Finally, we swap the ffn_output back.
I don't think TensorMap currently has a way to delete keys / replace entries though right now. Is that intentional or would it be fine to add some functions for that?
The problem here is not the const void*
, it simply means "pointer to something that is const", notice the const-ness is not on the pointer but the data it points to (so modifying the buffer content is somehow problematic).
What we actually want to modify is the data
member of output_tensors->at("ffn_output")
, however, swapInternalBuffer
is operating on the copy of a copy of output_tensors->at("ffn_output")
.
The interface of swapInternalBuffer
is not able to directly operate on output_tensors->at("ffn_output")
because it needs vector<Tensor>*
. You have to make a copy of a Tensor
header in order to it into a std::vector<Tensor>
anyway.
What we could do besides from changing the interface of CustomAllReduceComm
is to sync the data pointer manually after "swapping" on the copy of it
// right after `swapInternalBuffer`
output_tensors->at("ffn_output").data = swap_tensors[0].data;
// ...
// right after `customAllReduce`
output_tensors->at("ffn_output").data = swap_tensors[0].data;
@lzhangzz , I misread the data type part, thanks for the correction! Your explanation makes sense. I'll play around with what you suggested and let you know if it fixes it
Thanks @lzhangzz , it makes sense to me.
@PerkzZheng and @byshiue , can you please take a look at it, please? We have to
1/ validate that we can reproduce the issue (but based on the code, I'd be surprised if it was not the case),
2/ validate that @lzhangzz 's proposal fixes the issue (I think it does),
3/ integrate the PR.
Thanks!
Julien
Thank you for the fix!