Your implementation is not the same as the original paper described in section 2.1. Offset output channel dimension should be 2 x N, where N is the conv kernel size k*k in 2D case. That is, the offset is shared across the channel dimension. You could refer to the implementation of https://github.com/ChunhuanLin/deform_conv_pytorch