When the kernel's height is 5, stride is 2, and the output channel is 16, I tested different input channels and input heights and found that when input_channel%4==1 or input_channel%4==2, the performance Performance is poor. Is this a problem with my usage or a problem with the operator itself?