What factors determine if a model or a layer behaves like a low- or high-pass filter?
waitingcheung opened this issue · 1 comments
Your paper reports that generally MSAs behave like low-pass filters (shape-biased) and Convs behave like high-pass filters (texture-biased). Recently I came across papers that report shape bias in their findings and I wonder about your thoughts on them.
Low-pass filters (shape-biased)
- MSAs (global or local)
- Large kernel Convs (ConvNext or RepLKNet)
- ResNet or DeiT trained on stylized ImageNet
- Masked Image Modeling
High-pass filters (texture-biased)
- 3x3 Convs in ResNet
These findings suggest that factors affecting the behavior can be spatial aggregation, kernel size, training data, or training procedures. It seems that only 3x3 Convs behave like high-pass filters or I may be missing something. In your another thread you mentioned that group size also makes a difference. I wonder how ResNet and ResNeXt differ and I suppose ResNeXt is also texture-biased.
I will appreciate your insights on what factors determine if a model or a layer behaves like a low- or high-pass filter.
Hi @waitingcheung,
The question you pointed out is certainly interesting and important. However, unfortunately, it is difficult to answer in one word. I believe spatial aggregation, depthwise separable operations, large kernel size, shape-biased training datasets all increase the shape-biases of neural nets. I also would expect vanilla CNNs/ViTs to be one of the most texture/shape-biased models. Thus, we can spectralize the shape-biases of vanilla CNNs, hybrid models, and vanilla ViTs (instead of dichotomizing them).