Title: 0.5M Parameters Suffice: Uniting CNNs and Transformers for Ultra-Lightweight 3D Multimodal Brain Tumor Segmentation
Transformer has a natural advantage in global feature modelling due to its unique self-attention mechanism, while Convolutional Neural Networks (CNNs) rely on their powerful spatial inductive biases to efficiently capture local features using fewer parameters. For the 3D brain tumour segmentation task, both local and global features are crucial. Additionally, balancing high accuracy with low computational cost is a critical challenge. To address this, we design an ultra-lightweight 3D brain tumour segmentation model, TransLiteUNet. The model is a hybrid CNN-Transformer architecture that achieves accurate segmentation without pre-training. To improve parametric efficiency, we propose a 3D axial depth-separable convolutional residual structure (3DRes-ADS Block). We also introduce a lightweight LiteViT block, which enhances global feature modelling at a lower computational cost. The model complexity of TransLiteUNet (0.43M parameters, 14.98G FLOPs) and its simplified version, TransLiteUNet-s (0.23M parameters, 11.84G FLOPs), is much lower than that of state-of-the-art models like CKD-TransBTS (81.60M parameters, 462.60G FLOPs) and 3D UX-NET (53.05M parameters, 1518.81G FLOPs). On the BraTS2020 dataset, TransLiteUNet achieves an average Dice score of 0.843 (ET: 0.769; TC: 0.850; WT: 0.911). Under the same conditions, its performance is comparable to mainstream models, while reducing parameters and computation by tens or even hundreds of times.