SRCBTFusion-Net: An efficient Fusion Architecture via Stacked Residual Convolution Blocks and Transformer for Remote Sensing Image Semantic Segmentation
Convolutional neural network (CNN) and Transformer-based self-attention models have their advantages in extracting local information and global semantic information, and it is a trend to design a model combining stacked residual convolution blocks (SRCB) and Transformer. How to efficiently integrate the two mechanisms to improve the segmentation effect of remote sensing (RS) images is an urgent problem to be solved. An efficient fusion via SRCB and Transformer (SRCBTFusion-Net) is proposed as a new semantic segmentation architecture for RS images. The SRCBTFusion-Net adopts an encoder-decoder structure, and the Transformer is embedded into SRCB to form a double coding structure, then the coding features are up-sampled and fused with multi-scale features of SRCB to form a decoding structure. Firstly, a semantic information enhancement module (SIEM) is proposed to get global clues for enhancing deep semantic information. Subsequently, the relationship guidance module (RGM) is incorporated to re-encode the decoder’s upsampled feature maps, enhancing the edge segmentation performance. Secondly, a multipath atrous self-attention module (MASM) is developed to enhance the effective selection and weighting of low-level features, effectively reducing the potential confusion introduced by the skip connections between low-level and high-level features. Finally, a multi-scale feature aggregation module (MFAM) is developed to enhance the extraction of semantic and contextual information, thus alleviating the loss of image feature information and improving the ability to identify similar categories. The proposed SRCBTFusion-Net’s performance on the Vaihingen and Potsdam datasets is superior to the state-of-the-art methods.
Our divided experimental Vaihingen dataset and Potsdam dataset (https://www.aliyundrive.com/s/VjRwXPLYedt)
Extraction code:l2x4
Then prepare the datasets in the following format for easy use of the code:
├── datasets ├── Postdam │ ├── origin │ ├── train │ │ ├── images │ │ ├── labels │ │ └── train_org.txt │ └── val │ ├── images │ ├── labels │ └── val_org.txt └── Vaihingen ├── origin ├── train │ ├── images │ ├── labels │ └── train_org.txt └── val ├── images ├── labels └── val_org.txt
If you don't want to train, you can adopt the weights we trained on two datasets (https://pan.baidu.com/s/1VRXZ4uFhGcOZMmexmre4BA)
Extraction code: cfks
python transformerCNN/train.py
Comparison of different methods in performance on Potsdam and Vaihingen Datasets:
Method |
Params (M) |
Speed (FPS) |
Flops (G) |
Potsdam |
Vaihingen |
MIoU (%) |
MIoU (%) |
||||
TransUNet |
76.77 |
19 |
15.51 |
76.86 |
74.30 |
ABCNet |
28.57 |
26 |
7.24 |
74.89 |
70.55 |
Deeplabv3+ |
39.76 |
32 |
43.30 |
77.31 |
74.70 |
Swin-Unet |
41.42 |
22 |
0.02 |
59.72 |
57.19 |
UNetformer |
24.19 |
23 |
6.03 |
77.73 |
74.95 |
Segformer |
84.59 |
18 |
11.65 |
77.54 |
75.23 |
SRCBTFusion-Net |
86.30 |
28 |
22.58 |
78.62 |
76.27 |
Fig. 1. Examples of semantic segmentation results of different models on Potsdam dataset, the last column shows the predictions of our SRCBTFusion-Net, GT represents real label.
Fig. 2. Examples of semantic segmentation results of different models on Vaihingen dataset, the last column shows the predictions of our SRCBTFusion-Net, GT represents real label.
@ARTICLE{10328787, author={Chen, Junsong and Yi, Jizheng and Chen, Aibin and Lin, Hui}, journal={IEEE Transactions on Geoscience and Remote Sensing}, title={SRCBTFusion-Net: An Efficient Fusion Architecture via Stacked Residual Convolution Blocks and Transformer for Remote Sensing Image Semantic Segmentation}, year={2023}, volume={61}, number={}, pages={1-16}, doi={10.1109/TGRS.2023.3336689}}
Python 3.7.0+ Pytorch 1.8.2 CUDA 12.2 tqdm 4.63.0 numpy 1.21.6 ml-collections collections scipy logging