Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery
Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery
Blog Article
Convolutional neural networks have long dominated semantic segmentation of very-high-resolution (VHR) remote sensing (RS) images.However, restricted by the fixed receptive field of convolution operation, convolution-based models cannot directly obtain contextual information.Meanwhile, Swin Transformer possesses great potential in modeling long-range dependencies.
Nevertheless, Swin Transformer breaks images into patches that are single-dimension sequences without considering the position loss problem inside patches.Therefore, Inspired by Swin Transformer and Unet, we propose SUD-Net (Swin transformer-based Unet-like with Dynamic attention pyramid head Network), a new U-shaped architecture composed of Swin Transformer blocks and convolution Immunity layers simultaneously through a dual encoder and an upsampling decoder with a Dynamic Attention Pyramid Head (DAPH) attached to the backbone.First, we propose a dual encoder structure combining Swin Transformer blocks and reslayers in reverse order to complement global semantics with detailed representations.
Second, aiming at Treatment-Serums the spatial loss problem inside each patch, we design a Multi-Path Fusion Model (MPFM) with specially devised Patch Attention (PA) to encode position information of patches and adaptively fuse features of different scales through attention mechanisms.Third, a Dynamic Attention Pyramid Head is constructed with deformable convolution to dynamically aggregate effective and important semantic information.SUD-Net achieves exceptional results on ISPRS Potsdam and Vaihingen datasets with 92.
51%mF1, 86.4%mIoU, 92.98%OA, 89.
49%mF1, 81.26%mIoU, and 90.95%OA, respectively.