Currently, convolutional neural networks (CNN) (e.g., U-Net) have become the
de facto standard and attained immense success in medical image segmentation.
However, as a downside, CNN based methods are a double-edged sword as they fail
to build long-range dependencies and global context connections due to the
limited receptive field that stems from the intrinsic characteristics of the
convolution operation. Hence, recent articles have exploited Transformer
variants for medical image segmentation tasks which open up great opportunities
due to their innate capability of capturing long-range correlations through the
attention mechanism. Although being feasibly designed, most of the cohort
studies incur prohibitive performance in capturing local information, thereby
resulting in less lucidness of boundary areas. In this paper, we propose a
contextual attention network to tackle the aforementioned limitations. The
proposed method uses the strength of the Transformer module to model the
long-range contextual dependency. Simultaneously, it utilizes the CNN encoder
to capture local semantic information. In addition, an object-level
representation is included to model the regional interaction map. The extracted
hierarchical features are then fed to the contextual attention module to
adaptively recalibrate the representation space using the local information.
Then, they emphasize the informative regions while taking into account the
long-range contextual dependency derived by the Transformer module. We validate
our method on several large-scale public medical image segmentation datasets
and achieve state-of-the-art performance. We have provided the implementation
code in https://github.com/rezazad68/TMUnet.