UNet++: A Nested U-Net Architecture for Medical Image Segmentation (2024)

Journal List
HHS Author Manuscripts
PMC7329239

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018). Author manuscript; available in PMC 2020 Jul 1.

Published in final edited form as:

Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018). 2018 Sep; 11045: 3–11.

Published online 2018 Sep 20. doi:10.1007/978-3-030-00889-5_1

PMCID: PMC7329239

NIHMSID: NIHMS1600717

PMID: 32613207

Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang

Author information Copyright and License information PMC Disclaimer

Abstract

In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

1. Introduction

The state-of-the-art models for image segmentation are variants of the encoder-decoder architecture like U-Net [9] and fully convolutional network (FCN) [8]. These encoder-decoder networks used for segmentation share a key similarity: skip connections, which combine deep, semantic, coarse-grained feature maps from the decoder sub-network with shallow, low-level, fine-grained feature maps from the encoder sub-network. The skip connections have proved effective in recovering fine-grained details of the target objects; generating segmentation masks with fine details even on complex background. Skip connections is also fundamental to the success of instance-level segmentation models such as Mask-RCNN, which enables the segmentation of occluded objects. Arguably, image segmentation in natural images has reached a satisfactory level of performance, but do these models meet the strict segmentation requirements of medical images?

2. Related Work

Long et al. [8] first introduced fully convolutional networks (FCN), while U-Net was introduced by Ronneberger et al. [9]. They both share a key idea: skip connections. In FCN, up-sampled feature maps are summed with feature maps skipped from the encoder, while U-Net concatenates them and add convolutions and non-linearities between each up-sampling step. The skip connections have shown to help recover the full spatial resolution at the network output, making fully convolutional methods suitable for semantic segmentation. Inspired by DenseNet architecture [5], Li et al. [7] proposed H-denseunet for liver and liver tumor segmentation. In the same spirit, Drozdzalet al. [2] systematically investigated the importance of skip connections, and introduced short skip connections within the encoder. Despite the minor differences between the above architectures, they all tend to fuse semantically dissimilar feature maps from the encoder and decoder sub-networks, which, according to our experiments, can degrade segmentation performance.

The other two recent related works are GridNet [3] and Mask-RCNN [4]. GridNet is an encoder-decoder architecture wherein the feature maps are wired in a grid fashion, generalizing several classical segmentation architectures. GridNet, however, lacks up-sampling layers between skip connections; and thus, it does not represent UNet++. Mask-RCNN is perhaps the most important meta framework for object detection, classification and segmentation. We would like to note that UNet++ can be readily deployed as the backbone architecture in Mask-RCNN by simply replacing the plain skip connections with the suggested nested dense skip pathways. Due to limited space, we were not able to include results of Mask RCNN with UNet++ as the backbone architecture; however, the interested readers can refer to the supplementary material for further details.

3. Proposed Network Architecture: UNet++

Fig. 1a shows a high-level overview of the suggested architecture. As seen, UNet++ starts with an encoder sub-network or backbone followed by a decoder sub-network. What distinguishes UNet++ from U-Net (the black components in Fig. 1a) is the re-designed skip pathways (shown in green and blue) that connect the two sub-networks and the use of deep supervision (shown red).

Open in a separate window

Fig. 1:

(a) UNet++ consists of an encoder and decoder that are connected through a series of nested dense convolutional blocks. The main idea behind UNet++ is to bridge the semantic gap between the feature maps of the encoder and decoder prior to fusion. For example, the semantic gap between (X^0,0,X^1,3) is bridged using a dense convolution block with three convolution layers. In the graphical abstract, black indicates the original U-Net, green and blue show dense convolution blocks on the skip pathways, and red indicates deep supervision. Red, green, and blue components distinguish UNet++ from U-Net. (b) Detailed analysis of the first skip pathway of UNet++. (c) UNet++ can be pruned at inference time, if trained with deep supervision.

3.1. Re-designed skip pathways

Re-designed skip pathways transform the connectivity of the encoder and decoder sub-networks. In U-Net, the feature maps of the encoder are directly received in the decoder; however, in UNet++, they undergo a dense convolution block whose number of convolution layers depends on the pyramid level. For example, the skip pathway between nodes X^0,0 and X^1,3 consists of a dense convolution block with three convolution layers where each convolution layer is preceded by a concatenation layer that fuses the output from the previous convolution layer of the same dense block with the corresponding up-sampled output of the lower dense block. Essentially, the dense convolution block brings the semantic level of the encoder feature maps closer to that of the feature maps awaiting in the decoder. The hypothesis is that the optimizer would face an easier optimization problem when the received encoder feature maps and the corresponding decoder feature maps are semantically similar.

Formally, we formulate the skip pathway as follows: let x^i,j denote the output of node X^i,j where i indexes the down-sampling layer along the encoder and j indexes the convolution layer of the dense block along the skip pathway. The stack of feature maps represented by x^i,j is computed as

$x^{i, j} = {\begin{array}{l} H (x^{i - 1, j}), & j = 0 \\ H ({[x^{i, k}]}_{k = 0}^{j - 1}, U (x^{i + 1, j - 1})), & j > 0 \end{array}$

(1)

where function $H (\cdot)$ is a convolution operation followed by an activation function, $U (\cdot)$ denotes an up-sampling layer, and [ ] denotes the concatenation layer. Basically, nodes at level j = 0 receive only one input from the previous layer of the encoder; nodes at level j = 1 receive two inputs, both from the encoder sub-network but at two consecutive levels; and nodes at level j > 1 receive j + 1 inputs, of which j inputs are the outputs of the previous j nodes in the same skip pathway and the last input is the up-sampled output from the lower skip pathway. The reason that all prior feature maps accumulate and arrive at the current node is because we make use of a dense convolution block along each skip pathway. Fig. 1b further clarifies Eq. 1 by showing how the feature maps travel through the top skip pathway of UNet++.

3.2. Deep supervision

We propose to use deep supervision [6] in UNet++, enabling the model to operate in two modes: 1) accurate mode wherein the outputs from all segmentation branches are averaged; 2) fast mode wherein the final segmentation map is selected from only one of the segmentation branches, the choice of which determines the extent of model pruning and speed gain. Fig. 1c shows how the choice of segmentation branch in fast mode results in architectures of varying complexity.

Owing to the nested skip pathways, UNet++ generates full resolution feature maps at multiple semantic levels, {x^0,j, j ∈ {1, 2, 3, 4}}, which are amenable to deep supervision. We have added a combination of binary cross-entropy and dice coefficient as the loss function to each of the above four semantic levels, which is described as:

$L (Y, \hat{Y}) = - \frac{1}{N} \sum_{b = 1}^{N} (\frac{1}{2} \cdot Y_{b} \cdot \log {\hat{Y}}_{b} + \frac{2 \cdot Y_{b} \cdot {\hat{Y}}_{b}}{Y_{b} + {\hat{Y}}_{b}})$

(2)

where ${\hat{Y}}_{b}$ and Y_b denote the flatten predicted probabilities and the flatten ground truths of b^th image respectively, and N indicates the batch size.

In summary, as depicted in Fig. 1a, UNet++ differs from the original U-Net in three ways: 1) having convolution layers on skip pathways (shown in green), which bridges the semantic gap between encoder and decoder feature maps; 2) having dense skip connections on skip pathways (shown in blue), which improves gradient flow; and 3) having deep supervision (shown in red), which as will be shown in Section 4 enables model pruning and improves or in the worst case achieves comparable performance to using only one loss layer.

4. Experiments

Datasets:

As shown in Table 1, we use four medical imaging datasets for model evaluation, covering lesions/organs from different medical imaging modalities. For further details about datasets and the corresponding data pre-processing, we refer the readers to the supplementary material.

Table 1:

The image segmentation datasets used in our experiments.

Dataset	Images	Input Size	Modality	Provider
cell nuclei	670	96×96	microscopy	Data Science Bowl 2018
colon polyp	7,379	224×224	RGB video	ASU-Mayo [10,11]
liver	331	512×512	CT	MICCAI 2018 LiTS Challenge
lung nodule	1,012	64×64×64	CT	LIDC-IDRI [1]

Open in a separate window

Baseline models:

For comparison, we used the original U-Net and a customized wide U-Net architecture. We chose U-Net because it is a common performance baseline for image segmentation. We also designed a wide U-Net with similar number of parameters as our suggested architecture. This was to ensure that the performance gain yielded by our architecture is not simply due to increased number of parameters. Table 2 details the U-Net and wide U-Net architecture.

Table 2:

Number of convolutional kernels in U-Net and wide U-Net.

encoder / decoder	X^0,0/X^0,4	X^1,0/X^1,3	X^2,0/X^2,2	X^3,0/X^3,1	X^4,0/X^4,0
U-Net	32	64	128	256	512
wide U-Net	35	70	140	280	560

Open in a separate window

Implementation details:

We monitored the Dice coefficient and Intersection over Union (IoU), and used early-stop mechanism on the validation set. We also used Adam optimizer with a learning rate of 3e-4. Architecture details for U-Net and wide U-Net are shown in Table 2. UNet++ is constructed from the original U-Net architecture. All convolutional layers along a skip pathway (X^i,j) use k kernels of size 3×3 (or 3×3×3 for 3D lung nodule segmentation) where k = 32 × 2ⁱ. To enable deep supervision, a 1×1 convolutional layer followed by a sigmoid activation function was appended to each of the target nodes: {x^0,j| j ∈ {1,2,3,4}}. As a result, UNet++ generates four segmentation maps given an input image, which will be further averaged to generate the final segmentation map. More details can be founded at github.com/Nested-UNet.

Results:

Table 3 compares U-Net, wide U-Net, and UNet++ in terms of the number parameters and segmentation accuracy for the tasks of lung nodule segmentation, colon polyp segmentation, liver segmentation, and cell nuclei segmentation. As seen, wide U-Net consistently outperforms U-Net except for liver segmentation where the two architectures perform comparably. This improvement is attributed to the larger number of parameters in wide U-Net. UNet++ without deep supervision achieves a significant performance gain over both U-Net and wide U-Net, yielding average improvement of 2.8 and 3.3 points in IoU. UNet++ with deep supervision exhibits average improvement of 0.6 points over UNet++ without deep supervision. Specifically, the use of deep supervision leads to marked improvement for liver and lung nodule segmentation, but such improvement vanishes for cell nuclei and colon polyp segmentation. This is because polyps and liver appear at varying scales in video frames and CT slices; and thus, a multi-scale approach using all segmentation branches (deep supervision) is essential for accurate segmentation. Fig. 2 shows a qualitative comparison between the results of U-Net, wide U-Net, and UNet++.

Open in a separate window

Fig. 2:

Qualitative comparison between U-Net, wide U-Net, and UNet++, showing segmentation results for polyp, liver, and cell nuclei datasets (2D-only for a distinct visualization).

Table 3:

Segmentation results (IoU: %) for U-Net, wide U-Net and our suggested architecture UNet++ with and without deep supervision (DS).

Architecture	Params	Dataset
Architecture	Params	cell nuclei	colon polyp	liver	lung nodule
U-Net [9]	7.76M	90.77	30.08	76.62	71.47
Wide U-Net	9.13M	90.92	30.14	76.58	73.38
UNet++ w/o DS	9.04M	92.63	33.45	79.70	76.44
UNet++ w/ DS	9.04M	92.52	32.12	82.90	77.21

Open in a separate window

Model pruning:

Fig. 3 shows segmentation performance of UNet++ after applying different levels of pruning. We use UNet++ Lⁱ to denote UNet++ pruned at level i (see Fig. 1c for further details). As seen, UNet++ L³ achieves on average 32.2% reduction in inference time while degrading IoU by only 0.6 points. More aggressive pruning further reduces the inference time but at the cost of significant accuracy degradation.

Open in a separate window

Fig. 3:

Complexity, speed, and accuracy of UNet++ after pruning on (a) cell nuclei, (b) colon polyp, (c) liver, and (d) lung nodule segmentation tasks respectively. The inference time is the time taken to process 10k test images using one NVIDIA TITAN X (Pascal) with 12 GB memory.

5. Conclusion

To address the need for more accurate medical image segmentation, we proposed UNet++. The suggested architecture takes advantage of re-designed skip pathways and deep supervision. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks, resulting in a possibly simpler optimization problem for the optimizer to solve. Deep supervision also enables more accurate segmentation particularly for lesions that appear at multiple scales such as polyps in colonoscopy videos. We evaluated UNet++ using four medical imaging datasets covering lung nodule segmentation, colon polyp segmentation, cell nuclei segmentation, and liver segmentation. Our experiments demonstrated that UNet++ with deep supervision achieved an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

Acknowledgments

This research has been supported partially by NIH under Award Number R01HL128785, by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH.

References

1. Armato SG, McLennan G, Bidaut L, McNitt-Gray MF, Meyer CR, Reeves AP, Zhao B, Aberle DR, Henschke CI, Hoffman EA, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011. [PMC free article] [PubMed] [Google Scholar]

2. Drozdzal M, Vorontsov E, Chartrand G, Kadoury S, and Pal CThe importance of skip connections in biomedical image segmentation In Deep Learning and Data Labeling for Medical Applications, pages 179–187. Springer, 2016. [Google Scholar]

3. Fourure D, Emonet R, Fromont E, Muselet D, Tremeau A, and Wolf CResidual conv-deconv grid network for semantic segmentation. arXiv preprint arXiv:1707.07958, 2017. [Google Scholar]

4. He K, Gkioxari G, Dollár P, and Girshick RMask r-cnn In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017. [Google Scholar]

5. Huang G, Liu Z, Weinberger KQ, and van der Maaten LDensely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017. [Google Scholar]

6. Lee C-Y, Xie S, Gallagher P, Zhang Z, and Tu ZDeeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015. [Google Scholar]

7. Li X, Chen H, Qi X, Dou Q, Fu C-W, and Heng PAH-denseunet: Hybrid densely connected unet for liver and liver tumor segmentation from ct volumes. arXiv preprint arXiv:1709.07330, 2017. [PubMed] [Google Scholar]

8. Long J, Shelhamer E, and Darrell TFully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. [PubMed] [Google Scholar]

9. Ronneberger O, Fischer P, and Brox TU-net: Convolutional networks for biomedical image segmentation In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. [Google Scholar]

10. Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, and Liang JConvolutional neural networks for medical image analysis: Full training or fine tuning?IEEE transactions on medical imaging, 35(5):1299–1312, 2016. [PubMed] [Google Scholar]

11. Zhou Z, Shin J, Zhang L, Gurudu S, Gotway M, and Liang JFine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In IEEE conference on computer vision and pattern recognition (CVPR), pages 7340–7351, 2017. [PMC free article] [PubMed] [Google Scholar]

UNet++: A Nested U-Net Architecture for Medical Image Segmentation (2024)

FAQs

Is UNet++ better than U-Net? ›

In summary, UNet++ is an extension of the UNet architecture that improves segmentation performance by introducing nested skip pathways and better feature aggregation, making it a powerful tool for various computer vision tasks, especially semantic segmentation.

Discover More Details ›

Is U-Net good for segmentation? ›

In recent years, deep Convolutional Neural Networks (CNNs) have been widely adopted for medical image segmentation and have achieved significant success. UNet, which is based on CNNs, is the mainstream method used for medical image segmentation.

Read The Full Story ›

Which architecture is best for medical image segmentation? ›

One of the most often used models in medical picture segmentation tasks is the U-Net model, which is built on the principle of FCN to extract multiscale features.

Discover More Details ›

Why is U-Net used in medical image segmentation? ›

U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities.

What is the difference between UNet and UNet++? ›

1a, UNet++ differs from the original U-Net in three ways: 1) having convolution layers on skip pathways (shown in green), which bridges the semantic gap between encoder and decoder feature maps; 2) having dense skip connections on skip pathways (shown in blue), which improves gradient flow; and 3) having deep ...

View Details ›

What is better than UNet? ›

Attention Res-UNet is better than UNet for segmentation because it incorporates residual connections, squeeze and excite units, atrous spatial pyramid pooling, and attention gates, which improve the overall outcome and achieve a better trade-off between precision and recall.

Find Out More ›

What are the disadvantages of U-Net? ›

The limitations of U-Net in surface defect detection include low accuracy, low precision, and a cumbersome detection process. Finally, U-Net lacks the ability to realize global semantic information interaction, which can be addressed by combining Transformer and Residual network in RT-Unet.

Read On ›

Why is U-Net better than CNN? ›

In CNN, the image is converted into a vector which is largely used in classification problems. But in U-Net, an image is converted into a vector and then the same mapping is used to convert it again to an image. This reduces the distortion by preserving the original structure of the image.

Explore More ›

What are the limitations of U-Net? ›

The U-Net architecture has limitations in terms of the "token-flatten" problem and the "scale-sensitivity" problem for medical image segmentation. The U-Net architecture has limitations in understanding long distance spatial relations in medical images, which hampers its performance in segmentation tasks.

Keep Reading ›

What is U-Net in deep learning? ›

U-Net is an encoder-decoder convolutional neural network with extensive medical imaging, autonomous driving, and satellite imaging applications. However, understanding how the U-Net performs segmentation is important, as all novel architectures post-U-Net develop on the same intuition.

Read On ›

How does medical image segmentation work? ›

How does medical image segmentation work? When working with CT, MRI, and other types of scans, segmentation generally works by taking information from the background image data and using it to generate a mask. Depending on the task, users may work on their scans in 2D or 3D.

Discover More ›

What is the most accurate image segmentation? ›

Threshold-based segmentation, graph-based segmentation, morphological-based segmentation, edge-based segmentation, clustering-based segmentation, Bayesian-based segmentation, and neural network-based segmentation are some of the best image segmentation models.

Is UNet better than LinkNet? ›

There are studies in the literature showing that LinkNet is more successful than UNet in segmenting medical images (Kallam et al. (2020) , Akyel and Arıcı (2022). The success of the models can be increased by hybrid use in medical image segmentation. ...

See Details ›

What is better than EfficientNet? ›

ResNet-RSs are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar or better accuracies on ImageNet.

What is the difference between CNN and UNet? ›

It is similar to any classification task we perform with convolutional neural networks except for the fact that in a U-Net, we do not have any fully connected layers in the end, as the output we require now is not the class label but a mask of the same size as our input image.

Tell Me More ›

What is a UNet model? ›

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg.