U-Net Architecture for Biomedical
Image Segmentation
1. Introduction and Motivation
Biomedical image segmentation requires assigning a class label to each pixel in a medical
image, enabling precise localization of anatomical structures (e.g., tumors, organs). Unlike
traditional image classification, which outputs a single label per image, segmentation
demands dense prediction. To meet these unique requirements, the U-Net architecture—
originally proposed by Ronneberger et al. in 2015—was developed to deliver high accuracy
on limited datasets while maintaining full-resolution outputs.
2. Architectural Overview
The U-Net follows a symmetric encoder-decoder architecture with skip connections and is
specially designed to work with very few annotated images—common in biomedical
settings.
Encoder (Contracting Path) – Left Half
- Structure: The encoder is composed of repeated application of two 3×3 convolutions (with
ReLU), followed by a 2×2 max pooling operation with stride 2 for downsampling.
- Function: It captures the contextual and semantic information from the input image. This
is akin to the feature extraction path in standard classification CNNs (e.g., VGG).
- Channel Depth: As we move deeper, the number of feature channels increases (e.g., 64 →
128 → 256 → 512), enabling learning of complex representations.
Decoder (Expanding Path) – Right Half
- Structure: Each step in the decoder consists of an upsampling of the feature map (via 2×2
transposed convolution or 'up-conv'), a concatenation with the corresponding high-
resolution features from the encoder (skip connection), followed by two 3×3 convolutions.
- Function: The decoder enables precise localization by combining contextual information
with spatial information (preserved via skip connections).
- Final Layer: A 1×1 convolution is used to map each 64-component feature vector to the
desired number of classes (e.g., background, organ, tumor).
3. Key Design Components
🔵 Conv 3x3, ReLU: Preserves local spatial context and allows deep feature representation.
🔴 Max Pool 2x2: Reduces spatial dimensions to extract abstract features (context).
🟢 Up-conv 2x2: Upsamples lower-resolution features during decoding.
🔷 Skip Connections: Feature maps from encoder layers are cropped and concatenated with
decoder layers to preserve spatial accuracy, crucial in segmentation.
4. Output and Segmentation Objective
The output is a segmentation map of the same spatial dimensions as the input image. Each
pixel in this map is classified into a semantic class by assigning pixel-level class probabilities
using softmax over the final layer. This pixel-wise classification enables dense semantic
labeling, which is fundamental in medical diagnostics.
5. Why U-Net Works for Biomedical Applications
The following table summarizes key challenges in biomedical segmentation and how U-Net
addresses them:
Challenge U-Net Solution
Limited labeled data Strong data augmentation and full use of
context through overlapping tiles
Need for pixel-level precision Skip connections preserve high-resolution
spatial details
Small structures (e.g., tumors) Multi-scale feature extraction and fusion
capture fine and global structures
Segmentation over classification Outputs pixel probabilities, not whole-
image probabilities
6. Mechanism of Pixel-wise Probability Prediction
Unlike classification models where we get an image-level softmax, U-Net treats each pixel as
a separate classification task. By combining local features (decoder) with contextual
features (encoder), and preserving spatial integrity, each pixel is assigned a class label.
Thus, we shift from image-level predictions to dense predictions, enabling high-resolution
segmentations essential for clinical use.
7. Visual Summary of the U-net architecture.
Left Side (Encoder): Blue blocks (3×3 convolutions + ReLU), red arrows (max pooling).
Right Side (Decoder): Green arrows (up-conv), gray horizontal arrows (skip connections).
Middle Bridge: Deepest layer with maximum feature channels; acts as the bottleneck of the
network.
8. The Outputs of a U-Net
Panel Description
This is the ground truth label mask. It shows the manually annotated
1. Input Mask (far regions of interest (in red), likely representing a tumor or lesion within
left) the body. This serves as the reference for model training and
evaluation.
2. Predicted This is the probability heatmap from the output channel corresponding
Channel for to the background class. Brighter regions represent higher confidence
Background that a pixel belongs to the background.
This channel shows the model's confidence per pixel for the tumor
3. Predicted
class. Red areas suggest high probability predictions, ideally matching
Channel for Tumor
the red ring in the ground truth.
This is the final binary segmentation mask after applying argmax
4. Final Prediction
across channels. It assigns each pixel to the class with the highest
(far right)
predicted probability. This map ideally resembles the input mask.
🧠 Explanation of the Process
1. Multi-Channel Output:
o U-Net ends with multiple channels in its final layer — one for each class (e.g.,
background, tumor).
o Each channel gives a pixel-wise probability map.
2. Softmax or Sigmoid Activation:
o For binary classification (background vs. tumor), sigmoid is used.
o For multi-class segmentation, softmax is applied across channels.
3. Inference & Decision:
o In binary tasks: A threshold (like 0.5) converts probabilities into a binary
mask.
o In multi-class: Use argmax to select the most probable class per pixel.
4. Output Mask Utility:
o The final prediction mask (Panel 4) is used for:
Overlaying on medical scans.
Calculating metrics like Dice Score, Jaccard Index (IoU).
Informing diagnosis or surgical planning.
9. Applications in Biomedical Imaging
U-Net has become the de facto standard for various biomedical image segmentation tasks,
including:
- Tumor detection in MRI/CT scans
- Cell nucleus segmentation in microscopy
- Organ boundary segmentation for surgical planning
- Histopathological tissue segmentation
10. Conclusion
The U-Net architecture's fully convolutional, symmetric design with skip connections makes
it highly effective for biomedical image segmentation. Its ability to predict class labels for
each pixel while preserving both global context and fine detail ensures it meets the high
precision demands of medical diagnostics. The Output visualization confirms that U-Net not
only learns to classify each pixel accurately but does so while preserving spatial and
MONAI Framework
🧠 What is MONAI?
MONAI (Medical Open Network for AI) is an open-source, PyTorch-based framework
specifically designed for deep learning in healthcare imaging. Developed by NVIDIA
and the community, it bridges the gap between academic research and clinical deployment
by offering specialized tools for medical imaging.
🎯 Purpose and Key Goals
Goal Description
Provide a consistent framework for training and evaluating models in
Standardization
medical imaging.
Offer researchers and clinicians reusable, modular components built on top
Accessibility
of PyTorch.
Optimize training using mixed precision and GPU acceleration (via NVIDIA
Performance
technologies).
Easily integrate with PyTorch, Ignite, and third-party libraries for data
Extensibility
loading, model training, and evaluation.
🧱 Core Components of MONAI
1. Transforms
A rich set of preprocessing and postprocessing operations tailored for medical
images.
Handles 3D/4D data, medical formats like NIfTI, DICOM.
Includes spatial transforms (resizing, flipping, affine), intensity normalization, and
data augmentation.
Example: monai.transforms.RandAffined, ScaleIntensity, CropForeground
2. Datasets & DataLoaders
Specialized classes like CacheDataset, PersistentDataset, and
SmartCacheDataset optimize IO for large volumetric datasets.
3. Network Architectures
Implements healthcare-optimized models:
o UNet, AttentionUNet, SegResNet, DenseNet121, DynUNet, etc.
Plug-and-play model configurations with parameters fine-tuned for medical tasks.
4. Engines & Handlers (via Ignite)
Simplifies training loops and evaluation.
SupervisedTrainer, SupervisedEvaluator abstract model training and
validation.
Built-in handlers for logging, checkpointing, early stopping.
5. Metrics & Evaluation
Provides medical-relevant metrics like:
o Dice Coefficient
o Hausdorff Distance
o Surface Distance
6. Inferencing Utilities
Sliding window inference for large 3D images.
DeepEdit and DeepGrow for interactive segmentation tasks.
7. MONAI Label
An AI-assisted labeling and annotation tool that integrates with annotation
platforms like 3D Slicer or OHIF Viewer.
Supports active learning loops.