[course site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Segmentation
Day 3 Lecture 1
#DLUPC
Outline
● What is object segmentation?
○ Applications
● Semantic segmentation
○ From image classification to semantic segmentation
■ Fully convolutional networks
■ Learnable Upsampling
■ Skip connections
○ FCN8s, Dilated Convolutions, U-Net, ...
● Instance segmentation
○ Simultaneous Detection and Segmentation
○ Mask R-CNN
2
Image and object segmentation
● Image Segmentation
○ Group pixels into regions that share some similar properties
● Segmenting images into meaningful objects
○ Object-level segmentation: accurate localization and recognition
Superpixels
(Ren ICCV 2003
3
Object segmentation: applications
Image editing and composition (Xu, 2016) Robotics
Autonomous driving
(cordts, 2016)
Medical image analysis
(Casamitjana, 2017)
4
Semantic segmentation
● Label every pixel: recognize the class of every pixel
● Do not differentiate instances
5Mottaghi et al, “The role of context for object detection and semantic segmentation in the wild”, CVPR 2014
Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
6
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth
Datasets for semantic/instance segmentation
7
● 20 categories
● +10,000 images
● Semantic segmentation GT
● Instance segmentation GT
● Real indoor & outdoor scenes
● 540 categories
● +10,000 images
● Dense annotations
● Semantic segmentation GT
● Objects + stuff
Pascal Visual Object Classes Pascal Context
Datasets for semantic/instance segmentation
8
● Real indoor scenes
● 10,000 images
● 58,658 3D bounding boxes
● Dense annotations
● Instances GT
● Semantic segmentation GT
● Objects + stuff
● Real indoor & outdoor scenes
● 80 categories
● +300,000 images
● 2M instances
● Partial annotations
● Semantic segmentation GT
● Instance segmentation GT
● Objects, but no stuff
SUN RGB-D
COCO Common Objects in Context
Datasets for semantic/instance segmentation
9
● Real driving scenes
● 30 categories
● +25,000 images
● 20,000 partial annotations
● 5,000 dense annotations
● Semantic segmentation GT
● Instance segmentation GT
● Depth, GPS and other metadata
● Objects and stuff
● Real general scenes
● +150 categories
● +22,000 images
● Semantic segmentation GT
● Instance + parts segmentation GT
● Objects and stuff
CityScapes ADE20K
From classification to semantic segmentation
10
CNN
DOG
CAT
extract a
patch
run through a CNN
trained for image
classification
classify the
center pixel
CAT
repeat for every pixel
From classification to semantic segmentation
● A classification network becoming fully convolutional
○ Fully connected layers can also be viewed as convolutions with kernels that cover the
entire input region
11
Shelhamer, Long, Darrell, Fully Convolutional Networks for Semantic Segmentation, 2014-2016
From classification to semantic segmentation
● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
● Problems
○ Output is smaller than input → Add upsampling layers
○ Output is very coarse → Add fine details from previous layers
12
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Credit: Shelhamer, Long
● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
From classification to semantic segmentation
13
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Conv, pool, non linearity
Learnable Upsampling Pixelwise Output + loss
Credit: Shelhamer, Long
Learnable upsampling: recovering spatial shape
Upsampling: Transposed convolution also called fractionally strided convolution or
‘deconvolution’
14
Convolution
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic
I input image 4x4 vectorized to 16x1
O output image 4x1 (later reshaped 2x2)
h 3x3 kernel;
C 16x4 (weights)
The backward pass is obtained by transposing C: CT
Transposed convolution (also called fractionally strided convolution or ‘deconvolution’): swaps
forward and backward passes of a convolution
Convolution as a matrix operation
C=
Learnable upsampling: recovering spatial shape
It is always possible to emulate a transposed convolution with a direct convolution (fractional
stride)
15
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic
Learnable upsampling: recovering spatial shape
16
1D Convolution with stride 2
1D Transposed
Convolution with stride 2
1D Subpixel convolution
with stride 1/2
The two operators can achieve the
same result if the filters are learned.
Shi, Is the deconvolution layer the same as a convolutional layer?, 2016
1D example:
DeconvNet:
VGG-16 (conv+Relu+MaxPool) + mirrored VGG (Unpooling+’deconv’+Relu)
More than one upsampling layer
17
Normal VGG “Upside down” VGG
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Resolution: spectrum of deep features
Problem: coarse output
● Combine where (local, shallow) with what (global, deep)
18
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”)
Credit: Shelhamer, Long
Fine details: skip connections
19
Adding 1x1 conv classifying layer on top of pool4,
Then upsample x2 (init to bilinear and then learned)
conv7 prediction, sum both, and upsample x16 for output
end-to-end, joint learning
of semantics and location
skip tu fuse
layers
Credit: Shelhamer, Long
Skip connections
● A multi-stream network that fuses features/predictions across layers
20
Input image stride 32 stride 16 stride 8 ground truth
no skps 1 skip 2 skipsCredit: Shelhamer, Long
Transfer learning
● Cast ILSVRC (AlexNet, VGG, GoogLeNet) classifiers into FCNs and augment them for
dense prediction: discard classifier layer, transform FC to CONV, add 1X1 CONV with 21
filters for scoring at each output location, upsampling
● Add skip connections
● Train for segmentation by fine-tuning all layers with PASCAL VOC 2011 with a pixelwise
loss.
● Metrics: pixel accuracy, mean accuracy, mean pixel intersection over union
21
Mean IU: Per-class evaluation: an intersection
of the predicted and true sets of pixels for a
given class, divided by their union
Pascal Test Set
Pascal Validation Set
Based on VGG
FCN-32s = FCN VGG
FCN-8s results on Pascal
22
SDS: Simultaneous Detection and
Segmentation Hariharan et al. ECCV14
Semantic segmentation
Typical architecture
● Downsampling path: extracts coarse features
● Upsampling path: recovers input image resolution
● Skip connections: recovers detailed information
● Post-processing (optional): refines predictions (CRF)
Other architectures:
● DeepLab: ‘atrous’ convolutions + spatial pyramid + CRF (Chen, ICLR 2015)
● CRF-RNN: FCN + CRF as Recurrent NN (Zheng, ICCV 2015)
● U-Net (Ronnemberger, 2015)
● Fully Convolutional DenseNets (Jégou, 2016)
● Dilated convolutions (Yu, 2016) 23
U-Net
● A contracting path and an expansive path
● Adds convolutions in the upsampling path (“symmetric” net)
● Skip connections: concatenation of feature maps
24
Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv 2015
Winner of
CAD Caries challenge ISBI 2015
Cell tracking challenge ISBI 2015
Fully convolutional DenseNets
● Adds feed-forward connections between layers
● Based on U-Nets:
○ connections between downsampling – upsampling paths
● Based on DenseNets* (for image classification):
○ each layer directly connected to every other layer
○ alleviate the vanishing-gradient problem
○ strengthen feature propagation
○ encourage feature reuse
○ substantially reduce the number of parameters.
25
Jégou et al, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, Dec. 2016
* Huang et al, “Densely connected convolutional networks”, arxiv Aug 2016
Dense block: Complete architecture:
Dilated convolutions
● Systematically aggregate multiscale contextual information without losing resolution
○ Usual convolution
○ Dilated convolution
26
Yu, Koltun, Multi-scale context aggregation by dilated convolutions, 2016
Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
27
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth
Instance segmentation: Multi-task cascades
28
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Won COCO 2015 challenge (with ResNet)
Learn entire model end to end
Instance segmentation: Multi-task cascades
Results on Pascal VOC 2012 and MS COCO:
29
Instance segmentation: Mask R-CNN
● Extension of Faster R-CNN to instance segmentation
● A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to
generate a mask (segmentation output).
● This is in parallel to the classification and bounding box regression network of Faster R-CNN
● RoIAlign instead of RoIPool to properly aligning extracted features with input
30He et al, Mask R-CNN, 2017
Instance segmentation: Mask R-CNN
● Classification and bounding box detection losses like Faster R-CNN
● A new loss term for mask prediction
● Output: C x m x m volume for mask prediction (C classes, m size of square mask)
31
He et al, Mask R-CNN, 2017
Instance segmentation: Mask R-CNN
● Masks are combined with classifications and bounding boxes from Faster R-CNN
32
He et al, Mask R-CNN, 2017
Instance segmentation: Mask R-CNN
● Results on COCO dataset
● MNC and FCIS were winners of COCO 2015 and 2016
33
MNC: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
FCIS: Li et al., “Fully convolutional instance-aware semantic segmentation”, CVPR 2017
Summary
● Semantic segmentation
○ Fully convolutional networks
○ Learnable Upsampling
○ Skip connections
○ Models:
■ FCN8s, Dilated Convolutions, U-Net, FC Densenets
● Instance segmentation
○ Based on object/segments proposals
■ Simultaneous Detection and Segmentation (R-CNN)
■ Multii-task cascade (Faster R-CNN)
■ Mask R-CNN (Faster R-CNN)
○ Others
■ Recurrent instance segmentation
34
Questions?
35
FCN-8s vs DeepLab vs Dilated Convolutions
36
Input image FCN-8s DeepLab DilConv Ground truth
Pascal VOC 2012 test set
Mean IoU:
FCN-8s = 62.2
DeepLab = 62.1
DilConv = 67.6
FC8: Fully Convolutional Networks for Semantic Segmentation, Long, Darrell, Shelhamer, 2014-2016
DeepLab: Semantic Image Segm. with Deep Conv. Nets, Atrous Convolution, and Fully Connected CRFs, Chen, et al, 2015
Dilated convolutions: Multi-scale context aggregation by dilated convolutions, Yu, Koltun, 2016

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)

  • 1.
    [course site] Verónica Vilaplana [email protected] AssociateProfessor Universitat Politecnica de Catalunya Technical University of Catalonia Segmentation Day 3 Lecture 1 #DLUPC
  • 2.
    Outline ● What isobject segmentation? ○ Applications ● Semantic segmentation ○ From image classification to semantic segmentation ■ Fully convolutional networks ■ Learnable Upsampling ■ Skip connections ○ FCN8s, Dilated Convolutions, U-Net, ... ● Instance segmentation ○ Simultaneous Detection and Segmentation ○ Mask R-CNN 2
  • 3.
    Image and objectsegmentation ● Image Segmentation ○ Group pixels into regions that share some similar properties ● Segmenting images into meaningful objects ○ Object-level segmentation: accurate localization and recognition Superpixels (Ren ICCV 2003 3
  • 4.
    Object segmentation: applications Imageediting and composition (Xu, 2016) Robotics Autonomous driving (cordts, 2016) Medical image analysis (Casamitjana, 2017) 4
  • 5.
    Semantic segmentation ● Labelevery pixel: recognize the class of every pixel ● Do not differentiate instances 5Mottaghi et al, “The role of context for object detection and semantic segmentation in the wild”, CVPR 2014
  • 6.
    Instance segmentation ● Detectinstances, categorize and label every pixel ● Labels are class-aware and instance-aware 6 Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017 Object detection Semantic Segm. Instance segm. Ground truth
  • 7.
    Datasets for semantic/instancesegmentation 7 ● 20 categories ● +10,000 images ● Semantic segmentation GT ● Instance segmentation GT ● Real indoor & outdoor scenes ● 540 categories ● +10,000 images ● Dense annotations ● Semantic segmentation GT ● Objects + stuff Pascal Visual Object Classes Pascal Context
  • 8.
    Datasets for semantic/instancesegmentation 8 ● Real indoor scenes ● 10,000 images ● 58,658 3D bounding boxes ● Dense annotations ● Instances GT ● Semantic segmentation GT ● Objects + stuff ● Real indoor & outdoor scenes ● 80 categories ● +300,000 images ● 2M instances ● Partial annotations ● Semantic segmentation GT ● Instance segmentation GT ● Objects, but no stuff SUN RGB-D COCO Common Objects in Context
  • 9.
    Datasets for semantic/instancesegmentation 9 ● Real driving scenes ● 30 categories ● +25,000 images ● 20,000 partial annotations ● 5,000 dense annotations ● Semantic segmentation GT ● Instance segmentation GT ● Depth, GPS and other metadata ● Objects and stuff ● Real general scenes ● +150 categories ● +22,000 images ● Semantic segmentation GT ● Instance + parts segmentation GT ● Objects and stuff CityScapes ADE20K
  • 10.
    From classification tosemantic segmentation 10 CNN DOG CAT extract a patch run through a CNN trained for image classification classify the center pixel CAT repeat for every pixel
  • 11.
    From classification tosemantic segmentation ● A classification network becoming fully convolutional ○ Fully connected layers can also be viewed as convolutions with kernels that cover the entire input region 11 Shelhamer, Long, Darrell, Fully Convolutional Networks for Semantic Segmentation, 2014-2016
  • 12.
    From classification tosemantic segmentation ● Dense prediction: fully convolutional, end to end, pixel-to-pixel network ● Problems ○ Output is smaller than input → Add upsampling layers ○ Output is very coarse → Add fine details from previous layers 12 Final layer is 1X1 conv with #channels = #classes Pixelwise loss function: Credit: Shelhamer, Long
  • 13.
    ● Dense prediction:fully convolutional, end to end, pixel-to-pixel network From classification to semantic segmentation 13 Final layer is 1X1 conv with #channels = #classes Pixelwise loss function: Conv, pool, non linearity Learnable Upsampling Pixelwise Output + loss Credit: Shelhamer, Long
  • 14.
    Learnable upsampling: recoveringspatial shape Upsampling: Transposed convolution also called fractionally strided convolution or ‘deconvolution’ 14 Convolution More info: Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016 https://github.com/vdumoulin/conv_arithmetic I input image 4x4 vectorized to 16x1 O output image 4x1 (later reshaped 2x2) h 3x3 kernel; C 16x4 (weights) The backward pass is obtained by transposing C: CT Transposed convolution (also called fractionally strided convolution or ‘deconvolution’): swaps forward and backward passes of a convolution Convolution as a matrix operation C=
  • 15.
    Learnable upsampling: recoveringspatial shape It is always possible to emulate a transposed convolution with a direct convolution (fractional stride) 15 More info: Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016 https://github.com/vdumoulin/conv_arithmetic
  • 16.
    Learnable upsampling: recoveringspatial shape 16 1D Convolution with stride 2 1D Transposed Convolution with stride 2 1D Subpixel convolution with stride 1/2 The two operators can achieve the same result if the filters are learned. Shi, Is the deconvolution layer the same as a convolutional layer?, 2016 1D example:
  • 17.
    DeconvNet: VGG-16 (conv+Relu+MaxPool) +mirrored VGG (Unpooling+’deconv’+Relu) More than one upsampling layer 17 Normal VGG “Upside down” VGG Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
  • 18.
    Resolution: spectrum ofdeep features Problem: coarse output ● Combine where (local, shallow) with what (global, deep) 18 fuse features into deep jet (cf. Hariharan et al. CVPR15 “hypercolumn”) Credit: Shelhamer, Long
  • 19.
    Fine details: skipconnections 19 Adding 1x1 conv classifying layer on top of pool4, Then upsample x2 (init to bilinear and then learned) conv7 prediction, sum both, and upsample x16 for output end-to-end, joint learning of semantics and location skip tu fuse layers Credit: Shelhamer, Long
  • 20.
    Skip connections ● Amulti-stream network that fuses features/predictions across layers 20 Input image stride 32 stride 16 stride 8 ground truth no skps 1 skip 2 skipsCredit: Shelhamer, Long
  • 21.
    Transfer learning ● CastILSVRC (AlexNet, VGG, GoogLeNet) classifiers into FCNs and augment them for dense prediction: discard classifier layer, transform FC to CONV, add 1X1 CONV with 21 filters for scoring at each output location, upsampling ● Add skip connections ● Train for segmentation by fine-tuning all layers with PASCAL VOC 2011 with a pixelwise loss. ● Metrics: pixel accuracy, mean accuracy, mean pixel intersection over union 21 Mean IU: Per-class evaluation: an intersection of the predicted and true sets of pixels for a given class, divided by their union Pascal Test Set Pascal Validation Set Based on VGG FCN-32s = FCN VGG
  • 22.
    FCN-8s results onPascal 22 SDS: Simultaneous Detection and Segmentation Hariharan et al. ECCV14
  • 23.
    Semantic segmentation Typical architecture ●Downsampling path: extracts coarse features ● Upsampling path: recovers input image resolution ● Skip connections: recovers detailed information ● Post-processing (optional): refines predictions (CRF) Other architectures: ● DeepLab: ‘atrous’ convolutions + spatial pyramid + CRF (Chen, ICLR 2015) ● CRF-RNN: FCN + CRF as Recurrent NN (Zheng, ICCV 2015) ● U-Net (Ronnemberger, 2015) ● Fully Convolutional DenseNets (Jégou, 2016) ● Dilated convolutions (Yu, 2016) 23
  • 24.
    U-Net ● A contractingpath and an expansive path ● Adds convolutions in the upsampling path (“symmetric” net) ● Skip connections: concatenation of feature maps 24 Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv 2015 Winner of CAD Caries challenge ISBI 2015 Cell tracking challenge ISBI 2015
  • 25.
    Fully convolutional DenseNets ●Adds feed-forward connections between layers ● Based on U-Nets: ○ connections between downsampling – upsampling paths ● Based on DenseNets* (for image classification): ○ each layer directly connected to every other layer ○ alleviate the vanishing-gradient problem ○ strengthen feature propagation ○ encourage feature reuse ○ substantially reduce the number of parameters. 25 Jégou et al, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, Dec. 2016 * Huang et al, “Densely connected convolutional networks”, arxiv Aug 2016 Dense block: Complete architecture:
  • 26.
    Dilated convolutions ● Systematicallyaggregate multiscale contextual information without losing resolution ○ Usual convolution ○ Dilated convolution 26 Yu, Koltun, Multi-scale context aggregation by dilated convolutions, 2016
  • 27.
    Instance segmentation ● Detectinstances, categorize and label every pixel ● Labels are class-aware and instance-aware 27 Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017 Object detection Semantic Segm. Instance segm. Ground truth
  • 28.
    Instance segmentation: Multi-taskcascades 28 Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015 Won COCO 2015 challenge (with ResNet) Learn entire model end to end
  • 29.
    Instance segmentation: Multi-taskcascades Results on Pascal VOC 2012 and MS COCO: 29
  • 30.
    Instance segmentation: MaskR-CNN ● Extension of Faster R-CNN to instance segmentation ● A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to generate a mask (segmentation output). ● This is in parallel to the classification and bounding box regression network of Faster R-CNN ● RoIAlign instead of RoIPool to properly aligning extracted features with input 30He et al, Mask R-CNN, 2017
  • 31.
    Instance segmentation: MaskR-CNN ● Classification and bounding box detection losses like Faster R-CNN ● A new loss term for mask prediction ● Output: C x m x m volume for mask prediction (C classes, m size of square mask) 31 He et al, Mask R-CNN, 2017
  • 32.
    Instance segmentation: MaskR-CNN ● Masks are combined with classifications and bounding boxes from Faster R-CNN 32 He et al, Mask R-CNN, 2017
  • 33.
    Instance segmentation: MaskR-CNN ● Results on COCO dataset ● MNC and FCIS were winners of COCO 2015 and 2016 33 MNC: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015 FCIS: Li et al., “Fully convolutional instance-aware semantic segmentation”, CVPR 2017
  • 34.
    Summary ● Semantic segmentation ○Fully convolutional networks ○ Learnable Upsampling ○ Skip connections ○ Models: ■ FCN8s, Dilated Convolutions, U-Net, FC Densenets ● Instance segmentation ○ Based on object/segments proposals ■ Simultaneous Detection and Segmentation (R-CNN) ■ Multii-task cascade (Faster R-CNN) ■ Mask R-CNN (Faster R-CNN) ○ Others ■ Recurrent instance segmentation 34
  • 35.
  • 36.
    FCN-8s vs DeepLabvs Dilated Convolutions 36 Input image FCN-8s DeepLab DilConv Ground truth Pascal VOC 2012 test set Mean IoU: FCN-8s = 62.2 DeepLab = 62.1 DilConv = 67.6 FC8: Fully Convolutional Networks for Semantic Segmentation, Long, Darrell, Shelhamer, 2014-2016 DeepLab: Semantic Image Segm. with Deep Conv. Nets, Atrous Convolution, and Fully Connected CRFs, Chen, et al, 2015 Dilated convolutions: Multi-scale context aggregation by dilated convolutions, Yu, Koltun, 2016