Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Object Segmentation
Day 2 Lecture 7
Object Segmentation
Define the accurate boundaries of all objects in an image
2
Semantic Segmentation
Label every pixel!
Don’t differentiate
instances (cows)
Classic computer
vision problem
Slide Credit: CS231n 3
Instance Segmentation
Detect instances,
give category, label
pixels
“simultaneous
detection and
segmentation” (SDS)
Slide Credit: CS231n 4
Object Segmentation: Datasets
Pascal Visual Object Classes
20 Classes
~ 5.000 images
Pascal Context
540 Classes
~ 10.000 images
5
Object Segmentation: Datasets
SUN RGB-D
19 Classes
~ 10.000 images
Microsoft COCO
80 Classes
~ 300.000 images
6
Object Segmentation: Datasets
CityScapes
30 Classes
~ 25.000 images
ADE20K
>150 Classes
~ 22.000 images
7
Semantic Segmentation
Slide Credit: CS231n
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
8
Semantic Segmentation
Slide Credit: CS231n
CNN
Run “fully convolutional” network
to get all pixels at once
9
Semantic Segmentation
Slide Credit: CS231n
CNN
Smaller output
due to pooling
Problem 1:
10
Learnable upsampling
Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015
Learnable upsampling!
Slide Credit: CS231n 11
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
12
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
13
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
14
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
15
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
16
Reminder: Convolutional Layer
Slide Credit: CS231n
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
17
Learnable Upsample: Deconvolutional Layer
Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
18
Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Learnable Upsample: Deconvolutional Layer
19
Learnable Upsample: Deconvolutional Layer
Slide Credit: CS231n
3 x 3 “deconvolution”, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Sum where
output overlaps
20
Learnable Upsample: Deconvolutional Layer
Warning: Checkerboard effect when kernel size is not
divisible by the stride
Source: distill.pub
21
Learnable Upsample: Deconvolutional Layer
Source: distill.pub
stride = 2, kernel_size = 3
22
Warning: Checkerboard effect when kernel size is not
divisible by the stride
Semantic Segmentation
Slide Credit: CS231n
Noh et al. Learning Deconvolution Network for Semantic Segmentation. ICCV 2015
“Regular” VGG “Upside down” VGG
23
Better Upsampling: Subpixel
Re-arange features in previous convolutional layer to form a
higher resolution output
Shi et al.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.CVPR 2016
24
Semantic Segmentation
CNN
Blobby-like
segmentations
Problem 2:
High-level features (e.g. conv5 layer) from a pretrained classification network are the input for the
segmentation branch
25
Skip Connections
Slide Credit: CS231n
Skip connections = Better results
“skip
connections”
Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015
Recovering low level features from early layers
26
Dilated Convolutions
Yu & Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016
Structural change in convolutional layers for dense prediction problems (e.g. image segmentation)
● The receptive field grows exponentially as you add more layers → more context information in deeper
layers wrt regular convolutions
● Number of parameters increases linearly as you add more layers
27
Instance Segmentation
Detect instances,
give category, label
pixels
“simultaneous
detection and
segmentation” (SDS)
Slide Credit: CS231n 28
Instance Segmentation
More challenging than Semantic Segmentation
● Number of objects is variable
● No unique match between predicted and ground truth objects (cannot use
instance IDs)
Several attack lines:
● Proposal-based methods
● Recurrent Neural Networks
29
Proposal-based
Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014
External
Segment
proposals
Mask out background
with mean image
Similar to R-CNN, but with segment proposals
30
Proposal-based
Slide Credit: CS231nHariharan et al. Hypercolumns for Object Segmentation and Fine-grained Localization. CVPR 2015
31
Proposal-based Instance Segmentation: MNC
Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016
Won COCO 2015
challenge
(with ResNet)
Region proposal network (RPN)
Reshape boxes to
fixed size,
figure / ground
logistic regression
Mask out background,
predict object class
Learn entire model
end-to-end!
Faster R-CNN for Pixel Level Segmentation in a multi-stage cascade strategy
32
Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016
Predictions Ground truth
Proposal-based Instance Segmentation: MNC
33
He et al. Mask R-CNN. arXiv Mar 2017
Proposal-based Instance Segmentation: Mask R-CNN
Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks
and class labels
34
He et al. Mask R-CNN. arXiv Mar 2017
Mask R-CNN
● Classification & box detection losses are identical to those in Faster R-CNN
● Addition of a new loss term for mask prediction:
The network outputs a K x m x m volume for mask prediction, where K is the
number of categories and m is the size of the mask (square)
35
He et al. Mask R-CNN. arXiv Mar 2017
Mask R-CNN: RoI Align
Reminder: RoI Pool from Fast R-CNN
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
x/16 & rounding → misalignment ! + not differentiable
36
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Mask R-CNN: RoI Align
Use bilinear interpolation instead of cropping + maxpool
37
Mapping given by box coordinates
( 12
and 21
= 0 translation + scale)
38
He et al. Mask R-CNN. arXiv Mar 2017
Mask R-CNN
Object Detection
Instance Segmentation
39
Recurrent Instance Segmentation
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016
40
Sequential mask generation
Recurrent Instance Segmentation
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016
41
Mapping between ground truth and predicted masks ?
Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
1-Compute the IoU for all pairs of Predicted/GT
masks
Ŷt
Yt
42
Coverage Loss
Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
1-Compute the IoU for all pairs of Predicted/GT
masks
0.9
0
0
0.1
0.8
0.1
...
...
...
...
43
Coverage Loss
Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
2-Find best matching:
Loss: Sum of the Intersections over the union for
the best matching (*-1)
44
Coverage Loss
Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
3-Also take into account the scores
s1
= 0.93
s2
= 0.73
s3
= 0.86
s4
= 0.63
s5
= 0.56
Where:
is the binary cross entropy:
is the Iverson bracket which:
Is 1 if the condition is true and 0 else
45
Coverage Loss
Recurrent Instance Segmentation:
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC
4-Add everything together
46
Coverage Loss
Recurrent Instance Segmentation
Romera-Paredes & H.S. Torr. Recurrent Instance Segmentation ECCV 2016 47
Summary
Segmentation Datasets
Semantic Segmentation Methods
● Deconvolution
● Dilated Convolution
● Skip Connections
Instance Segmentation Methods
● Proposal-Based
● Recurrent
48
Questions ?

Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)

  • 1.
    Amaia Salvador [email protected] PhD Candidate UniversitatPolitècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Object Segmentation Day 2 Lecture 7
  • 2.
    Object Segmentation Define theaccurate boundaries of all objects in an image 2
  • 3.
    Semantic Segmentation Label everypixel! Don’t differentiate instances (cows) Classic computer vision problem Slide Credit: CS231n 3
  • 4.
    Instance Segmentation Detect instances, givecategory, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 4
  • 5.
    Object Segmentation: Datasets PascalVisual Object Classes 20 Classes ~ 5.000 images Pascal Context 540 Classes ~ 10.000 images 5
  • 6.
    Object Segmentation: Datasets SUNRGB-D 19 Classes ~ 10.000 images Microsoft COCO 80 Classes ~ 300.000 images 6
  • 7.
    Object Segmentation: Datasets CityScapes 30Classes ~ 25.000 images ADE20K >150 Classes ~ 22.000 images 7
  • 8.
    Semantic Segmentation Slide Credit:CS231n CNN COW Extract patch Run through a CNN Classify center pixel Repeat for every pixel 8
  • 9.
    Semantic Segmentation Slide Credit:CS231n CNN Run “fully convolutional” network to get all pixels at once 9
  • 10.
    Semantic Segmentation Slide Credit:CS231n CNN Smaller output due to pooling Problem 1: 10
  • 11.
    Learnable upsampling Long etal. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Learnable upsampling! Slide Credit: CS231n 11
  • 12.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 12
  • 13.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 13
  • 14.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input 14
  • 15.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 15
  • 16.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 16
  • 17.
    Reminder: Convolutional Layer SlideCredit: CS231n Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input 17
  • 18.
    Learnable Upsample: DeconvolutionalLayer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 18
  • 19.
    Slide Credit: CS231n 3x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Learnable Upsample: Deconvolutional Layer 19
  • 20.
    Learnable Upsample: DeconvolutionalLayer Slide Credit: CS231n 3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter values Sum where output overlaps 20
  • 21.
    Learnable Upsample: DeconvolutionalLayer Warning: Checkerboard effect when kernel size is not divisible by the stride Source: distill.pub 21
  • 22.
    Learnable Upsample: DeconvolutionalLayer Source: distill.pub stride = 2, kernel_size = 3 22 Warning: Checkerboard effect when kernel size is not divisible by the stride
  • 23.
    Semantic Segmentation Slide Credit:CS231n Noh et al. Learning Deconvolution Network for Semantic Segmentation. ICCV 2015 “Regular” VGG “Upside down” VGG 23
  • 24.
    Better Upsampling: Subpixel Re-arangefeatures in previous convolutional layer to form a higher resolution output Shi et al.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.CVPR 2016 24
  • 25.
    Semantic Segmentation CNN Blobby-like segmentations Problem 2: High-levelfeatures (e.g. conv5 layer) from a pretrained classification network are the input for the segmentation branch 25
  • 26.
    Skip Connections Slide Credit:CS231n Skip connections = Better results “skip connections” Long et al. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015 Recovering low level features from early layers 26
  • 27.
    Dilated Convolutions Yu &Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. ICLR 2016 Structural change in convolutional layers for dense prediction problems (e.g. image segmentation) ● The receptive field grows exponentially as you add more layers → more context information in deeper layers wrt regular convolutions ● Number of parameters increases linearly as you add more layers 27
  • 28.
    Instance Segmentation Detect instances, givecategory, label pixels “simultaneous detection and segmentation” (SDS) Slide Credit: CS231n 28
  • 29.
    Instance Segmentation More challengingthan Semantic Segmentation ● Number of objects is variable ● No unique match between predicted and ground truth objects (cannot use instance IDs) Several attack lines: ● Proposal-based methods ● Recurrent Neural Networks 29
  • 30.
    Proposal-based Slide Credit: CS231nHariharanet al. Simultaneous Detection and Segmentation. ECCV 2014 External Segment proposals Mask out background with mean image Similar to R-CNN, but with segment proposals 30
  • 31.
    Proposal-based Slide Credit: CS231nHariharanet al. Hypercolumns for Object Segmentation and Fine-grained Localization. CVPR 2015 31
  • 32.
    Proposal-based Instance Segmentation:MNC Dai et al. Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Won COCO 2015 challenge (with ResNet) Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Mask out background, predict object class Learn entire model end-to-end! Faster R-CNN for Pixel Level Segmentation in a multi-stage cascade strategy 32
  • 33.
    Dai et al.Instance-aware Semantic Segmentation via Multi-task Network Cascades. CVPR 2016 Predictions Ground truth Proposal-based Instance Segmentation: MNC 33
  • 34.
    He et al.Mask R-CNN. arXiv Mar 2017 Proposal-based Instance Segmentation: Mask R-CNN Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks and class labels 34
  • 35.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN ● Classification & box detection losses are identical to those in Faster R-CNN ● Addition of a new loss term for mask prediction: The network outputs a K x m x m volume for mask prediction, where K is the number of categories and m is the size of the mask (square) 35
  • 36.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN: RoI Align Reminder: RoI Pool from Fast R-CNN Hi-res input image: 3 x 800 x 600 with region proposal Convolution and Pooling Hi-res conv features: C x H x W with region proposal Fully-connected layers Max-pool within each grid cell RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w x/16 & rounding → misalignment ! + not differentiable 36
  • 37.
    Jaderberg et al.Spatial Transformer Networks. NIPS 2015 Mask R-CNN: RoI Align Use bilinear interpolation instead of cropping + maxpool 37 Mapping given by box coordinates ( 12 and 21 = 0 translation + scale)
  • 38.
  • 39.
    He et al.Mask R-CNN. arXiv Mar 2017 Mask R-CNN Object Detection Instance Segmentation 39
  • 40.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 40 Sequential mask generation
  • 41.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 41 Mapping between ground truth and predicted masks ?
  • 42.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks Ŷt Yt 42 Coverage Loss
  • 43.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 1-Compute the IoU for all pairs of Predicted/GT masks 0.9 0 0 0.1 0.8 0.1 ... ... ... ... 43 Coverage Loss
  • 44.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 2-Find best matching: Loss: Sum of the Intersections over the union for the best matching (*-1) 44 Coverage Loss
  • 45.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 3-Also take into account the scores s1 = 0.93 s2 = 0.73 s3 = 0.86 s4 = 0.63 s5 = 0.56 Where: is the binary cross entropy: is the Iverson bracket which: Is 1 if the condition is true and 0 else 45 Coverage Loss
  • 46.
    Recurrent Instance Segmentation: Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 Slide Credit: M. Baradad, ReadCV@UPC 4-Add everything together 46 Coverage Loss
  • 47.
    Recurrent Instance Segmentation Romera-Paredes& H.S. Torr. Recurrent Instance Segmentation ECCV 2016 47
  • 48.
    Summary Segmentation Datasets Semantic SegmentationMethods ● Deconvolution ● Dilated Convolution ● Skip Connections Instance Segmentation Methods ● Proposal-Based ● Recurrent 48
  • 49.