SELF SUPERVISED
LEARNING
Prof. Biplab Banerjee
GNR 650
Slides Overview
• Motivation
• Introduction to self-supervision
• Strategies of self-supervision
• Non-Contrastive Strategies
• Contrastive Strategies
• Use cases
Motivation
• Supervised Learning has shown great promise in deep
learning.
• Deep Learning models are data hungry in nature.
• Manual labelling of data is too costly and time-
consuming affair.
• For e.g. ImageNet dataset curation started in 2006 and
continued till 2010 (~5 years for collecting and
labelling).
• Internet consists of vast amount of unlabeled data that
has not yet been effectively utilised.
Motivation
• Can we learn label agnostic feature representation from
a large unlabeled data which can be generalised to
multiple task?
• Unsupervised learning is hard therefore we carry out
self-supervision for learning from data without labels.
Neural
Network
Unlabeled Representatio
Data n
What is Self-Supervision?
• Generate labelled data from unlabeled data by
some means of automation and without much of
a human supervision.
• Train a neural network to predict "generated"
labels to learn better representation
• Finally Fine-tune on downstream
• task with few samples
What is Self-Supervision?
• We train a neural network by "generating" labels from
unlabeled data and then fine tune on the downstream
task that you might be interested in.
Strategy for Self-Supervision
• From unlabeled data generate some tasks, also known
as pretext task.
• For learning a better representation of data via neural
network encoder
• Then transfer the learnt encoder to the downstream
task by replacing the header and training it.
• Two primary ways to do it:
o Non-contrastive method
o Contrastive method
How to create pretext task?
• So given a data input we
would like to generate the
pretext task in such a way
that model tries to predict or
reconstruct some part/entire
data itself.
• For e.g. Using a part of an
image to generate some
other part of image.
• Then train a neural network in SOURCE:
Yann LeCun @EPFL - "Self-supervised learning: could machines learn like huma
contrastive or non-contrastive ns?" ([Link])
way for self-supervision.
• Hope that the learnt encoder
"generalizes" for downstream
task.
Non-Contrastive way
• Involves working with input image, distorting it in some
way and then try to predict that.
• Create a supervised learning task via automatically
generating labels.
• Some common tasks are:
o Rotation
o Impainting
o Colorization
o Jigsaw puzzle
o Counting objects
o Relative patch position
Rotation
• You rotate images and try to
predict the rotation of image.
• Idea: a good model should be
able to predict only if have visual
common sense of how the object
looks like without distortion.
• Rotate image into 4 direction and
try to predict the rotation.
• Treated as classification problem
Source: Unsupervised Representation
Learning by Predicting Image Rotations by
Gidaris et. Al. 2018
Relative patch position
• Given two patches, predict the relative position of each
other.
• Models tries to learn how different parts of images are
relatively placed and thus learn how• Relative
the objects
to aare in
single
real world. patch, we take
surrounding 8 patch
and number them
and try to predict it.
• Treated as
Source: Unsupervised Visual Representation Learning by Context classification problem
Prediction
by Doersch et. Al. 2015
Jigsaw Puzzle
• Given 9 patches, you jumble them up and then try to
predict the correct class of the jumbling
• Treated as classification problem by indexing to specific
permutation index.
Source: Unsupervised Learning of Visual Representations by Solving
Jigsaw Puzzles by Noroozi & Favaro, 2016
Impainting
• Fill in the missing patch via impainting
• Can be treated as regression problem for reconstructing
missing pixel.
• Can also be treated as
generative problem to
generate missing pixel.
• So, it utilizes both
reconstruction and
adversarial for good result
Source: Context Encoders: Feature Learning by Inpainting by
Pathak et al., 2016
Impainting (Losses)
• Reconstruction loss tries to reconstruct the pixels
• Adversarial loss tries to find whether the image passed
is real or impainted one.
Source: Context Encoders: Feature Learning by Inpainting by Pathak
et al., 2016
Impainting (comparison of loss
results)
Source: Context Encoders: Feature Learning by Inpainting by Pathak
et al., 2016
Masking possibilities
Results on different tasks
Colorization
• Tries to find the color using the black and white image.
• This was tries and done in LAB color space opposed to
RGB color space.
• L denotes perceptual
lightness and a,b denotes
rgb colors.
• Can be treated as
reconstruction and
generative problem.
Source: Colorful Image Colorization by Zhang et al.
2016a
Colorization using split brain
autoencoder
• Divide the input into different channel then use one
channel to predict another one via separate encoders.
• Some common way of splitting, separate color from
lightning and use color to predict lightning and vice
versa.
• Other way is to split image into depth and color image
and use either of them to predict other.
Source: Split-Brain Autoencoders: Unsupervised Learning by
Cross-Channel Prediction by Zhang et al. 2016b
Colorization using split brain
autoencoder
• Colorization and depth prediction example as shown.
Source: Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction by
Zhang et al. 2016b
Alternate aggregation technique
Results
• Pre-trained using ImageNet labels and fine-tuned on
PASCAL data.
Pretrained on imagenet fully
No Pretraining for them, utilizes weight
rescaling
Impainting
Relative Position
Colorization
Jigsaw Solver
Split brain autoencoder
Rotation of image
Source: Unsupervised Representation Learning by Predicting
Image Rotations by Gidaris et. Al. 2018
Contrastive self supervised learning
multi-modal contrastive learning
SimCLR
Data augmentations
Results
Effects of the projection head
Effects on the batch size
Momentum contrast
MoCo decouples batch size from
negatives
Uses a queue of negative samples
Tackles the time issue of SimCLR
Gradient update in MoCo
Results
Barlow twins
Advantages of BT
• Redundancy reduction in the features
• Avoids collapse in the embedding space
• It does not require large batches, or momentum
encoder
Results
DINO
Some visualization
Segmentation task using DINO
backbone
Pretext invariant SSL
PIRL with memory bank
Results
References
• lecture_12.pdf ([Link])
• Self-Supervised Representation Learning | Lil'Log (lilianw
[Link])
• Week 10 · Deep Learning ([Link])
• Yann LeCun @EPFL - "Self-supervised learning: could ma
chines learn like humans?" ([Link])