learning with few data
[Link]/2023-nldl-tutorial
Marcus Liwicki, Machine Learning
Luleå University of Technology
Marcus Liwicki : learning with few data
are you working on your
PhD
or finished recently ?
did you ever
feel insignificant
doubt your skills
or
feel unchallenged ?
did you ever
feel insignificant
doubt your skills
or
feel unchallenged
did you ever
feel insignificant
doubt your skills
or
feel unchallenged ?
did you ever
feel insignificant
doubt your skills
or
feel unchallenged
unchallenged ?
You are not alone !
Marcus Liwicki, Machine Learning
Luleå University of Technology
[Link]/2023-nldl-tutorial
ELLIS member, WASP member
IEEE senior member, IAPR award winner, …
agenda
motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning
and some spices in-between:
what I have learned during my life as presenter
agenda
motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning
and some spices in-between:
what I have learned during my life as presenter
machine learning needs data
11
machine learning (ideal)
Data Labels
Priors
12
reality
Data
Priors
Labels
13
Data
minimize
human
Data Labels supervision Priors
Priors Labels
how?
1. adding more unlabeled data or synthetic data
2. incorporating more prior (knowledge)
14
there are so many priors hidden in structure
15
there are so many priors hidden in structure
including priors
92.15% (SotA 88.2%)
Better than
Google
16
prior
experience (from earlier experiments)
proven architectures, meta parameters, …
knowledge (human reasoning)
correlating the given input details and identifying discriminative features
data (intrinsic or human induced)
sequential correlation, local correlation
filenames folder structures, taxonomies
[Link]
[Link]
17
time to learn something about presentations ;)
should we use dark background ?
or white ?
ok, enough of the torture
but why did so many of you torture each other?
Contrast is important
equity in the machine learning group
Marcus Gustav
Pedro Konstantina Fotini Christian Kanjar Vibha Fredrik
Notice
something?
Almost 40%
Priyamvada Saleha
woman
György Rajkumar
Oluwatosin Homam Mattias Nosheen
Sana Ali András Richa Karl Carl Prakash Lama Elisa
machine learning for the welfare of society
thanks to previous and current PhDs
Michele Alberti Vinay Pondenkandath Gustav G. Pihlgren Prakash
Ch. Chhipa
overview of approaches
end to end learning
• transfer learning (A Survey on Deep Transfer Learning - 2018)
• Utilizing pretrained models and finetuning on application specific data
• Required less data to fine tune than training it from scratch
• clustering – (Deep Clustering for Unsupervised Learning of Visual Features - 2018)
• Labelled data not required
representation learning
• auto-encoding – (Variational Autoencoder for Deep Learning of Images, Labels and Captions, 2016)
• Questionable if this is a good way to go – (A Pitfall of Unsupervised Pre-Training, 2017)
• contrastive learning (SimCLR - July 2020, SwAV – October 2020)
• Pretraining mechanism which utilizes application specific unlabeled data
• Also compute intensive but possibility to scale down
25
transfer learning
Source: [Link] Source: [Link]
learning/
remarks
• successful but only initial layers with low-level features are common & useful across applications
• no possibility for unlabeled data
26
ImageNet pretraining works outside of
natural images
footsteps for person identification
(88 % for 13 persons, previous SotA 77 %)
MS Singh, V Pondenkandath, B Zhou, P Lukowicz, M Liwicki
Transforming sensor data to the image domain for deep learning—An application to footstep detection, IJCNN 2017
27
ImageNet pre-training works often well
Linda Studer, Michele Alberti, Vinaychandran Pondenkandath, Pinar Goktepe, Thomas Kolonko, Andreas Fischer, Marcus Liwicki, Rolf Ingold:
A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis, ICDAR, 2019
28
shortcomings – ImageNet transfer learning
ImageNet-trained CNNs are biased towards texture
– Strongly biased towards recognizing textures rather than shapes
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, September). ImageNet-trained CNNs are biased towards texture; increasing shape bias
improves accuracy and robustness. In International Conference on Learning Representations.
29
ImageNet transfer learning in medical images
Medical image domain
Transfer learning
Retina DR dataset CheXpert dataset
ImageNet
ImageNet transfer learning does not significantly affect performance on medical imaging tasks
– Ref: Transfusion: Understanding Transfer Learning for Medical Imaging
Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019). Transfusion: Understanding transfer learning for medical imaging. Advances in neural information processing systems, 32.
– Task specific learning - only initial layers with low-level features are useful
Adapted from [Link]
30
ImageNet transfer learning in histopathology
Sharmay, Y., Ehsany, L., Syed, S., & Brown, D. E. (2021, July). HistoTransfer: Understanding Transfer Learning for Histopathology. In 2021 IEEE EMBS International Conference on
Biomedical and Health Informatics (BHI) (pp. 1-4). IEEE.
Gastrointestinal, breast cancer
ImageNet vs. SSL
Why ImageNet supervised transfer learning is sub-optimal?
Possibly, ImageNet trained model is overfitted for natural scenes
Optimized for dataset specific characteristics
31
clustering
group features with k-means and update the weights to optimize for these assignments
Source: [Link]
remarks
• Compute intensive when applied on images
• Non robust feature representation when feature extracted with pretrained models
32
agenda
motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding – and alternatives
contrastive learning
comparative summary
remarks on contrastive learning
and some spices in-between:
what I have learned during my life as presenter
Auto-Encoding – pre-training
INPUT ENCODER FEATURES DECODER OUTPUT
34
Auto-Encoding – classification
INPUT ENCODER FEATURES CLASSIFIER OUTPUT
“cat”
35
a pitfall of unsupervised pre-training, 2017
a good auto-encoder (low reconstruction error) does not
necessarily lead to better accuracy
alternative: use PCA or LDA for initialization
Will they
converge ?
No ! Better local minima ?
Michele Alberti, Mathias Seuret, Vinaychandran Pondenkandath, Rolf Ingold, Marcus Liwicki
Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks. ICDAR 2017
37
auto-encoding limitation
what we want what we might get
38
39
variational auto-encoders
X Encoder N(μ, σ2) z Decoder X’
σ
2
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes.“
2013
40
perceptual loss
Another
Encode z Decoder X’ Neural y’
r Network
X
Another
Neural y
Network
Thorough investigation :
Improving image autoencoder embeddings with perceptual loss, 2020
And Oskar Sjögren (yesterday)
41
42
try it out …
[Link]/2023-nldl-tutorial
[Link]
[Link]
[Link]
43
Contrastive Learning (CL)
Self-Supervised Method:
Allows model to learn
generic representations on unlabeled
data
Method:
Learn similarity between augmented representation from
same image
Learn dissimilarity otherwise
Source: [Link] 44
(not so) recent work in Contrastive Learning
Simple Framework for Contrastive Learning (SimCLR)
A Simple Framework for Contrastive Learning of Visual Representations (SimCLR v1), ICML - 2020
Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2), NeurIPS – 2020
Momentum Contrast Learning (MOCO)
Momentum Contrast for Unsupervised Visual Representation Learning (MOCO v1), CVPR - Mar 2020
Improved Baselines with Momentum Contrastive Learning (MOCO v2), ?? Arxiv Oct- 2020
Bootstrap Your Own Latent A New Approach to Self-Supervised Learning, NeurlPS - 2020
Contrastive Learning with Clustering
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAE), Arxiv 2020
45
Comparative Summary on SOTA
Contrastive Learning
Clustering + Self-supervised
Self-Labelling
Source (IARAI): [Link]
• Remarks
• Priors (augmentation mechanism) is more important than learning method
• Obtains performance approx. equal to supervised methods with 10% labelled data
it’s easy on natural images
distorted views (augmented views) of input visual
Human prior for visual Relevant Augmentation
Size Resize
Shape Crop, Flip
Foreground-Background Blur, Noise, Color schemes, filtering
Angle Flip, Rotation
Color spectrum Contrast, saturation
but does not work in other domains
Distorted views (augmented views) of input visual
Human prior for visual Relevant Augmentation
Size Resize
Shape Crop, Flip
Foreground-Background Blur, Noise, Color schemes, filtering
Angle Flip, Rotation
Color spectrum Contrast, saturation
medical images, remote sensing imagery, non-obvious visual concepts
insufficiency of human prior for distorted view
49
use two views of same patient
Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., ... & Norouzi, M. (2021). Big self-supervised models advance medical image classification. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478-3488).
50
but wait … did we just use labels ?
51
our approach: shifting focus from human prior to data prior
Supervised Self-supervised Adapting self-supervised
approach approach (on natural approach on specialized
visual concepts) domain
Data reduce human prior Data
minimize human (augmentation) &
supervision Huma incorporate data
Label n
Data
s Priors
prior Huma Data
n Prior
Priors s
Human Label Label
Priors s s
52
let us use the data prior
data (prior) magnification levels (in BreakHis data) are
utilized to generate both views for SSL input
the only human prior used in magnification sampling
achieves state-of-the-art results with only 20%
labels on classification
Chhipa, P. C., Upadhyay, R., Pihlgren, G. G., Saini, R., Uchida, S., & Liwicki, M. (2022). Magnification Prior: A Self-Supervised Method for Learning Representations on
Breast Cancer Histopathological Images. arXiv preprint arXiv:2203.07707.
53
ideas for data prior
temporal proximity
spatial proximity
sequential co-occurrence (BERT)
different modalities
more ?
54
curious, what more we can learn about
presentation techniques ?
btw., should we use slide numbers ?
typical issues, I observe at scientific
conferences :
unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :
unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :
unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :
unconfident posture
filler sounds
angle and interaction
agenda
motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning
And some spices in-between:
What I have learned during my life as presenter
97’123’452
summary
end to end learning
• transfer learning
• clustering
representation learning
• auto-encoding
• PCA, LDA
• perceptual loss
• contrastive learning
meta learning (not covered today)
63
remarks on contrastive learning
Method Contrastive Learning Contribution Limitation
Key Factor
SimCLR V1.0 K1: Similarity learning for positive Established benchmark performance on 1. ‘Large batch size’ due to positive + negative pair
pairs unsupervised contrastive learning 2. ‘Mass gradient computation & backprop issue’ due to all
K2: Dissimilarity learning for (+ve & -ve) pairs
negative pairs
SimCLR V2.0 K1 + K2 on Task agnostic Big n/w + Added enablement of semi-supervised Same as SimCLR V1.0 + usage of bigger networks
which used in distillation for task learning through distillation
specific small n/w
MOCO V1.0 K1 + K2 over momentum encoder Revealed unsupervised contrastive learning 1. ‘Mass gradient computation & backprop issue’ due to all
where CL as dynamic dictionary with smaller batch size and lessor (+ve & -ve) pairs (same as SimCLR because as q-encoder
lookup backpropagation of gradients backpropagates)
2. Overhead of dynamic dictionary queue
MOCO V2.0 MOCO V1.0 + 2-layer MLP Stronger baseline, outperformed on 1. ‘Mass gradient computation & backprop issue’ due to all
projection head SimCLR and MOCO v1.0. (+ve & -ve) pairs same as SimCLR because q-encoder and
k-encoder both backpropagates
2. Overhead of dynamic dictionary queue
BYOL K1+ momentum encoding + two Achieves self supervised CL without 1. Complex pipeline with large number of pruning. Makes it
separate networks (online and negative pair. Establishes benchmarks in challenging for concept utilization.
target) semi-supervised approach. Robust for
smaller batch size.
SwAE K1 + Swapped” prediction Achieves self supervised CL without 1. Relatively complex loss computation due to swapped
mechanism + cluster assignment negative pair. Claims state of art in prediction
unsupervised image clustering. 2. Additional online cluster assignment swapping
DINO Distillation Self attention without supervision 1. More research required
transformers Moderate computation power 2. Authors are not self-critical
Barlow Twins Redundancy reduction minimize covariance across embedding
dimension
Maximize invariance across sample
64
remarks on contrastive learning
CL is leading the self-supervision & potential push for semi-supervised
CL in current state is compute intensive
65
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing
66
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing
knowledge distillation ( BYOL 2020, SimSiam 2020) do not use negative pairs
batch size 512
however, embedding output size in range of 4096
67
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing
knowledge distillation ( BYOL 2020, SimSiam 2020) do not use negative pairs
batch size 512
however, embedding output size in range of 4096
for non natural images, smaller batch size
is already good (128)
reason: not RGB images, but simpler
68
Remarks on Contrastive Learning
CL is leading the self-supervision & potential push for semi-supervised
CL in current state is compute intensive (batch size, negative pairs, & gradients) which
makes it challenging for direct (as-it-is) applications. Needs (Research Potential) to be
tailored for custom and small-scale application requirement.
Contrastive methods are sensitive to the choice of image/data augmentation.
Leveraging utilization of application specific but unlabeled data.
CL can be benchmarking framework (Different methods for different applications) for
semi-supervised and even supervised task.
69
thanks to my colleagues
there is so much more, I could share [Link]/2023-nldl-tutorial
[Link]