0% found this document useful (0 votes)
7 views

lecture_2_handout

The document outlines the goals and concepts of computer vision, emphasizing the importance of understanding images, naming objects, and recognizing actions. It discusses the evolution of visual intelligence, the differences between language and vision processing, and introduces various learning methods in computer vision. Additionally, it provides a tentative syllabus for a course on computer vision, detailing topics to be covered each week.

Uploaded by

yg3481
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

lecture_2_handout

The document outlines the goals and concepts of computer vision, emphasizing the importance of understanding images, naming objects, and recognizing actions. It discusses the evolution of visual intelligence, the differences between language and vision processing, and introduces various learning methods in computer vision. Additionally, it provides a tentative syllabus for a course on computer vision, detailing topics to be covered each week.

Uploaded by

yg3481
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

CSCI-GA 2271

Computer Vision
Lecture 2
Saining Xie
Courant Institute, NYU
[email protected]
Recap

Goals of Computer Vison


Goals of Computer Vision
Get a computer to understand
Goals of Computer Vision
Goal: Naming

Building

Person
Road
Goals of Computer Vision
Goal: Naming

Silver Center

Saining Xie
Asphalt concrete
Goals of Computer Vision
Goal: Naming

The image depicts a street corner view of a building with


a distinctive urban architectural style, likely located in a
city. The building has a stone facade with tall windows
and a classic design, suggesting it is an institutional or
educational structure. Several purple flags are visible on
the building, each featuring the logo of New York
University (NYU), indicating that this building is part of
the NYU campus. The street appears active with
pedestrians, and a few cars are present, further
confirming the city setting…
Goals of Computer Vision
Goal: 3D Structure
Goals of Computer Vision
Goal: Actions

What can I
do here?
Goals of Computer Vision
Goal: Matching
phylogeny of intelligence
538.8 million years ago
Cambrian era
“biological explosion”

“The evolution of the eye is likely to have been a


catalyst for the explosion, initiating an arms race
between organisms that were increasingly aware
of their surroundings.”
https://www.nhm.ac.uk/discover/eyes-on-the-prize-evolution-of-vision.html
phylogeny of intelligence
50,000 – 150,000 years
Behavioral modernity

- Language
- Abstract thinking
- Symbolic behavior
“Who won the game?”
Language vs Visual Intelligence

“Where can I buy this mug?” “Which direction leads home?”

“what does this remind you of?”

[V-IRL – Ji et al. ECCV 2024]

Tasks Requiring more


Tasks Requiring more Visual Intelligence
Strong Language Capability
Language
Abstract knowledge
High information density
Small gap between pre-training and testing
Homogeneous data format on web
Easy to distill
Easy to impress human! (“hard problems are easy”)
Next token prediction
Vision Language
Natural signals / observations Abstract knowledge
Low information density / redundant High information density
Huge gap between pre-training and testing ?
Small gap between pre-training and testing
Many moving parts, data hard to acquire Homogeneous data format on web
Hard to distill, distribution shift Easy to distill
Hard to impress human! (“easy problems are hard”) Easy to impress human! (“hard problems are easy”)

But really important


(even for language understanding, grounding, TruthGPT, experiential computing, embodiment, …)

?
Self-Supervised Supervised
Learning Learning

NLP Learning from “unlabeled*” text corpora Fine-tuning with


Next word prediction Reinforcement Learning from Human Feedback (RLHF)
GPT-1/2/3 Instruction following
ChatGPT

Vision Learning from unlabeled natural images Fine-tuning with


MoCo, MAE ImageNet classification labels, bounding boxes etc.
Generative Discriminative
Models Models

Natural Language Generation (GPT) Natural Language Understanding (BERT + finetuning)


Autogressive Models Current vision SSL
Varational Autoencoders Image Classification
GANs Object Recognition
Diffusion Models …

GPT unifies the tasks in NLP, but in vision Synthesis & Analysis are still very much independent
Tentative Syllabus (check website for updated schedule)
Week 1: Why Computer Vision Matters (Sept 5)
Week 2: Filtering, Detectors, Descriptors + Why Representation Learning Matters (Sept 12)
Assignment 0 due
Week 3: Deep Learning Basics, Backpropagation, AutoDiff (Sept 19)
Week 4: Training Deep Neural Networks: optimization, initialization, regularization, normalization (Sept 26)
Week 5: ECCV Conference field report; Remote Lecture: Detection and Segmentation (Oct 3)
Assignment 1 due, Proposal due
Week 6: Attention and Transformer Deep Dive (Oct 10)
Week 7: Self-supervised Learning and Multi-modal Learning (Oct 17)
Week 8: Generative Models 1 – GANs (Oct 24)
Assignment 2 due
Week 9: Generative Models 2 – VAEs, Diffusion Models, Flow-based Models (Oct 31)
Week 10: Visualizing and Understanding Neural Networks (Nov 7)
Week 11: Motions + Deep Learning on Spatiotemporal Data (Nov 14)
Week 12: 3D Vision: Cameras + Meshes, Point Cloud, NeRF, Gaussian Splatting (Nov 21)
Assignment 3 due
Week 13: Thanksgiving Recess, no class (Nov 28)
Week 14: Guest Lecture: Topics TBD (Dec 5)
Week 15: Project Final Presentation (Dec 12)
Final Project report / code due (Dec 20)
“Do you realize how lucky you are working
in AI / CV in 2024?”
Design vs. Learning
Hand designed features
Typical CV pipeline back then (e.g., 2011)
And still doesn’t really work…

99%

Visualizing Object Detection Features, Vondrick et al., 2015


Every domain is different
Image Filtering
Let’s Take An Image
Let’s Fix Things
• We have noise in our image
• Let’s replace each pixel with a weighted
average of its neighborhood
• Weights are filter kernel
1/9 1/9 1/9

Out 1/9 1/9 1/9

1/9 1/9 1/9

Slide Credit: D. Lowe


1D Case

Signal 10 12 9 11 10 11 12

What’s the average of


Filter 1/3 1/3 1/3
9, 10, 12?

(a) 9 (b) 11.5


Output 10.33
(c) 10.33 (d) 11.66
1D Case

Signal 10 12 9 11 10 11 12

Filter 1/3 1/3 1/3

Output 10.33

Done! Next?
(a) 10.66 (b) 9.33
1D Case (c) 14.2 (d) 11.33

Signal 10 12 9 11 10 11 12

Filter 1/3 1/3 1/3

Output 10.33 10.66


(a) 10.33 (b) 11.33
1D Case (c) 10 (d) 9.1

Signal 10 12 9 11 10 11 12

Filter 1/3 1/3 1/3

Output 10.33 10.66 10


1D Case

Signal 10 12 9 11 10 11 12

Filter 1/3 1/3 1/3

Output 10.33 10.66 10 10.66


1D Case

Signal 10 12 9 11 10 11 12

Filter 1/3 1/3 1/3

Output 10.33 10.66 10 10.66 11


1D Case
10 12 9 11 10 11 12


1/3 1/3 1/3

=
10.33 10.66 10 10.66 11

You lose pixels (maybe…)


Filter “sees” only a few pixels at a time
Applying a Linear Filter

Input Filter Output


I11 I12 I13 I14 I15 I16

I21 I22 I23 I24 I25 I26 F11 F12 F13 O11 O12 O13 O14

I31 I32 I33 I34 I35 I36 F21 F22 F23 O21 O22 O23 O24

I41 I42 I43 I44 I45 I46 F31 F32 F33 O31 O32 O33 O34

I51 I52 I53 I54 I55 I56


Applying a Linear Filter

Input & Filter Output


F11
I11 F12
I12 F13
I13 I14 I15 I16

F21
I21 F22
I22 F23
I23 I24 I25 I26 O11

F31
I31 F32
I32 F33
I33 I34 I35 I36

I41 I42 I43 I44 I45 I46

I51 I52 I53 I54 I55 I56

O11 = I11*F11 + I12*F12 + … + I33*F33


Applying a Linear Filter

Input & Filter Output


I11 F11
I12 F12
I13 F13
I14 I15 I16

I21 F21
I22 F22
I23 F23
I24 I25 I26 O11 O12

I31 F31
I32 F32
I33 F33
I34 I35 I36

I41 I42 I43 I44 I45 I46

I51 I52 I53 I54 I55 I56

O12 = I12*F11 + I13*F12 + … + I34*F33


Applying a Linear Filter

Input Filter Output


I11 I12 I13 I14 I15 I16

I21 I22 I23 I24 I25 I26 F11 F12 F13

I31 I32 I33 I34 I35 I36 F21 F22 F23

I41 I42 I43 I44 I45 I46 F31 F32 F33

I51 I52 I53 I54 I55 I56

How many times can we apply a


3x3 filter to a 5x6 image?
Applying a Linear Filter

Input Filter Output


I11 I12 I13 I14 I15 I16

I21 I22 I23 I24 I25 I26 F11 F12 F13 O11 O12 O13 O14

I31 I32 I33 I34 I35 I36 F21 F22 F23 O21 O22 O23 O24

I41 I42 I43 I44 I45 I46 F31 F32 F33 O31 O32 O33 O34

I51 I52 I53 I54 I55 I56

Oij = Iij*F11 + Ii(j+1)*F12 + … + I(i+2)(j+2)*F33


Painful Details – Edge Cases
Convolution doesn’t keep the whole image.
Suppose f is the image and g the filter.
Full – any part of g touches f. Same – same size as f;
Valid – only when filter doesn’t fall off edge.
full same valid
g g g g
g g

f f f

g g
g g g g
f/g Diagram Credit: D. Lowe
Painful Details – Edge Cases
What to about the “?” region?

???? Symm: fold sides over


g g

Circular/Wrap: wrap around


f

pad/fill: add value, often 0


g g

f/g Diagram Credit: D. Lowe


Painful Details – Does it Matter?
(I’ve applied the filter per-color channel)
Which padding did I use and why?
Input Box Filtered Box Filtered
Image ??? ???

Note – this is a zoom of the filtered, not a filter of the zoomed


Painful Details – Does it Matter?
(I’ve applied the filter per-color channel)

Input Box Filtered Box Filtered


Image Symm Pad Zero Pad

Note – this is a zoom of the filtered, not a filter of the zoomed


Practice with Linear Filters

0 0 0
0 1 0 ?
0 0 0

Original

Slide Credit: D. Lowe


Practice with Linear Filters

0 0 0
0 1 0
0 0 0

Original The Same!

Slide Credit: D. Lowe


Practice with Linear Filters

0 0 0
0 0 1 ?
0 0 0

Original

Slide Credit: D. Lowe


Practice with Linear Filters

0 0 0
0 0 1
0 0 0

Original Shifted
LEFT
1 pixel

Slide Credit: D. Lowe


Practice with Linear Filters

0 1 0
0 0 0 ?
0 0 0

Original

Slide Credit: D. Lowe


Practice with Linear Filters

0 1 0
0 0 0
0 0 0

Original Shifted
DOWN
1 pixel

Slide Credit: D. Lowe


Practice with Linear Filters

1/9 1/9 1/9

1/9 1/9 1/9 ?


1/9 1/9 1/9

Original

Slide Credit: D. Lowe


Practice with Linear Filters

1/9 1/9 1/9

1/9 1/9 1/9

1/9 1/9 1/9

Original Blur
(Box Filter)

Slide Credit: D. Lowe


Practice with Linear Filters

0 0 0

0 2 0

0 0 0

-
?
1/9 1/9 1/9
Original
1/9 1/9 1/9

1/9 1/9 1/9

Slide Credit: D. Lowe


Practice with Linear Filters

0 0 0

0 2 0

0 0 0

1/9 1/9 1/9


Original Sharpened
(Accentuates
1/9 1/9 1/9
difference from
local average)
1/9 1/9 1/9

Slide Credit: D. Lowe


Sharpening

Slide Credit: D. Lowe


Properties – Linear
Assume: I image f1, f2 filters
Linear: apply(I,f1+f2) = apply(I,f1) + apply(I,f2)
I is a white box on black, and f1, f2 are rectangles

A( , + ) =A( , )=

A( , )+A( , )= + =

Note: I am showing filters un-normalized and blown up. They’re a smaller box
filter (i.e., each entry is 1/(size^2))
Properties – Shift-Invariant

Assume: I image, f filter


Shift-invariant: shift(apply(I,f)) = apply(shift(I,f))
Intuitively: only depends on filter neighborhood

A( , )=

A( , )=
Painful Details – Signal Processing

Often called “convolution”. Actually cross-correlation. Source of terrible


confusion.

Cross-Correlation Convolution
(Original Orientation) (Flipped in x and y)
Properties of Convolution
• Any shift-invariant, linear operation is a convolution (⁎)
• Commutative: f ⁎ g = g ⁎ f
• Associative: (f ⁎ g) ⁎ h = f ⁎ (g ⁎ h)
• Distributes over +: f ⁎ (g + h) = f ⁎ g + f ⁎ h
• Scalars factor out: kf ⁎ g = f ⁎ kg = k (f ⁎ g)
• Identity (a single one with all zeros):

* =
Property List: K. Grauman
Questions?

• Nearly everything onwards is a convolution.


• This is important to get right.
Smoothing With A Box
Intuition: if filter touches it, it gets a contribution.
Input Filter Output

1/9 1/9 1/9

1/9 1/9 1/9

1/9 1/9 1/9


Solution – Weighted Combination

Intuition: weight contributions according to closeness to center.

𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 1
What’s this?

𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ exp −
2𝜎 2
Recognize the Filter?
It’s a Gaussian!

1 𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 2
exp −
2𝜋𝜎 2𝜎 2

0.003 0.013 0.022 0.013 0.003


0.013 0.060 0.098 0.060 0.013
0.022 0.098 0.162 0.098 0.022
0.013 0.060 0.098 0.060 0.013
0.003 0.013 0.022 0.013 0.003
Smoothing With A Box & Gauss
Still have some speckles, but it’s not a big box
Input Box Filter Gauss. Filter
Gaussian Filters

σ=1 σ=2 σ=4 σ=8


filter = 21x21 filter = 21x21 filter = 21x21 filter = 21x21

Note: filter visualizations are independently normalized throughout the


slides so you can see them better
Applying Gaussian Filters
Applying Gaussian Filters
Input Image
(no filter)
Applying Gaussian Filters
σ=1
Applying Gaussian Filters
σ=2
Applying Gaussian Filters
σ=4
Applying Gaussian Filters
σ=8
Picking a Filter Size
Too small filter → bad approximation
Want size ≈ 6σ (99.7% of energy)
Left far too small; right slightly too small!
σ = 8, size = 21 σ = 8, size = 43
Runtime Complexity
Image size = NxN = 6x6
Filter size = MxM = 3x3
I11 I12 I13 I14 I15 I16 for ImageY in range(N):
I21 F11
I22 F12
I23 F13
I24 I25 I26 for ImageX in range(N):
I31 I32
F21 I33
F22 I34
F23 I35 I36 for FilterY in range(M):
I41 F31
I42 F32
I43 F33
I44 I45 I46
for FilterX in range(M):

I51 I52 I53 I54 I55 I56

I61 I62 I63 I64 I65 I66


Time: O(N2M2)
Separability

1 𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 2
exp −
2𝜋𝜎 2𝜎 2


1 𝑥2 1 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ exp − 2 exp − 2
2𝜋𝜎 2𝜎 2𝜋𝜎 2𝜎
Separability
1D Gaussian ⁎ 1D Gaussian = 2D Gaussian
Image ⁎ 2D Gauss = Image ⁎ (1D Gauss ⁎ 1D Gauss )
= (Image ⁎ 1D Gauss) ⁎ 1D Gauss

⁎ =
Runtime Complexity
Image size = NxN = 6x6
Filter size = Mx1 = 3x1
I11 I12 I13 I14 I15 I16 for ImageY in range(N):
I21 F1
I22 I23 I24 I25 I26 for ImageX in range(N):
I31 I32
F2 I33 I34 I35 I36 for FilterY in range(M):
I41 F3
I42 I43 I44 I45 I46

for ImageY in range(N):
I51 I52 I53 I54 I55 I56
for ImageX in range(N):
I61 I62 I63 I64 I65 I66
for FilterX in range(M):
What are my compute savings …
for a 13x13 filter?
Time: O(N2M)
Why Gaussian?

Gaussian filtering removes parts of the signal above a certain


frequency. Often noise is high frequency and signal is low frequency.
Where Gaussian Fails
Applying Gaussian Filters
σ=1
Why Does This Fail?
Means can be arbitrarily distorted by outliers

Signal 10 12 9 8 1000 11 10 12

Filter 0.1 0.8 0.1

Output 11.5 9.2 107.3 801.9 109.8 10.3

What else is an “average” other than a mean?


Non-linear Filters (2D)

[040, 081, 013, 125, 830, 076, 144, 092, 108]


Sort
40 81 13 22
[013, 040, 076, 081, 092, 108, 125, 144, 830]
125 830 76 80
144 92 108 95 92
132 102 106 87
[830, 076, 080, 092, 108, 095, 102, 106, 087]

Sort
[076, 080, 087, 092, 095, 102, 106, 108, 830]

95
Applying Median Filter
Median
Filter
(size=3)
Applying Median Filter
Median
Filter
(size = 7)
Is Median Filtering Linear?

1 1 1 0 0 0 1 1 1
1 1 2 + 0 1 0 = 1 2 2
2 2 2 0 0 0 2 2 2
Median Filter

1 + 0 = 2

Example from (I believe): Kristen Grauman


Some Examples of Filtering
Filtering – Sharpening
Image Smoothed

-
Details

=
Filtering – Sharpening
Image Details


“Sharpened” α=1

=
Filtering – Sharpening
Image Details


“Sharpened” α=0

=
Filtering – Sharpening
Image Details


“Sharpened” α=2

=
Filtering – Sharpening
Image Details


“Sharpened” α=0

=
Filtering – Extreme Sharpening
Image Details


“Sharpened” α=10

=
Filtering

What’s this Filter?


T
-1 0 1 -1 0 1

Derivative Dx Derivative Dy
Images as Functions or Points
Key idea: can treat image as a point in R(HxW) or as
a function of x,y.

𝜕𝐼 How much the intensity of the


(𝑥, 𝑦) image changes as you go
𝜕𝑥 horizontally at (x,y)
∇𝐼(𝑥, 𝑦) = (Often called Ix)
𝜕𝐼
(𝑥, 𝑦)
𝜕𝑦
Images as Functions
Image is function f(x,y)
𝜕 𝑓(𝑥, 𝑦) 𝑓 𝑥 + 𝜖, 𝑦 − 𝑓(𝑥, 𝑦)
Remember: = lim
𝜕𝑥 𝜖→0 𝜖

𝜕 𝑓(𝑥, 𝑦) 𝑓 𝑥 + 1, 𝑦 − 𝑓(𝑥, 𝑦)
Approximate: ≈
𝜕𝑥 1
-1 1

𝜕 𝑓(𝑥, 𝑦) 𝑓 𝑥 + 1, 𝑦 − 𝑓(𝑥 − 1, 𝑦)
Another one: ≈
𝜕𝑥 2
-1 0 1
Other Differentiation Operators
Horizontal Vertical
−1 0 1 1 1 1
Prewitt −1 0 1 0 0 0
−1 0 1 −1 −1 −1
−1 0 1 1 2 1
Sobel −2 0 2 0 0 0
−1 0 1 −1 −2 −1
Why not just use [-1,1] or [-1,0,1]?

- sensitive to noise: not to apply it on a single row of pixels, but on 3 rows: this allows to get an
average gradient on these 3 rows, that will soften possible noise.

- But this one tends to average things a little too much: when applied to one specific row, we
lose much of what makes the detail of this specific row.
Image Gradient
Compute derivatives Ix and Iy with filters

Ix Iy
Image Gradient
Compute derivatives Ix and Iy with filters

Ix Iy
Image Gradient Magnitude
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Image Gradient Magnitude
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel

𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = ,0 ∇𝑓 = 0, ∇𝑓 = ,
𝜕𝑥 𝜕𝑦 𝜕𝑥 𝜕𝑦
Figure Credit: S. Seitz
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel

I’m making the lightness equal to gradient magnitude


Filtering – Bonus
Filtering – Missing Data
Oh no! Missing data!
(and we know where)

Common with many non-normal cameras (e.g., depth cameras)


Filtering – Missing Data

Image ⁎

Per-element /

Binary
Mask ⁎
Filtering – Missing Data

Image

Per-element /

Binary
Mask
Filtering – Missing Data

Before
Filtering – Missing Data

After
Filtering – Missing Data

After (without missing data)


Edges
Where do Edges Come From?
Depth / Distance
Discontinuity

Why?
Where do Edges Come From?
Surface Normal / Orientation
Discontinuity

Why?
Where do Edges Come From?
Surface Color / Reflectance Properties
Discontinuity
Illumination
Discontinuity
Recap: Image Gradient
Compute derivatives Ix and Iy with filters

Ix Iy
Compute derivatives Ix and Iy with filters

Ix Iy
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel

Showing the gradient direction at every pixel


Gradient Direction
atan2(Iy,Ix): orientation
Why is there structure at 1 and not at 2?

1
2
Gaussian Derivative Filter
1 pixel 3 pixels 7 pixels

Removes noise, but blurs edge


Slide Credit: D. Forsyth
Filters We’ve Seen
Smoothing Derivative

Example Gaussian Deriv. of gauss


Goal Remove noise Find edges
Problems (ill-defined)
Image human segmentation gradient magnitude
Hand designed features
Typical CV pipeline back then (e.g., 2011)
And still doesn’t really work…

99%

Visualizing Object Detection Features, Vondrick et al., 2015


“Do you realize how lucky you are working
in AI / CV in 2024?”

So, what changed?


Representation learning!
Why hand design was doomed to fail…

3361 possible states?


(>> number of atoms in the universe)

Figure: AlphaGo vs Lee Sedol G4 move 78


2563×600×800
possible states?

Figure: detectron2
• In some cases, we can directly “use” the useful representations.

• In many applications, we need to “transfer” the useful


representations to some downstream tasks.
Architecture: what to train?

Inductive biases
Objective: how to train?
Data: where to train? huge diverse corpus general domain

loosely organized
not labeled

not huge, or not diverse, or both


task-specific
carefully curated domain

task-specific
annotations
Large pre-trained language models

Zhao, Wayne Xin, et al. "A Survey of Large Language Models." arXiv preprint arXiv:2303.18223 (2023).
Large pre-trained language models

Data Architecture Objective


Visual Pre-training has a longer history…

[Deng et al., 2009]

Data Architecture Objective


Visual Pre-training has a longer history…
Arguably the first major success of “pre-training"…

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
Visual Pre-training has a longer history…

End-to-end Object Detection – Faster/Mask-RCNN

Offline CNN features → Backbone pre-training

Ren, He, Girshick, Sun, Faster R-CNN: Towards Real-Time Object


Detection with Region Proposal Networks, NeurIPS ’15
Architecture / Objective / Data
1. How to design neural network architectures

PDP LeNet AlexNet DSN ResNet ResNeXt ViT ConvNeXt v1/v2

1986 1989 2012 2014 2015 2017 2019 2020 2022


Architecture / Objective / Data
2. Training objectives beyond supervised classification:
Are labels necessary?

Various MoCo v3 SLIP, DINO


RBM Pretext tasks BERT / GPT MoCo SimCLR BYOL MAE CLIP

2006 2012 - 2018 2019 2020 2021 2022


Architecture / Objective / Data
3. What data to use for pre-training

2021 2022
Architecture / Objective / Data
1. How to design neural network architectures

PDP LeNet AlexNet DSN ResNet ResNeXt ViT ConvNeXt v1/v2

1986 1989 2012 2014 2015 2017 2019 2020 2022


Connectionism
[A general framework for parallel distributed processing. Rumelhart et al., 1986]

(PDP group is now at )

1986
Convolutional Neural Networks
[Learning Internal Representations by Error Propagation. Rumelhart et al., 1986]

ConvNet using BP

- Receptive field
- Translation equivariance
- Trained by error propagation

1986
LeNet
[Backpropagation Applied to Handwritten Zip Code Recognition, LeCun et al., 1989]

LeNet

1989
AlexNet
[Krizhevsky, Sutskever and Hinton, 2012]

[Deng et al., 2009]


[Russakovsky et al., 2015]

AlexNet

2012

You might also like