lecture_2_handout
lecture_2_handout
Computer Vision
Lecture 2
Saining Xie
Courant Institute, NYU
[email protected]
Recap
Building
Person
Road
Goals of Computer Vision
Goal: Naming
Silver Center
Saining Xie
Asphalt concrete
Goals of Computer Vision
Goal: Naming
What can I
do here?
Goals of Computer Vision
Goal: Matching
phylogeny of intelligence
538.8 million years ago
Cambrian era
“biological explosion”
- Language
- Abstract thinking
- Symbolic behavior
“Who won the game?”
Language vs Visual Intelligence
?
Self-Supervised Supervised
Learning Learning
GPT unifies the tasks in NLP, but in vision Synthesis & Analysis are still very much independent
Tentative Syllabus (check website for updated schedule)
Week 1: Why Computer Vision Matters (Sept 5)
Week 2: Filtering, Detectors, Descriptors + Why Representation Learning Matters (Sept 12)
Assignment 0 due
Week 3: Deep Learning Basics, Backpropagation, AutoDiff (Sept 19)
Week 4: Training Deep Neural Networks: optimization, initialization, regularization, normalization (Sept 26)
Week 5: ECCV Conference field report; Remote Lecture: Detection and Segmentation (Oct 3)
Assignment 1 due, Proposal due
Week 6: Attention and Transformer Deep Dive (Oct 10)
Week 7: Self-supervised Learning and Multi-modal Learning (Oct 17)
Week 8: Generative Models 1 – GANs (Oct 24)
Assignment 2 due
Week 9: Generative Models 2 – VAEs, Diffusion Models, Flow-based Models (Oct 31)
Week 10: Visualizing and Understanding Neural Networks (Nov 7)
Week 11: Motions + Deep Learning on Spatiotemporal Data (Nov 14)
Week 12: 3D Vision: Cameras + Meshes, Point Cloud, NeRF, Gaussian Splatting (Nov 21)
Assignment 3 due
Week 13: Thanksgiving Recess, no class (Nov 28)
Week 14: Guest Lecture: Topics TBD (Dec 5)
Week 15: Project Final Presentation (Dec 12)
Final Project report / code due (Dec 20)
“Do you realize how lucky you are working
in AI / CV in 2024?”
Design vs. Learning
Hand designed features
Typical CV pipeline back then (e.g., 2011)
And still doesn’t really work…
99%
Signal 10 12 9 11 10 11 12
Signal 10 12 9 11 10 11 12
Output 10.33
Done! Next?
(a) 10.66 (b) 9.33
1D Case (c) 14.2 (d) 11.33
Signal 10 12 9 11 10 11 12
Signal 10 12 9 11 10 11 12
Signal 10 12 9 11 10 11 12
Signal 10 12 9 11 10 11 12
⁎
1/3 1/3 1/3
=
10.33 10.66 10 10.66 11
I21 I22 I23 I24 I25 I26 F11 F12 F13 O11 O12 O13 O14
I31 I32 I33 I34 I35 I36 F21 F22 F23 O21 O22 O23 O24
I41 I42 I43 I44 I45 I46 F31 F32 F33 O31 O32 O33 O34
F21
I21 F22
I22 F23
I23 I24 I25 I26 O11
F31
I31 F32
I32 F33
I33 I34 I35 I36
I21 F21
I22 F22
I23 F23
I24 I25 I26 O11 O12
I31 F31
I32 F32
I33 F33
I34 I35 I36
I21 I22 I23 I24 I25 I26 F11 F12 F13 O11 O12 O13 O14
I31 I32 I33 I34 I35 I36 F21 F22 F23 O21 O22 O23 O24
I41 I42 I43 I44 I45 I46 F31 F32 F33 O31 O32 O33 O34
f f f
g g
g g g g
f/g Diagram Credit: D. Lowe
Painful Details – Edge Cases
What to about the “?” region?
0 0 0
0 1 0 ?
0 0 0
Original
0 0 0
0 1 0
0 0 0
0 0 0
0 0 1 ?
0 0 0
Original
0 0 0
0 0 1
0 0 0
Original Shifted
LEFT
1 pixel
0 1 0
0 0 0 ?
0 0 0
Original
0 1 0
0 0 0
0 0 0
Original Shifted
DOWN
1 pixel
Original
Original Blur
(Box Filter)
0 0 0
0 2 0
0 0 0
-
?
1/9 1/9 1/9
Original
1/9 1/9 1/9
0 0 0
0 2 0
0 0 0
A( , + ) =A( , )=
A( , )+A( , )= + =
Note: I am showing filters un-normalized and blown up. They’re a smaller box
filter (i.e., each entry is 1/(size^2))
Properties – Shift-Invariant
A( , )=
A( , )=
Painful Details – Signal Processing
Cross-Correlation Convolution
(Original Orientation) (Flipped in x and y)
Properties of Convolution
• Any shift-invariant, linear operation is a convolution (⁎)
• Commutative: f ⁎ g = g ⁎ f
• Associative: (f ⁎ g) ⁎ h = f ⁎ (g ⁎ h)
• Distributes over +: f ⁎ (g + h) = f ⁎ g + f ⁎ h
• Scalars factor out: kf ⁎ g = f ⁎ kg = k (f ⁎ g)
• Identity (a single one with all zeros):
* =
Property List: K. Grauman
Questions?
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 1
What’s this?
𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ exp −
2𝜎 2
Recognize the Filter?
It’s a Gaussian!
1 𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 2
exp −
2𝜋𝜎 2𝜎 2
1 𝑥2 + 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ 2
exp −
2𝜋𝜎 2𝜎 2
→
1 𝑥2 1 𝑦2
𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑗 ∝ exp − 2 exp − 2
2𝜋𝜎 2𝜎 2𝜋𝜎 2𝜎
Separability
1D Gaussian ⁎ 1D Gaussian = 2D Gaussian
Image ⁎ 2D Gauss = Image ⁎ (1D Gauss ⁎ 1D Gauss )
= (Image ⁎ 1D Gauss) ⁎ 1D Gauss
⁎ =
Runtime Complexity
Image size = NxN = 6x6
Filter size = Mx1 = 3x1
I11 I12 I13 I14 I15 I16 for ImageY in range(N):
I21 F1
I22 I23 I24 I25 I26 for ImageX in range(N):
I31 I32
F2 I33 I34 I35 I36 for FilterY in range(M):
I41 F3
I42 I43 I44 I45 I46
…
for ImageY in range(N):
I51 I52 I53 I54 I55 I56
for ImageX in range(N):
I61 I62 I63 I64 I65 I66
for FilterX in range(M):
What are my compute savings …
for a 13x13 filter?
Time: O(N2M)
Why Gaussian?
Signal 10 12 9 8 1000 11 10 12
Sort
[076, 080, 087, 092, 095, 102, 106, 108, 830]
95
Applying Median Filter
Median
Filter
(size=3)
Applying Median Filter
Median
Filter
(size = 7)
Is Median Filtering Linear?
1 1 1 0 0 0 1 1 1
1 1 2 + 0 1 0 = 1 2 2
2 2 2 0 0 0 2 2 2
Median Filter
1 + 0 = 2
-
Details
=
Filtering – Sharpening
Image Details
+α
“Sharpened” α=1
=
Filtering – Sharpening
Image Details
+α
“Sharpened” α=0
=
Filtering – Sharpening
Image Details
+α
“Sharpened” α=2
=
Filtering – Sharpening
Image Details
+α
“Sharpened” α=0
=
Filtering – Extreme Sharpening
Image Details
+α
“Sharpened” α=10
=
Filtering
Derivative Dx Derivative Dy
Images as Functions or Points
Key idea: can treat image as a point in R(HxW) or as
a function of x,y.
𝜕 𝑓(𝑥, 𝑦) 𝑓 𝑥 + 1, 𝑦 − 𝑓(𝑥, 𝑦)
Approximate: ≈
𝜕𝑥 1
-1 1
𝜕 𝑓(𝑥, 𝑦) 𝑓 𝑥 + 1, 𝑦 − 𝑓(𝑥 − 1, 𝑦)
Another one: ≈
𝜕𝑥 2
-1 0 1
Other Differentiation Operators
Horizontal Vertical
−1 0 1 1 1 1
Prewitt −1 0 1 0 0 0
−1 0 1 −1 −1 −1
−1 0 1 1 2 1
Sobel −2 0 2 0 0 0
−1 0 1 −1 −2 −1
Why not just use [-1,1] or [-1,0,1]?
- sensitive to noise: not to apply it on a single row of pixels, but on 3 rows: this allows to get an
average gradient on these 3 rows, that will soften possible noise.
- But this one tends to average things a little too much: when applied to one specific row, we
lose much of what makes the detail of this specific row.
Image Gradient
Compute derivatives Ix and Iy with filters
Ix Iy
Image Gradient
Compute derivatives Ix and Iy with filters
Ix Iy
Image Gradient Magnitude
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Image Gradient Magnitude
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = ,0 ∇𝑓 = 0, ∇𝑓 = ,
𝜕𝑥 𝜕𝑦 𝜕𝑥 𝜕𝑦
Figure Credit: S. Seitz
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
Image Gradient Direction
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
Image ⁎
Per-element /
Binary
Mask ⁎
Filtering – Missing Data
Image
Per-element /
Binary
Mask
Filtering – Missing Data
Before
Filtering – Missing Data
After
Filtering – Missing Data
Why?
Where do Edges Come From?
Surface Normal / Orientation
Discontinuity
Why?
Where do Edges Come From?
Surface Color / Reflectance Properties
Discontinuity
Illumination
Discontinuity
Recap: Image Gradient
Compute derivatives Ix and Iy with filters
Ix Iy
Compute derivatives Ix and Iy with filters
Ix Iy
Gradient Magnitude (Ix2 + Iy2 )1/2
Gives rate of change at each pixel
Gradient Direction atan2(Ix, Iy)
Gives direction of change at each pixel
1
2
Gaussian Derivative Filter
1 pixel 3 pixels 7 pixels
99%
Figure: detectron2
• In some cases, we can directly “use” the useful representations.
Inductive biases
Objective: how to train?
Data: where to train? huge diverse corpus general domain
loosely organized
not labeled
task-specific
annotations
Large pre-trained language models
Zhao, Wayne Xin, et al. "A Survey of Large Language Models." arXiv preprint arXiv:2303.18223 (2023).
Large pre-trained language models
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic
segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
Visual Pre-training has a longer history…
2021 2022
Architecture / Objective / Data
1. How to design neural network architectures
1986
Convolutional Neural Networks
[Learning Internal Representations by Error Propagation. Rumelhart et al., 1986]
ConvNet using BP
- Receptive field
- Translation equivariance
- Trained by error propagation
1986
LeNet
[Backpropagation Applied to Handwritten Zip Code Recognition, LeCun et al., 1989]
LeNet
1989
AlexNet
[Krizhevsky, Sutskever and Hinton, 2012]
AlexNet
2012