01 Intro
01 Intro
Deep Learning
Course Instructor Information
Required:
Dive Into Deep Learning
By Aston Zhang, Zachary C. Lipton, Mu Li,
Alexander J. Smola · 2023
Link to PDF: [https://d2l.ai/d2l-en.pdf]
• We will cover many topics in this text book
• We will also include special topics on recent
progresses on image processing
• There will other reference books also.
Requirement for Final Project
Written report
• Report format: the same as an IEEE conference paper
• Executable code must be submitted with clear comments
except for a survey study
Academic integrity (avoiding plagiarism)
• don’t copy other person’s work
• describe using your own words
• complete citation and acknowledgement whenever you use
any other work (either published or online)
Requirement for Final Project
Evaluation
• written report (be clear, complete, correct, etc.)
• code (be clear, complete, correct, well documented, etc.)
• oral presentation
• discussion with the instructor
• quality: publication-level project – extra credits
Paper Reading and Presentation
y= sign(∑ W i X i +b)
The McCulloch-Pitts Binary Neuron
Perceptron: weights are motorized potentiometers i=1
https://youtu.be/X1G2g3SiCwU
More History
1970s: statistical patter recognition (Duda & Hart 1973)
1979: Kunihiko Fukushima, Neocognitron
1982: Hopfield Networks
1983: Hinton & Sejnowski, Boltzmann Machines
1985/1986: Practical Backpropagation for neural net training
1989: Convolutional Networks
1991: Bottou & Gallinari, module-based automatic differentiation
1995: Hochreiter & Schmidhuber, LSTM recurrent net.
1996: structured prediction with neural nets, graph transformer nets
…..
2003: Yoshua Bengio, neural language model
2006: Layer-wise unsupervised pre-training of deep networks
2010: Collobert & Weston, self-supervised neural nets in NLP
More History
2012: AlexNet / convnet on GPU / object classification
2015: I. Sutskever, neural machine translation with multilayer LSTM
2015: Weston, Chopra, Bordes: Memory Networks
2016: Bahdanau, Cho, Bengio: GRU, attention mechanism
2016: Kaiming He, ResNet
The Standard Paradigm of Pattern Recognition
Feature Trainable
Extractor C las sifier
Feature Trainable
Extractor Classifier
Trainable
Deep Learning
Parameterized
Example: Nearest neighbor: Deterministic G(x,w)
Function
implicit parameter input
x y
Computing function G may involve
complicated algorithms Input Desired
output
Block diagram notations for computation graphs
Deterministic function
x G(x,w) y Multiple inputs and outputs (tensors, scalars,….)
Implicit parameter variable (here: w)
average
G(x,w)
Average loss over the set G(x,w)
G(x,w)
G(x,w)
x[0] y[0]
x[1] y[1]
x[2] y[2]
x[3] y[3]
Supervised Machine Learning = Function Optimization
Function with
adjustable parameters
Objective
Function Error
traffic light: -1
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is ∂L(W , X )
a bit random. W i ← W i− η
∂W i
Gradient Descent
Full (batch) gradient
-g
g
w
w
w w w
w w w
w
w w
w w w
w
w w
w w
w
w
w
Traditional Neural Net
s[j] z[j]
w
w
w
Backprop through a non-linear function
Chain rule: c 1
g(h(s))’ = g’(h(s)).h’(s)
dc/ds = dc/dz*dz/ds cost cost
dc/ds = dc/dz*h’(s)
Perturbations:
Perturbing s by ds will perturb z dc/dz
z by: dz=ds*h’(s) h(s) network hT(s) derivative
* network
This will perturb c by s dc/ds
dc = dz*dc/dz = ds*h’(s)*dc/dz
Hence: dc/ds = dc/dz*h’(s)
x y dc/dx dc/dy
Backprop through a weighted sum
Perturbations:
c 1
Perturbing z by dz will perturb
s[0],s[1],s[2] by ds[0]=w[0]*dz, cost cost
ds[1]=w[1]*dz, ds[2]=w[2]*dz
This will perturb c by dc/ds[1]
s[0] s[1] s[2] dc/ds[0] dc/ds[2]
dc = ds[0]*dc/ds[0]+
ds[1]*dc/ds[1]+ w[0] w[1] w[2] network w[0] w[1] w[2] derivative
réseau
network
ds[2]*dc/ds[2] dérivée
z
Hence: dc/dz = dc/ds[0]*w[0]+ dc/dz
dc/ds[1]*w[1]+
dc/ds[2]*w[2]+ x y dc/dx dc/dy
(Deep) Multi-Layer Neural Nets
This is a car
Weig ht
matrix
Hidden
Layer
Block Diagram of a Traditional Neural Net
linear blocks
Non-linear blocks
PyTorch definition
Object-oriented version
Uses predefined nn.Linear class,
(which includes a bias vector)
Uses torch.relu function
State variables are temporary
Linear Classifiers and their limitations
N
Partitions the space into two half spaces separated by the hyperplane:
N
∑ w i x i+b=0
i=1
Not linearly separable dataset
W
x2 x2
x1 x1
-b/w1 -b/w1
Number of linearly separable dichotomies
Feature Trainable
Extractor C las sifier
Representation /
Features
Ideas for “generic” feature extraction
Basic principle:
expanding the dimension of the representation so that things are more
likely to become linearly separable.
- space tiling
- random projections
- polynomial classifier (feature cross-products)
- radial basis functions
- kernel machines
Example: monomial features
generalizable to degree d
Unfortunately impractical
for large d
Number of features is d
choose N, which grows
like Nd
But d=2 is used a lot in
“attention” circuits.
Shallow networks are universal approximators!
Pooling,
Non-Linear Aggregation,
Function Projection,
Dim reduction
Input
High-dim features Stable/invariant
(Unstable/non-smooth) features
Non-Linear Expansion → Pooling
[]
The perfect representations of a face image:
Face/not face
Its coordinates on the face manifold 1.2
−3 Pose
Its coordinates away from the manifold 0.2 Lighting
− 2 .. . Expression
kind of representation
We do not have good and general methods to learn functions that turns an image into this
Disentangling factors of variation
View
Pixel n
Ideal
Feature
Extractor
Pixel 2
Expression
Pixel 1
Data Manifold
[Hadsell et al. CVPR 2006]
Deep Learning = Learning Hierarchical Representations
Feature Trainable
Extractor Classifier
Trainable
Deep Learning
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Multilayer Architecture == Hierarchical representation
Hierarchy of representations with increasing level of abstraction
Each stage is a kind of trainable feature transform
Image recognition
Pixel → edge → texton → motif → part → object
Text
Character → word → word group → clause → sentence → story
Speech
Sample → spectral band → sound → … → phone → phoneme → word
Why would deep architectures be more efficient?
[Bengio & LeCun 2 0 0 7 “Scaling Learning Algorithms Towards AI”]
A deep architecture trades space for time (or breadth for depth)
more layers (more sequential computation),
but less hardware (less parallel computation).
Example1: N-bit parity
requires N-1 XOR gates in a tree of depth log(N).
Even easier if we use threshold gates
requires an exponential number of gates of we restrict ourselves to 2 layers (DNF
formula with exponential number of minterms).
Example2: circuit for addition of 2 N-bit binary numbers
Requires O(N) gates, and O(N) layers using N one-bit adders with ripple carry
propagation.
Requires lots of gates (some polynomial in N) if we restrict ourselves to two layers (e.g.
Disjunctive Normal Form).
Bad news: almost all boolean functions have a DNF formula with an exponential
number of minterms O(2^N).....