11-Nonlinear Models (Neural Networks)
Linear models
In previous topics, we mainly dealt with linear models
• Regression ℎ(𝒙) = 𝒘! 𝒙 + 𝑏
• Classification:
ℎ(𝒙) = argmax 𝒘!
" 𝒙 + 𝑏"
Category 𝑖 is referred than Category 𝑗
𝒘! !
" 𝒙 + 𝑏" ≥ 𝒘# 𝒙 + 𝑏# i.e., (𝒘" − 𝒘# )! 𝒙 + (𝑏" − 𝑏# ) ≥ 0
Binary classification with logistic regression is a special case of the above.
Nonlinear models
• Non-linear features 𝝓(𝒙)
○ E.g., Gaussian discriminant analysis with the different covariance
matrices, in which case we have quadratic features of 𝒙.
• Non-linear kernel 𝑘6𝒙" , 𝒙# 8
○ A kernel is an inner-product of two data samples that are
transformed in a certain vector space. The vector space could be very
high-dimensional (e.g., with infinite dimensions). A linear
classification in such a high-dimensional space could be non-linear in
the original low dimensional space.
• Learnable non-linear mapping
○ We can probably stack a few layers of learnable non-linear functions
(e.g., logistic functions) to learn the non-linear feature 𝝓(𝒙) or a non-
linear kernel that is appropriate to the task at hand.
Motivation: XOR, a nonlinear classification problem
○ We can probably stack a few layers of learnable non-linear functions
(e.g., logistic functions) to learn the non-linear feature 𝝓(𝒙) or a non-
linear kernel that is appropriate to the task at hand.
Motivation: XOR, a nonlinear classification problem
• The problem cannot be solved by logistic regression
• What if we stack multiple logistic regression classifiers?
• The XOR problem is solvable by three linear classifiers
○ One built upon the other two
○ But this is programming [HW]
§ Some machinery that allows you to specify certain things
§ Programming means you specify these things (usually
heuristically by human intelligence) that are can be input to the
machinery
§ The machinery accomplishes a certain task according to your
input (program).
○ Programming is very tedious and only feasible simple tasks
○ We want to learn the weights.
• Can we learn the weights?
○ Yes, still by gradient descent.
• Can we compute the gradient?
○ Yes, it's still a differentiable function
○ Again, brute-force computation of gradient is very tedious.
○ We need a systematic way of
○ Again, brute-force computation of gradient is very tedious.
○ We need a systematic way of
§ Defining a deep architecture, and
§ Computing its gradient.
Artificial Neural Network
• A perceptron [Rosenblatt, 1958]
𝑧 = 𝒘! 𝒙 + 𝑏
𝑦 = 𝑓(𝑧)
where 𝑓 is a binary thresholding function
• A perceptron-like neuron, unit, or node
𝑧 = 𝒘! 𝒙 + 𝑏
𝑦 = 𝑓 (𝑧 )
𝑓 is an activation function, e.g., sigmoid, tanh, ReLU
○ Usually we use nonlinear activation
○ Linear activation may be used for regression
• A multi-layer neural network, or a multi-layer perceptron
A common structure is layer-wise fully connected
For each node 𝑗 at layer
For each node 𝑗 at layer
To simply notations, we omit the layer L, but call the output of the current
layer as 𝑦 and the input of the current layer 𝑥, which is the output of the
lower layer. In the simplified notation,
Since we have multiple layers, we need a recursive algorithm that
computes the activation of all nodes automatically.
Forward propagation (FP)
○ Initialization
○ Recursion
○ Termination
Gradient of multi-layer neural networks
Main idea: if we can compute the gradient for one layer, we may use
chain rule to compute the gradient for all layers.
Recursion on what?
We consider a local layer
We consider a local layer
Backpropagation (BP)
○ Initialization
○ Recursion
Backpropagation (BP)
○ Initialization
○ Recursion
○ Termination
• A few more thoughs
○ Non-layerwise connection: Topological sort
○ Multiple losses: BP is a linear system
○ Tied weights: Total derivative is the summation
Auto-differentiation in general
Numerical gradient checking