0% found this document useful (0 votes)
8 views

01 Intro

intro to Deep learning

Uploaded by

admiller030
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

01 Intro

intro to Deep learning

Uploaded by

admiller030
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CSD456

Deep Learning
Course Instructor Information

Instructor: Dr. Saurabh Shigwan


Email: [email protected]
Office: C219F
Office Hours: By appointment
PhD TA: TBD
Prerequisites of This Course

This is a computer science course


• It will involve a fair amount of math
– calculus, linear algebra, geometry
– probability
– analog/digital signal processing

• It will involve the modeling and design of a real


system - one final course project
– Programming skills with Python and PyTorch
Text Book

Required:
Dive Into Deep Learning
By Aston Zhang, Zachary C. Lipton, Mu Li,
Alexander J. Smola · 2023
Link to PDF: [https://d2l.ai/d2l-en.pdf]
• We will cover many topics in this text book
• We will also include special topics on recent
progresses on image processing
• There will other reference books also.
Requirement for Final Project

A complete research project


• Introduction (problem formulation/definition)
• literature review
• the proposed method and analysis
• experiment
• conclusion
• reference
Requirement for Final Project

• Select a topic and write a one-page proposal (31st August)


• Progress report (discuss with the instructor)
• Research work and report writing
• Oral presentation
• Final project report
Requirement for Final Project

Teamwork is acceptable for a research project (Option 1)


• <=3 people
• Get the permission from the instructor first
• Under a single topic, each member must have their own
specific tasks
• One combined report with each member clearly stating
their own contributions
• One combined presentation
Requirement for Final Project

Written report
• Report format: the same as an IEEE conference paper
• Executable code must be submitted with clear comments
except for a survey study
Academic integrity (avoiding plagiarism)
• don’t copy other person’s work
• describe using your own words
• complete citation and acknowledgement whenever you use
any other work (either published or online)
Requirement for Final Project

Evaluation
• written report (be clear, complete, correct, etc.)
• code (be clear, complete, correct, well documented, etc.)
• oral presentation
• discussion with the instructor
• quality: publication-level project – extra credits
Paper Reading and Presentation

• A paper picked by yourself and approved by the instructor


• Suggested paper source:IEEE TPAMI, IEEE TIP,IEEE TMI, IJCV,
CVIU, Elsevier Pattern Recognition, NeurIPS, ICCV, ECCV, CVPR,
WACV, ICASSP, ICIP, ICML, ECML, MICCAI, ISBI, IPMI

• Thorough understanding of the paper


• Prepare PPT slides
• Clearly explain the main contributions in the selected
paper
• Critical comments and discussions
• About 15 mins oral presentation for each Group
Assessment scheme

Evaluation Instrument Weightage Learning Outcomes


Mid Term Test 25% Understanding half semester
concepts
Laboratory/Assign. 20% Testing implementation skill
End Term Exam 25% Understanding full semester
concepts
Group Project 30% Testing project building
History of Deep Learning

• Early concepts date back to the 1940s and 1950s (e.g.,


Perceptron).
• Major breakthroughs in the 1980s with backpropagation.
• Resurgence in the 2000s with the availability of large datasets
and powerful GPUs.
• Deep learning is transforming industries with its powerful
predictive capabilities.
• Continuous research is needed to overcome current challenges.
• The field is rapidly evolving with new techniques and applications
emerging regularly.
Inspiration for Deep Learning: The Brain!
1943: McCulloch & Pitts, networks of binary neurons can do logic
1947: Donald Hebb, Hebbian synaptic plasticity
1948: Norbert Wiener, cybernetics, optimal filter,
feedback, autopoïesis, auto-organization.
1957: Frank Rosenblatt, Perceptron
1961: Bernie Widrow, Adaline
1962: Hubel & Wiesel, visual cortex architecture
1969: Minsky & Papert, limits of the Perceptron
Supervised Learning goes back to the Perceptron & Adaline
N

y= sign(∑ W i X i +b)
The McCulloch-Pitts Binary Neuron
Perceptron: weights are motorized potentiometers i=1

Adaline: Weights are electrochemical “memistors”

https://youtu.be/X1G2g3SiCwU
More History
1970s: statistical patter recognition (Duda & Hart 1973)
1979: Kunihiko Fukushima, Neocognitron
1982: Hopfield Networks
1983: Hinton & Sejnowski, Boltzmann Machines
1985/1986: Practical Backpropagation for neural net training
1989: Convolutional Networks
1991: Bottou & Gallinari, module-based automatic differentiation
1995: Hochreiter & Schmidhuber, LSTM recurrent net.
1996: structured prediction with neural nets, graph transformer nets
…..
2003: Yoshua Bengio, neural language model
2006: Layer-wise unsupervised pre-training of deep networks
2010: Collobert & Weston, self-supervised neural nets in NLP
More History
2012: AlexNet / convnet on GPU / object classification
2015: I. Sutskever, neural machine translation with multilayer LSTM
2015: Weston, Chopra, Bordes: Memory Networks
2016: Bahdanau, Cho, Bengio: GRU, attention mechanism
2016: Kaiming He, ResNet
The Standard Paradigm of Pattern Recognition

...since the 1960s


...and “traditional” Machine Learning
until the “Deep Learning Revolution” (circa 2012)

Feature Trainable
Extractor C las sifier

Hand engineered Trainable


What is Deep Learning?

• Deep learning is a subset of machine learning.


• It uses neural networks with many layers (deep architectures).
• Models learn from large amounts of data to make predictions or
decisions.
• Deep learning has achieved state-of-the-art results in various
fields.
• Key applications include computer vision, natural language
processing, and reinforcement learning.
• It powers technologies like autonomous vehicles, medical
diagnostics, and personal assistants.
Multilayer Neural Nets and Deep Learning
Traditional Machine Learning

Feature Trainable
Extractor Classifier

Hand engineered Trainable

Trainable
Deep Learning

Low-Level Mid-Level High-Level Trainable


Features Features Features Classifier
Parameterized Model
Cost
Parameterized model Function
Implicit
Output scalar output
y C(y,y)
Example: linear regression

Parameterized
Example: Nearest neighbor: Deterministic G(x,w)
Function
implicit parameter input

x y
Computing function G may involve
complicated algorithms Input Desired
output
Block diagram notations for computation graphs

Variables (tensor, scalar, continuous, discrete...)


x Observed: input, desired output…
y Computed variable: outputs of deterministic functions

Deterministic function
x G(x,w) y Multiple inputs and outputs (tensors, scalars,….)
Implicit parameter variable (here: w)

Scalar-valued function (implicit output)


y C(y,y) y Single scalar output (implicit)
used mostly for cost functions
Loss function, average loss.

Simple per-sample loss function

average

A set of samples y C(y,y)


y C(y,y)
y C(y,y)
y C(y,y)

G(x,w)
Average loss over the set G(x,w)
G(x,w)
G(x,w)

x[0] y[0]
x[1] y[1]
x[2] y[2]
x[3] y[3]
Supervised Machine Learning = Function Optimization

Function with
adjustable parameters
Objective
Function Error

traffic light: -1
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is ∂L(W , X )
a bit random. W i ← W i− η
∂W i
Gradient Descent
Full (batch) gradient

-g
g

Stochastic Gradient (SGD)


Pick a p in 0...P-1, then update w:

SGD exploits the redundancy in the samples


It goes faster than full gradient in most cases
In practice, we use mini-batches for parallelization.
Traditional Neural Net

Stacked linear and non-linear functional blocks


Weighted sums, matrix-vector product
Point-wise non-linearities (e.g. ReLu, tanh, … . )

w
w
w w w

w w w
w
w w

w w w
w

w w
w w
w
w
w
Traditional Neural Net

Stacked linear and non-linear functional blocks

w[i,j] s[i] z[i]


w w

s[j] z[j]
w

w
w
Backprop through a non-linear function

Chain rule: c 1
g(h(s))’ = g’(h(s)).h’(s)
dc/ds = dc/dz*dz/ds cost cost

dc/ds = dc/dz*h’(s)
Perturbations:
Perturbing s by ds will perturb z dc/dz
z by: dz=ds*h’(s) h(s) network hT(s) derivative
* network
This will perturb c by s dc/ds

dc = dz*dc/dz = ds*h’(s)*dc/dz
Hence: dc/ds = dc/dz*h’(s)
x y dc/dx dc/dy
Backprop through a weighted sum

Perturbations:
c 1
Perturbing z by dz will perturb
s[0],s[1],s[2] by ds[0]=w[0]*dz, cost cost
ds[1]=w[1]*dz, ds[2]=w[2]*dz
This will perturb c by dc/ds[1]
s[0] s[1] s[2] dc/ds[0] dc/ds[2]
dc = ds[0]*dc/ds[0]+
ds[1]*dc/ds[1]+ w[0] w[1] w[2] network w[0] w[1] w[2] derivative
réseau
network
ds[2]*dc/ds[2] dérivée
z
Hence: dc/dz = dc/ds[0]*w[0]+ dc/dz

dc/ds[1]*w[1]+
dc/ds[2]*w[2]+ x y dc/dx dc/dy
(Deep) Multi-Layer Neural Nets

Multiple Layers of simple units R eL U ( x )= m a x ( x , 0 )


Each units computes a weighted sum of its inputs
Weighted sum is passed through a non-linear function
The learning algorithm changes the weights

This is a car
Weig ht
matrix
Hidden
Layer
Block Diagram of a Traditional Neural Net

linear blocks

Non-linear blocks
PyTorch definition

Object-oriented version
Uses predefined nn.Linear class,
(which includes a bias vector)
Uses torch.relu function
State variables are temporary
Linear Classifiers and their limitations
N

Linear classifier ȳ =sign ( ∑ w i x i +b)


i =1

Partitions the space into two half spaces separated by the hyperplane:
N

∑ w i x i+b=0
i=1
Not linearly separable dataset
W
x2 x2

x1 x1
-b/w1 -b/w1
Number of linearly separable dichotomies

The probability that a dichotomy over P points in N dimensions is


linearly separable goes to zero as P gets larger than N
[Cover’s theorem 1966]
Solution: representations (a.k.a. features)

Extracting relevant features from the raw input


Computing good representations of the input
The feature extractor must be non-linear
Simple solution: expand the dimension non-linearly
But how?

Feature Trainable
Extractor C las sifier

Representation /
Features
Ideas for “generic” feature extraction

Basic principle:
expanding the dimension of the representation so that things are more
likely to become linearly separable.

- space tiling
- random projections
- polynomial classifier (feature cross-products)
- radial basis functions
- kernel machines
Example: monomial features

Feature extractor computes cross products of input variables


A linear classifier on top computes a polynomial of input variables

generalizable to degree d
Unfortunately impractical
for large d
Number of features is d
choose N, which grows
like Nd
But d=2 is used a lot in
“attention” circuits.
Shallow networks are universal approximators!

SVMs and Kernel methods


Layer1: kernels; layer2: linear
The first layer is “trained” with the
simplest unsupervised method ever
devised: using the samples as
templates for the kernel functions.
2-layer neural nets
Layer1: dot products + non-linear
function; Layer2: linear
But few useful functions can be
efficiently represented with only two
layers of reasonable size.
Do we really need deep architectures?
Theoretician's dilemma: “We can approximate any function as close as we
want with shallow architecture. Why would we need deep ones?”

kernel machines (and 2-layer neural nets) are “universal”.


Deep learning machines

Deep machines are more efficient for representing certain classes of


functions, particularly those involved in visual recognition
they can represent more complex functions with less “hardware”
We need an efficient parameterization of the class of functions that are useful
for “AI” tasks (vision, audition, NLP...)
Basic Idea for Invariant Feature Learning

Embed the input non-linearly into a high(er) dimensional space


In the new space, things that were non separable may become separable
Pool regions of the new space together
Bringing together things that are semantically similar. Like pooling.

Pooling,
Non-Linear Aggregation,
Function Projection,
Dim reduction
Input
High-dim features Stable/invariant
(Unstable/non-smooth) features
Non-Linear Expansion → Pooling

Entangled data manifolds


Non-Linear Dim
Pooling.
Expansion,
Aggregation
Disentangling
Sparse Non-Linear Expansion → Pooling
Use non-linear fn to break things apart, pool together similar things
Clustering,
Quantization, Pooling.
Sparse Coding Aggregation
Linear+ReLU
Discovering the Hidden Structure in High-Dimensional Data:
The manifold hypothesis
Learning Representations of Data:
Discovering & disentangling the independent explanatory factors
The Manifold Hypothesis:
Natural data lives in a low-dimensional (non-linear) manifold
Because variables in natural data are mutually dependent
Discovering the Hidden Structure in High-Dimensional
Data
Example: all face images of a person
1000x1000 pixels = 1,000,000 dimensions
But the face has 3 Cartesian coordinates and 3 Euler angles And
Ideal
humans have less than about 50 muscles in the face Hence the
Feature
manifold of face images for a person has <56 dimensions Extractor

[]
The perfect representations of a face image:
Face/not face
Its coordinates on the face manifold 1.2
−3 Pose
Its coordinates away from the manifold 0.2 Lighting
− 2 .. . Expression
kind of representation

We do not have good and general methods to learn functions that turns an image into this
Disentangling factors of variation

The Ideal Disentangling Feature Extractor

View
Pixel n

Ideal
Feature
Extractor
Pixel 2

Expression
Pixel 1
Data Manifold
[Hadsell et al. CVPR 2006]
Deep Learning = Learning Hierarchical Representations

Traditional Machine Learning

Feature Trainable
Extractor Classifier

Hand engineered Trainable

Trainable
Deep Learning

Low-Level Mid-Level High-Level Trainable


Features Features Features Classifier
Multilayer Architectures == Compositional Structure of Data
Naturally data is compositional => it is efficiently representable hierarchically

Low-Level Mid-Level Hig h-Level Trainable


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Multilayer Architecture == Hierarchical representation
Hierarchy of representations with increasing level of abstraction
Each stage is a kind of trainable feature transform
Image recognition
Pixel → edge → texton → motif → part → object
Text
Character → word → word group → clause → sentence → story
Speech
Sample → spectral band → sound → … → phone → phoneme → word
Why would deep architectures be more efficient?
[Bengio & LeCun 2 0 0 7 “Scaling Learning Algorithms Towards AI”]
A deep architecture trades space for time (or breadth for depth)
more layers (more sequential computation),
but less hardware (less parallel computation).
Example1: N-bit parity
requires N-1 XOR gates in a tree of depth log(N).
Even easier if we use threshold gates
requires an exponential number of gates of we restrict ourselves to 2 layers (DNF
formula with exponential number of minterms).
Example2: circuit for addition of 2 N-bit binary numbers
Requires O(N) gates, and O(N) layers using N one-bit adders with ripple carry
propagation.
Requires lots of gates (some polynomial in N) if we restrict ourselves to two layers (e.g.
Disjunctive Normal Form).
Bad news: almost all boolean functions have a DNF formula with an exponential
number of minterms O(2^N).....

You might also like