0% found this document useful (0 votes)
8 views103 pages

DL Lecture Part2

The document outlines a two-week lecture series on deep learning, presented by Dra. Jeaneth Machicao and based on the course STAT 453 by Prof. Sebastian Raschka. It covers various topics including mathematical foundations, neural networks, convolutional neural networks (CNNs), and applications in computer vision and language modeling. The document also includes links to course materials and playlists for further learning.

Uploaded by

jamilfelippe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views103 pages

DL Lecture Part2

The document outlines a two-week lecture series on deep learning, presented by Dra. Jeaneth Machicao and based on the course STAT 453 by Prof. Sebastian Raschka. It covers various topics including mathematical foundations, neural networks, convolutional neural networks (CNNs), and applications in computer vision and language modeling. The document also includes links to course materials and playlists for further learning.

Uploaded by

jamilfelippe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Deep learning:

a 2-weeks lecture
Part 2

Presented by: Dra. Jeaneth Machicao October/2020


[email protected]
Course overview: STAT 453: Deep Learning, Spring 2020
by Prof. Sebastian Raschka
Part1: Introduction Part4: DL for computer vision and language modeling
● Introduction to deep learning ● Introduction to convolutional neural networks 1-2
● The brief history of deep learning ○CNNs Architectures Illustrated
● Single-layer neural networks: The perceptron ● Introduction to recurrent neural networks 1-2
● Motivation: cases of use Part5: Deep generative models
● Hands-on ● Autoencoders,
Part2: Mathematical and computational foundations ● Autoregressive models
● Linear algebra and calculus for deep learning ● Variational autoencoders
● Parameter optimization with gradient descent ● Normalizing Flow models
● Automatic differentiation & PyThorch ● Generative adversarial networks
Part3: Introduction to neural networks ● Evaluating generative models
● Multinomial logistic regression
● Multilayer pecerptrons http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/
● Regularization https://github.com/rasbt/stat453-deep-learning-ss20
● Input normalization and weight initiliazation
• Course Playlist on youtube:
● Learning rated and advanced optimization algorithms Prof. Dalcimar Casanova
https://www.youtube.com/watch?v=0VD_2t6EdS4&list=PL9At2PVRU0ZqVArhU9QMyI3jSe113_m2-
Prof. Sebastian Raschka
https://www.youtube.com/watch?v=e_I0q3mmfw4&list=PLTKMiZHVd_2JkR6QtQEnml7swCnFBtq4P
Overview of our 2-weeks lecture!

1: Introduction 4: DL for computer vision and language modeling


● Introduction to deep learning ● Introduction to convolutional neural networks 1-2
● The brief history of deep learning ○ CNNs Architectures Illustrated
● Single-layer neural networks: The perceptron ● Introduction to recurrent neural networks 1-2
● Motivation: cases of use
● Hands-on (report)
2: Mathematical and computational foundations ● Deliver report of the hands-on
● Linear algebra and calculus for deep learning
● Parameter optimization with gradient descent
● Automatic differentiation & PyThorch
3: Introduction to neural networks
http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/
● Multinomial logistic regression
https://github.com/rasbt/stat453-deep-learning-ss20
● Multilayer pecerptrons
● Regularization • Course Playlist on youtube:
● Input normalization and weight initiliazation Prof. Dalcimar Casanova
● Learning rated and advanced optimization algorithms Prof. Sebastian Raschka
Lecture 12

Introduction to
Convolutional Neural Networks
Part 1
STAT 453: Deep Learning, Spring 2020
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/

https://github.com/rasbt/stat453-deep-learning-ss20/tree/master/L12-cnns

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 1
CNNs for Image Classification

output

Image Source:
twitter.com%2Fcats&psig=AOvVaw30_o-PCM-
K21DiMAJQimQ4&ust=1553887775741551
p(y=cat)

Image Source: https://www.pinterest.com/pin/


244742560974520446

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 2
Object Detection

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779-788).

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 3
Object Segmentation

umbrella.98 bus.99

umbrella.98
person1.00

person1.00
person1.00
backpack1.00
person1.00 person.99
handbag.96 person.99
person1.00 person1.00 person1.00
person1.00 person1.00
person.95 person.98
person1.00
person1.00 person1.00 person.94 person1.00 person1.00 person.89

person1.00 sheep.99
backpack.99
sheep.99 sheep.86
backpack.93 sheep.82 sheep.96
sheep.96 sheep.93 sheep.91 sheep.95 sheep.96 sheep1.00
sheep1.00
sheep.99
sheep1.00
sheep.99
sheep.96

sheep.99

person.99
bottle.99
dining table.96

bottle.99
bottle.99

person.99person1.00
person1.00
traffic light.96 tv.99

chair.98 chair.99
chair.90
dining table.99 chair.96 wine glass.97
chair.86
bottle.99wine glass.93 chair.99
bowl.85 wine glass1.00

elephant1.00
wine glass.99
wine glass1.00
person1.00 chair.96 chair.99 fork.95

person1.00 traffic light.95 bowl.81


person1.00
traffic light.92 traffic light.84
person1.00 person.85
person.96 truck1.00 person.99
motorcycle1.00 person.96person1.00
person.83 person1.00
motorcycle1.00 person.98
person.99 person.91
person.90 person.87 car.99 car.92
person.99
person.92 car.99 car.93
car1.00
motorcycle.95
knife.83

person.96

Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [15], achieving a mask AP of 35.7 and
running at 5 fps. Masks are shown in color, and bounding box, category, and confidences are also shown.

ingly minor change, RoIAlign has a large impact: it im- 2. Related Work
proves mask accuracy by relative 10% to 50%, showing
He,bigger
Kaiming,
gainsGeorgia Gkioxari,
under stricter Piotr Dollár,
localization and Ross
metrics. Girshick.
Second, we R-CNN:
"Mask The Region-based
R-CNN." CNN
In Proceedings (R-CNN)
of the approach [10]
IEEE International
Conference on Computer
found it essential Vision,mask
to decouple pp. 2961-2969. 2017. we
and class prediction: to bounding-box object detection is to attend to a manage-
predict a binary mask for each class independently, without able number of candidate object regions [33, 16] and evalu-
competition among classes, and rely on the network’s RoI ate convolutional networks [20, 19] independently on each
Sebastian
classification Raschka
branch to predict theSTAT 453: In
category. Intro RoI. R-CNN
to Deep Learning
contrast, and was extendedModels
Generative [14, 9] to allow attending
SS 2020to RoIs 4
Face Recognition

[1]
x
<latexit sha1_base64="p8Wx+cqqkWj+1zNtDaf7R0Gpalg=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA0ls122y7dbMLuRCyhP8KLB0W8+nu8+W/ctjlo64OBx3szzMwLEykMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3Q2q4FIo3UKDk7URzGoWSt8LRzdRvPXJtRKzucZzwIKIDJfqCUbRS6+kh871g0i2V3Yo7A1kmXk7KkKPeLX11ejFLI66QSWqM77kJBhnVKJjkk2InNTyhbEQH3LdU0YibIJudOyGnVumRfqxtKSQz9fdERiNjxlFoOyOKQ7PoTcX/PD/F/lWQCZWkyBWbL+qnkmBMpr+TntCcoRxbQpkW9lbChlRThjahog3BW3x5mTSrFe+8Ur27KNeu8zgKcAwncAYeXEINbqEODWAwgmd4hTcncV6cd+dj3rri5DNH8AfO5w81Jo97</latexit>

Similarity/
Distance
Score

[2]
x
<latexit sha1_base64="vzgd/QPklE2GpKgvXahAxpOTUdw=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA0ls122i7dbMLuRiyhP8KLB0W8+nu8+W/ctjlo64OBx3szzMwLE8G1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1Q6pRcIkNw43AdqKQRqHAVji6mfqtR1Sax/LejBMMIjqQvM8ZNVZqPT1kfjWYdEtlt+LOQJaJl5My5Kh3S1+dXszSCKVhgmrte25igowqw5nASbGTakwoG9EB+pZKGqEOstm5E3JqlR7px8qWNGSm/p7IaKT1OAptZ0TNUC96U/E/z09N/yrIuExSg5LNF/VTQUxMpr+THlfIjBhbQpni9lbChlRRZmxCRRuCt/jyMmlWK955pXp3Ua5d53EU4BhO4Aw8uIQa3EIdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx82rI98</latexit>

Siamese neural network

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 5
Lecture Overview

1. Image Classification
2. Convolutional Neural Network Basics
3. CNN Architectures
4. What a CNN Can See
5. CNNs in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 7
Why Image Classification is Hard

Different lighting, contrast, viewpoints, etc.

Image Source: Image Source: https://www.123rf.com/


twitter.com%2Fcats&psig=AOvVaw30_o-PCM- photo_76714328_side-view-of-tabby-cat-face-over-
K21DiMAJQimQ4&ust=1553887775741551 white.html

Or even simple translation This is hard for traditional


methods like multi-layer
perceptrons, because
the prediction is
basically based on a sum
of pixel intensities

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 8
Traditional Approaches

a) Use hand-engineered features

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 9
Traditional Approaches

a) Use hand-engineered features

Sasaki, K., Hashimoto, M., & Nagata, N. (2016). Person Invariant Classification of Subtle Facial Expressions Using Coded Movement Direction of
Keypoints. In Video Analytics. Face and Facial Expression Recognition and Audience Measurement (pp. 61-72). Springer, Cham.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 10
Traditional Approaches

b) Preprocess images (centering, cropping, etc.)

Image Source: https://www.tokkoro.com/2827328-cat-animals-nature-feline-park-green-trees-grass.html

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 11
Lecture Overview

1. Image Classification
2. Convolutional Neural Network Basics
3. CNN Architectures
4. What a CNN Can See
5. CNNs in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 12
Main Concepts Behind
Convolutional Neural Networks

• Sparse-connectivity: A single element in the feature map is


connected to only a small patch of pixels. (This is very different
from connecting to the whole input image, in the case of multi-layer
perceptrons.)

• Parameter-sharing: The same weights are used for different


patches of the input image.

• Many layers: Combining extracted local patterns to global patterns

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 13
Convolutional Neural Networks
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84

Full connection Gaussian connections


Convolutions Subsampling Convolutions Subsampling Full connection

Pooling

Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner: Gradient Based Learning Applied to Document Recognition,
Proceedings of IEEE, 86(11):2278–2324, 1998.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 15
Hidden Layers
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84

Full connection Gaussian connections


Convolutions Subsampling Convolutions Subsampling Full connection

"Automatic feature extractor" "Regular classifier"

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 16
Hidden Layers
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84

Full connection Gaussian connections


Convolutions Subsampling Convolutions Subsampling Full connection

Each "bunch" of feature maps represents one


hidden layer in the neural network.

Counting the FC layers, this network has 5 layers

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 17
Convolutional Neural Networks
Size of the resulting layers
Number of feature detectors
C3: f. maps 16@10x10
INPUT
C1: feature maps
6@28x28
S4: f. maps 16@5x5 Multi-layer perceptron
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84

Full connection Gaussian connections


Convolutions Subsampling Convolutions Subsampling Full connection

basically a fully-connected
nowadays called "pooling"
layer + MSE loss
"Feature detectors" (weight matrices) (nowadays better to use
that are being reused ("weight sharing") fc-layer + softmax
=> also called "kernel" or "filter" + cross entropy

Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner: Gradient Based Learning Applied to Document Recognition,
Proceedings of IEEE, 86(11):2278–2324, 1998.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 18
Weight Sharing
A "feature detector" (filter, kernel) slides over the inputs to generate
a feature map

9
X
The pixels are w j xj
j=1
referred to
<latexit sha1_base64="A0KexUBWYzFCrOQ6nv7KbgccmW0=">AAAB/3icbVDLSgMxFM34rPU1KrhxEyyCqzJTBXUhFN24rGAf0I5DJk3btElmSDJqGWfhr7hxoYhbf8Odf2PazkJbD1w4nHMv994TRIwq7Tjf1tz8wuLScm4lv7q2vrFpb23XVBhLTKo4ZKFsBEgRRgWpaqoZaUSSIB4wUg8GlyO/fkekoqG40cOIeBx1Be1QjLSRfHu3pWLuJ/1zN71NzlJ47/fhg9/37YJTdMaAs8TNSAFkqPj2V6sd4pgToTFDSjVdJ9JegqSmmJE034oViRAeoC5pGioQJ8pLxven8MAobdgJpSmh4Vj9PZEgrtSQB6aTI91T095I/M9rxrpz6iVURLEmAk8WdWIGdQhHYcA2lQRrNjQEYUnNrRD3kERYm8jyJgR3+uVZUisV3aNi6fq4UL7I4siBPbAPDoELTkAZXIEKqAIMHsEzeAVv1pP1Yr1bH5PWOSub2QF/YH3+AHSflbs=</latexit>

as "receptive field"

"feature map"

Rationale: A feature detector that works well in one region


may also work well in another region

Plus, it is a nice reduction in parameters to fit

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 19
Weight Sharing
A "feature detector" (kernel) slides over the inputs to generate
a feature map

9
X
w j xj
<latexit sha1_base64="A0KexUBWYzFCrOQ6nv7KbgccmW0=">AAAB/3icbVDLSgMxFM34rPU1KrhxEyyCqzJTBXUhFN24rGAf0I5DJk3btElmSDJqGWfhr7hxoYhbf8Odf2PazkJbD1w4nHMv994TRIwq7Tjf1tz8wuLScm4lv7q2vrFpb23XVBhLTKo4ZKFsBEgRRgWpaqoZaUSSIB4wUg8GlyO/fkekoqG40cOIeBx1Be1QjLSRfHu3pWLuJ/1zN71NzlJ47/fhg9/37YJTdMaAs8TNSAFkqPj2V6sd4pgToTFDSjVdJ9JegqSmmJE034oViRAeoC5pGioQJ8pLxven8MAobdgJpSmh4Vj9PZEgrtSQB6aTI91T095I/M9rxrpz6iVURLEmAk8WdWIGdQhHYcA2lQRrNjQEYUnNrRD3kERYm8jyJgR3+uVZUisV3aNi6fq4UL7I4siBPbAPDoELTkAZXIEKqAIMHsEzeAVv1pP1Yr1bH5PWOSub2QF/YH3+AHSflbs=</latexit>
j=1

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 20
Weight Sharing
A "feature detector" (kernel) slides over the inputs to generate
a feature map

9
X
w j xj
<latexit sha1_base64="A0KexUBWYzFCrOQ6nv7KbgccmW0=">AAAB/3icbVDLSgMxFM34rPU1KrhxEyyCqzJTBXUhFN24rGAf0I5DJk3btElmSDJqGWfhr7hxoYhbf8Odf2PazkJbD1w4nHMv994TRIwq7Tjf1tz8wuLScm4lv7q2vrFpb23XVBhLTKo4ZKFsBEgRRgWpaqoZaUSSIB4wUg8GlyO/fkekoqG40cOIeBx1Be1QjLSRfHu3pWLuJ/1zN71NzlJ47/fhg9/37YJTdMaAs8TNSAFkqPj2V6sd4pgToTFDSjVdJ9JegqSmmJE034oViRAeoC5pGioQJ8pLxven8MAobdgJpSmh4Vj9PZEgrtSQB6aTI91T095I/M9rxrpz6iVURLEmAk8WdWIGdQhHYcA2lQRrNjQEYUnNrRD3kERYm8jyJgR3+uVZUisV3aNi6fq4UL7I4siBPbAPDoELTkAZXIEKqAIMHsEzeAVv1pP1Yr1bH5PWOSub2QF/YH3+AHSflbs=</latexit>
j=1

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 21
9
X (@1)
wj xj
j=1
<latexit sha1_base64="f26ph3SsblXR0kXxlacC2FehXAE=">AAACBnicbVDLSsNAFJ3UV62vqEsRBotQNyWpgroQim5cVrAPaNMwmU7aaWeSMDNRS8jKjb/ixoUibv0Gd/6N08dCWw9cOJxzL/fe40WMSmVZ30ZmYXFpeSW7mltb39jcMrd3ajKMBSZVHLJQNDwkCaMBqSqqGGlEgiDuMVL3Blcjv35HhKRhcKuGEXE46gbUpxgpLbnmfkvG3E36F3baTs5TeO/220mhbB+l8MHtu2beKlpjwHliT0keTFFxza9WJ8QxJ4HCDEnZtK1IOQkSimJG0lwrliRCeIC6pKlpgDiRTjJ+I4WHWulAPxS6AgXH6u+JBHEph9zTnRypnpz1RuJ/XjNW/pmT0CCKFQnwZJEfM6hCOMoEdqggWLGhJggLqm+FuIcEwkonl9Mh2LMvz5NaqWgfF0s3J/ny5TSOLNgDB6AAbHAKyuAaVEAVYPAInsEreDOejBfj3fiYtGaM6cwu+APj8wfkL5gZ</latexit>

Multiple "feature
detectors" (kernels) are used
to create multiple feature
maps
9
X (@2)
w j xj
<latexit sha1_base64="nCqyd07UuJkUGWlSGLMV2F7bQVM=">AAACBnicbVDLSsNAFJ3UV62vqEsRBotQNyWpgroQim5cVrAPaNMwmU7aaSeTMDNRS8jKjb/ixoUibv0Gd/6N08dCWw9cOJxzL/fe40WMSmVZ30ZmYXFpeSW7mltb39jcMrd3ajKMBSZVHLJQNDwkCaOcVBVVjDQiQVDgMVL3Blcjv35HhKQhv1XDiDgB6nLqU4yUllxzvyXjwE36F3baTs5TeO/220mhXDpK4YPbd828VbTGgPPEnpI8mKLiml+tTojjgHCFGZKyaVuRchIkFMWMpLlWLEmE8AB1SVNTjgIinWT8RgoPtdKBfih0cQXH6u+JBAVSDgNPdwZI9eSsNxL/85qx8s+chPIoVoTjySI/ZlCFcJQJ7FBBsGJDTRAWVN8KcQ8JhJVOLqdDsGdfnie1UtE+LpZuTvLly2kcWbAHDkAB2OAUlME1qIAqwOARPINX8GY8GS/Gu/Exac0Y05ld8AfG5w/luZga</latexit>
j=1

9
X (@3)
wj xj
<latexit sha1_base64="N3BOf0nmcHBzr6vnBzaSoMFhcQo=">AAACBnicbVDLSsNAFJ34rPUVdSnCYBHqpiStoC6EohuXFewD2jRMppN22pkkzEzUErJy46+4caGIW7/BnX/j9LHQ1gMXDufcy733eBGjUlnWt7GwuLS8sppZy65vbG5tmzu7NRnGApMqDlkoGh6ShNGAVBVVjDQiQRD3GKl7g6uRX78jQtIwuFXDiDgcdQPqU4yUllzzoCVj7ib9CzttJ+cpvHf77SRfLh2n8MHtu2bOKlhjwHliT0kOTFFxza9WJ8QxJ4HCDEnZtK1IOQkSimJG0mwrliRCeIC6pKlpgDiRTjJ+I4VHWulAPxS6AgXH6u+JBHEph9zTnRypnpz1RuJ/XjNW/pmT0CCKFQnwZJEfM6hCOMoEdqggWLGhJggLqm+FuIcEwkonl9Uh2LMvz5NasWCXCsWbk1z5chpHBuyDQ5AHNjgFZXANKqAKMHgEz+AVvBlPxovxbnxMWheM6cwe+APj8wfnQ5gb</latexit>
j=1

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 22
Size Before and After Convolutions

Feature map size: kernel width


input width

padding
W K + 2P
O= +1
<latexit sha1_base64="F3e+5qMk1hWaddof/b46u0hNgJ4=">AAACBXicbZC7SgNBFIbPxluMt1VLLQaDIIhhNwraCEEbwcKI5gLJEmYns8mQ2Qszs0JYtrHxVWwsFLH1Hex8GyfJFpr4w8DHf87hzPndiDOpLOvbyM3NLywu5ZcLK6tr6xvm5lZdhrEgtEZCHoqmiyXlLKA1xRSnzUhQ7LucNtzB5ajeeKBCsjC4V8OIOj7uBcxjBCttdczdG3SO2p7AJGmgI3SNDlG5miZ3qQa7YxatkjUWmgU7gyJkqnbMr3Y3JLFPA0U4lrJlW5FyEiwUI5ymhXYsaYTJAPdoS2OAfSqdZHxFiva100VeKPQLFBq7vycS7Es59F3d6WPVl9O1kflfrRUr78xJWBDFigZkssiLOVIhGkWCukxQovhQAyaC6b8i0sc6EqWDK+gQ7OmTZ6FeLtnHpfLtSbFykcWRhx3YgwOw4RQqcAVVqAGBR3iGV3gznowX4934mLTmjGxmG/7I+PwBia6VZg==</latexit>
S

output width stride

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 23
Kernel Dimensions and Trainable Parameters

For a grayscale image with a


5x5 feature detector (kernel),
we have the following dimensions
(number of parameters to learn)

What do you think is the output size


for this 28x28 image?

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 24
Backpropagation in CNNs

Same overall concept as before: Multivariable chain rule,


but now with an additional weight sharing constraint

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 30
Pooling Layers Can Help With Local Invariance

Sebastian Raschka, Vahid Mirjalili. Python Machine


Learning. 3rd Edition. Birmingham, UK: Packt
Publishing, 2019. ISBN: 978-1789955750

Downside: Information is lost.


May not matter for classification, but applications where relative position is
important (like face recognition)

In practice for CNNs: some image preprocessing still recommended


Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 34
Pooling Layers Can Help With Local Invariance

Note that typical pooling layers do not have any learnable parameters
Downside: Information is lost.
May not matter for classification, but applications where relative position is
important (like face recognition)

In practice for CNNs: some image preprocessing still recommended


Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 35
Lecture Overview

1. Image Classification
2. Convolutional Neural Network Basics
3. CNN Architectures
4. What a CNN Can See
5. CNNs in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 43
What a CNN Can See
Simple example: vertical edge detector

(From classical computer vision research)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 44
What a CNN Can See
Simple example: vertical edge detector

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 45
What a CNN Can See
Simple example: horizontal edge detector

A CNN can learn whatever it finds


best based on optimizing the objective
(e.g., minimizing a particular loss
to achieve good classification accuracy)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 46
repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from

What a CNN Can See


the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions).
The final layer is a C-way softmax function, C being the number of classes. All filters
and feature maps are square in shape.
Which patterns from the training set activate the feature map?

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Fig. 4. Evolution of a randomly chosen subset of model features through training.


Each layer’s features are displayed in a different block. Within each block, we show
a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualiza-
tion shows the strongest activation (across all training examples) for a given feature
map, projected down to pixel space using our deconvnet approach. Color contrast is
artificially enhanced and the figure is best viewed in electronic form.
Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional
networks. European
occluderIncovers theconference on computer
image region vision (pp.
that appears 818-833).
in the Springer, Cham.
visualization, we see a
strong drop in activity in the feature map. This shows that the visualization
Method: backpropagate
genuinely correspondsstrong
to theactivation signals in
image structure thathidden layers that
stimulates to the input map,
feature images,
thenhence
applyvalidating the to
"unpooling" other
mapvisualizations
the values toshown in Fig.pixel
the original 4 and Fig. for
space 2.
visualization
5 Experiments STAT 453: Intro to Deep Learning and Generative Models
Sebastian Raschka SS 2020 47
What a CNN Can See
Which patterns from the training set activate the feature map?
Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional
networks.
824 In European
M.D. Zeiler andconference
R. Ferguson computer vision (pp. 818-833). Springer, Cham.

Layer 1

Layer 2

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 48
What a CNN Can See
Which patterns from the training set activate the feature map?
Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional
networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
Layer 2

Layer 3

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 49
What a CNN Can See
Which patterns from the training set activate the feature map?
Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional
networks. In European
Layer 3 conference on computer vision (pp. 818-833). Springer, Cham.

Layer 4 Layer 5
Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 50
1
Lecture Overview

1. Padding (control output size in addition to stride)


2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
○ LeNet-5
○ AlexNet
○ VGG-16
○ ResNet-50
○ Inception-v1
5. Transfer learning 2
Padding

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 3
no_padding no_strides

4
Padding jargon
• "valid" convolution: no padding
(feature map may shrink)
• "same" convolution: padding such
that the output size is equal to
the input size
• Common kernel size conventions:
• 3x3, 5x5, 7x7 (sometimes
1x1 in later layers to reduce
channels)

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 5
Lecture Overview

1. Padding (control output size in addition to stride)


2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
○ LeNet-5
○ AlexNet
○ VGG-16
○ ResNet-50
○ Inception-v1
5. Transfer learning 6
7
11
Lecture Overview

1. Padding (control output size in addition to stride)


2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
○ LeNet-5
○ AlexNet
○ VGG-16
○ ResNet-50
○ Inception-v1
5. Transfer learning 12
13
14
Lecture Overview

1. Padding (control output size in addition to stride)


2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
○ LeNet-5
○ AlexNet
○ VGG-16
○ ResNet-50
○ Inception-v1
5. Transfer learning
15
Common Architectures Revisited
We will discuss some additional common CNN architectures since the field evolved quite a bit since 2012 ...

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. 16
CNNs Architectures Illustrated

1. LeNet-5
2. AlexNet
3. VGG-16
4. ResNet-50
5. Inception-v1
6. Inception-v3
7. Xception
8. Inception-v4
Source: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d
9. Inception-ResNets
10. ResNeXt-50
17
...
LeNet-5 AlexNet VGG-16

ResNet-50 Inception-v1

Source: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d
18
Legend

19
1.LeNet-5 (1998)

● ~60,000 parameters. paper: Gradient-Based Learning Applied to Document Recognition.


Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
● One of the simplest architectures. Proceedings of the IEEE (1998).

● (“5”-layers) 2 convolutional and 3 fully- ● This architecture has become the


connected layers. standard ‘template’: stacking
● Sub-sampling layer and trainable weights convolutions and pooling layers, and
(aka average-pooling layer) ending the network with one or more
fully-connected layers.
○ (trainable weights is not current practice of
designing CNNs nowadays).

20
2.AlexNet (2012)

227x227x3

● ~60M parameters, paper: ImageNet Classification with Deep Convolutional Neural


Networks. Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton. University
● 8 layers — 5 convolutional and 3 fully- of Toronto, Canada. NeurIPS 2012

connected.
● They were the first to implement
● AlexNet just stacked a few more layers Rectified Linear Units (ReLUs) as
onto LeNet-5. activation functions.
● Trained in two GPUs GTX 580 between
5 and 6 days

21
22
3.VGG-16 (2014)

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. 23
3.VGG-16 (2014)

● CNNs starting to get deeper and deeper.


paper: Very Deep Convolutional Networks for Large-Scale Image
● Creators: Visual Geometry Group (VGG). Recognition. Karen Simonyan, Andrew Zisserman. University of
● 13 convolutional and 3 fully-connected Oxford, UK. arXiv preprint, 2014
layers
○ carrying ReLU tradition from AlexNet. ● The contribution from this paper is the
●Stacks more layers onto AlexNet, and smaller
designing of deeper networks (roughly
size filters
twice as deep as AlexNet).
○ (2×2 and 3×3).
●~138M parameters.
●~500MB of storage space.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
24
3.VGG-16 (2014)

PyTorch implementation:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13-cnns-
part2/code/vgg16.ipynb

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
25
4.ResNet-50 (2015)

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. 26
4.ResNet-50 (2015)

“with the network depth increasing, accuracy gets GitHub code from keras-team.
Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu
saturated and then degrades rapidly.” Zhang, Shaoqing Ren, Jian Sun. Microsoft. 2016 IEEE Conference on
● Microsoft Research addressed this Computer Vision and Pattern Recognition (CVPR).

problem — using skip connections. ●ResNet 34, 50, 101 up to 152 layers
● Early adopters of batch normalisation ○without compromising generalisation
power
● ~26M parameters. ●152 layers - trained in a cluster of 8
GPUs for 2 to 3 weeks
● The basic building block for ResNets are
●Among the first to use batch
the conv and identity blocks. normalisation.
27
4.ResNet-50 (2015)

28
4.ResNet-50 (2015)
PyTorch implementations of the previous slides:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13- cnns-part2/code/resnet-
blocks.ipynb

PyTorch implementations:
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13- cnns-part2/code/resnet-
34.ipynb
https://github.com/rasbt/stat453-deep-learning-ss20/blob/master/L13- cnns-part2/code/resnet-
152.ipynb

(Can be substantially improved with more hyperparameter tuning)

29
5.Inception-v1/ GoogLeNet (2014)

Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. 30
5.Inception-v1/ GoogLeNet (2014)

"In this paper, we will focus on an efficient deep neural network architecture for computer vision,
codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in
conjunction with the famous “we need to go deeper” internet meme"

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions."
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015.

31
5.Inception-v1/ GoogLeNet (2014)

Full architecture

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions."
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015.

32
5.Inception-v1 (2014)

“The main hallmark of this architecture is the improved


utilisation of the computing resources inside the network.”
● 22-layer architecture with 5M parameters.
● Used Network In Network approach
○ using ‘Inception modules’.
Each module presents 3 ideas: paper. Going Deeper with Convolutions. Christian Szegedy, Wei Liu,
Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
1) Parallel towers of convs. with different filters, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Google,
followed by concatenation University of Michigan, University of North Carolina. 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)
○ captures different features at 1×1, 3×3 and 5×5,
‘clustering’ them. ● Dense modules/blocks instead of stacking
2) 1×1 convs. for dim. reduction (avoid bottlenecks). convolutional layers.
3) Two auxiliary classifiers ● Curiosity: name Inception (from movie).
○ discarded at inference time. 33
5.Inception-v1 (2014)

34
Appendix: Network In Network (2014)
● Recalling from convolution:
○ A pixel is a linear combination of the weights in
Paper: Network In Network. Min Lin, Qiang Chen, Shuicheng Yan.
a filter and the current sliding window.
National University of Singapore. arXiv preprint, 2013.
○ NiN proposes a mini neural network with 1
hidden layer instead of this linear combination.
● MLP convolutional layers, 1×1
● One hidden layer network in a CNN. convolutions
○ a.k.a MLPconv is the same as 1×1 convolutions ● Global average pooling (taking
○ Main feature for Inception architectures. average of each feature map, and
feeding the resulting vector into the
softmax layer)

35
LeNet-5 AlexNet VGG-16

ResNet-50 Inception-v1

Source: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d
36
Lecture Overview

1. Padding (control output size in addition to stride)


2. Spatial Dropout and BatchNorm
3. Considerations for CNNs on GPUs
4. Common Architectures
○ LeNet-5
○ AlexNet
○ VGG-16
○ ResNet-50
○ Inception-v1
5. Transfer learning 37
Transfer Learning

● Key idea:
● ✦ Feature extraction layers may be generally useful
● ✦ Use a pre-trained model (e.g., pre-trained on ImageNet)
● ✦ Freeze the weights: Only train last layer (or last few
layers)
● Related approach: Fine-tuning, train a pre-trained network
on your smaller dataset

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 38
Example 3 - Feature Extractor (I)

● Pre-trained VGG19 model

Two Two Four Four Four


Image Max Max Max Max Max 3 FC
Conv3 Conv3 Conv3 Conv3 Conv3
224x224
- pool - pool - pool - pool - pool
64 128 256 512 512
Example 3 - Feature Extractor (II)

● Pre-trained VGG19 model

Two Two Four Four Four


Image Max Max Max Max Max 3 FC
Conv3 Conv3 Conv3 Conv3 Conv3
224x224
- pool - pool - pool - pool - pool
64 128 256 512 512
Example 3 - Feature Extractor (III)

● Pre-trained VGG19 model

Two Two Four Four Four


Image Max Max Max Max Max 3 FC
Conv3 Conv3 Conv3 Conv3 Conv3
224x224
- pool - pool - pool - pool - pool
64 128 256 512 512
Example 3 - Feature Extractor (IV)

● Pre-trained VGG19 model

Two Two Four Four Four


Image Max Max Max Max Max 3 FC
Conv3 Conv3 Conv3 Conv3 Conv3
224x224
- pool - pool - pool - pool - pool
64 128 256 512 512

Etc …..
Example 3 - Feature Extractor (V)

● Pre-trained VGG19 model

Capa 1 Capa 2 Capa 3

+1 +1
Two Two Four
Image Max Max
Conv3 Conv3 Conv3
224x224
- pool - pool - x1 a1(2) a1(3)
64 128 256
x2 a2(2)

Etc …..
Which Layers to Replace & Train
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video
classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition (pp. 1725-1732).

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 39
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning- ss20/blob/master/L13-cnns-
part2/code/vgg16-transferlearning.ipynb

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 40
Transfer Learning
PyTorch implementation: https://github.com/rasbt/stat453-deep-learning- ss20/blob/master/L13-cnns-
part2/code/vgg16-transferlearning.ipynb

Freeze

Replace

Adapted from: Sebastian Raschka. STAT 453: Intro to Deep Learning and Generative Models. SS 2020 41
42
Extra: Useful tools to visualize DL architectures
● Netron
● Tensorboard
● PyTorchViz
● plot_model API by Keras

43
Lecture 13

Introduction to
Convolutional Neural Networks
Part 3
STAT 479: Deep Learning, Spring 2019
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-ss2019/

Sebastian Raschka STAT 479: Deep Learning SS 2019 1


Additional Concepts to Wrap Up the
Intro to Convolutional Neural Networks

Sebastian Raschka STAT 479: Deep Learning SS 2019 !31


ConvNets and 3D Inputs

3D
3D
3D
3DConv
3D DenseBlock Temporal
Temporal
Temporal 3D DenseBlock Temporal Temporal
3D
3D DenseBlock
DenseBlock Temporal 3D DenseBlock Temporal
Temporal
Temporal 3D
3D DenseBlock
DenseBlock Temporal
Temporal 3D
3D DenseBlock
DenseBlock

Conv
3DConv
DenseBlock Transition
Transition 3D DenseBlock 3D DenseBlock Temporal 3D DenseBlock
11 3DConv
DenseBlock Transition Action

Conv
Transition 11 Transition 3DConv
DenseBlock Transition
Transition 3DConv
DenseBlock

Conv
Conv Transition Conv Transition
Transition 11 Transition
Transition 11
Conv
Conv11 111 Conv Conv Conv
1 Conv12 111
2
Conv
Conv13 111
3
Conv
Conv14 Label

1*1*T
1*1*T
1*1*T11
1*1*T11
Conv
Conv
Conv
Conv

3D DenseBlock 3*3*T
3*3*T
3*3*T22 Conv Avg 3D DenseBlock
3*3*T22
Conv Conv
Conv
Conv Concat Pooling Conv
Conv

3*3*T
3*3*T
3*3*T33
3*3*T33
Conv
Conv
Conv
Conv

3D Temporal Transition Layer

Figure 1: Temporal 3D ConvNet (T3D). Our Temporal Transition Layer (TTL) is applied to our DenseNet3D. T3D uses
video clips as input. The 3D feature-maps from the clips are densely propagated throughout the network. The TTL operates
Diba, Ali,onMohsen Fayyaz,
the different Vivek
temporal Sharma,
depths, Amir Hossein
thus allowing Karami,
the model Mohammad
to capture Mahdi
the appearance Arzani,
and Rahman
temporal Yousefzadeh,
information and Luc
from the short,
Van Gool.
mid, "Temporal 3d convnets:
and long-range New
terms. The architecture
output andistransfer
of the network learning
a video-level for video classification." arXiv preprint arXiv:
prediction.
1711.08200 (2017).

agation, and state-of-the-art performance on image classi- ral depths. The advantage of TTL is that it captures the short,
fication tasks. In specific, (i) we modify 2D DenseNet mid, and long term dynamics, that embody important in-
by replacing the 2D kernels by 3D kernels in the standard formation not captured when working with some fixed tem-
DenseNet architecture and we present it as DenseNet3D; poral depth homogeneously throughout the network. The
Also very popular for Medical Imaging (MRI, CT scans ...)
and (ii) introducing our new Temporal 3D ConvNets (T3D) feature-map of lth layer is fed as input to the TTL layer,
0
by deploying 3D temporal transition layer (TTL) instead of T T L : x ! x , resulting in a dense-aggregated feature rep-
0
transition layer in DenseNet. In both setups, the building 0 0
resentation x , where x 2 Rh⇥w⇥c and x 2 Rh⇥w⇥c .
blocks of the network and the architecture choices proposed In specific, the feature-map from lth , xl is convolved with
in [17] are kept same. K variable 3D convolution kernel temporal depths, result-
Notation. The output feature-maps of the 3D Convolu- ing to intermediate feature-maps {S1 , S2 , . . . , SK }, S1 2
Sebastian
tions and pooling kernels atRaschka
the l layer extractedSTAT
th
for an 479:RDeep
h⇥w⇥c1 Learningh⇥w⇥c2
, S2 2 R , SK SS
2 R2019
h⇥w⇥cK
, where c1 , 32
h⇥w⇥c
ConvNets and 3D Inputs

Same concept as before except


that we now have 3D
images and kernels
n1 ⇥n2 ⇥cin
X2R
<latexit sha1_base64="GzJ9t8GJxsmHc4EQcbLCmf2kfPA=">AAACIXicbVDLSsNAFJ34rPUVdelmsAiuSlIFuyy6cVnFPqCpYTKdtEMnkzBzI5SQX3Hjr7hxoUh34s84fQjaemDg3HPv5c45QSK4Bsf5tFZW19Y3Ngtbxe2d3b19++CwqeNUUdagsYhVOyCaCS5ZAzgI1k4UI1EgWCsYXk/6rUemNI/lPYwS1o1IX/KQUwJG8u2qFxEYBGHWzrHHJZ6VQXaXP2TSd7EHPGIaS7/yQ6mfcZnnvl1yys4UeJm4c1JCc9R9e+z1YppGTAIVROuO6yTQzYgCTgXLi16qWULokPRZx1BJzLFuNnWY41Oj9HAYK/Mk4Kn6eyMjkdajKDCTEwN6sTcR/+t1UgirXWMoSYFJOjsUpgJDjCdx4R5XjIIYGUKo4uavmA6IIhRMqEUTgrtoeZk0K2X3vFy5vSjVruZxFNAxOkFnyEWXqIZuUB01EEVP6AW9oXfr2Xq1PqzxbHTFmu8coT+wvr4Bx1+j5A==</latexit>

m1 ⇥m2 ⇥cin ⇥cout cout


W2R
<latexit sha1_base64="IvH8iHAHqViLIgIXdealOqNYh9w=">AAACMHicbVBNSwMxEM36bf2qevQSLIKnslsFPYoe9FjFtkJ3XbJpVoNJdklmhbLsT/LiT9GLgiJe/RVm2wq1+iDw5s0Mk/eiVHADrvvqTE3PzM7NLyxWlpZXVteq6xttk2SashZNRKKvImKY4Iq1gINgV6lmREaCdaK7k7LfuWfa8ERdQj9lgSQ3isecErBSWD31JYHbKM47Bfa5wsMyyi+K61yGHvaBS2awDBs/lIY5V8VYlWRQFGG15tbdAfBf4o1IDY3QDKtPfi+hmWQKqCDGdD03hSAnGjgVrKj4mWEpoXfkhnUtVcReC/KB4QLvWKWH40TbpwAP1PGNnEhj+jKyk6UfM9krxf963Qziw8D6SzNgig4PxZnAkOAyPdzjmlEQfUsI1dz+FdNbogkFm3HFhuBNWv5L2o26t1dvnO/Xjo5HcSygLbSNdpGHDtAROkNN1EIUPaBn9IbenUfnxflwPoejU85oZxP9gvP1DSPwqkc=</latexit>
b2R
<latexit sha1_base64="IEwnrR7t13ocT+tV1TkUvtNEGRY=">AAACDHicbVDLSgMxFM3UV62vqks3wVJwVWaqoMuiG5dV7APasWTSTBuaSYYkI5QwH+DGX3HjQhG3foA7/8ZMOwttPRA4Ofde7rkniBlV2nW/ncLK6tr6RnGztLW9s7tX3j9oK5FITFpYMCG7AVKEUU5ammpGurEkKAoY6QSTq6zeeSBSUcHv9DQmfoRGnIYUI22lQbnSj5AeB6EJUtinHM6/gblN7w0eGJHoNLVdbs2dAS4TLycVkKM5KH/1hwInEeEaM6RUz3Nj7RskNcWMpKV+okiM8ASNSM9SjiKifDM7JoVVqwxhKKR9XMOZ+nvCoEipaWT9VjOvarGWif/VeokOL3xDeZxowvF8UZgwqAXMkoFDKgnWbGoJwpJarxCPkURY2/xKNgRv8eRl0q7XvNNa/eas0rjM4yiCI3AMToAHzkEDXIMmaAEMHsEzeAVvzpPz4rw7H/PWgpPPHII/cD5/ANp9nCQ=</latexit>

Sebastian Raschka STAT 479: Deep Learning SS 2019 33


ConvNets for Text with 1D Convolutions

We can think of text as image with width 1

(concatenated
word embeddings)
This Is my great sentence

https://pytorch.org/docs/stable/nn.html#conv1d

Sebastian Raschka STAT 479: Deep Learning SS 2019 36


CNNs for Text (with 2D Convolutions)

Good results have also been achieved by representing a sentence


as a matrix of word vectors and applying 2D convolutions
(where each filter uses a different kernel size)

wait
for
the
video
and
do
n't
rent
it

n x k representation of Convolutional layer with Max-over-time Fully connected layer


sentence with static and multiple filter widths and pooling with dropout and
non-static channels feature maps softmax output

Figure 1: Model architecture with two channels for an example sentence.

necessary)
Kim, is represented
Y. (2014). as neural networks for sentence
Convolutional thatclassification.
is kept static arXiv
throughout
preprinttraining and one that
arXiv:1408.5882.
is fine-tuned via backpropagation (section 3.2).2
x1:n =Sebastian
x1 x2 Raschka
. . . xn , (1)479:InDeep
STAT the multichannel
Learning architecture,
SS 2019 illustrated in fig- !38
Pre-Trained Models for Text

https://modelzoo.co/model/pytorch-nlp

Sebastian Raschka STAT 479: Deep Learning SS 2019 45


Lecture 14

Introduction to
Recurrent Neural Networks
STAT 453: Deep Learning, Spring 2020
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat453-ss2020/

Lecture Slides:
https://github.com/rasbt/stat453-deep-learning-ss20/tree/master/L14-rnns

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 1
A Classic Approach for Text Classification:
Bag-of-Words Model

vocabulary = {
'and': 0,
"Raw" training dataset
'is': 1 Training set as design matrix
'one': 2,
[1]
x = ”The sun is shining”
'shining': 3,
2 3
0 1 0 1 1 0 1 0 0
<latexit sha1_base64="OKMSIS0rHzI4u0yb4ZDXL5GhtgA=">AAACGHicbVC7SgNBFJ31GeMramkzJAhWcVcFbYSgjWUEkwjJGmYnd5PB2dll5q4kLPsZNv6KjYUitun8Gycxha8DA4dz7uXOOUEihUHX/XDm5hcWl5YLK8XVtfWNzdLWdtPEqebQ4LGM9U3ADEihoIECJdwkGlgUSGgFdxcTv3UP2ohYXeMoAT9ifSVCwRlaqVs66EQMB0GYDfPbrO35OT2jtIMwxKx8PQBqUkWFoWYglFD9ct4tVdyqOwX9S7wZqZAZ6t3SuNOLeRqBQi6ZMW3PTdDPmEbBJeTFTmogYfyO9aFtqWIRGD+bBsvpnlV6NIy1fQrpVP2+kbHImFEU2MlJDPPbm4j/ee0Uw1M/EypJERT/OhSmkmJMJy3RntDAUY4sYVwL+1fKB0wzjrbLoi3B+x35L2keVr2j6uHVcaV2PqujQHZJmewTj5yQGrkkddIgnDyQJ/JCXp1H59l5c96/Ruec2c4O+QFn/AnSvZ+f</latexit>

[2]
x = ”The weather is sweet”
'sun': 4,
X = 40 15
<latexit sha1_base64="5I7HegCmiP0Jx+G/Dp7ZWqZNVss=">AAACGnicbVC7SgNBFJ31GeMramkzJAhWYTcK2ghBG8sIiQZ21zA7uWsGZx/M3FXDst9h46/YWChiJzb+jZNH4evAwOGce7lzTpBKodG2P62Z2bn5hcXSUnl5ZXVtvbKxea6TTHHo8EQmqhswDVLE0EGBErqpAhYFEi6C65ORf3EDSoskbuMwBT9iV7EIBWdopF7F8SKGgyDM74rL3G34BT2i1EO4w7zaHgC9BWODokJTfQuA1aJXqdl1ewz6lzhTUiNTtHqVd6+f8CyCGLlkWruOnaKfM4WCSyjKXqYhZfyaXYFraMwi0H4+jlbQHaP0aZgo82KkY/X7Rs4irYdRYCZHQfRvbyT+57kZhod+LuI0Q4j55FCYSYoJHfVE+0IBRzk0hHElzF8pHzDFOJo2y6YE53fkv+S8UXf26o2z/VrzeFpHiWyTKtklDjkgTXJKWqRDOLknj+SZvFgP1pP1ar1NRmes6c4W+QHr4wuB26CG</latexit>

[3]
x[3] = ”The sun is shining, x = ”The sun is shining,
'sweet': 5,
1 0 0 0 1 1 0
[3]
weet,xand=one
”The sun
theand is is
weather
one shining,
is two”
sweet, and one and one is two”
'the': 6, <latexit sha1_base64="JiNG/+AXF+oBM1ttTsBLjeFOxpM=">AAACT3icbVFNa9wwEJW3aZtsvzbNMRexS6GHZbGTQHsphPTSYwLZJGA7iyyPd8XKkpHG3SzG/7CX9ta/0UsPCSHyxoR8PRB6vDfDaJ6SQgqLvv/X67xYe/nq9fpG983bd+8/9DY/nlhdGg5jrqU2ZwmzIIWCMQqUcFYYYHki4TSZf2/8059grNDqGJcFxDmbKpEJztBJk14W5QxnSVZd1OdVuBvX9BulEcIFVv3jGVBbKiostTOhhJoO6yhqXXTmAlwvmFXBAgCHlKmUagV3t3Nwofv1pDfwR/4K9CkJWjIgLQ4nvT9RqnmZg0IumbVh4BcYV8yg4BLqblRaKBifsymEjiqWg42rVR41/eSUlGbauKOQrtT7HRXLrV3miatstrePvUZ8zgtLzL7GlVBFiaD47aCslBQ1bcKlqTDAUS4dYdwI91bKZ8wwju4Lui6E4PHKT8nJzijYHe0c7Q32D9o41sk26ZPPJCBfyD75QQ7JmHDyi/wjl+TK++399647bWnHa8kWeYDOxg2D2bOH</latexit>
2 3 2 1 1 1 2 1 1
sweet, and one and one is two”
<latexit sha1_base64="96E+QZra0hz6oQdF3fe0hkUBu2c=">AAACfnicbVFdS8MwFE3r16xfUx99iQ5FFGc7BfciDH3xcYLTwTpGmt1uwTQtSSqOsp/hH/PN3+KLWVdlTi9c7sk553KTmyDhTGnX/bDshcWl5ZXSqrO2vrG5Vd7eeVRxKim0aMxj2Q6IAs4EtDTTHNqJBBIFHJ6C59uJ/vQCUrFYPOhRAt2IDAQLGSXaUL3ymx8RPQzCrD3G19gPYMBEFhhOstex4+Ij7Jn8rrPYzdP3nd/EP2bjqZl6YbI2o3mzZ8cH0f8Z3CtX3KqbB/4LvAJUUBHNXvnd78c0jUBoyolSHc9NdDcjUjPKYez4qYKE0GcygI6BgkSgulm+vjE+NEwfh7E0KTTO2dmOjERKjaLAOCfLUvPahPxP66Q6rHczJpJUg6DTQWHKsY7x5C9wn0mgmo8MIFQyc1dMh0QSqs2POWYJ3vyT/4LHWtW7qNbuLyuNm2IdJbSHDtAx8tAVaqA71EQtRNGntW+dWKc2so/sM/t8arWtomcX/Qq7/gU8y6tk</latexit>

'two': 7, ⇤ ⇥
'weather': 8, y = 0, 1, 0
}
<latexit sha1_base64="I1WxPE7V0m67om8S48xR6u6CtsY=">AAACG3icbVDLSgMxFM3UVx1fVZdugkVwUcpMFXQjFN24rGAf0Cklk95pQzOZIcmIZZj/cOOvuHGhiCvBhX9j+kC09UDgcM693Jzjx5wp7ThfVm5peWV1Lb9ub2xube8UdvcaKkokhTqNeCRbPlHAmYC6ZppDK5ZAQp9D0x9ejf3mHUjFInGrRzF0QtIXLGCUaCN1CxUvJHrgB+kowxfY86HPROobTbL7zHZK2C1hx/ZA9H7UbqHolJ0J8CJxZ6SIZqh1Cx9eL6JJCEJTTpRqu06sOymRmlEOme0lCmJCh6QPbUMFCUF10km2DB8ZpYeDSJonNJ6ovzdSEio1Cn0zOU6i5r2x+J/XTnRw3kmZiBMNgk4PBQnHOsLjonCPSaCajwwhVDLzV0wHRBKqTZ22KcGdj7xIGpWye1Ku3JwWq5ezOvLoAB2iY+SiM1RF16iG6oiiB/SEXtCr9Wg9W2/W+3Q0Z8129tEfWJ/feTSgbg==</latexit>

training
class labels

Classifier
Ex.: https://github.com/rasbt/python-machine-learning-book-3rd-
edition/tree/master/ch08 (e.g., logistic regression, MLP, ...)
Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 2
1D CNNs for text (and other sequence data)

T
h
e

s
u
n

i
s ...
s
h
i
n
i
n
g

.
.
.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 3
Lecture Overview

RNNs and Sequence Modeling Tasks


Backpropagation Through Time
Long-short term memory (LSTM)
Many-to-one Word RNNs
Generating Text with Character RNNs
Attention Mechanisms and Transformers

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 4
Sequential data is not i.i.d.

Figure: Sebastian Raschka, Vahid Mirjalili. Python


Machine Learning. 3rd Edition. Birmingham, UK: Packt
Publishing, 2019

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 5
www.nature.com/scientificreports/

Applications:
Working with Sequential Data

• Text classification
• Speech recognition (acoustic modeling)
• language translation
• ...

Stock market predictions


Figure
Shen, 5. TheWenzheng
Zhen, basic architectural structure
Bao, and of our Huang.
De-Shuang model KEGRU. (1) We first
"Recurrent
consists of a number
Neural Network for of k-mer sequence
Predicting built byFactor
Transcription splitting DNA sequence.
Binding Sites." (2) B
first step, we
Scientific use the8,pre-trained
reports model
no. 1 (2018): word2vec to learn the k-mer embeddi
15270.
stacked into the embedding matrix that will be used to initialize the embeddin
DNA or (amino acid/protein)
GRU network to solve long-range dependencies problem and to learn feature
sequence. (4) The prediction results were generated by the dense layer and the
loss function to compare the prediction results with the true target labels.

Fig 8. Displays the actual data and the predicted data from the four models for each stock index in
sequence modeling
Year 1 from 2010.10.01 to 2011.09.30.
https://doi.org/10.1371/journal.pone.0180944.g008
Bao, Wei, Jun Yue, and Yulei Rao. "A deep learning framework for financial time series using
stacked autoencoders and long-short term memory." PloS one 12, no. 7 (2017): e0180944.

one.0180944 July 14, 2017 16 / 24

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 6
Overview time step t

Networks we used
previously: also called Recurrent Neural
feedforward neural Network (RNN)
networks

Figure: Sebastian Raschka, Vahid Mirjalili. Python


Machine Learning. 3rd Edition. Birmingham, UK: Packt
Publishing, 2019

Recurrent edge

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 7
Overview

Figure: Sebastian Raschka, Vahid Mirjalili. Python


Machine Learning. 3rd Edition. Birmingham, UK: Packt
Publishing, 2019

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 8
Different Types of Sequence Modeling Tasks

Figure: Sebastian Raschka, Vahid Mirjalili. Python


Machine Learning. 3rd Edition. Birmingham, UK: Packt
Publishing, 2019

Figure based on:

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy (http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 9
Different Types of Sequence Modeling Tasks

Many-to-one: The input data is a sequence, but the output is a fixed-size


vector, not a sequence.

Ex.: sentiment analysis, the input is some text, and the output is a class
label.

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 10
https://www.kdnuggets.com/images/sentiment-fig-1-689.jpg
Different Types of Sequence Modeling Tasks

One-to-many: Input data is in a standard format (not a sequence), the


output is a sequence.

Ex.: Image captioning, where the input is an image, the output is a text
description of that image

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 11
https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
Different Types of Sequence Modeling Tasks
Many-to-many: Both inputs and outputs are sequences. Can be direct
or delayed.

Ex.: Video-captioning, i.e., describing a sequence of images via text


(direct).

Translating one language into another (delayed)

Sebastian Raschka STAT 453: Intro to Deep Learning and Generative Models SS 2020 12
https://static-01.hindawi.com/articles/mpe/volume-2018/3125879/figures/3125879.fig.001.svgz

You might also like