0% found this document useful (0 votes)
29 views30 pages

AA12 Deep Learning 2024

Uploaded by

Miguel Meninas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views30 pages

AA12 Deep Learning 2024

Uploaded by

Miguel Meninas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

12/19/24

(BRIEF INTRODUCTION TO)


DEEP LEARNING
Readings:
• G. James, D. Witten, T. Hastie, R. Tibshirani, J. Taylor. An Introduction to Statistical Learning with
Applications in Python, Springer, July 2023.
• François Chollet. Deep Learning with Python. 2nd Edition, 2017.
• Ian Goodfellow. Deep Learning. MIT Press, 2016.

AI, ML and Deep Learning


• AI is: “the effort to automate intellectual tasks normally
performed by humans.”

1
12/19/24

Analytical engine
• 1830s-1840s Charles Babbage invented the Analytical Engine
• In 1843, Ada Lovelace remarked on the invention of the
Analytical Engine:
“The Analytical Engine has no pretensions whatever to originate
anything. It can do whatever we know how to order it to
perform... Its province is to assist us in making available what
we’re already acquainted with.”

AI in the 50’s
• AI pioneer Alan Turing quoted as “Lady Lovelace’s objection” in
his landmark 1950 paper “Computing Machinery and
Intelligence,”1 which introduced the Turing test.
• Turing was of the opinion that computers could in principle be
made to emulate all aspects of human intelligence.

1 A.M. Turing, “Computing Machinery and Intelligence,” Mind 59, no. 236 (1950): 433–460.

2
12/19/24

Programming and ML
• The usual way to make a computer do useful work is to have a
human programmer write down rules — a computer program.
• Machine learning turns this around: the machine looks at the
input data and the corresponding answers, and figures out what
the rules should be.

Machine Learning
• A machine learning system is trained rather than explicitly
programmed.

3
12/19/24

Machine Learning
• Machine learning, and especially deep learning, exhibits some
mathematical theory, but is fundamentally an engineering
discipline.
• Unlike theoretical physics or mathematics, machine learning is a
very hands-on field driven by empirical findings and deeply
reliant on advances in software and hardware.

Machine Learning
• To do machine learning we need:
• Input data points—For instance, if the task is speech recognition,
these data points could be sound files of people speaking.
• Examples of the expected output—In a speech-recognition task, these
could be human-generated transcripts of sound files.
• A way to measure whether the algorithm is doing a good job —
determine distance between the algorithm’s current output and
expected output. The measurement is used as a feedback signal to
adjust the way the algorithm works. This adjustment step is what we
call learning.
8

4
12/19/24

ML and DL
• The central problem in machine learning and deep learning
is to meaningfully transform data: in other words, to learn
useful representations of the input data at hand —
representations that get us closer to the expected output.
• Machine learning models are all about finding appropriate
representations for their input data — transformations of the
data that make it more amenable to the task at hand.

Representations of data
• Machine learning is finding appropriate representations for
their input data — transformations of the data.

10

10

5
12/19/24

How to learn?
• This is fine for such an extremely simple problem!
• Could you do the same if the task were to classify images of
handwritten digits?
• Could you write down explicit, computer-executable image
transformations that would illuminate the difference
between a 6 and an 8, between a 1 and a 7, across all kinds
of different handwriting?

11

11

How to learn?
• Hardly!
• It will be much easier to automate the process!
• Learning, in the context of machine learning, describes an
automatic search process for data transformations that
produce useful representations of some data, guided by
some feedback signal.

12

12

6
12/19/24

Machine Learning basics


• Machine learning/Computational Intelligence is a field of computer science that
gives computers the ability to learn without being explicitly programmed.
• Methods that can learn from and make predictions on data.

Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data Prediction
Model

13

13

Types of learning class A

class A

Classification
• Supervised: Learning with a labeled training set
• Ex.: email classification with labeled emails
• Unsupervised: Discover patterns in unlabeled data
• Ex.: cluster similar documents based on text
Clustering
• Reinforcement: learn to act based on feedback/reward
• Ex.: learn to play Go, reward: win or lose
• Self-supervised: labels generated from the input data
• Ex: image and text classification
Regression
Other apps: anomaly detection, sequence labeling, etc.,…
14

14

7
12/19/24

Self-Supervised Learning
• Supervised learning without human-annotated labels; supervised learning
without any humans in the loop.
• Labels are generated from the input data, typically using a heuristic
algorithm.
• Autoencoders is an example, where the generated targets are the input,
unmodified.
• Examples: predict the next frame in a video, given past frames; predict the
next word in a text, given previous words.
• Self-supervised learning is both supervised or unsupervised learning,
depending on whether you pay attention to the learning mechanism or to
the context of its application.
15

15

Deep learning
• Deep learning is a very active area of research in the
machine learning and artificial intelligence communities. The
cornerstone of deep learning is the neural network.
• The term deep learning appeared around 2010. Roughly,
consists of new architectures for the shallow neural
networks.

16

16

8
12/19/24

Activation functions

• Rectified linear unit (ReLU) activation function is a piecewise-linear function,


that does not smash the output.

17

17

Characteristics of ReLU
• All negative values are converted into zero, so there no
negative values are available.
• Threshold values are infinity, so there is no issue of vanishing
gradient problem so learning can be very slow.
• Speed is fast when compared to other activation functions,
such as e.g. sigmoids.
• In general, it needs more neurons because of its linearity.

18

18

9
12/19/24

Example: handwritten digits


• MNIST database
• See LeCun, Cortes, and Burges (2010) “The MNIST database of handwritten
digits”, available at http://yann.lecun.com/exdb/mnist.
• Each grayscale image has 28 × 28 pixels, each of which is an eight-bit number (0–
255) which represents how dark that pixel is.
• The digits 3, 5, and 8 are enlarged to show their 784 individual pixel values, which
are the inputs.
• The outputs are the 10 digits, Y = (Y0, Y1, ... , Y9).
• There are 60,000 training images, and 10,000 test images.

19

19

Example: handwritten digits

20

20

10
12/19/24

MLP to handwritten
recognition

21

21

MLP to handwritten recognition


• Number of inputs p = 784.
• Two hidden layers L1 with 256 neurons and L2 with 128 neurons
• Activation functions: logistic or softmax (value between 0 and 1)
• Output variables: 10.
• Number of weights between X and L1: 785×256 = 200,960.
• Number of weights between L1 and L2: 257 × 128 = 32,896.
• Number of weights between L2 and Y: 129 × 10 = 1,290.
• Total number of weights: 235,146 and only 60,000 training samples.
• To avoid overfitting, regularization is needed.
• Here two forms of regularization are used: ridge regularization (similar to ridge regression),
and dropout regularization.

22

22

11
12/19/24

Handwritten recognition
• Softmax activation function:

• for m = 0, 1, . . . , 9. This ensures that the 10 numbers behave like


probabilities (non-negative and sum to one).
• When the response is quantitative, we minimize the squared-error loss.
• If response is qualitative, we minimize the cross-entropy:

23

23

Results

24

24

12
12/19/24

Regularization and SGD


• Gradient descent usually takes many steps to reach a local minimum.
• In practice, there are a number of approaches for accelerating the process.
• When the number of observations is large, one can sample a small fraction
or mini-batch to compute a gradient step.
• This process is known as stochastic gradient descent (SGD) and is the state
of the art for learning deep neural networks.
• Recall the digit recognition problem. The network has over 235,000
weights, which is around four times the number of training examples.
Regularization is essential here to avoid overfitting.

25

25

Stochastic Gradient Descent


• Stochastic gradient descent (SGD): computing the gradient at a
single sample chosen randomly at each step.
• Mini-batch: computes the gradient at a mini-batch of points, also
chosen randomly at each step.

26

26

13
12/19/24

Ridge regularization
• Ridge regularization is achieved by augmenting the objective function with a
penalty term:

• The parameter l is often preset at a small value. In MNIST, the weights in the two
hidden layers are penalized and the output layer is not penalized at all.
• Lasso regularization is also popular as an additional form or regularization as an
alternative to ridge.

27

27

Application to MNIST database

28

28

14
12/19/24

Application to MNIST database


• Mini-batch size: 128 observations per gradient update.
• 20% of the 60,000 training observations were used as a validation set in
order to determine when training should stop.
• So, 48,000 observations were used for training, and hence there are
48,000/128 ≈ 375 minibatch gradient updates per epoch.
• The validation objective starts to increase by 30 epochs, so early stopping
can also be used as an additional form of regularization.

29

29

Dropout regularization
• Inspired by random forests, the idea is to randomly remove a fraction f
of the units in a layer when fitting the model.

30

30

15
12/19/24

Dropout regularization
• The randomly removal fraction f of the units is done separately each time a
training observation is processed.
• The surviving units stand in for those missing, and their weights are scaled up by
a factor of 1/(1 − f) to compensate.
• This prevents nodes from becoming over-specialized, and can be seen as a form
of regularization.
• In practice dropout is achieved by randomly setting the activations for the
“dropped out” units to zero, while keeping the architecture intact.

31

31

Images from CIFAR100 database


• Neural networks rebounded around 2010 with big successes in image
classification.

32

32

16
12/19/24

CIFAR100 database
• CIFAR100 database:
• See Krizhevsky (2009) “Learning multiple layers of features from tiny images”,
available at https://www.cs.toronto.edu/~kriz/.
• It has 60,000 images labeled according to 20 superclasses (e.g. aquatic
mammals), with 5 classes per superclass (beaver, dolphin, otter, seal, whale).
• Each image has a resolution of 32 × 32 pixels, with three 8-bit numbers per pixel
representing red, green and blue.
• The feature map is a 3-dimensional array: 2D space and the three colors.
• There is a designated training set of 50,000 images, and a test set of 10,000
images.

33

33

Convolutional filter
• Example of a 4 × 3 image:

• Consider a 2 × 2 filter of the form:

• The convolved image with the filter is given by:

34

34

17
12/19/24

Example of convolution filter


• Image of a tiger with 192× 179.
• Convolution filters are a 15 × 15
image containing zeros (black),
with a narrow strip of ones
(white) oriented either vertically
or horizontally.

35

35

Pooling layer
• A pooling (subsampling) layer provides a way to condense a large image into a
smaller pooling summary image.
• The max pooling operation uses maximum value of each non-overlapping 2×2
block of pixels in an image.
• Simple example of max pooling:

36

36

18
12/19/24

CNN for CIFAR100 image classification task


• For the CIFAR100 test set, the best accuracy was just above 75% in
2016, now is almost 100%!

37

37

Data augmentation
• An additional important trick used with image modeling is data
augmentation.
• Each training image is replicated many times, with each replicate
randomly distorted in a natural way such that human recognition is
unaffected.

38

38

19
12/19/24

Example
• Original image (leftmost) is distorted in natural ways to produce different
images with the same class label.
• These distortions do not fool humans, and act as a form of regularization
when fitting the CNN.

39

39

Document classification
• New example that has important applications in industry and science: predicting
attributes of documents.
• Example: IMDb (Internet Movie Database) ratings:
• Maas et al. (2011) “Learning word vectors for sentiment analysis”, in Proc. 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.
142–150.
• Extract of a negative review:
This has to be one of the worst films of the 1990s. When my friends & I were watching
this film (being the target audience it was aimed at) we just sat & watched the first half
an hour with our jaws touching the floor at how bad it really was. The rest of the time,
everyone else in the theater just started talking to each other, leaving or generally
crying into their popcorn...
40

40

20
12/19/24

Document classification
• Each review can have a different length, include slang or non-words, have spelling
errors, etc. It is necessary to featurize such a document, i.e., define a set of
predictors.
• The simplest and most common featurization is the bag-of-words model.
• Each document is scored for the presence or absence of each of the words in a
language dictionary. Usually, only the M most frequent words are used.
• For IMDb M = 10,000. IMDb uses 25,000 reviews for training and also 25,000 for
test. The size of the validation set is 2,000.
• The resulting training feature matrix X has dimension 25,000×10,000, but only
1.3% of the binary entries are nonzero.

41

41

Example: IMDb classification


• The response of the database is the sentiment of the review,
which is classified as positive or negative.
• Two modeling techniques:
• Lasso logistic regression using regularization.
• Neural network with two hidden layers, each with 16 ReLU units.

42

42

21
12/19/24

Accuracy of the models


• Both models have test
accuracy of 88%.

43

43

Recurrent Neural Networks


• Many data sources are sequential in nature, calling for special treatment when
building predictive models. Examples include:
• Time series of temperature, rainfall, wind speed, air quality, finance data, etc. to
forecast weather or climate.
• Documents such as book reviews, newspaper articles, tweets, handwriting (as e.g.
doctor’s notes). The sequence and relative positions of words in a document capture
the narrative, theme and tone. For e.g. topic classification, sentiment analysis, and
language translation.
• Recorded speech, musical recordings, and other sound recordings, for text
transcription of a speech, language translation, assess the quality of a piece of music,
etc.

44

44

22
12/19/24

Time series
• Time series: any data obtained via measurements at regular intervals.
• Examples:
• daily price of a stock,
• hourly electricity consumption,
• weekly sales of a store.
• Time series are everywhere: seismic activity, evolution of fish population,
weather at a location, visitors to a website, country’s GDP, credit card
transactions, etc., etc.
• Working with timeseries involves understanding the dynamics of a system:
periodic cycles, trends over time, regular regime and sudden spikes.

45

45

Timeseries for what?


• Forecasting: predicting what will happen next in a series. Ex.: Forecast
electricity consumption a few hours in advance to anticipate demand.
• Classification: assign one or more categorical labels to a timeseries. Ex.:
activity of a visitor on a website, classify whether the visitor is a bot or a
human, classify marbles by color and veins.
• Event detection: Identify the occurrence of a specific expected event within a
continuous data stream. Ex.: “hot word detection,” such as “Ok Google”, “Hey
Siri” or “Hey Alexa”.
• Anomaly detection: Detect anything unusual happening within a continuous
data stream. Ex.: Unusual activity on your corporate network? Might be an
attacker. Unusual readings on a manufacturing line? Time for a human to
check. Anomaly detection is typically done via unsupervised learning.
46

46

23
12/19/24

Example: temperature forecasting


• Predicting the temperature 24 hours in the future, given a
timeseries of 14 measurements by a set of sensors on the roof of
a building, measured every hour.
• Variables: temperature, atmospheric pressure, humidity, wind
direction, etc.
• Variables were recorded every 10 minutes since 2003. The data
bellow considered the subset 2009-2016.
https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip

47

47

Visualizing the data!


• Temperature over the full temporal range of the dataset (◦C).

48

48

24
12/19/24

Visualizing the data


• Temperature over the first 10 days of the dataset (◦C).

49

49

Recurrent Neural Networks


• In a recurrent neural network (RNN), the input object X is a sequence.
• In the IMDb movie reviews each document can be represented as a sequence
of L words. The order of the words and closeness of certain words in a
sentence, convey semantic meaning.
• Sequence(s) can be time series. Example: forecasting using historical trading
data from the New York stock exchange.

50

50

25
12/19/24

Recurrent Neural Networks


• Example with an input sequence and a single output.

51

51

Time series forecasting

• Historical trading
statistics from the New
York stock exchange.

52

52

26
12/19/24

New York stock exchange


• Three daily time series covering the period December 3, 1962 to December 31,
1986, LeBaron and Weigend (1998):
• Log trading volume, vt. Fraction of all outstanding shares that are traded on that day,
relative to a 100-day moving average of past turnover (log scale).
• Dow Jones return, rt. Difference between the log of the Dow Jones industrial index
on consecutive trading days.
• Log volatility, zt. This is based on the absolute values of daily price movements.
• Predicting stock prices is a notoriously hard problem!
• Predicting trading volume based on recent past history is more manageable.

53

53

New York stock exchange


• An observation consists of the measurements (vt, rt, zt) on day t. There are a total of
T = 6,051 triples (see figure in slide 47).
• The day-to-day observations are not independent of each other. The series exhibit auto-
correlation.
• Consider pairs of observations (vt, vt – ℓ), a lag of ℓ days apart. If we take all such pairs in
the vt series and compute their correlation coefficient, this gives the autocorrelation at
lag ℓ.
• Figure next slide shows the autocorrelation function for all lags up to 37, and we see
considerable correlation.
• The log volume vt is also a predictor. Then, we will use the past values of log volume to
predict values in the future.

54

54

27
12/19/24

Log trading volume autocorrelation

55

55

RNN forecaster
• We wish to predict a value vt from past values vt–1, vt–2, … and also to make
use of past values of the other series rt–1, rt–2, … and zt–1, zt–2, … .
• Form of input and output data:

• Each value of t makes a separate (X, Y) pair, for t running from L + 1 to T.


• For the NYSE data we will use L = 5 trading days to predict the next day’s
trading volume. Hence, L = 5. Clearly L is a parameter that should be chosen
with care, perhaps using validation data.

56

56

28
12/19/24

RNN structure
• The model was fit with K = 12 hidden units.
• 4,281 training sequences from the data before January 2, 1980, were used
to derive the model. Thus, 1,770 values were used to forecast the log
volume after that date.
• The model achieved an R2 = 0.42 on the test data.
• As a baseline for comparison, using yesterday’s value for log volume as the
prediction for today has R2 = 0.18.

57

57

RNN forecast of log trading volume

• Black lines are the true volumes, and the orange the forecasts.
• The forecasted series accounts for 42% of the variance of log trading volume.

58

58

29
12/19/24

Further improvement
• Autoregressive (AR) linear models was fit to the data and obtained an R2 = 0.41.
• A feedforward neural network (NN) achieved an R2 = 0.42 such as the RNN.
• Model can be improved by including the variable day-of-week corresponding
to the day t of the target vt (which can be learned from the calendar dates
supplied with the data).
• Trading volume is often higher on Mondays and Fridays.
• The performance of the AR model improved to R2 = 0.46 as did the RNN, and
the NN model improved to R2 = 0.47.
• Some of the big breakthroughs in language modeling and translation, as e.g.,
Google Translator, resulted from the recent improvements in RNNs.

59

59

What to choose?
• We have a number of very powerful tools at our disposal, including neural networks,
random forests and boosting, and support vector machines, to name a few.
• But we have also linear models, and simple variants of these.
• When faced with new data modeling and prediction problems, its tempting to always
go for the trendy new methods.
• However, simpler models perform as well, they are easier to fit and understand, and
less fragile than the more complex approaches. The choice must be based on the
performance/complexity tradeoff.
• Typically, Deep Learning is an attractive choice when the sample size of the training
set is extremely large, and when interpretability of the model is not a high priority.

60

60

30

You might also like