0% found this document useful (0 votes)
45 views53 pages

Module-1 DL

The document introduces deep learning, focusing on biological and machine vision, including the neocognitron and LeNet-5 architectures. It discusses the evolution of machine vision, the significance of the ImageNet dataset, and the breakthrough of AlexNet in deep learning. Additionally, it covers natural language processing (NLP), representation learning, and the transition from one-hot representations to word vectors for better understanding of language.

Uploaded by

sindhuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views53 pages

Module-1 DL

The document introduces deep learning, focusing on biological and machine vision, including the neocognitron and LeNet-5 architectures. It discusses the evolution of machine vision, the significance of the ImageNet dataset, and the breakthrough of AlexNet in deep learning. Additionally, it covers natural language processing (NLP), representation learning, and the transition from one-hot representations to word vectors for better understanding of language.

Uploaded by

sindhuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module-1

Introducing Deep Learning


Biological and Machine Vision
Biological Vision:
• Biological vision is the process by which animals see and process visual
information.
• It is a complex system that involves the eyes, the brain, and the nervous system.
• The eyes collect light from the environment and focus it onto the retina, a layer
of light-sensitive cells at the back of the eye.
• The retina contains two types of photoreceptor cells: rods and cones.
• Rods are more sensitive to light than cones, but they do not provide color vision.
• Cones are responsible for color vision, but they are less sensitive to light than
rods.
• The photoreceptor cells in the retina convert light into electrical signals. These
signals are then transmitted to the brain via the optic nerve. The brain processes
the signals from the retina to create a visual image.
• The visual system is not a simple camera. It is a complex system
that is constantly adapting to the environment.
• The brain uses a variety of techniques to extract information from
the visual image, including:
• Edge detection: The brain identifies edges in the visual image. Edges
are important because they help to define objects.
• Color detection: The brain identifies the colors in the visual image.
Color is important for recognizing objects and for understanding the
environment.
• Motion detection: The brain identifies the motion of objects in the
visual image. Motion is important for tracking objects and for
understanding the environment.
• Depth perception: The brain uses a variety of cues to determine the
depth of objects in the visual image. Depth perception is important
for navigation and for interacting with the environment.
• Here are some of the key differences between biological
and machine vision:
• Biological vision is analog, while machine vision is digital.
This means that biological vision works with continuous
signals, while machine vision works with discrete signals.
• Biological vision is adaptive, while machine vision is static.
This means that biological vision can change its response to
the environment, while machine vision cannot.
• Biological vision is robust to noise, while machine vision is
not. This means that biological vision can still function in
the presence of noise, while machine vision can be easily
fooled by noise
Machine Vision
• The study of biological vision has helped to
advance the field of machine vision.
• Machine vision systems are now able to perform
many of the same tasks as biological vision, but
they still have a long way to go before they can
match the capabilities of biological vision.
• We've been talking about the biological visual system
because it's the inspiration for modern machine vision
techniques called deep learning.
• Figure 1.8 shows a timeline of vision in biological
organisms and machines.
• The top timeline shows the development of vision in
trilobites and Hubel and Wiesel's 1959 discovery of the
hierarchical nature of the primary visual cortex.
• The machine vision timeline is split into two tracks:
deep learning (pink) and traditional machine learning
(purple).
• Deep learning is our focus , and it's more powerful and
revolutionary than traditional machine learning.
The Neocognitron
• The neocognitron is a hierarchical, multilayered artificial neural network.
It has been used for Japanese handwritten character recognition and other
pattern recognition tasks, and served as the inspiration for convolutional
neural networks.
• The neocognitron is inspired by the model proposed by Hubel & Wiesel in
1959. They found two types of cells in the visual primary cortex called
simple cell and complex cell, and also proposed a cascading model of these
two types of cells for use in pattern recognition tasks.
• The neocognitron is a natural extension of these cascading models
• The neocognitron consists of multiple layers of cells, each of which
performs a specific function.
• The first layer of cells, called S-cells, detects edges and other local
features in the input image.
• The second layer of cells, called C-cells, integrates the outputs of the S-
cells to detect more complex features.
• The third layer of cells, called M-cells, integrates the outputs of the C-
cells to detect objects.
• The neocognitron is trained using a process called self-organization. In
self-organization, the weights of the connections between the cells are
adjusted so that the network learns to recognize a specific set of patterns.
• The neocognitron can be trained to recognize a variety of patterns,
including handwritten characters, faces, and objects.
LeNet-5
• LeNet-5 is a convolutional neural network (CNN) architecture
proposed by Yann LeCun et al. in 1998.
• It was one of the first CNNs to achieve state-of-the-art results on the
MNIST handwritten digit recognition dataset.
• LeNet-5 is a relatively simple CNN, but it is still a powerful
architecture that can be used for a variety of image recognition tasks.
LeNet-5 consists of seven layers:
• 2 convolutional layers with 20 and 50 filters, respectively
• 2 max pooling layers with a pool size of 2x2
• 2 fully connected layers with 500 and 100 neurons, respectively
• A softmax layer with 10 outputs, one for each digit
• The convolutional layers extract features from the input image. The
max pooling layers reduce the size of the feature maps, while
preserving the most important features. The fully connected layers
classify the features into one of the 10 digits.
This diagram is showing the LeNet-5 architecture, one of the earliest and most influential Convolutional Neural Networks (CNNs),
designed by Yann LeCun for handwritten digit recognition (like the MNIST dataset). Let’s break it down step-by-step:

1. Input Image
• On the far left, we see a handwritten digit “2” (black and white image).
• This is the input layer — in the original LeNet-5, it’s a 32×32 pixel grayscale image.
2. Large Simple Features
• The first few layers extract basic features such as:
– Edges
– Lines
– Simple shapes
• These features are detected using convolutional layers.
• The blue cubes represent feature maps (multiple channels of
filtered images).
• After convolution, a subsampling (pooling) layer reduces the
image size while keeping the most important information.
• Smaller More Complex Features
• As we move right in the diagram:
– Layers detect more abstract and complex patterns, like curves,
intersections, and specific shapes of the digit.
– Higher layers combine earlier features to form hierarchical feature
representations (e.g., a loop, a horizontal line with a curve).
• Fully Connected Layers
• Toward the right, the 3D feature maps are flattened into a 1D vector.
• Fully connected layers combine all learned features to make a decision.
• Probability Outputs
• The final layer outputs 10 probabilities, one for each digit (0–9).
• In this case:
– The output neuron for “2” is highlighted in green because the
network predicts “2” with the highest probability.
Traditional machine learning vs deep
learning
An Example of Traditional ML
• Paul Viola and Michael Jones in the early 2000s employed
rectangular filters such as the vertical or horizontal black-and-
white bars.
• Features generated by passing these filters over an image can
be fed into machine learning algorithms to reliably detect the
presence of a face.
ImageNet
• Following LeNet-5, research into artificial neural networks, including
deep learning, fell out of favor.
• The consensus became that the approach’s automated feature
generation was not pragmatic—that even though it worked well for
handwritten character recognition,the feature-free ideology was
perceived to have limited breadth of applicability.
• The next breakthrough in neural networks was facilitated by a high-
quality public dataset, ImageNet.
• The handwritten digit data used to train LeNet-5 contained tens of
thousands of images. ImageNet, in contrast, contains tens of millions.
• The 14 million images in the ImageNet dataset are spread across
22,000 categories. These categories are as diverse as container ships,
leopards, starfish, and elderberries.
ILSVRC and AlexNet
• ILSVRC (the ImageNet Large Scale Visual Recognition
Challenge)is an open challenge started in 2010 on a
subset of the ImageNet data.
• LSVRC became a premier challenge for assessing the
world’s state-of-the-art machine vision algorithms.
• The ILSVRC subset consists of 1.4 million images across
1,000 categories.
• AlexNet, a deep learning model developed by Alex
Krizhevsky and Ilya Sutskever won the challenge in
2012.
– This is considered as a watershed moment for deep leaning.
• 1990s: LeNet-5
• 2000s: TML methods (Viola–Jones)
• 2010: ILSVRC starts
• 2012: AlexNet’s breakthrough
AlexNet and future models in ILSVRC
Architecture of AlexNet
• On the left, start with an RGB image (e.g., picture of a cat).
• The original AlexNet input size: 227 × 227 × 3 pixels. (w*h*d)
• Convolutional Layers (CONV)
• The network begins with multiple convolutional layers (shown as blue
boxes).
• These layers learn hierarchical features:
– Early layers: Detect simple edges, colors, and textures.
– Middle layers: Capture shapes, corners, and object parts.
– Deeper layers: Identify more abstract patterns, like faces, paws, or flower petals.
• Each CONV layer is followed by:
– ReLU activation (faster training, non-linear transformation).
– Pooling (reduces spatial size(spatial size refers to the width and height), keeps
important info).
– Normalization (in original AlexNet: Local Response Normalization).
• Deep architecture: 8 layers (5 CONV + 3 FC).
• ReLU activation → faster convergence.
• Dropout in FC layers → reduced overfitting.
• GPU training → huge speedup.
• Data augmentation → improved generalization.
• A GPU (Graphics Processing Unit) is a special type of
processor originally designed for rendering graphics in
computers (like in gaming), but it’s now widely used for
deep learning because of its ability to perform
massively parallel computations.
Tensorflow Playground
[Link]
#activation=relu&batchSize=10&dataset=spiral&reg
Dataset=regplane&learningRate=0.03&regularizatio
nRate=0&noise=0&networkShape=8,8,4,2&seed=0.
32263&showTestData=false&discretize=false&percT
rainData=50&x=true&y=true&xTimesY=false&xSqu
ared=false&ySquared=false&cosX=false&sinX=false
&cosY=false&sinY=false&collectStats=false&proble
m=classification&initZero=false&hideText=false
Module 1
Introducing Deep Learning: Biological and Machine Vision:
Biological Vision, Machine Vision: The Neocognitron, LeNet-5, The
Traditional Machine Learning Approach, ImageNet and the ILSVRC,
AlexNet, TensorFlow Playground.

Human and Machine Language: Deep Learning for Natural


Language Processing: Deep Learning Networks Learn
Representations Automatically, Natural Language Processing, A
Brief History of Deep Learning for NLP, Computational
Representations of Language: One-Hot Representations of Words,
Word Vectors, Word-Vector Arithmetic, word2viz, Localist Versus
Distributed Representations, Elements of Natural Human
Language.,
Deep Learning for NLP
• The Austro-British philosopher Ludwig
Wittgenstein said: „The meaning of a word is
its use in the language”.
– The natural language processing with deep
learning relies heavily on this premise
Deep Learning Networks Learn
Representations Automatically
Representation Learning
(techniques that learn the features automatically from data).
Artificial Neural Networks / Deep Learning are a subset of
representation learning.
Natural Language Processing
• Natural language processing is a field of research that sits at the
intersection of computer science, linguistics, and artificial
intelligence.

• NLP involves taking the naturally spoken or naturally written


language of humans—such as this sentence you’re reading right
now—and processing it with machines to automatically
complete some task or to make a task easier for a human to do.
Natural Language Processing
• Humans communicate through some form of language
either by text or speech.
• To make interactions between computers and humans,
computers need to understand natural languages used
by humans.
• Natural language processing is all about making
computers learn, understand, analyse, manipulate and
interpret natural(human) languages.
• NLP stands for Natural Language Processing, which is a
part of Computer Science, Human language, and
Artificial Intelligence.
Examples of NLP
• Classifying documents
• Machine translation
• Automated question-answering
• Chatbots
• Search engines: autocompleting users’
searches and predicting what information or
website they’re seeking.
History of Deep Learning for NLP
Computational Representations of
Language
• One hot representation
• Word vectors
One Hot Representation
• One-hot representation of words is a way to encode words as vectors that
are all zero except for a single one in the position corresponding to that
word in the vocabulary.
• However, the simplicity and sparsity of one-hot representations are
limiting when incorporated into a natural language application
Adv
• Simple to implement.
• Unambiguous — each word has a unique vector.
• Works well for small vocabularies.
Dadv
High Dimensionality & Sparsity → If your
vocabulary has 50,000 words, each word becomes
a 50,000-length vector with only one “1” and the
rest “0”s. That’s a lot of wasted space.
• Word embeddings solve these issues by:
• Representing words in a dense vector (e.g.,
50–300 dimensions instead of 50,000).
Word Vectors
• Vector representations of words are the information-
dense alternative to one-hot encodings of words.
• Whereas one-hot representations capture
information about word location only, word vectors
(also known as word embeddings or vector-space
embeddings) capture information about word
meaning as well as location.
• Assigns each word within a corpus to a particular,
meaningful location within a multidimensional space
called the vector space.
Word vectors
• Two of the most popular techniques for
converting natural language into word vectors
are word2vec and GloVe.
• With either technique, our objective while
considering any given target word is to
accurately predict the target word given its
context words.
Word vectors – An example in 3 dimensions

Vking = [--0.9, 1.9, 2.2]


Vman = [--1.1, 2.4, 3.0]
Vman = [--3.2, 2.5, 2.6]
Characteristics of word vectors
• The closer two words are within vector space,13 the
closer their meaning, as determined by the similarity of
the context words appearing near them in natural
language.
• Synonyms and common misspellings of a given word—
because they share an identical meaning—would be
expected to have nearly identical context words and
therefore nearly identical locations in vector space.
• Words that are used in similar contexts, such as those
that denote time, tend to occur near each other in
vector space
Word-vector Arithemetic
Word-vector Arithemetic example
If vector for king (vking) is [-0.9, 1.9, 2.2], vector
for man (vman) is [-1.1, 2.4, 3.0], and that for
woman (vwoman) is [-3.2, 2.5, 2.6], calculate the
vector for queen(vqueen)
• Subtract vman​from vking​:
• [−0.9−(−1.1), 1.9−2.4, 2.2−3.0]
=[−0.9+1.1, 1.9−2.4, 2.2−3.0]
• [0.2, −0.5, −0.8]
• Add vwoman​:
• [0.2+(−3.2), −0.5+2.5, −0.8+2.6]
• =[−3.0, 2.0, 1.8]
• Final vector for queen:
• vqueen​=[−3.0, 2.0, 1.8]
word2viz
• word2viz is a tool for exploring word vectors
interactively.
• It can be accessed at [Link]/word2viz
• The developer of the word2viz tool, Julia
Bazinska, compressed a 50-dimensional word-
vector space down to two dimensions in order
to visualize the vectors on an xy-coordinate
system
Default Screen for word2viz
Localist vs Distributed Distributions
• One-hot representations are localist. They store information
on a given word discretely, within a single row of a typically
extremely sparse matrix.
• Word vectors store the meaning of words in a distributed
representation across n-dimensional space.
Localised(one hot)
• Characteristics:
• Not subtle → Every word is equally far apart (no
similarity).
• Manual taxonomies → Needs explicit definitions, no
natural grouping.
• Handles new words poorly → A new word means adding a
new dimension.
• Subjective → Human intervention needed to define
relationships.
• Word similarity not represented → "Cat" and "Dog" are
just as unrelated as "Cat" and "Car".
Distributed(vector)
• Very nuanced → Captures subtle meaning
differences.
• Automatic → Learns from large data, no manual
definition.
• Handles new words better (especially with subword
embeddings).
• Driven by natural language data → Based on
context in texts.
• Word similarity represented → Semantic similarity
= closeness in vector space.
Elements of Natural Human Language
• Phonemes → Smallest units of sound (like “k”, “a”).
• Morphemes → Smallest meaningful units (like “un-”, “happy”).
• Words → Built from phonemes & morphemes (basic units of language).
• Syntax → Rules for arranging words into meaningful sentences
(grammar).
• Semantics → Meaning of words and sentences.
• 👉 As we move rightward:
• Elements become more abstract (from sounds → meaning).
• They also become more complex to represent in NLP.
• In short:
Phonemes & morphemes (simple, concrete) → words → syntax →
semantics (abstract, complex).
Phonology and Morphology
• Phonology is concerned with the way that language sounds when
it is spoken.
– The traditional ML approach is to encode segments of auditory input
as specific phonemes from the language’s range of available
phonemes.
– With deep learning, we train a model to predict phonemes from
features automatically learned from auditory input and then represent
those phonemes in a vector space.
• Morphology is concerned with the forms of words.
– For example, the three morphemes out, go, and ing combine to form
the word outgoing.
– The traditional ML approach is to identify morphemes in text from a
list of all the morphemes in a given language.
– With deep learning, we train a model to predict the occurrence of
particular morphemes. Hierarchically deeper layers of artificial neurons
can then combine multiple vectors (e.g., the three representing out,
go, and ing) into a single vector representing a word.
Representations: Traditional ML vs Deep
Learning

You might also like