Deep learning for Natural
language processing
Viet-Trung Tran
1	
  
Some of the challenges in Language
Understanding
• Language is ambiguous:
– Every sentence has many possible
interpretations.
• Language is productive:
– We will always encounter new
words or new

constructions
• Language is culturally specific




2	
  
fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
ML: Traditional Approach
• For each new problem/question
– Gather as much LABELED data as you can get
– Throw some algorithms at it (mainly put in an SVM and

keep it at that)
– If you actually have tried more algos: Pick the best
– Spend hours hand engineering some features / feature

selection / dimensionality reduction (PCA, SVD, etc)
– Repeat…




3	
  
Deep learning vs the rest
4	
  
Deep Learning: Why for NLP ?
• Beat state of the art
– Language Modeling (Mikolov et al. 2011) [WSJ AR task]
– Speech Recognition (Dahl et al. 2012, Seide et al 2011;

following Mohammed et al. 2011)
– Sentiment Classification (Socher et al. 2011)
– MNIST hand-written digit recognition (Ciresan et al.

2010)
– Image Recognition (Krizhevsky et al. 2012) [ImageNet]




5	
  
Language semantics
• What is the meaning of a word?

(Lexical semantics)
• What is the meaning of a sentence?

([Compositional] semantics)
• What is the meaning of a longer piece of
text?

(Discourse semantics)




6	
  
One-hot encoding
•  Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
•  Typical vocabulary sizes will vary between 10 000 and 250
000
7	
  
One-hot encoding
•  The one-hot vector of an ID is a vector filled with 0s, except
for a 1at the position associated with the ID
–  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
•  A one-hot encoding makes no assumption about word
similarity
•  All words are equally different from each other
8	
  
Word representation
•  Standard
–  Bag of Words
–  A one-hot encoding
–  20k to 50k dimensions
–  Can be improved by
factoring in document
frequency
•  Word embedding
–  Neural Word
embeddings
–  Uses a vector space
that attempts to
predict a word given a
context window
–  200-400 dimensions
Word	
  embeddings	
  make	
  seman0c	
  similarity	
  and	
  
synonyms	
  possible	
   9	
  
Distributional representations
•  “You shall know a word by the company it
keeps” (J. R. Firth 1957)
•  One of the most successful ideas of modern
•  statistical NLP!
10	
  
•  Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)

•  Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11	
  
Neural distributional
representations
•  Neural word embeddings
•  Combine vector space semantics with the
prediction of probabilistic models
•  Words are represented as a dense vector
Human	
  =	
  	
  
12	
  
Vector space model
13	
  
Word embeddings
Turian,	
  J.,	
  Ra0nov,	
  L.,	
  Bengio,	
  Y.	
  (2010).	
  Word	
  representa0ons:	
  	
  
A	
  simple	
  and	
  general	
  method	
  for	
  semi-­‐supervised	
  learning	
  
14	
  
15	
  
•  What words have embeddings closest to a given word?
From Collobert et al. (2011)

16	
  
Word Embeddings for MT: Mikolov (2013)
17	
  
Word Embeddings
•  one of the most exciting area of research in deep learning
•  introduced by Bengio, et al. 2013
•  W:words→Rn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500). 
–  W(‘‘cat")=(0.2, -0.4, 0.7, ...)
–  W(‘‘mat")=(0.0, 0.6, -0.1, ...)
•  Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
•  W is initialized as random vectors for each word. 
•  Word embedding learns to have meaningful vectors to
perform some task.


 18	
  
Learning word vectors (Collobert et al. JMLR 2011)
•  Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example


19	
  
Example
•  Train a network for is predicting whether a 5-
gram (sequence of five words) is ‘valid.’ 
•  Source
– any text corpus (wikipedia)
•  Break half number of 5-grams to get negative
training examples
– Make 5-gram nonsensical
– "cat sat song the mat”


 20	
  
Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
•  Look up each word in the 5-gram through W
•  Feed those into R network
•  R tries to predict if the 5-gram is 'valid' or 'invalid'
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
•  The network needs to learn good parameters for both W
and R.
21	
  
22	
  
Idea
•  “a few people sing well” → “a couple people
sing well”
•  the validity of the sentence doesn’t change
•  if W maps synonyms (like “few” and
“couple”) close together
– R’s perspective little changes.


23	
  
Bingo
•  The number of possible 5-grams is massive
•  But, small number of data points to learn
from
•  Similar class of words
– “the wall is blue” → “the wall is red”
•  Multiple words
– “the wall is blue” → “the ceiling is red”
•  Shifting “red” closer to “blue” makes the
network R perform better.
24	
  
Word embedding property
•  Analogies between words encoded in the
difference vectors between words.
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
25	
  
Linguis0c	
  Regulari0es:	
  Mikolov	
  (2013)	
  
26	
  
Word embedding property: Shared
representations
•  The use of word representations… has become a
key “secret sauce” for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))

27	
  
•  W and F learn to perform
task A. Later, G can learn
to perform B based on W

28	
  
Bilingual word-embedding
29	
  
English – Chinese word mapping
30	
  
Embed images and words in a single
representation
31	
  
Feedforward neural net language model
(NNLM) Belgio et. al., 2003
• Long training time
32	
  
Recurrent neural network based language
model (Mikolov et. al., 2010)
• Elman Network
33	
  
Simple RNN training
• Input vector: 1-of-N encoding (one hot)
• Repeated epoch
– S(0): vector of small value (0,1)
– Hidden layer: 30 – 500 units
– All training data from corpus are sequentially presented
– Init learning rate: 0.1
– Error function
– Standard backpropagation with stochastic gradient descent
• Conversion achieved after 10 – 20 epochs
34	
  
Word2vec (Mikolov et. al., 2013)
• Log-linear model
• Previous models: non-linear hidden layer ->
complexity
• Continuous word vectors are learned using
simple model
35	
  
Continuous BoW (CBOW) Model
• Similar to the feed-forward NNLM, but
– Non-linear hidden layer removed 
• Called CBOW (continuous BoW) because the
order of the words is lost
CBOW Model
Continuous Skip-gram Model
• Similar to CBOW, but 
– Tries to maximize classification of a word based on
another word in the same sentence
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting
word vectors, higher training time
– More distant words are usually less related to the current
word than those close to it
– Give less weight to the distant words by sampling less
from those words in the training examples
Continuous Skip-gram Model
RECURSIVE NEURAL
NETWORKS
40	
  
Modular Network that learns word
embeddings 
•  Fixed number of inputs 

41	
  
Recursive neural networks
•  Output of a module go into a module of the same
type
•  tree-structured neural networks
•  No fixed number of inputs
42	
  
Building on Word Vector Space Models
• But how can we represent the meaning of longer
phrases? 
• By mapping them into the same vector space! 








43	
  
How should we map phrases into a
vector space?
44	
  
Sentence Parsing: What we want
45	
  
Learn Structure and Representation
46	
  
Recursive Neural Networks for

Structure Prediction
47	
  
Recursive Neural Network Definition
48	
  
Recursive Application of Relational
Operators
49	
  
Parsing a sentence with an RNN
50	
  
Parsing a sentence
51	
  
Parsing a sentence
52	
  
Parsing a sentence
53	
  
Labeling in Recursive Neural Networks
54	
  
Recursive matrix-vector model
55	
  
Recursive neural tensor network 

56	
  
Socher et al. 2013: Sentence sentiment
analysis
57	
  
Neural tensor network
58	
  
Reversible sentence representation
BoNou	
  2011	
  
•  Bilingual sentence representation
59	
  
Cho et al. (2014)
60	
  
Credits
•  Richard
Socher, Christopher
– Stanford University
– nlp.stanford.edu/courses/NAACL2013/
•  Roelof Pieters, PhD candidate KTH/CSC
•  http://colah.github.io/
•  Bengio GSS 2012
61	
  
Language Modeling
•  A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
•  Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
•  Plays a crucial role in speech recognition and
machine translation systems
62	
  
N-gram models
•  An n-gram is a sequence of n words
–  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.
–  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.
–  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’],
[‘‘a’’,‘‘sequence’’,‘‘of’’],etc.
•  n-gram models estimate the conditional from n-
grams counts


•  The counts are obtained from a training corpus (a
dataset of word text)
63	
  

More Related Content

PDF
Representation Learning of Vectors of Words and Phrases
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
Deep Learning for Information Retrieval
PDF
Word Embeddings, why the hype ?
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
Deep learning for natural language embeddings
Representation Learning of Vectors of Words and Phrases
Visual-Semantic Embeddings: some thoughts on Language
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Information Retrieval
Word Embeddings, why the hype ?
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep learning for natural language embeddings

What's hot (20)

PDF
Multi modal retrieval and generation with deep distributed models
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
Deep Learning and Text Mining
PPTX
Word2vec slide(lab seminar)
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
Learning to understand phrases by embedding the dictionary
PDF
Anthiil Inside workshop on NLP
PDF
Zero shot learning through cross-modal transfer
PDF
(Kpi summer school 2015) word embeddings and neural language modeling
PPTX
Using Text Embeddings for Information Retrieval
PPTX
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
PPTX
A Simple Introduction to Word Embeddings
PPTX
NLP Bootcamp
PPTX
Recurrent networks and beyond by Tomas Mikolov
PDF
Information Retrieval with Deep Learning
PPTX
Word Embedding to Document distances
PPTX
What is word2vec?
Multi modal retrieval and generation with deep distributed models
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
(Deep) Neural Networks在 NLP 和 Text Mining 总结
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Deep Learning and Text Mining
Word2vec slide(lab seminar)
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Tomáš Mikolov - Distributed Representations for NLP
Learning to understand phrases by embedding the dictionary
Anthiil Inside workshop on NLP
Zero shot learning through cross-modal transfer
(Kpi summer school 2015) word embeddings and neural language modeling
Using Text Embeddings for Information Retrieval
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
A Simple Introduction to Word Embeddings
NLP Bootcamp
Recurrent networks and beyond by Tomas Mikolov
Information Retrieval with Deep Learning
Word Embedding to Document distances
What is word2vec?
Ad

Similar to Deep learning for nlp (20)

PPTX
Deep Learning Bangalore meet up
PPTX
DLBLR talk
PPTX
Word embedding
PPTX
A Panorama of Natural Language Processing
PDF
Word2Vec
PDF
AINL 2016: Nikolenko
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PPTX
wordembedding.pptx
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
Word_Embeddings.pptx
PDF
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
PPTX
A Neural Probabilistic Language Model
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PDF
Challenges in transfer learning in nlp
PPTX
Word_Embedding.pptx
PDF
Representation Learning of Text for NLP
PPTX
Pycon ke word vectors
PPTX
NLP Introduction and basics of natural language processing
PDF
Generative Artificial Intelligence and Large Language Model
Deep Learning Bangalore meet up
DLBLR talk
Word embedding
A Panorama of Natural Language Processing
Word2Vec
AINL 2016: Nikolenko
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
wordembedding.pptx
NLP Bootcamp 2018 : Representation Learning of text for NLP
Word_Embeddings.pptx
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
A Neural Probabilistic Language Model
Embedding for fun fumarola Meetup Milano DLI luglio
Challenges in transfer learning in nlp
Word_Embedding.pptx
Representation Learning of Text for NLP
Pycon ke word vectors
NLP Introduction and basics of natural language processing
Generative Artificial Intelligence and Large Language Model
Ad

More from Viet-Trung TRAN (20)

PDF
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
PDF
Dynamo: Amazon’s Highly Available Key-value Store
PDF
Pregel: Hệ thống xử lý đồ thị lớn
PDF
Mapreduce simplified-data-processing
PDF
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
PPTX
giasan.vn real-estate analytics: a Vietnam case study
PDF
Giasan.vn @rstars
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PPTX
Large-Scale Geographically Weighted Regression on Spark
PDF
Recent progress on distributing deep learning
PDF
success factors for project proposals
PDF
GPSinsights poster
PPTX
OCR processing with deep learning: Apply to Vietnamese documents
PDF
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
PDF
Introduction to BigData @TCTK2015
PDF
From neural networks to deep learning
PDF
From decision trees to random forests
PPTX
Recommender systems: Content-based and collaborative filtering
PPTX
3 - Finding similar items
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Dynamo: Amazon’s Highly Available Key-value Store
Pregel: Hệ thống xử lý đồ thị lớn
Mapreduce simplified-data-processing
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
giasan.vn real-estate analytics: a Vietnam case study
Giasan.vn @rstars
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Large-Scale Geographically Weighted Regression on Spark
Recent progress on distributing deep learning
success factors for project proposals
GPSinsights poster
OCR processing with deep learning: Apply to Vietnamese documents
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Introduction to BigData @TCTK2015
From neural networks to deep learning
From decision trees to random forests
Recommender systems: Content-based and collaborative filtering
3 - Finding similar items

Recently uploaded (20)

PDF
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPTX
Transport System for Biology students in the 11th grade
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPTX
AI-Augmented Business Process Management Systems
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPT
Classification methods in data analytics.ppt
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPT
Technicalities in writing workshops indigenous language
PDF
Mcdonald's : a half century growth . pdf
PPTX
cardiac failure and associated notes.pptx
PPT
What is life? We never know the answer exactly
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Transport System for Biology students in the 11th grade
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
AI-Augmented Business Process Management Systems
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Nucleic-Acids_-Structure-Typ...-1.pdf 011
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPT for Diseases (1)-2, types of diseases.pptx
inbound2857676998455010149.pptxmmmmmmmmm
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Classification methods in data analytics.ppt
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
Technicalities in writing workshops indigenous language
Mcdonald's : a half century growth . pdf
cardiac failure and associated notes.pptx
What is life? We never know the answer exactly

Deep learning for nlp

  • 1. Deep learning for Natural language processing Viet-Trung Tran 1  
  • 2. Some of the challenges in Language Understanding • Language is ambiguous: – Every sentence has many possible interpretations. • Language is productive: – We will always encounter new words or new
 constructions • Language is culturally specific
 
 2   fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
  • 3. ML: Traditional Approach • For each new problem/question – Gather as much LABELED data as you can get – Throw some algorithms at it (mainly put in an SVM and
 keep it at that) – If you actually have tried more algos: Pick the best – Spend hours hand engineering some features / feature
 selection / dimensionality reduction (PCA, SVD, etc) – Repeat…
 
 3  
  • 4. Deep learning vs the rest 4  
  • 5. Deep Learning: Why for NLP ? • Beat state of the art – Language Modeling (Mikolov et al. 2011) [WSJ AR task] – Speech Recognition (Dahl et al. 2012, Seide et al 2011;
 following Mohammed et al. 2011) – Sentiment Classification (Socher et al. 2011) – MNIST hand-written digit recognition (Ciresan et al.
 2010) – Image Recognition (Krizhevsky et al. 2012) [ImageNet]
 
 5  
  • 6. Language semantics • What is the meaning of a word?
 (Lexical semantics) • What is the meaning of a sentence?
 ([Compositional] semantics) • What is the meaning of a longer piece of text?
 (Discourse semantics)
 
 6  
  • 7. One-hot encoding •  Form vocabulary of words that maps lemmatized words to a unique ID (position of word in vocabulary) •  Typical vocabulary sizes will vary between 10 000 and 250 000 7  
  • 8. One-hot encoding •  The one-hot vector of an ID is a vector filled with 0s, except for a 1at the position associated with the ID –  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w) = [ 0 0 0 1 0 0 0 0 0 0 ] •  A one-hot encoding makes no assumption about word similarity •  All words are equally different from each other 8  
  • 9. Word representation •  Standard –  Bag of Words –  A one-hot encoding –  20k to 50k dimensions –  Can be improved by factoring in document frequency •  Word embedding –  Neural Word embeddings –  Uses a vector space that attempts to predict a word given a context window –  200-400 dimensions Word  embeddings  make  seman0c  similarity  and   synonyms  possible   9  
  • 10. Distributional representations •  “You shall know a word by the company it keeps” (J. R. Firth 1957) •  One of the most successful ideas of modern •  statistical NLP! 10  
  • 11. •  Word Embeddings (Bengio et al, 2001; Bengio et al, 2003) based on idea of distributed representations for symbols (Hinton 1986) •  Neural Word embeddings (Mnih and Hinton 2007, Collobert & Weston 2008, Turian et al 2010; Collobert et al. 2011, Mikolov et al.2011) 11  
  • 12. Neural distributional representations •  Neural word embeddings •  Combine vector space semantics with the prediction of probabilistic models •  Words are represented as a dense vector Human  =     12  
  • 14. Word embeddings Turian,  J.,  Ra0nov,  L.,  Bengio,  Y.  (2010).  Word  representa0ons:     A  simple  and  general  method  for  semi-­‐supervised  learning   14  
  • 15. 15  
  • 16. •  What words have embeddings closest to a given word? From Collobert et al. (2011) 16  
  • 17. Word Embeddings for MT: Mikolov (2013) 17  
  • 18. Word Embeddings •  one of the most exciting area of research in deep learning •  introduced by Bengio, et al. 2013 •  W:words→Rn is a paramaterized function mapping words in some language to high-dimensional vectors (200 to 500). –  W(‘‘cat")=(0.2, -0.4, 0.7, ...) –  W(‘‘mat")=(0.0, 0.6, -0.1, ...) •  Typically, the function is a lookup table, parameterized by a matrix, θ, with a row for each word: Wθ(wn)=θn •  W is initialized as random vectors for each word. •  Word embedding learns to have meaningful vectors to perform some task. 18  
  • 19. Learning word vectors (Collobert et al. JMLR 2011) •  Idea: A word and its context is a positive training example, a random word in the same context give a negative training example 19  
  • 20. Example •  Train a network for is predicting whether a 5- gram (sequence of five words) is ‘valid.’ •  Source – any text corpus (wikipedia) •  Break half number of 5-grams to get negative training examples – Make 5-gram nonsensical – "cat sat song the mat” 20  
  • 21. Neural network to determine if a 5-gram is 'valid' (Bottou 2011) •  Look up each word in the 5-gram through W •  Feed those into R network •  R tries to predict if the 5-gram is 'valid' or 'invalid' –  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1 –  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0 •  The network needs to learn good parameters for both W and R. 21  
  • 22. 22  
  • 23. Idea •  “a few people sing well” → “a couple people sing well” •  the validity of the sentence doesn’t change •  if W maps synonyms (like “few” and “couple”) close together – R’s perspective little changes. 23  
  • 24. Bingo •  The number of possible 5-grams is massive •  But, small number of data points to learn from •  Similar class of words – “the wall is blue” → “the wall is red” •  Multiple words – “the wall is blue” → “the ceiling is red” •  Shifting “red” closer to “blue” makes the network R perform better. 24  
  • 25. Word embedding property •  Analogies between words encoded in the difference vectors between words. – W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle") – W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king") 25  
  • 27. Word embedding property: Shared representations •  The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. (Luong et al. (2013)) 27  
  • 28. •  W and F learn to perform task A. Later, G can learn to perform B based on W 28  
  • 30. English – Chinese word mapping 30  
  • 31. Embed images and words in a single representation 31  
  • 32. Feedforward neural net language model (NNLM) Belgio et. al., 2003 • Long training time 32  
  • 33. Recurrent neural network based language model (Mikolov et. al., 2010) • Elman Network 33  
  • 34. Simple RNN training • Input vector: 1-of-N encoding (one hot) • Repeated epoch – S(0): vector of small value (0,1) – Hidden layer: 30 – 500 units – All training data from corpus are sequentially presented – Init learning rate: 0.1 – Error function – Standard backpropagation with stochastic gradient descent • Conversion achieved after 10 – 20 epochs 34  
  • 35. Word2vec (Mikolov et. al., 2013) • Log-linear model • Previous models: non-linear hidden layer -> complexity • Continuous word vectors are learned using simple model 35  
  • 36. Continuous BoW (CBOW) Model • Similar to the feed-forward NNLM, but – Non-linear hidden layer removed • Called CBOW (continuous BoW) because the order of the words is lost
  • 38. Continuous Skip-gram Model • Similar to CBOW, but – Tries to maximize classification of a word based on another word in the same sentence • Predicts words within a certain window • Observations – Larger window size => better quality of the resulting word vectors, higher training time – More distant words are usually less related to the current word than those close to it – Give less weight to the distant words by sampling less from those words in the training examples
  • 41. Modular Network that learns word embeddings •  Fixed number of inputs 41  
  • 42. Recursive neural networks •  Output of a module go into a module of the same type •  tree-structured neural networks •  No fixed number of inputs 42  
  • 43. Building on Word Vector Space Models • But how can we represent the meaning of longer phrases? • By mapping them into the same vector space! 
 
 
 
 43  
  • 44. How should we map phrases into a vector space? 44  
  • 45. Sentence Parsing: What we want 45  
  • 46. Learn Structure and Representation 46  
  • 47. Recursive Neural Networks for
 Structure Prediction 47  
  • 48. Recursive Neural Network Definition 48  
  • 49. Recursive Application of Relational Operators 49  
  • 50. Parsing a sentence with an RNN 50  
  • 54. Labeling in Recursive Neural Networks 54  
  • 56. Recursive neural tensor network 56  
  • 57. Socher et al. 2013: Sentence sentiment analysis 57  
  • 59. Reversible sentence representation BoNou  2011   •  Bilingual sentence representation 59  
  • 60. Cho et al. (2014) 60  
  • 61. Credits •  Richard Socher, Christopher – Stanford University – nlp.stanford.edu/courses/NAACL2013/ •  Roelof Pieters, PhD candidate KTH/CSC •  http://colah.github.io/ •  Bengio GSS 2012 61  
  • 62. Language Modeling •  A language model is a probabilistic model that assigns probabilities to any sequence of words p(w1, ... ,wT) •  Language modeling is the task of learning a language model that assigns high probabilities to well formed sentences •  Plays a crucial role in speech recognition and machine translation systems 62  
  • 63. N-gram models •  An n-gram is a sequence of n words –  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc. –  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc. –  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’], [‘‘a’’,‘‘sequence’’,‘‘of’’],etc. •  n-gram models estimate the conditional from n- grams counts •  The counts are obtained from a training corpus (a dataset of word text) 63