0% found this document useful (0 votes)

10 views

NLP Basic - YL

Uploaded by

rui91seu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

NLP Basic - YL

Uploaded by

rui91seu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Get Started with NLP

NLP and Machine Learning

● Machine learning focuses on the development of computer programs

that can access data and use it to learn for themselves.

● Machine learning techniques can be applied to solve NLP problems.

○ Key challenge: How to convert the unstructured text into a structured
format
○ Representation for text matters
○ Representation learning is a set of techniques that learn a feature, i.e.,
transform the raw data input to a representation that can be effectively
exploited in machine learning tasks
Representation Matters

● Computer programs does not understand text

● Numeric representation are required for text

● Different from images (RGB matrix), there are not direct

transformation.
NLP pipeline

Raw Text Preprocessing

? Tokenization for
language chunks

Machine Learning Numerical representation

Models for these chunks
How do we represent the meaning of a
word?
Word Meaning
● The idea that is represented by a word, phrase, etc.
● The idea that a person wants to express by using words, signs, etc.
● The idea that is expressed in a work of writing, art, etc.

Common solution: Use WordNet, a thesaurus containing lists of synonym sets and
hypernyms (“is a” relationships)
Problems with resources like WordNet
● Great as a resource but missing nuance.
○ E.g. “proficient” is listed as a synonym for “good”.
○ This is only correct in some context

● Missing new meanings of words

○ Impossible to keep everything up-to-date

● Subjective
● Requires human efforts to create and adapt
● Can’t compute accurate word similarities.
Legacy Techniques: counting
is everything
One-hot Vector

● Map each word to an unique ID

● ID can be the index of the word in the whole vocabulary.

and 0
the 1
cat 2
The cat and the dog play
and, the, cat, dog, dog 3

The cat is on the mat play, on, mat, is play 4

on 5

corpus vocab. mat 6

is 7
One-hot Vector
● The ID can determine the one-hot word vector

● A vector ﬁlled with 0s, except for a 1 at the position of the ID

cat mat

0 0
0 0
1 0
0 0
0 0
0 0
0 1
0 0
One-hot Vector
● Pros
○ Simple
○ Easily computed and suitable for parallel computing

● Cons
○ Dimensionality is the size of vocabulary
○ Out-of-Vocabulary (OOV) problem
○ All words are independent
Bag-of-Words

● Steps
○ Build vocab i.e., set of all the words in the corpus
○ Count the occurrence of words in each document

The cat and the dog play 1 2 1 1 1 0 0 0

and, the, cat, dog,
The cat is on the mat play, on, mat, is 0 2 1 0 0 1 1 1

corpus vocab.
Bag-of-Words

● Pros
○ Simple
○ Surprisingly effective
○ Fast

● Cons
○ Order of words does not matter
○ Cannot capture syntactic/semantic information
N-gram model

● Steps
○ Build vocab, which set of all n-gram in the corpus
○ Count the occurrence of n-gram in each document

The cat and the dog play

The cat, cat and, and the, the dog, dog play, cat is, is on, on the,
The cat is on the mat the mat

corpus vocab.

1 1 1 1 1 0 0 0 0

1 0 0 0 0 1 1 1 1
N-gram model

● Pros
○ Word order is considered

● Cons
○ Vocab size is very huge
○ Cannot capture syntactic/semantic information
○ Is able to incorporate limited word order information
Term Frequency-Inverse Document Frequency

● Build vocab i.e., set of all the words in the corpus

● Count the occurrence of words in each document
● Use weighting scheme to determine the value
○ TF(w) = number of times term w appears in a document/Total number of terms in the
document
○ IDF(w) = log(total number of documents / number of documents with the term w in it)
○ The ﬁnal weight is TF(w) * IDF(w)
● Intuitive logic:
○ Capture the importances of a word to document in a corpus
○ Importance of words is proportionally to the number of times a word appears
○ Importance of words is inversely proportionally to the document containing the word

Sample BruntWork data - Excel Test
No ratings yet
Sample BruntWork data - Excel Test
24 pages
English World 9 Workbook PDF
67% (9)
English World 9 Workbook PDF
146 pages
Capstone Chapter 9 Case Problem Grey Code Corporation SBA 1 2
No ratings yet
Capstone Chapter 9 Case Problem Grey Code Corporation SBA 1 2
10 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Lect04
No ratings yet
Lect04
44 pages
Unit iv
No ratings yet
Unit iv
58 pages
Chapter-03-1
No ratings yet
Chapter-03-1
43 pages
Unit iv
No ratings yet
Unit iv
57 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
word embedding
No ratings yet
word embedding
35 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
Module III
No ratings yet
Module III
42 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
10 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
feature eng
No ratings yet
feature eng
34 pages
lecture 10
No ratings yet
lecture 10
86 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
week2and3
No ratings yet
week2and3
76 pages
NLP m3
No ratings yet
NLP m3
111 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Christopher Manning Lecture 1: Introduction and Word Vectors
No ratings yet
Christopher Manning Lecture 1: Introduction and Word Vectors
42 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
Unit-2
No ratings yet
Unit-2
21 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Word Embeddings With Neural Network
No ratings yet
Word Embeddings With Neural Network
5 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
cs224n 2023 Lecture01 Wordvecs1
No ratings yet
cs224n 2023 Lecture01 Wordvecs1
40 pages
NLP_Notes
No ratings yet
NLP_Notes
12 pages
DVT UNIT -4 Notes 211124 (1)
No ratings yet
DVT UNIT -4 Notes 211124 (1)
21 pages
Embeddings
No ratings yet
Embeddings
3 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Text Mining
No ratings yet
Text Mining
34 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP tutorial1
No ratings yet
NLP tutorial1
7 pages
Lec8-9- VSM
No ratings yet
Lec8-9- VSM
20 pages
dvt u4 my notes
No ratings yet
dvt u4 my notes
15 pages
The Code Monster Manual: Numbers, Types, Storage, And Abstraction
From Everand
The Code Monster Manual: Numbers, Types, Storage, And Abstraction
Kevin Focke
No ratings yet
Addition
From Everand
Addition
Sally Fisk
2/5 (1)
Essential English - Grade 2
From Everand
Essential English - Grade 2
Vicky Kirkpatrick
5/5 (6)
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet
Download Complete Solution Manual for Introduction to Algorithms, third edition By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein PDF for All Chapters
100% (13)
Download Complete Solution Manual for Introduction to Algorithms, third edition By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein PDF for All Chapters
44 pages
CS8481 DBMS RECORD FINAL Modfied
No ratings yet
CS8481 DBMS RECORD FINAL Modfied
123 pages
Problem Statements For KLEOS 2.0
No ratings yet
Problem Statements For KLEOS 2.0
33 pages
Godot Game Project Boost
No ratings yet
Godot Game Project Boost
186 pages
CMA Cyber Incident Respons Plan Template Public
No ratings yet
CMA Cyber Incident Respons Plan Template Public
19 pages
ADPJ2
No ratings yet
ADPJ2
4 pages
Decision Trees Explained - Entropy, Information Gain, Gini Index, CCP Pruning - by Shailey Dash - Towards Data Science
No ratings yet
Decision Trees Explained - Entropy, Information Gain, Gini Index, CCP Pruning - by Shailey Dash - Towards Data Science
25 pages
13241_EN_i-FX2 COMFORT
No ratings yet
13241_EN_i-FX2 COMFORT
69 pages
New 149626
No ratings yet
New 149626
24 pages
Delay H Impl
No ratings yet
Delay H Impl
3 pages
G17 Final Project Report
No ratings yet
G17 Final Project Report
65 pages
Internet Technology: Dr. Ruchi Garg
No ratings yet
Internet Technology: Dr. Ruchi Garg
14 pages
Movie Ticket Booking System
No ratings yet
Movie Ticket Booking System
15 pages
Preamble
No ratings yet
Preamble
1 page
Langelihle Masango Nov 2024
No ratings yet
Langelihle Masango Nov 2024
3 pages
COP2800 MidtermStudyGuide
No ratings yet
COP2800 MidtermStudyGuide
31 pages
A2 Unit 5 WB PDF
No ratings yet
A2 Unit 5 WB PDF
18 pages
EcoStar430F-M DE 2014-09-30 V1
No ratings yet
EcoStar430F-M DE 2014-09-30 V1
7 pages
CN Combined Merged
No ratings yet
CN Combined Merged
487 pages
Manufacturing Execution System (MES) : Middleware Extensibility Hooks
No ratings yet
Manufacturing Execution System (MES) : Middleware Extensibility Hooks
28 pages
Distributed Application Critical Design Review (CDR) Checklist
No ratings yet
Distributed Application Critical Design Review (CDR) Checklist
17 pages
Promass 80 - Brief Operating Instructions
No ratings yet
Promass 80 - Brief Operating Instructions
24 pages
Calculator Project Report 1
No ratings yet
Calculator Project Report 1
36 pages
SAP Business-All-In-One Fast Start Program Solution Brief
No ratings yet
SAP Business-All-In-One Fast Start Program Solution Brief
4 pages
SWE-212 CO&A Assignment 1
No ratings yet
SWE-212 CO&A Assignment 1
2 pages
El 210, Texts
No ratings yet
El 210, Texts
5 pages
Module 2 Coa
No ratings yet
Module 2 Coa
13 pages