0% found this document useful (0 votes)
28 views20 pages

NLP Week 1 20

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views20 pages

NLP Week 1 20

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Introduction to Natural Language Processing & Applications

Huan Vu, Faculty of DS&AI, NEU


Week 1 - Foundations and Modern Approaches
A comprehensive exploration of how computers understand, interpret, and generate human language — and why it matters for your career in data science.
Course Agenda
1 Fundamentals & Evolution 2 Real-World Applications
Understanding NLP's journey from rule-based systems to modern Exploring how NLP powers technologies we use daily
modern transformers

3 Practical Tools & Implementation 4 Building Your First NLP Pipeline


Hands-on experience with Python libraries that make NLP accessible From text preprocessing to implementing your first NLP model
accessible
What is Natural Language Processing?
NLP is the branch of artificial intelligence focused on giving computers the Text Understanding Contextual Interpretation
ability to understand, interpret, and generate human language in a way that is
both meaningful and useful.

It sits at the intersection of:

• Linguistics
• Computer Science
• Artificial Intelligence
• Cognitive Science
Language Generation
Why NLP is Challenging
Ambiguity Context Dependency Structural Complexity
"I saw a man on a hill with a telescope." "The bank is closed." Languages have complex grammars and
exceptions
• Who has the telescope? • Financial institution?
• Is the telescope being used to see the • River bank? • Irregular verbs
man? • Meaning depends on context • Nested clauses
• Multiple valid interpretations • Cultural references
Real-World NLP Applications
NLP powers many technologies we interact with daily, often without realizing it.

Conversational AI Sentiment Analysis Machine Translation


Virtual assistants (Siri, Alexa), customer service Tools that analyze customer reviews, social media Systems like Google Translate that convert text from
service chatbots, and interactive voice response posts, and survey responses to determine positive, one language to another while preserving meaning
response systems that understand and respond to negative, or neutral sentiment. and context.
respond to human requests.
More NLP Applications

Information Retrieval Text Summarization


Search engines that understand queries in natural language and return relevant Tools that condense long documents into shorter versions while retaining key
return relevant results, even with spelling errors or synonyms. retaining key information and main points.

Content Generation Healthcare NLP


Systems that create human-like text for emails, reports, articles, and more Applications that extract information from medical records, identify trends in
more based on prompts or templates. trends in patient data, and assist with clinical documentation.
The Evolution of NLP
1950s-1980s: Rule-Based Era 2010-2017: Neural Era
Hand-crafted linguistic rules and dictionaries Deep learning revolution

• ELIZA (1966) - early chatbot using pattern matching • Word2Vec embeddings (2013)
matching • Recurrent Neural Networks
• Focus on syntax and grammar rules • Sequence-to-sequence models
• Limited by rigid structures

1 2 3 4

1980s-2010s: Statistical Era 2017-Present: Transformer Era


Probability and machine learning Attention mechanisms

• Hidden Markov Models • BERT, GPT, T5


• Conditional Random Fields • Few-shot and zero-shot learning
• N-gram language models • Multimodal models (text + vision)
Rule-Based NLP (1950s-1980s)
Key Characteristics
ELIZA (1966)
• Hand-crafted linguistic rules
One of the earliest NLP systems, ELIZA simulated conversation by
• Pattern matching
pattern matching and substitution. It could mimic a psychotherapist
• Dictionaries and thesauri
by turning statements into questions:
• Syntax parsing based on grammar rules

Limitations Human: "I am feeling sad."

• Couldn't handle exceptions well ELIZA: "Why do you feel sad?"


• Required extensive linguistic expertise
Despite its simplicity, ELIZA created a surprising illusion of
• Difficult to maintain and scale understanding.
• Struggled with ambiguity
Statistical NLP (1980s-2010s)
Statistical approaches shifted focus from rigid rules to probabilities derived from large text corpora.

Key Technologies Advantages Limitations


• N-gram language models • Data-driven rather than rule-based • Limited by feature engineering
• Hidden Markov Models (HMMs) • Better handling of ambiguity • Struggled with long-term dependencies
• Maximum Entropy Models • Could learn from examples dependencies

• Conditional Random Fields (CRFs) • More robust to unexpected inputs • Required large amounts of training data

• Support Vector Machines (SVMs) • Captured patterns humans might miss


• Poor semantic understanding
• Context window limitations
Neural NLP (2010-2017)
The neural revolution began with word embeddings that captured semantic Major Improvements
relationships between words as vectors in a high-dimensional space.
• Better handling of semantics
• Improved language generation
• Ability to capture longer dependencies
Word2Vec (2013): Represented words as dense vectors where
where similar words appear close together. The famous example: • Reduced need for feature engineering
example: king - man + woman ≈ queen. • Transfer learning capabilities

Limitations
Neural Architectures
• Vanishing gradient problem in long sequences
• Recurrent Neural Networks (RNNs)
• Sequential processing (slow)
• Long Short-Term Memory (LSTM)
• Limited context window
• Gated Recurrent Units (GRU)
• High computational requirements
• Sequence-to-sequence models
Transformer Revolution (2017-Present)
The paper "Attention Is All You Need" (2017) introduced the Transformer architecture, fundamentally changing NLP.

1 Attention Mechanism 2 Pretraining & Fine-tuning 3 Massive Scale


Unlike RNNs, Transformers process entire Models are first pretrained on vast amounts Transformer models have grown from
sequences at once through self-attention, of text (billions of words) and then fine- BERT's 340M parameters to GPT-4's
weighing the importance of each word tuned for specific tasks with much smaller reported trillion+ parameters, capturing
relative to all others. This parallelization datasets. This transfer learning approach capturing increasingly subtle patterns in
enables training on massive datasets. dramatically improved performance across in language and demonstrating emergent
all NLP tasks. emergent capabilities not explicitly trained
trained for.
Key Transformer Models

2018 2019 2020 2022+


BERT GPT-2 T5 Modern LLMs
Bidirectional Encoder Representations Generative Pretrained Transformer 2 Text-to-Text Transfer Transformer by GPT-4, Claude, Llama 2, etc. Exhibit
Representations from Transformers by Transformer 2 by OpenAI. Auto- by Google. Unified all NLP tasks into a Exhibit emergent abilities like
Transformers by Google. regressive model trained to predict into a text-to-text format. reasoning, code generation, and
Revolutionized NLP by considering predict next words. Notable for high - Demonstrated how a single model multi-step problem solving not
considering context from both high-quality text generation model architecture could handle present in smaller models.
directions. Excels at understanding capabilities that raised ethical multiple tasks with state-of-the-art
understanding tasks like classification concerns. art results.
classification and named entity
recognition.
The NLP Pipeline
Despite advances in end-to-end learning, most NLP applications still follow a structured pipeline.

Text Acquisition
Gathering raw text from sources like websites, documents, databases, or APIs

Preprocessing
Cleaning text, handling encoding issues, removing HTML tags, normalizing text

Tokenization
Breaking text into words, subwords, characters, or other meaningful units

Feature Extraction
Converting tokens to numerical representations (embeddings, TF-IDF, etc.)

Modeling
Applying algorithms to perform specific NLP tasks
Essential NLP Tasks

Tokenization Part-of-Speech Tagging


Breaking text into tokens (words, subwords, characters) Identifying word types (noun, verb, adjective, etc.)

Named Entity Recognition Dependency Parsing


Finding and classifying named entities (person, organization, location) Analyzing grammatical structure and word relationships

Sentiment Analysis Machine Translation


Determining emotional tone (positive, negative, neutral) Converting text between languages
Course Tools: Python for NLP
Why Python?
Setting Up Your Environment
• Clear, readable syntax
We recommend using Anaconda for this course. Create a dedicated
• Rich ecosystem of NLP libraries
dedicated environment with:
• Strong academic and industry adoption
• Excellent for prototyping and production
conda create -n nlp_course python=3.10
• Extensive documentation and community support conda activate nlp_course
pip install nltk spacy transformers torch datasets
Python has become the de facto standard for NLP and machine learning work,
with an estimated 70% of practitioners using it as their primary language.
NLTK: Natural Language Toolkit
Overview Sample Code

NLTK is one of the oldest and most comprehensive Python libraries for NLP,
import nltk
developed primarily for education and research.
nltk.download('punkt’)
Key Features nltk.download('wordnet’)
from nltk.tokenize import word_tokenize
• Extensive corpus access (50+ corpora and lexical resources) from nltk.stem import WordNetLemmatizer
• Complete text processing pipeline tools text = "The quick brown foxes jumped over the lazy
• Support for classification, tokenization, stemming, tagging, parsing dogs”
tokens = word_tokenize(text)
print(tokens)
• Detailed documentation and book
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in
tokens]
print(lemmas)
spaCy: Industrial-Strength NLP
Overview Sample Code

spaCy is designed for production use, focusing on efficiency and ease of use.
import spacy
# Load English model
Key Features nlp = spacy.load("en_core_web_sm")
# Process text
• Built for speed and production environments doc = nlp("Apple is looking to buy U.K. startup for $1
• Pre-trained models for multiple languages billion")
• End-to-end pipeline with single API # Named Entity Recognition
for ent in doc.ents:
• Integrated with deep learning frameworks
print(f"{ent.text}: {ent.label_}")
• Visualization tools
# Dependency parsing
for token in doc:
print(f"{token.text}: {token.dep_} ->
{token.head.text}")
Hugging Face: Transformers Made Easy
Overview Sample Code

Hugging Face has become the central hub for state-of-the-art NLP models and
from transformers import pipeline
tools.
# Sentiment analysis
Key Components sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love this course, it's
• Transformers: Library for pre-trained models amazing!")
• Datasets: Standardized access to NLP datasets print(result)
• Tokenizers: Fast tokenization implementations # [{'label': 'POSITIVE', 'score': 0.9998}]
# Text generation
• Model Hub: Community platform for sharing models
generator = pipeline("text-generation")
• Spaces: Interactive demos for models
text = generator("Natural language processing is",
max_length=30)
print(text[0]['generated_text'])
Course Project Preview: Building an NLP Pipeline
Throughout the course, you'll build a complete NLP system piece by piece, applying what you learn each week.

1 Week 1-2: Data Collection & Preprocessing 2 Week 3-4: Feature Engineering
Gather text data from various sources and build a robust preprocessing Implement different text representation techniques from TF-IDF to
pipeline including cleaning, normalization, and tokenization. IDF to modern embeddings and analyze their effectiveness.

3 Week 5-7: Model Development 4 Week 8-10: Integration & Deployment


Train and evaluate models for specific NLP tasks, starting with classical Combine components into a complete application solving a real -world
classical approaches and progressing to transformer-based solutions. real-world NLP problem and prepare it for deployment.
solutions.
Key Takeaways

NLP is Transforming Industries Rapid Evolution


From healthcare to customer service, NLP is fundamentally changing how The field has progressed from simple rule-based systems to sophisticated
changing how businesses operate and how humans interact with transformer models in just a few decades, with the pace of innovation
with technology. accelerating.

Accessibility Practical Skills Matter


Modern tools and libraries have democratized NLP, making powerful This course will equip you with both theoretical understanding and hands-
powerful techniques accessible to developers without specialized and hands-on experience using industry-standard tools like NLTK, spaCy,
specialized linguistics knowledge. NLTK, spaCy, and Hugging Face.

Next week: We'll dive into text preprocessing techniques and build our first NLP components!

You might also like