2 Marks
2 Marks
Text wrangling involves further manipulation of the text data to prepare it for analysis or
machine learning tasks.
Feature engineering for text representation involves converting raw text data into a format
suitable for machine learning models. This process transforms text into numerical features that
can be used to train and evaluate algorithms.
TF-IDF adjusts the word counts by considering the importance of a word in a document relative
to its frequency in the entire corpus. It helps in emphasizing important words and de-
emphasizing common words
output
running -> run
eats -> eat
Output
['sample', 'sentence', '.']
Lemmatization is the process of reducing a word to its base or dictionary form (known as the
lemma), which is different from stemming. Unlike stemming, which simply cuts off word
endings, lemmatization considers the context and returns valid words.
For example:
Bag of Words (BoW) is a simple and commonly used model in natural language processing
(NLP) for representing text data. The main idea behind BoW is to treat text (like a document,
sentence, or paragraph) as a collection of individual words, disregarding the order and context in
which they appear.
2.What is embeddings?
Embeddings are dense, low-dimensional vector representations of words, phrases, or documents,
learned from large text corpora. They are designed to capture the semantic relationships between
words, where words with similar meanings are represented by similar vectors.
3.Define BERT?
BERT (Bidirectional Encoder Representations from Transformers): BERT generates
contextual embeddings, meaning that the representation of a word changes depending on the
words around it. It is based on the Transformer architecture and has achieved state-of-the-art
results in many NLP tasks.
4.What is Word2Vec?
Word2Vec is a popular technique in Natural Language Processing (NLP) for representing words
as vectors. Word2Vec models are used to capture the semantic relationships between words by
representing them in a continuous vector space, where words with similar meanings are placed
closer together.
5. What is Glove?
GloVe, or Global Vectors for Word Representation, is an unsupervised learning algorithm
developed by researchers at Stanford for obtaining vector representations of words. The idea
behind GloVe is to leverage the global statistical information of a corpus to produce dense word
embeddings.
Text Classification
Named Entity Recognition (NER)
Sentiment Analysis
Machine Translation
7.What is FastText?
FastText is an extension of the Word2Vec model developed by Facebook's AI Research (FAIR)
lab. It addresses some of the limitations of Word2Vec, particularly in handling out-of-
vocabulary (OOV) words and capturing subword information.
Deep learning is a subset of machine learning that focuses on algorithms inspired by the
structure and function of the brain, known as artificial neural networks. It is particularly
powerful for tasks involving large amounts of data, such as image recognition, natural language
processing, and speech recognition.
10. What is Self-supervised learning?
Self-supervised learning is an approach where the model learns from unlabeled data by
predicting parts of the input from other parts. It's used in pre-training large models on vast
datasets.
11.What is RNN?
Recurrent Neural Networks (RNNs) are a type of neural network designed for sequential data,
where the order of the data points is crucial. They are particularly effective for tasks where the
context from previous inputs in a sequence influences the current output. RNNs are widely used
in natural language processing (NLP), time series prediction, and other tasks involving
sequential data.
12. What is Transformers?
Transformers are a type of deep learning model architecture that has revolutionized the field of
natural language processing (NLP) and, more recently, has been applied to a variety of other
tasks, including computer vision, time series analysis, and more
13. What is Text summarization?
Text Summarization is a process of distilling the most important information from a source text
and presenting it in a shorter form while maintaining the overall meaning and key points. It's a
crucial tool in managing large amounts of textual data, helping users quickly grasp the main
ideas without having to read the entire content.
14. What is Topic Modeling?
Topic Modeling is an unsupervised machine learning technique used to discover the hidden
thematic structure within a large collection of documents. It identifies patterns of word co-
occurrence in the text and groups words into topics that capture the main themes across the
documents.
15. What are the applications of Topic Modeling?
1. Content Recommendation: By identifying topics within a user's reading history, content
recommendation systems can suggest articles or books that align with their interests.
2. Document Classification: Topic modeling helps in categorizing documents based on
their main themes, such as clustering news articles by topic.
3. Text Mining: Researchers and analysts use topic modeling to explore large text corpora
and identify key themes, trends, or hidden insights.
UNIT III
QUESTION ANSWERING AND DIALOGUE SYSTEMS
1.What is Information Retrieval?
Information Retrieval (IR) is the process of obtaining relevant information from a large
repository, often in response to a user query. The primary goal of IR is to help users
find the information they need quickly and efficiently. This is commonly applied to
document collections, such as web pages, research articles, or databases, where a
system retrieves documents based on their relevance to the user's search terms.
directly, IR-based QA systems first find relevant information from a database or set of
documents and then extract the most relevant part to form an answer.
Language models for question answering (QA) have advanced significantly in recent years,
particularly with the advent of transformer-based architectures like OpenAI’s GPT, Google’s
BERT, and other similar models. These models use vast amounts of text data to learn linguistic
patterns, enabling them to generate or retrieve accurate answers to questions without relying on
explicit knowledge bases.
Classic QA models, developed before the deep learning and transformer revolutions, relied more
on structured approaches, traditional machine learning, and rule-based systems. These models
typically focused on understanding and retrieving answers from specific types of data, such as
documents, structured databases, or even human-curated knowledge bases.
Rule-based QA systems were among the earliest attempts at automating question answering.
These systems followed manually defined rules or templates to process the question and retrieve
the answer. They were effective for specific, narrow domains but lacked flexibility.
Pipeline-based QA systems break down the QA process into a sequence of independent steps,
each responsible for a specific task such as parsing, entity extraction, relation identification, and
answer generation.
Statistical machine learning models improved upon rule-based and IR-based systems by learning
patterns from data, often using features derived from questions and text. Classic machine
learning algorithms such as support vector machines (SVMs) and decision trees were applied to
QA tasks.
Chatbots are designed for interactive communication, often implemented in customer service or
personal assistants like Siri and Alexa. While not purely QA systems, they often include QA
components.
ML-based chatbots: More advanced, with natural language understanding (NLU) and dialogue
management systems.
1. Customer Support Chatbots are widely used in customer support for handling FAQs,
troubleshooting issues, and guiding users through product features or services. They
reduce the workload for human agents and provide instant assistance.
2. Virtual Assistants AI chatbots like Siri, Google Assistant, and Alexa act as personal
assistants, helping users perform tasks like setting reminders, controlling smart home
devices, and answering questions.
3. E-commerce In e-commerce, chatbots help customers find products, process orders, and
provide information about discounts or promotions. They can guide users through the
shopping process or offer product recommendations.
4. Healthcare Healthcare chatbots assist patients with appointment scheduling, providing
medical information, reminding users to take medications, and even performing symptom
checking.
Designing dialogue systems, especially conversational agents like chatbots and virtual assistants,
requires a careful balance of linguistic understanding, interaction flow, and backend integration.
Dialogue systems are typically composed of multiple components that allow them to engage
users in natural, coherent, and goal-oriented conversations.
Combining rule-based components with machine learning models creates hybrid systems. For
example, NLU and dialogue management may be rule-based for task-oriented conversations,
while response generation is handled by an AI model for more natural interaction
UNIT IV
TEXT-TO-SPEECH SYNTHESIS
1.What is Text Normalization?
Text normalization is the process of transforming text into a standard format to facilitate easier
processing and analysis, especially in natural language processing (NLP) tasks. It involves
several steps that help to reduce the variability in text data.
Text-to-Speech (TTS): Converts written text to speech by first converting letters into
phonemes.
Automatic Speech Recognition (ASR): Uses phoneme models for recognizing speech
and mapping spoken words to text.
Language Learning Tools: Helps learners by generating phonetic transcriptions of
words.
4.Define Prosody?
Prosody refers to the rhythm, intonation, and stress patterns in speech that convey meaning,
emotion, and structure. It's an essential aspect of natural language and spoken communication,
affecting how messages are perceived beyond the basic phonetic sounds.
Signal processing is the analysis, manipulation, and interpretation of signals to extract useful
information, enhance their quality, or convert them into a desired format. Signals can be
anything that conveys information, such as sound, images, sensor readings, or data streams, and
they can be represented in various forms like analog (continuous) or digital (discrete).
1. Analog Signals: Continuous signals, like sound waves or light, that vary over time and
take any value in a given range.
o Example: Human speech captured by a microphone.
2. Digital Signals: Discrete-time signals, often derived from the sampling of analog signals,
represented as sequences of numbers (binary).
o Example: A digitally recorded audio file.
Communication Systems
Parametric TTS systems generate speech by modeling the speech production process. Instead of
concatenating pre-recorded speech, parametric approaches synthesize speech by using statistical
models to control parameters like pitch, duration, and formants (vocal tract resonances) to
generate audio waveforms from scratch.
Deep learning-based text-to-speech (TTS) systems, particularly those like WaveNet, represent a
major leap in generating natural and high-quality synthetic speech. These systems address many
limitations of traditional methods like concatenative and parametric TTS by using neural
networks to learn the complex patterns of human speech directly from data.
UNIT V
AUTOMATIC SPEECH RECOGNITION
1.What is Acoustic Modelling?
Acoustic modeling is a crucial component of speech recognition systems, where it deals with the
representation of the relationship between linguistic units of speech (such as phonemes or
words) and the corresponding audio signal. It focuses on how to statistically model the way
phonetic units are produced in various contexts, including differences in speakers, accents, and
environmental noise.
2.What is Phonemes?
Phonemes are the smallest units of sound in a language, and acoustic models attempt to
recognize these by mapping the audio signal to the corresponding phonetic sounds. For example,
the words "cat" and "bat" differ by just one phoneme: /k/ and /b/.
GMMs are used to model the distribution of the acoustic features associated with each HMM
state. A GMM is a weighted sum of several Gaussian distributions and helps capture the
variability in speech signals for a particular phoneme.
A Hidden Markov Model (HMM) is a statistical model that is widely used in speech
recognition, natural language processing, and various other time-series applications. It is
particularly well-suited for modeling sequences where observations are generated by underlying
hidden states, which evolve over time.
Discriminative Training
Better Generalization
Labeling
Supervised learning
Evaluation
Speech recognition systems often need to adapt to new speakers, environments, or languages.
Techniques like Maximum Likelihood Linear Regression (MLLR) or speaker adaptation
training can be used to fine-tune acoustic models for specific speakers or conditions.
Step 1: Pre-Emphasis
Step 2: Framing
Step 3: Windowing
Step 4: Fourier Transform