0% found this document useful (0 votes)
86 views

Data Analytics All Paper Solution

The document provides solutions to questions from two papers on Data Analytics, covering definitions, concepts, and applications related to data analytics, machine learning, and natural language processing. Key topics include data characterization, confusion matrices, prediction models, and types of data analytics. It also discusses techniques such as linear regression, stemming vs. lemmatization, and the importance of support and confidence in association rule mining.

Uploaded by

armanmohd50584
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Data Analytics All Paper Solution

The document provides solutions to questions from two papers on Data Analytics, covering definitions, concepts, and applications related to data analytics, machine learning, and natural language processing. Key topics include data characterization, confusion matrices, prediction models, and types of data analytics. It also discusses techniques such as linear regression, stemming vs. lemmatization, and the importance of support and confidence in association rule mining.

Uploaded by

armanmohd50584
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Here are the solutions to both Paper 1 and Paper 2,

referencing the Data Analytics textbook.

Paper 1
Q1) Attempt all of the following for 1 mark:
a) Define Data Analytics.
Data Analytics is the science of extracting meaningful, valuable
information from raw data to aid decision-making and identify
patterns.
b) Define Tokenization.
Tokenization is the process of splitting text into smaller units,
such as words or phrases, for analysis in text processing.
c) Define Machine Learning.
Machine Learning is a subset of AI where systems improve their
performance by learning from data without being explicitly
programmed.
d) What is clustering?
Clustering groups data into clusters based on similarity, with
each cluster containing data points similar to each other.
e) What is Frequent Itemset?
A frequent itemset is a set of items that occur frequently
together in transactions, typically identified in market basket
analysis.
f) What is data characterization?
Data characterization summarizes the general features or
properties of a dataset to provide insights into its content.
g) What is outlier?
An outlier is a data point that significantly deviates from the
rest of the data, indicating a potential error or anomaly.
h) What is Bag of Words?
The Bag of Words (BoW) model represents text by counting the
frequency of each word, ignoring grammar and word order.
i) What is Text Analytics?
Text Analytics analyzes textual data to extract meaningful
patterns, insights, and trends.
j) Define Trend Analytics.
Trend Analytics identifies patterns or trends in data over time to
provide actionable insights.

Q2) Attempt all of the following for 1.5 marks:


a) What is confusion matrix?
A confusion matrix is a table used to evaluate a classification
model's performance by comparing actual and predicted values
across classes, categorizing outcomes as true positives, true
negatives, false positives, and false negatives.
b) Define support and confidence in association rule
mining.
 Support measures how frequently a rule applies in the
dataset.
Support=Transactions containing both X and YTotal transa
ctions\text{Support} = \frac{\text{Transactions containing
both X and Y}}{\text{Total transactions}}
 Confidence measures how often the rule is correct.
Confidence=Transactions containing both X and YTransacti
ons containing X\text{Confidence} = \frac{\
text{Transactions containing both X and Y}}{\
text{Transactions containing X}}
c) Explain any two Machine Learning (ML) Applications.
1. Fraud Detection: Identifies fraudulent transactions by
analyzing patterns in transaction data.
2. Medical Diagnosis: Helps predict diseases by analyzing
patient data and medical history.
d) Write a short note on stop words.
Stop words are common words like "a," "the," "is," and "in,"
which are often removed in text analysis as they add little value
to understanding the content.
e) Define supervised learning and unsupervised
learning.
 Supervised Learning: Uses labeled data to train models,
e.g., predicting house prices.
 Unsupervised Learning: Identifies patterns in unlabeled
data, e.g., clustering customer groups.
Here are detailed answers for Main Question 3 from both
Paper 1 and Paper 2, with additional explanation.

Paper 1, Q3
a) What is prediction? Explain any one regression model
in detail.
Prediction involves forecasting future values or outcomes
based on historical data using models or statistical methods. It
is widely used in fields like business forecasting, healthcare,
and finance. Predictions are typically categorized into
classification (categorical output) or regression (continuous
output).
Linear Regression Model:
Linear regression is a supervised learning algorithm used for
predictive analysis. It predicts a dependent variable (Y) based
on the relationship with an independent variable (X) using the
equation:
Y=mX+cY = mX + c
Where:
 mm: Slope of the line (rate of change).
 cc: Y-intercept (value of YY when X=0X = 0).
Example: Predicting house prices based on size.
Steps in linear regression:
1. Collect historical data (e.g., house sizes and prices).
2. Plot data points on a graph.
3. Determine the best-fitting line that minimizes the sum of
squared errors between actual and predicted values.
Advantages:
 Simple to implement and interpret.
 Useful for small datasets.
Limitations:
 Assumes linear relationships, which may not hold for
complex data.

b) Differentiate between Stemming and Lemmatization.


Aspect Stemming Lemmatization
Reduces words to Converts words to their base
their root or base or dictionary form using
Definition
form by removing vocabulary and morphological
suffixes. analysis.
Focuses on quick and
Provides accurate and
Purpose approximate
meaningful base forms.
reduction.
Complexi Simple, rule-based Complex, requires linguistic
ty approach. understanding.
Output "studying" → "study," "studying" → "study,"
Example "studies" → "studi." "studies" → "study."
Use Search engines, Text analysis, sentiment
Cases keyword matching. analysis.
Example:
 For the word "running":
o Stemming → "run."
o Lemmatization → "run."
 For the word "better":
o Stemming → "better."
o Lemmatization → "good" (correct base word).

c) Describe types of Data Analytics.


Data Analytics can be categorized into four primary types
based on its purpose and methodology:
1. Descriptive Analytics (What happened?)
o Focuses on summarizing historical data to
understand past performance.
o Example: Analyzing past sales data to determine
trends.
o Techniques: Data aggregation, statistical analysis.
2. Diagnostic Analytics (Why did it happen?)
o Investigates the causes of specific outcomes or
events by identifying relationships and patterns.
o Example: Analyzing why website traffic dropped
during a campaign.
o Techniques: Root cause analysis, correlation analysis.
3. Predictive Analytics (What will happen?)
o Uses historical data and statistical models to forecast
future outcomes.
o Example: Predicting customer churn using machine
learning.
o Techniques: Regression analysis, machine learning.
4. Prescriptive Analytics (How can we make it
happen?)
o Suggests actions or strategies to achieve desired
outcomes based on predictions.
o Example: Recommending stock purchases based on
market trends.
o Techniques: Optimization models, decision trees.
Each type builds upon the previous one, providing progressively
actionable insights.

Here is the detailed solution for Paper 2:

Q1) Attempt all of the following for 1.5 marks each:


a) Define Data Analytics.
Data Analytics is the process of examining raw data to uncover
patterns, trends, and valuable insights to make informed
decisions. It involves techniques from statistics, computer
science, and business intelligence.

b) What is AVC & ROC curve?


 AUC (Area Under the Curve): Measures the ability of a
classifier to distinguish between classes. A value closer to
1 indicates a better model.
 ROC (Receiver Operating Characteristic): A graph
plotting True Positive Rate (Sensitivity) against False
Positive Rate (1-Specificity) to evaluate classification
performance.

c) Write any two applications of Supervised Machine


Learning.
1. Spam Email Detection: Classify emails as spam or non-
spam.
2. Medical Diagnosis: Predict diseases based on patient
symptoms and medical records.

d) Give the formula for support & confidence.


 Support:
Support=Transactions containing both X and YTotal transa
ctions\text{Support} = \frac{\text{Transactions containing
both X and Y}}{\text{Total transactions}}
 Confidence:
Confidence=Transactions containing both X and YTransacti
ons containing X\text{Confidence} = \frac{\
text{Transactions containing both X and Y}}{\
text{Transactions containing X}}

e) What is an outlier?
An outlier is a data point significantly different from others in a
dataset, potentially indicating variability or measurement error.

f) State applications of NLP.


1. Sentiment analysis of customer reviews.
2. Chatbots and virtual assistants like Alexa and Siri.

g) What is web scraping?


Web scraping involves extracting data from websites using
automated tools or programs for purposes like market analysis
or research.

h) What is the purpose of n-gram?


The purpose of n-gram is to analyze sequences of words or
characters in a text, aiding in tasks like text prediction and
machine translation.

i) Define classification.
Classification is a supervised learning technique where input
data is categorized into predefined classes based on its
features.

j) Define Recall.
Recall measures the proportion of actual positives correctly
identified by the model.
Recall=True Positives (TP)True Positives (TP) + False Negatives (
FN)\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True
Positives (TP) + False Negatives (FN)}}

Q2) Attempt all of the following for 2.5 marks each:


a) Explain the concept of underfitting & overfitting.
 Underfitting: Occurs when a model is too simple, failing
to capture the underlying patterns in the data. It leads to
poor performance on training and test data.
 Overfitting: Happens when a model is overly complex,
capturing noise in the training data. It performs well on
training data but poorly on test data.

b) What is Linear Regression? What type of Machine


Learning applications can be solved with Linear
Regression?
Linear Regression predicts a continuous dependent variable
based on one or more independent variables.
Applications:
1. Predicting house prices based on size and location.
2. Estimating sales trends over time.

c) What is Social Media Analytics?


Social Media Analytics is the process of analyzing data from
social media platforms to understand user behavior, trends,
and brand perception. It aids in targeted marketing and
customer engagement.

d) What are the advantages of FP-growth Algorithm?


1. No need for candidate generation, unlike Apriori.
2. Scans the database fewer times, making it efficient.
3. Handles large datasets effectively by compressing data
into a prefix tree (FP-tree).
e) What are dependent & independent variables?
 Dependent Variable: The target variable being predicted
or explained (e.g., house price).
 Independent Variable: The input variables used to
predict the dependent variable (e.g., size, location).

Q3) Attempt all of the following for 4 marks each:


a) What are frequent itemsets & association rules?
Describe with example.
Refer to Paper 1, Q4(a) for a detailed explanation.

b) What is stemming & lemmatization?


Refer to Paper 1, Q3(b) for a detailed differentiation.

c) Explain various types of Data Analytics.


Refer to Paper 1, Q3(c) for a detailed explanation.

Q4) Attempt all of the following for 4 marks each:


a) What is Bag of Words & POS tagging in NLP?
 Bag of Words (BoW): Represents text as a collection of
word frequencies, ignoring grammar and order.
Example: "I love dogs and cats" → {"I": 1, "love": 1,
"dogs": 1, "and": 1, "cats": 1}.
 POS Tagging: Assigns grammatical categories (e.g.,
noun, verb) to words.
Example: "The dog runs" → "The (Det), dog (Noun), runs
(Verb)".

b) What is Logistic Regression? Explain it with example.


Logistic Regression is a supervised learning algorithm used for
binary classification. It predicts probabilities using the sigmoid
function:
P(Y=1)=11+e−zP(Y=1) = \frac{1}{1 + e^{-z}}
Where z=b0+b1X1+b2X2+…z = b_0 + b_1X_1 + b_2X_2 + \
dots.
Example: Predicting whether an email is spam (1) or not spam
(0).

c) Frequent Itemsets using Apriori Algorithm with


Minimum Support = 3
Using the database provided:
 Frequent Itemsets: {E},{T},{C}\{E\}, \{T\}, \{C\}.
Refer to Section 3.1 of the textbook for detailed
calculations.

Q5) Attempt all of the following for 3.5 marks each:


a) Define the terms:
i) Confusion Matrix: A table showing actual vs. predicted
classifications, with entries for True Positives (TP), False
Positives (FP), True Negatives (TN), and False Negatives (FN).
ii) Accuracy: Proportion of correct predictions.
Accuracy=TP+TNTotal Predictions\text{Accuracy} = \frac{\
text{TP} + \text{TN}}{\text{Total Predictions}}
iii) Precision: Proportion of correctly predicted positives.
Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\
text{TP} + \text{FP}}

b) What is Machine Learning? Explain its types.


Machine Learning is a branch of AI that enables systems to
learn and improve from data without explicit programming.
Types:
1. Supervised Learning: Learns from labeled data (e.g.,
classification, regression).
2. Unsupervised Learning: Discovers patterns in unlabeled
data (e.g., clustering).
3. Reinforcement Learning: Learns by interacting with the
environment and maximizing rewards (e.g., game-playing
bots).

You might also like