0% found this document useful (0 votes)

6 views12 pages

Dsbda Ut4

The document covers key concepts in data analytics, including types of analytics such as descriptive, diagnostic, predictive, prescriptive, and cognitive analytics. It also discusses algorithms like Apriori for association rule mining, logistic regression for classification, linear regression for modeling relationships, and decision trees for supervised learning. Additionally, it highlights data preprocessing techniques, Naïve Bayes classifiers, and the use of Scikit-learn and Matplotlib for machine learning and visualization.

Uploaded by

practicalcodes04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views12 pages

Dsbda Ut4

Uploaded by

practicalcodes04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Rohit

Unit 4 Imp. Points + PYQs

1. Types of Analytics in Big Data.

1. Descriptive Analytics
o Purpose: Understand what has happened in the past.
o Function: Summarizes raw data to make it interpretable by humans.
o Techniques: Data aggregation, data mining, reporting, dashboards.
o Example: Monthly website traffic reports, sales trend analysis.
2. Diagnostic Analytics
o Purpose: Understand why something happened.
o Function: Finds the root cause of outcomes and trends.
o Techniques: Drill-down, data discovery, correlation analysis.
o Example: Analyzing the reason for a decline in sales in a specific quarter.
3. Predictive Analytics
o Purpose: Forecast future outcomes based on historical data.
o Function: Uses statistical models and machine learning to predict trends.
o Techniques: Regression analysis, decision trees, neural networks.
o Example: Predicting customer churn or future product demand.
4. Prescriptive Analytics
o Purpose: Recommend actions to achieve desired outcomes.
o Function: Suggests decisions based on predictive models and optimization.
o Techniques: Simulation, optimization algorithms, decision analysis.
o Example: Recommending pricing strategies or supply chain adjustments.
5. Cognitive Analytics
o Purpose: Mimic human thinking to interpret complex data.
o Function: Leverages AI and natural language processing (NLP) to understand unstructured
data.
o Techniques: Machine learning, deep learning, NLP.
o Example: Chatbots that analyze and respond to customer queries in real-time.
2. Apriori Algorithm, Support and Confidence Value, and Association Rules.

1. Apriori Algorithm
o Definition: Apriori is an algorithm used in association rule mining to identify frequent
itemsets in a dataset and derive rules that explain relationships among items.
o Purpose: It helps in market basket analysis to discover item combinations that occur
frequently together.
o Process:
1. Identify frequent individual items using a minimum support threshold.
2. Extend these to larger itemsets as long as they appear frequently.
3. Use the frequent itemsets to generate association rules.
2. Support
o Definition: Indicates how frequently an itemset appears in the dataset.
o Formula: Support(A)= Number of transactions containing (A) / Total number of transactions
o Example: If 100 transactions are made and 20 contain milk, then Support(Milk) = 20/100 =
0.2
3. Confidence
o Definition: Indicates the likelihood that item B is also bought when item A is bought.
o Formula: Confidence(A→B) = Support(A∪B) / Support(A)
o Example: If 15 out of 20 milk transactions also contain bread, Confidence(Milk → Bread) =
15/20 = 0.75
4. Association Rules
o Definition: Implication expressions of the form A → B, meaning if A occurs, then B is likely
to occur.
o Components:
▪ Antecedent (A): The item(s) on the left-hand side of the rule
▪ Consequent (B): The item(s) on the right-hand side
▪ Metrics: Support, Confidence, and Lift (optional, for strength of rule)
o Example:
▪ Rule: If Milk, then Bread
▪ Support = 0.2, Confidence = 0.75
These techniques are widely used in recommendation systems, retail analytics, and customer behavior
analysis.
3. Logistic Regression, Its Need, Types, and Use of Logistic & Sigmoid Function.

1. Logistic Regression
• Definition: Logistic Regression is a statistical method used to model binary (yes/no, true/false) or
categorical outcomes. Unlike linear regression, it predicts the probability that a given input belongs
to a particular category.
• Purpose: Used for classification problems, not regression, despite the name.

2. Need for Logistic Regression

• When the dependent variable is categorical (typically binary, e.g., 0 or 1).
• Linear regression is not suitable because it can predict values beyond 0 and 1, which are not valid
probabilities.
• Logistic Regression maps predicted values to probabilities using the logistic/sigmoid function.

3. Logistic Function (Sigmoid Function)

• Formula: σ(z) = 1 / 1 + e-z
Where z = β0 + β1X1+ β2X2 + … + βnXn
• Purpose: Converts any real-valued number into a value between 0 and 1 (interpreted as a
probability).
• Behavior:
o If z is very large → output ≈ 1
o If z is very small → output ≈ 0
o If z = 0 → output = 0.5

4. Types of Logistic Regression

Type Description Use Case

Binary Logistic Predicts one of two possible outcomes

Email classification
Regression (e.g., spam vs not spam)

Multinomial Logistic Predicts one of three or more unordered Classifying type of cuisine (Indian,
Regression categories Italian, Chinese)

Ordinal Logistic Customer satisfaction levels (Poor,

Predicts ordered categories
Regression Fair, Good, Excellent)
5. Use Cases
• Credit scoring (good/bad credit)
• Disease diagnosis (positive/negative)
• Churn prediction (churn/stay)
• Click-through prediction (click/no-click)
Logistic Regression is simple, efficient, and widely used for binary classification problems in machine
learning and statistics.

4. Linear Regression.

1. Definition
Linear Regression is a supervised learning algorithm and a statistical technique that models the
relationship between a dependent variable (target) and one or more independent variables (predictors)
using a straight line (linear function).

2. Types of Linear Regression

Type Description Use Case Example

Simple Linear Involves one independent variable and one Predicting house price based on
Regression dependent variable. area.

Multiple Linear Involves two or more independent variables to Predicting house price based on
Regression predict one dependent variable. area, location, and rooms.

Polynomial Linear A form of linear regression where the relationship Predicting complex curves such
Regression is modeled as an nth degree polynomial. as stock prices.

3. Equation
Simple Linear Regression: Y= β0+ β1X + ϵ
Multiple Linear Regression: Y= β0 +β1X1 + β2X2 +…+ βnXn+ϵ
Where:
• Y: Dependent variable (target)
• X: Independent variable(s)
• β0: Intercept (constant)
• βn: Coefficients (slopes for predictors)
• ϵ: Error term (residuals)
4. Assumptions of Linear Regression
• Linearity: The relationship between the dependent and independent variables is linear.
• Independence: Observations are independent of each other.
• Homoscedasticity: Constant variance of the residuals.
• Normality: Residuals should be normally distributed.

5. Use Cases
• Predicting sales based on advertising budget.
• Estimating exam scores from study hours.
• Forecasting revenue from website traffic.
Linear Regression is fundamental in data science and serves as a baseline model for many regression
problems.
5. Data Preprocessing and Handling Missing Data, Data Transformation, Removing Duplicates, and
Essential Python Libraries.

1. Data Preprocessing
Data preprocessing is the crucial initial step in any data science or machine learning project. It involves
preparing raw data for analysis or modeling by cleaning, transforming, and organizing it.

2. Handling Missing Data

Method Description

Removing Missing Values Use dropna() to remove rows/columns with missing values.

Imputation Fill missing values using fillna() with statistical values like mean or
(Mean/Median/Mode) median.

Use method='ffill' or 'bfill' to propagate non-null values forward or

Forward/Backward Fill
backward.

Model-based Imputation Predict missing values using regression or kNN methods.

Example (Python):
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)

3. Data Transformation

Technique Description

Normalization Scales values to a range [0,1] using Min-Max Scaling.

Standardization Centers data to have a mean of 0 and standard deviation of 1.

Encoding Categorical Convert categories to numeric using Label Encoding or One-Hot

Variables Encoding.

Example (Standardization):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'income']])
4. Removing Duplicates

Method Description

drop_duplicates() Removes duplicate rows from the DataFrame.

keep=‘first’/‘last’/False Option to keep the first, last, or no duplicate records.

Example (Python):
df = df.drop_duplicates()

5. Essential Python Libraries for Preprocessing

Library Purpose

Pandas Data manipulation, missing value handling, duplicates, basic transformations.

NumPy Numerical operations, array manipulation.

Scikit-learn Advanced preprocessing like scaling, encoding, and imputation.

Missingno Visualizing missing data.

Feature-engine Specialized preprocessing like transformation, binning, encoding, etc.

6. Decision Tree, Its Parts, and Criteria Used for Splitting Nodes

1. What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm used for both classification and regression
tasks. It works by recursively splitting the dataset into subsets based on the feature that results in the most
significant information gain.

2. Parts of a Decision Tree

Part Description

Root Node The topmost node that represents the entire dataset and the first split.

Internal Nodes Nodes that represent a test on an attribute (feature).

Branches Edges connecting nodes, representing outcomes of tests.

Leaf/Terminal Nodes Final nodes that contain the output label or prediction.

3. Criteria for Splitting Nodes

Splitting is the process of dividing a node into two or more sub-nodes based on a feature. The goal is to
increase the homogeneity of the resultant nodes. Different criteria are used depending on the problem type:

A. For Classification Trees

Criterion Description

Gini Impurity Measures the impurity or purity of a node. Lower Gini means purer nodes.

Entropy / Information Gain Measures the information gain after splitting. Higher gain is preferred.

Chi-square Measures statistical significance of differences in distributions.

B. For Regression Trees

Criterion Description

Mean Squared Error Splits based on minimizing squared difference between predicted and actual
(MSE) values.

Mean Absolute Error

Similar to MSE but uses absolute differences.
(MAE)

Reduction in Variance Chooses splits that reduce variance of target variable in children.
4. Advantages of Decision Trees
• Easy to understand and interpret
• Can handle both numerical and categorical data
• Requires little data preprocessing

5. Limitations
• Prone to overfitting (solved using pruning or ensemble methods like Random Forest)
• Unstable with small variations in data
7. Naïve Bayes’ Classifier and Its Applications.

1. What is Naïve Bayes’ Classifier?

Naïve Bayes is a probabilistic supervised learning algorithm based on Bayes' Theorem. It is called
"naïve" because it assumes that all features (attributes) are independent of each other given the class label
— which is rarely true in real-world data, but works surprisingly well in many scenarios.

2. Bayes' Theorem
P(A∣B)=P(B∣A)⋅P(A) / P(B)
Where:
• P(A∣B): Posterior probability of class A given feature B
• P(B∣A): Likelihood of feature B given class A
• P(A): Prior probability of class A
• P(B): Prior probability of feature B

3. Types of Naïve Bayes Classifiers

Type Description

Gaussian Naïve Bayes Assumes features follow a normal distribution (used for continuous data).

Multinomial Naïve Bayes Used for discrete counts (e.g., word counts in text classification).

Bernoulli Naïve Bayes Used for binary/boolean features (e.g., word present or not).

4. Advantages
• Simple and fast to implement
• Works well with high-dimensional data
• Effective for text classification and spam filtering
• Requires less training data

5. Limitations
• Assumes independence among features
• Performs poorly if this assumption is strongly violated
• Not suitable for regression tasks
6. Applications of Naïve Bayes

Domain Application Example

Text Classification Spam detection, sentiment analysis, email filtering

Medical Diagnosis Predicting diseases based on symptoms

Recommendation Systems Predicting user preferences based on past behavior

Fraud Detection Identifying fraudulent transactions

Document Categorization Classifying news articles, product reviews, or research papers

Naïve Bayes is a fast and efficient classifier particularly well-suited for text-based tasks such as spam
detection and sentiment analysis. Despite its simplifying assumptions, it often performs competitively with
more complex models, making it a popular choice for many real-world applications.
8. Scikit-learn Library and Visualization Using Matplotlib – With Example.

1. Scikit-learn Overview
Scikit-learn (sklearn) is a powerful open-source Python library for machine learning, providing simple
and efficient tools for data mining and data analysis. It supports various algorithms such as classification,
regression, clustering, dimensionality reduction, and model selection.

2. Matplotlib Overview
Matplotlib is a comprehensive library used for creating static, animated, and interactive visualizations in
Python. When used alongside Scikit-learn, it helps visualize model performance, decision boundaries, and
data distributions.

Simple Example: Iris Dataset with KNN and Matplotlib

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load iris dataset

iris = load_iris()
X = iris.data[:, :2] # First two features for plotting
y = iris.target

# Train k-NN classifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Predict using the model

y_pred = knn.predict(X)

# Plot the data points

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('KNN Classification (k=3)')
plt.show()

What it does:
• Loads the Iris dataset.
• Trains a simple k-NN model.
• Predicts the same data (for simplicity).
• Plots the points colored by predicted class.

Predictive Analytics Updated
No ratings yet
Predictive Analytics Updated
30 pages
Mental Status Examination
100% (1)
Mental Status Examination
20 pages
Community Engagement, Solidarity and Citizenship: Senior High
100% (1)
Community Engagement, Solidarity and Citizenship: Senior High
11 pages
Debugging The Linux Kernel With JTAG
100% (2)
Debugging The Linux Kernel With JTAG
7 pages
JLPT SENSEI - N4 Grammar List
100% (1)
JLPT SENSEI - N4 Grammar List
5 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Week 1 Sandwiches
100% (1)
Week 1 Sandwiches
4 pages
Princeton Dissertation Boot Camp
100% (2)
Princeton Dissertation Boot Camp
8 pages
Tybsc Cs368 Data Analytics Labbook
No ratings yet
Tybsc Cs368 Data Analytics Labbook
58 pages
Shephard Fairey Lesson Plan
100% (1)
Shephard Fairey Lesson Plan
5 pages
Bca Sem - 6 Seminar Guidelines and Instructions
No ratings yet
Bca Sem - 6 Seminar Guidelines and Instructions
2 pages
M11GM IIb 2
No ratings yet
M11GM IIb 2
3 pages
Leadership - Human Performance - Stagen PDF
No ratings yet
Leadership - Human Performance - Stagen PDF
33 pages
Home Economics in Cross Cultural Development
No ratings yet
Home Economics in Cross Cultural Development
30 pages
First Aid Procedure
No ratings yet
First Aid Procedure
2 pages
Nonsense
No ratings yet
Nonsense
5 pages
DMIS Swimming Circular With Undertaking Form 2025 19042025 125256
No ratings yet
DMIS Swimming Circular With Undertaking Form 2025 19042025 125256
3 pages
Controlling Thoughts
No ratings yet
Controlling Thoughts
2 pages
Project Review Questionnaire
No ratings yet
Project Review Questionnaire
9 pages
What Is Cognitive Neuroscience (Presentation) Author Complex Systems Florida Atlantic University
No ratings yet
What Is Cognitive Neuroscience (Presentation) Author Complex Systems Florida Atlantic University
16 pages
Grech Grech 2020 Stroke Knowledge Developing A Framework For Data Integration in A Sequential Exploratory Mixed Method
No ratings yet
Grech Grech 2020 Stroke Knowledge Developing A Framework For Data Integration in A Sequential Exploratory Mixed Method
14 pages
Sponsorship
No ratings yet
Sponsorship
2 pages
Students Perspectives
No ratings yet
Students Perspectives
12 pages
HUM 101: Introduction To The Visual Arts: Course Syllabus B02 Section 70749 3 Credit Hours/45 Contact Hours
No ratings yet
HUM 101: Introduction To The Visual Arts: Course Syllabus B02 Section 70749 3 Credit Hours/45 Contact Hours
11 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
1.3.1+What+Animal+Are+you
No ratings yet
1.3.1+What+Animal+Are+you
3 pages
Chegg QA Guideline - v10
No ratings yet
Chegg QA Guideline - v10
34 pages
CV Ridho Prima Putra Ats
No ratings yet
CV Ridho Prima Putra Ats
1 page
Basic Essay Structure
No ratings yet
Basic Essay Structure
1 page
Educational Planning
No ratings yet
Educational Planning
26 pages
Lessons On Stigma Teaching About HIVAIDS
No ratings yet
Lessons On Stigma Teaching About HIVAIDS
12 pages
ML Combined
No ratings yet
ML Combined
254 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
ML 2 ND Unit
No ratings yet
ML 2 ND Unit
50 pages
Aiml Model
No ratings yet
Aiml Model
13 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
University Institute of Computing: Big Data Analytics 22CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 22CAH-782
27 pages
Big Data Analysis
No ratings yet
Big Data Analysis
25 pages
Predictive Analytics
No ratings yet
Predictive Analytics
8 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
ML Final
No ratings yet
ML Final
92 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
ML 01 (Pranavv)
No ratings yet
ML 01 (Pranavv)
14 pages
Module 5
No ratings yet
Module 5
31 pages
Lec05 - Supervised
No ratings yet
Lec05 - Supervised
26 pages
Integration of TPACK As A Basic Framework For 21st Century Learning: An Analysis of Professional Teacher Competencies
No ratings yet
Integration of TPACK As A Basic Framework For 21st Century Learning: An Analysis of Professional Teacher Competencies
6 pages
CH 5
No ratings yet
CH 5
42 pages
CV For Exam
No ratings yet
CV For Exam
1 page
CP401 PDF
No ratings yet
CP401 PDF
1 page
Fam QB Ans
No ratings yet
Fam QB Ans
9 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
SML
No ratings yet
SML
8 pages
Big Data Imp Notes of Big Dats
No ratings yet
Big Data Imp Notes of Big Dats
17 pages
Summary DS231
No ratings yet
Summary DS231
11 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
Information Retrieval Important Questions
No ratings yet
Information Retrieval Important Questions
20 pages
Regression Logistic Unit3 Notes
No ratings yet
Regression Logistic Unit3 Notes
6 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
Unit-4 Pda
No ratings yet
Unit-4 Pda
111 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
22 pages
Predective Analytics
No ratings yet
Predective Analytics
11 pages
CO 2 Session 3
No ratings yet
CO 2 Session 3
39 pages
Regression Vs Classification in Machine Learning Explained!
No ratings yet
Regression Vs Classification in Machine Learning Explained!
10 pages
FDS-Unit III-ECE
No ratings yet
FDS-Unit III-ECE
16 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
4 pages
Ensem Imp Data Science and Big Data Analytics Unit - 4
No ratings yet
Ensem Imp Data Science and Big Data Analytics Unit - 4
25 pages
DataAnalytics LabBook
No ratings yet
DataAnalytics LabBook
61 pages
LFD 1
No ratings yet
LFD 1
39 pages
Labook DA
No ratings yet
Labook DA
59 pages
ML Algo Terms
No ratings yet
ML Algo Terms
11 pages
Bda Answers
No ratings yet
Bda Answers
18 pages
Machinelearning Algorithm Basics2 NOTES
No ratings yet
Machinelearning Algorithm Basics2 NOTES
72 pages
Unit 5 Pyq
No ratings yet
Unit 5 Pyq
37 pages
Lecture 1.1 1.2
No ratings yet
Lecture 1.1 1.2
11 pages
MLT Unit 2 Linear Regression
No ratings yet
MLT Unit 2 Linear Regression
26 pages
Da 2
No ratings yet
Da 2
31 pages
ML DL NLP Definitions
No ratings yet
ML DL NLP Definitions
22 pages
Dsbda Prelim QB Solution
No ratings yet
Dsbda Prelim QB Solution
11 pages
ml2 250401 105339
No ratings yet
ml2 250401 105339
10 pages
Da Imp Qna Cleaned
No ratings yet
Da Imp Qna Cleaned
7 pages
TE Computer DSBDA
No ratings yet
TE Computer DSBDA
11 pages
ML 01 (Shubham)
No ratings yet
ML 01 (Shubham)
14 pages
Unit 5 Classification and Regression
No ratings yet
Unit 5 Classification and Regression
3 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
Classification Models
No ratings yet
Classification Models
3 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet

Dsbda Ut4

Uploaded by

Dsbda Ut4

Uploaded by

Rohit

Unit 4 Imp. Points + PYQs

1. Types of Analytics in Big Data.

2. Need for Logistic Regression

3. Logistic Function (Sigmoid Function)

4. Types of Logistic Regression

Type Description Use Case

Binary Logistic Predicts one of two possible outcomes

Ordinal Logistic Customer satisfaction levels (Poor,

2. Types of Linear Regression

Type Description Use Case Example

2. Handling Missing Data

Use method='ffill' or 'bfill' to propagate non-null values forward or

Model-based Imputation Predict missing values using regression or kNN methods.

Normalization Scales values to a range [0,1] using Min-Max Scaling.

Standardization Centers data to have a mean of 0 and standard deviation of 1.

Encoding Categorical Convert categories to numeric using Label Encoding or One-Hot

drop_duplicates() Removes duplicate rows from the DataFrame.

keep=‘first’/‘last’/False Option to keep the first, last, or no duplicate records.

5. Essential Python Libraries for Preprocessing

Pandas Data manipulation, missing value handling, duplicates, basic transformations.

NumPy Numerical operations, array manipulation.

Scikit-learn Advanced preprocessing like scaling, encoding, and imputation.

Missingno Visualizing missing data.

Feature-engine Specialized preprocessing like transformation, binning, encoding, etc.

1. What is a Decision Tree?

2. Parts of a Decision Tree

Internal Nodes Nodes that represent a test on an attribute (feature).

Branches Edges connecting nodes, representing outcomes of tests.

3. Criteria for Splitting Nodes

A. For Classification Trees

Chi-square Measures statistical significance of differences in distributions.

B. For Regression Trees

Mean Absolute Error

1. What is Naïve Bayes’ Classifier?

3. Types of Naïve Bayes Classifiers

Domain Application Example

Text Classification Spam detection, sentiment analysis, email filtering

Medical Diagnosis Predicting diseases based on symptoms

Recommendation Systems Predicting user preferences based on past behavior

Fraud Detection Identifying fraudulent transactions

Document Categorization Classifying news articles, product reviews, or research papers

Simple Example: Iris Dataset with KNN and Matplotlib

# Load iris dataset

# Train k-NN classifier

# Predict using the model

# Plot the data points

You might also like