0% found this document useful (0 votes)
6 views12 pages

Dsbda Ut4

The document covers key concepts in data analytics, including types of analytics such as descriptive, diagnostic, predictive, prescriptive, and cognitive analytics. It also discusses algorithms like Apriori for association rule mining, logistic regression for classification, linear regression for modeling relationships, and decision trees for supervised learning. Additionally, it highlights data preprocessing techniques, Naïve Bayes classifiers, and the use of Scikit-learn and Matplotlib for machine learning and visualization.

Uploaded by

practicalcodes04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Dsbda Ut4

The document covers key concepts in data analytics, including types of analytics such as descriptive, diagnostic, predictive, prescriptive, and cognitive analytics. It also discusses algorithms like Apriori for association rule mining, logistic regression for classification, linear regression for modeling relationships, and decision trees for supervised learning. Additionally, it highlights data preprocessing techniques, Naïve Bayes classifiers, and the use of Scikit-learn and Matplotlib for machine learning and visualization.

Uploaded by

practicalcodes04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Rohit

Unit 4 Imp. Points + PYQs

1. Types of Analytics in Big Data.

1. Descriptive Analytics
o Purpose: Understand what has happened in the past.
o Function: Summarizes raw data to make it interpretable by humans.
o Techniques: Data aggregation, data mining, reporting, dashboards.
o Example: Monthly website traffic reports, sales trend analysis.
2. Diagnostic Analytics
o Purpose: Understand why something happened.
o Function: Finds the root cause of outcomes and trends.
o Techniques: Drill-down, data discovery, correlation analysis.
o Example: Analyzing the reason for a decline in sales in a specific quarter.
3. Predictive Analytics
o Purpose: Forecast future outcomes based on historical data.
o Function: Uses statistical models and machine learning to predict trends.
o Techniques: Regression analysis, decision trees, neural networks.
o Example: Predicting customer churn or future product demand.
4. Prescriptive Analytics
o Purpose: Recommend actions to achieve desired outcomes.
o Function: Suggests decisions based on predictive models and optimization.
o Techniques: Simulation, optimization algorithms, decision analysis.
o Example: Recommending pricing strategies or supply chain adjustments.
5. Cognitive Analytics
o Purpose: Mimic human thinking to interpret complex data.
o Function: Leverages AI and natural language processing (NLP) to understand unstructured
data.
o Techniques: Machine learning, deep learning, NLP.
o Example: Chatbots that analyze and respond to customer queries in real-time.
2. Apriori Algorithm, Support and Confidence Value, and Association Rules.

1. Apriori Algorithm
o Definition: Apriori is an algorithm used in association rule mining to identify frequent
itemsets in a dataset and derive rules that explain relationships among items.
o Purpose: It helps in market basket analysis to discover item combinations that occur
frequently together.
o Process:
1. Identify frequent individual items using a minimum support threshold.
2. Extend these to larger itemsets as long as they appear frequently.
3. Use the frequent itemsets to generate association rules.
2. Support
o Definition: Indicates how frequently an itemset appears in the dataset.
o Formula: Support(A)= Number of transactions containing (A) / Total number of transactions
o Example: If 100 transactions are made and 20 contain milk, then Support(Milk) = 20/100 =
0.2
3. Confidence
o Definition: Indicates the likelihood that item B is also bought when item A is bought.
o Formula: Confidence(A→B) = Support(A∪B) / Support(A)
o Example: If 15 out of 20 milk transactions also contain bread, Confidence(Milk → Bread) =
15/20 = 0.75
4. Association Rules
o Definition: Implication expressions of the form A → B, meaning if A occurs, then B is likely
to occur.
o Components:
▪ Antecedent (A): The item(s) on the left-hand side of the rule
▪ Consequent (B): The item(s) on the right-hand side
▪ Metrics: Support, Confidence, and Lift (optional, for strength of rule)
o Example:
▪ Rule: If Milk, then Bread
▪ Support = 0.2, Confidence = 0.75
These techniques are widely used in recommendation systems, retail analytics, and customer behavior
analysis.
3. Logistic Regression, Its Need, Types, and Use of Logistic & Sigmoid Function.

1. Logistic Regression
• Definition: Logistic Regression is a statistical method used to model binary (yes/no, true/false) or
categorical outcomes. Unlike linear regression, it predicts the probability that a given input belongs
to a particular category.
• Purpose: Used for classification problems, not regression, despite the name.

2. Need for Logistic Regression


• When the dependent variable is categorical (typically binary, e.g., 0 or 1).
• Linear regression is not suitable because it can predict values beyond 0 and 1, which are not valid
probabilities.
• Logistic Regression maps predicted values to probabilities using the logistic/sigmoid function.

3. Logistic Function (Sigmoid Function)


• Formula: σ(z) = 1 / 1 + e-z
Where z = β0 + β1X1+ β2X2 + … + βnXn
• Purpose: Converts any real-valued number into a value between 0 and 1 (interpreted as a
probability).
• Behavior:
o If z is very large → output ≈ 1
o If z is very small → output ≈ 0
o If z = 0 → output = 0.5

4. Types of Logistic Regression

Type Description Use Case

Binary Logistic Predicts one of two possible outcomes


Email classification
Regression (e.g., spam vs not spam)

Multinomial Logistic Predicts one of three or more unordered Classifying type of cuisine (Indian,
Regression categories Italian, Chinese)

Ordinal Logistic Customer satisfaction levels (Poor,


Predicts ordered categories
Regression Fair, Good, Excellent)
5. Use Cases
• Credit scoring (good/bad credit)
• Disease diagnosis (positive/negative)
• Churn prediction (churn/stay)
• Click-through prediction (click/no-click)
Logistic Regression is simple, efficient, and widely used for binary classification problems in machine
learning and statistics.

4. Linear Regression.

1. Definition
Linear Regression is a supervised learning algorithm and a statistical technique that models the
relationship between a dependent variable (target) and one or more independent variables (predictors)
using a straight line (linear function).

2. Types of Linear Regression

Type Description Use Case Example

Simple Linear Involves one independent variable and one Predicting house price based on
Regression dependent variable. area.

Multiple Linear Involves two or more independent variables to Predicting house price based on
Regression predict one dependent variable. area, location, and rooms.

Polynomial Linear A form of linear regression where the relationship Predicting complex curves such
Regression is modeled as an nth degree polynomial. as stock prices.

3. Equation
Simple Linear Regression: Y= β0+ β1X + ϵ
Multiple Linear Regression: Y= β0 +β1X1 + β2X2 +…+ βnXn+ϵ
Where:
• Y: Dependent variable (target)
• X: Independent variable(s)
• β0: Intercept (constant)
• βn: Coefficients (slopes for predictors)
• ϵ: Error term (residuals)
4. Assumptions of Linear Regression
• Linearity: The relationship between the dependent and independent variables is linear.
• Independence: Observations are independent of each other.
• Homoscedasticity: Constant variance of the residuals.
• Normality: Residuals should be normally distributed.

5. Use Cases
• Predicting sales based on advertising budget.
• Estimating exam scores from study hours.
• Forecasting revenue from website traffic.
Linear Regression is fundamental in data science and serves as a baseline model for many regression
problems.
5. Data Preprocessing and Handling Missing Data, Data Transformation, Removing Duplicates, and
Essential Python Libraries.

1. Data Preprocessing
Data preprocessing is the crucial initial step in any data science or machine learning project. It involves
preparing raw data for analysis or modeling by cleaning, transforming, and organizing it.

2. Handling Missing Data

Method Description

Removing Missing Values Use dropna() to remove rows/columns with missing values.

Imputation Fill missing values using fillna() with statistical values like mean or
(Mean/Median/Mode) median.

Use method='ffill' or 'bfill' to propagate non-null values forward or


Forward/Backward Fill
backward.

Model-based Imputation Predict missing values using regression or kNN methods.

Example (Python):
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)

3. Data Transformation

Technique Description

Normalization Scales values to a range [0,1] using Min-Max Scaling.

Standardization Centers data to have a mean of 0 and standard deviation of 1.

Encoding Categorical Convert categories to numeric using Label Encoding or One-Hot


Variables Encoding.

Example (Standardization):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'income']])
4. Removing Duplicates

Method Description

drop_duplicates() Removes duplicate rows from the DataFrame.

keep=‘first’/‘last’/False Option to keep the first, last, or no duplicate records.

Example (Python):
df = df.drop_duplicates()

5. Essential Python Libraries for Preprocessing

Library Purpose

Pandas Data manipulation, missing value handling, duplicates, basic transformations.

NumPy Numerical operations, array manipulation.

Scikit-learn Advanced preprocessing like scaling, encoding, and imputation.

Missingno Visualizing missing data.

Feature-engine Specialized preprocessing like transformation, binning, encoding, etc.


6. Decision Tree, Its Parts, and Criteria Used for Splitting Nodes

1. What is a Decision Tree?


A Decision Tree is a supervised machine learning algorithm used for both classification and regression
tasks. It works by recursively splitting the dataset into subsets based on the feature that results in the most
significant information gain.

2. Parts of a Decision Tree

Part Description

Root Node The topmost node that represents the entire dataset and the first split.

Internal Nodes Nodes that represent a test on an attribute (feature).

Branches Edges connecting nodes, representing outcomes of tests.

Leaf/Terminal Nodes Final nodes that contain the output label or prediction.

3. Criteria for Splitting Nodes


Splitting is the process of dividing a node into two or more sub-nodes based on a feature. The goal is to
increase the homogeneity of the resultant nodes. Different criteria are used depending on the problem type:

A. For Classification Trees

Criterion Description

Gini Impurity Measures the impurity or purity of a node. Lower Gini means purer nodes.

Entropy / Information Gain Measures the information gain after splitting. Higher gain is preferred.

Chi-square Measures statistical significance of differences in distributions.

B. For Regression Trees

Criterion Description

Mean Squared Error Splits based on minimizing squared difference between predicted and actual
(MSE) values.

Mean Absolute Error


Similar to MSE but uses absolute differences.
(MAE)

Reduction in Variance Chooses splits that reduce variance of target variable in children.
4. Advantages of Decision Trees
• Easy to understand and interpret
• Can handle both numerical and categorical data
• Requires little data preprocessing

5. Limitations
• Prone to overfitting (solved using pruning or ensemble methods like Random Forest)
• Unstable with small variations in data
7. Naïve Bayes’ Classifier and Its Applications.

1. What is Naïve Bayes’ Classifier?


Naïve Bayes is a probabilistic supervised learning algorithm based on Bayes' Theorem. It is called
"naïve" because it assumes that all features (attributes) are independent of each other given the class label
— which is rarely true in real-world data, but works surprisingly well in many scenarios.

2. Bayes' Theorem
P(A∣B)=P(B∣A)⋅P(A) / P(B)
Where:
• P(A∣B): Posterior probability of class A given feature B
• P(B∣A): Likelihood of feature B given class A
• P(A): Prior probability of class A
• P(B): Prior probability of feature B

3. Types of Naïve Bayes Classifiers

Type Description

Gaussian Naïve Bayes Assumes features follow a normal distribution (used for continuous data).

Multinomial Naïve Bayes Used for discrete counts (e.g., word counts in text classification).

Bernoulli Naïve Bayes Used for binary/boolean features (e.g., word present or not).

4. Advantages
• Simple and fast to implement
• Works well with high-dimensional data
• Effective for text classification and spam filtering
• Requires less training data

5. Limitations
• Assumes independence among features
• Performs poorly if this assumption is strongly violated
• Not suitable for regression tasks
6. Applications of Naïve Bayes

Domain Application Example

Text Classification Spam detection, sentiment analysis, email filtering

Medical Diagnosis Predicting diseases based on symptoms

Recommendation Systems Predicting user preferences based on past behavior

Fraud Detection Identifying fraudulent transactions

Document Categorization Classifying news articles, product reviews, or research papers

Naïve Bayes is a fast and efficient classifier particularly well-suited for text-based tasks such as spam
detection and sentiment analysis. Despite its simplifying assumptions, it often performs competitively with
more complex models, making it a popular choice for many real-world applications.
8. Scikit-learn Library and Visualization Using Matplotlib – With Example.

1. Scikit-learn Overview
Scikit-learn (sklearn) is a powerful open-source Python library for machine learning, providing simple
and efficient tools for data mining and data analysis. It supports various algorithms such as classification,
regression, clustering, dimensionality reduction, and model selection.

2. Matplotlib Overview
Matplotlib is a comprehensive library used for creating static, animated, and interactive visualizations in
Python. When used alongside Scikit-learn, it helps visualize model performance, decision boundaries, and
data distributions.

Simple Example: Iris Dataset with KNN and Matplotlib


from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load iris dataset


iris = load_iris()
X = iris.data[:, :2] # First two features for plotting
y = iris.target

# Train k-NN classifier


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Predict using the model


y_pred = knn.predict(X)

# Plot the data points


plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('KNN Classification (k=3)')
plt.show()

What it does:
• Loads the Iris dataset.
• Trains a simple k-NN model.
• Predicts the same data (for simplicity).
• Plots the points colored by predicted class.

You might also like