Dsbda Ut4
Dsbda Ut4
1. Descriptive Analytics
o Purpose: Understand what has happened in the past.
o Function: Summarizes raw data to make it interpretable by humans.
o Techniques: Data aggregation, data mining, reporting, dashboards.
o Example: Monthly website traffic reports, sales trend analysis.
2. Diagnostic Analytics
o Purpose: Understand why something happened.
o Function: Finds the root cause of outcomes and trends.
o Techniques: Drill-down, data discovery, correlation analysis.
o Example: Analyzing the reason for a decline in sales in a specific quarter.
3. Predictive Analytics
o Purpose: Forecast future outcomes based on historical data.
o Function: Uses statistical models and machine learning to predict trends.
o Techniques: Regression analysis, decision trees, neural networks.
o Example: Predicting customer churn or future product demand.
4. Prescriptive Analytics
o Purpose: Recommend actions to achieve desired outcomes.
o Function: Suggests decisions based on predictive models and optimization.
o Techniques: Simulation, optimization algorithms, decision analysis.
o Example: Recommending pricing strategies or supply chain adjustments.
5. Cognitive Analytics
o Purpose: Mimic human thinking to interpret complex data.
o Function: Leverages AI and natural language processing (NLP) to understand unstructured
data.
o Techniques: Machine learning, deep learning, NLP.
o Example: Chatbots that analyze and respond to customer queries in real-time.
2. Apriori Algorithm, Support and Confidence Value, and Association Rules.
1. Apriori Algorithm
o Definition: Apriori is an algorithm used in association rule mining to identify frequent
itemsets in a dataset and derive rules that explain relationships among items.
o Purpose: It helps in market basket analysis to discover item combinations that occur
frequently together.
o Process:
1. Identify frequent individual items using a minimum support threshold.
2. Extend these to larger itemsets as long as they appear frequently.
3. Use the frequent itemsets to generate association rules.
2. Support
o Definition: Indicates how frequently an itemset appears in the dataset.
o Formula: Support(A)= Number of transactions containing (A) / Total number of transactions
o Example: If 100 transactions are made and 20 contain milk, then Support(Milk) = 20/100 =
0.2
3. Confidence
o Definition: Indicates the likelihood that item B is also bought when item A is bought.
o Formula: Confidence(A→B) = Support(A∪B) / Support(A)
o Example: If 15 out of 20 milk transactions also contain bread, Confidence(Milk → Bread) =
15/20 = 0.75
4. Association Rules
o Definition: Implication expressions of the form A → B, meaning if A occurs, then B is likely
to occur.
o Components:
▪ Antecedent (A): The item(s) on the left-hand side of the rule
▪ Consequent (B): The item(s) on the right-hand side
▪ Metrics: Support, Confidence, and Lift (optional, for strength of rule)
o Example:
▪ Rule: If Milk, then Bread
▪ Support = 0.2, Confidence = 0.75
These techniques are widely used in recommendation systems, retail analytics, and customer behavior
analysis.
3. Logistic Regression, Its Need, Types, and Use of Logistic & Sigmoid Function.
1. Logistic Regression
• Definition: Logistic Regression is a statistical method used to model binary (yes/no, true/false) or
categorical outcomes. Unlike linear regression, it predicts the probability that a given input belongs
to a particular category.
• Purpose: Used for classification problems, not regression, despite the name.
Multinomial Logistic Predicts one of three or more unordered Classifying type of cuisine (Indian,
Regression categories Italian, Chinese)
4. Linear Regression.
1. Definition
Linear Regression is a supervised learning algorithm and a statistical technique that models the
relationship between a dependent variable (target) and one or more independent variables (predictors)
using a straight line (linear function).
Simple Linear Involves one independent variable and one Predicting house price based on
Regression dependent variable. area.
Multiple Linear Involves two or more independent variables to Predicting house price based on
Regression predict one dependent variable. area, location, and rooms.
Polynomial Linear A form of linear regression where the relationship Predicting complex curves such
Regression is modeled as an nth degree polynomial. as stock prices.
3. Equation
Simple Linear Regression: Y= β0+ β1X + ϵ
Multiple Linear Regression: Y= β0 +β1X1 + β2X2 +…+ βnXn+ϵ
Where:
• Y: Dependent variable (target)
• X: Independent variable(s)
• β0: Intercept (constant)
• βn: Coefficients (slopes for predictors)
• ϵ: Error term (residuals)
4. Assumptions of Linear Regression
• Linearity: The relationship between the dependent and independent variables is linear.
• Independence: Observations are independent of each other.
• Homoscedasticity: Constant variance of the residuals.
• Normality: Residuals should be normally distributed.
5. Use Cases
• Predicting sales based on advertising budget.
• Estimating exam scores from study hours.
• Forecasting revenue from website traffic.
Linear Regression is fundamental in data science and serves as a baseline model for many regression
problems.
5. Data Preprocessing and Handling Missing Data, Data Transformation, Removing Duplicates, and
Essential Python Libraries.
1. Data Preprocessing
Data preprocessing is the crucial initial step in any data science or machine learning project. It involves
preparing raw data for analysis or modeling by cleaning, transforming, and organizing it.
Method Description
Removing Missing Values Use dropna() to remove rows/columns with missing values.
Imputation Fill missing values using fillna() with statistical values like mean or
(Mean/Median/Mode) median.
Example (Python):
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)
3. Data Transformation
Technique Description
Example (Standardization):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'income']])
4. Removing Duplicates
Method Description
Example (Python):
df = df.drop_duplicates()
Library Purpose
Part Description
Root Node The topmost node that represents the entire dataset and the first split.
Leaf/Terminal Nodes Final nodes that contain the output label or prediction.
Criterion Description
Gini Impurity Measures the impurity or purity of a node. Lower Gini means purer nodes.
Entropy / Information Gain Measures the information gain after splitting. Higher gain is preferred.
Criterion Description
Mean Squared Error Splits based on minimizing squared difference between predicted and actual
(MSE) values.
Reduction in Variance Chooses splits that reduce variance of target variable in children.
4. Advantages of Decision Trees
• Easy to understand and interpret
• Can handle both numerical and categorical data
• Requires little data preprocessing
5. Limitations
• Prone to overfitting (solved using pruning or ensemble methods like Random Forest)
• Unstable with small variations in data
7. Naïve Bayes’ Classifier and Its Applications.
2. Bayes' Theorem
P(A∣B)=P(B∣A)⋅P(A) / P(B)
Where:
• P(A∣B): Posterior probability of class A given feature B
• P(B∣A): Likelihood of feature B given class A
• P(A): Prior probability of class A
• P(B): Prior probability of feature B
Type Description
Gaussian Naïve Bayes Assumes features follow a normal distribution (used for continuous data).
Multinomial Naïve Bayes Used for discrete counts (e.g., word counts in text classification).
Bernoulli Naïve Bayes Used for binary/boolean features (e.g., word present or not).
4. Advantages
• Simple and fast to implement
• Works well with high-dimensional data
• Effective for text classification and spam filtering
• Requires less training data
5. Limitations
• Assumes independence among features
• Performs poorly if this assumption is strongly violated
• Not suitable for regression tasks
6. Applications of Naïve Bayes
Naïve Bayes is a fast and efficient classifier particularly well-suited for text-based tasks such as spam
detection and sentiment analysis. Despite its simplifying assumptions, it often performs competitively with
more complex models, making it a popular choice for many real-world applications.
8. Scikit-learn Library and Visualization Using Matplotlib – With Example.
1. Scikit-learn Overview
Scikit-learn (sklearn) is a powerful open-source Python library for machine learning, providing simple
and efficient tools for data mining and data analysis. It supports various algorithms such as classification,
regression, clustering, dimensionality reduction, and model selection.
2. Matplotlib Overview
Matplotlib is a comprehensive library used for creating static, animated, and interactive visualizations in
Python. When used alongside Scikit-learn, it helps visualize model performance, decision boundaries, and
data distributions.
What it does:
• Loads the Iris dataset.
• Trains a simple k-NN model.
• Predicts the same data (for simplicity).
• Plots the points colored by predicted class.