UNIT-2
DATA ANALYSIS
REGRESSION
Regression in data analytics is a statistical technique used to model and analyze the relationship between a
dependent variable and one or more independent variables, enabling predictions and insights from data.
Regression analysis is a statistical and predictive modeling technique used in data analytics to understand
and quantify the relationship between a dependent (target) variable and one or more independent
(predictor) variables. It helps measure how changes in independent variables affect the dependent
variable and is widely used for prediction, forecasting, and causal analysis.
PURPOSE OF REGRESSION ANALYSIS
Understand relationships between variables
Predict future values of the target variable based on predictors
Forecast trends using historical data
Identify key factors influencing outcomes (positive or negative impacts)
Support data-driven decision-making across industries
TYPES OF REGRESSION
LINEAR REGRESSION: Models a linear relationship between one dependent variable and one or
more independent variables. It finds the best fit line minimizing the difference between observed
and predicted values.
MULTIPLE REGRESSIONS: Extension of linear regression with multiple predictor variables.
POLYNOMIAL REGRESSION: Models nonlinear relationships by introducing polynomial terms.
LOGISTIC REGRESSION: Used for classification problems where dependent variable is categorical.
HOW REGRESSION WORKS
Data Collection and Preparation: Gather data on dependent and independent variables, clean it,
and handle missing or noisy data.
Model Selection: Choose an appropriate regression model based on data type, relationships, and
problem objective.
Parameter Estimation: Use methods like least squares to find coefficients that best fit the data.
Model Evaluation: Assess accuracy using metrics like R-squared (explained variance), mean
squared error, and visualize residuals.
Prediction and Interpretation: Use the model to predict outcomes and interpret relationships
between variables.
2
IMPORTANT CONCEPTS
Dependent Variable: The outcome being predicted or explained.
Independent Variables: Factors influencing the outcome.
Intercept: The expected value of the dependent variable when all independent variables are zero.
Coefficient: The estimated impact of an independent variable on the dependent variable.
Residuals: Differences between observed and predicted values, used to measure model fit.
LINEAR REGRESSION
Linear Regression is a supervised machine learning algorithm and statistical technique used to model the
linear relationship between a dependent (target) variable and one or more independent (predictor)
variables. It predicts the value of the dependent variable based on the independent variables, assuming a
straight-line relationship.
Linear regression describes the connection between one dependent variable (outcome) and one or more
independent variables (predictors) using a straight line. The relationship is mathematically represented by
the equation:
Y=a+bX
● Y: Dependent variable (what you want to predict)
● X: Independent variable(s) (used for prediction)
● a: Intercept (value of Y when X=0)
● b: Slope (how much Y changes with one unit of X)
TYPES OF LINEAR REGRESSION
● Simple Linear Regression: 1 predictor and 1outcome variable (e.g., studying hours and exam
scores).
● Multiple Linear Regression: Several predictors for 1 outcome (ex: height & gender predicting weight)
MULTIPLE LINEAR REGRESSION
3
LOGOSTIC REGRESSION
Logistic Regression is a statistical and supervised machine learning algorithm used for classification
problems where the dependent variable is binary or categorical (often coded as 0 or 1). Unlike linear
regression that predicts continuous values, logistic regression predicts the probability of an outcome
belonging to a particular class, such as "yes" or "no," "success" or "failure."
Logistic regression maps independent variables to probabilities using an S-shaped curve (sigmoid
function), not a straight line
It classifies results based on a probability threshold (often 0.5 : above = event occurs; below = event
does not occur)
PURPOSE
Predict the probability of a binary event occurring.
Classify observations into two categories based on predictor variables.
Model relationships where the outcome is categorical and explanatory variables can be continuous
or categorical.
EQUATION
Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It predicts the
probability that a given input point belongs to a particular class using the logistic (sigmoid) function.
EXAMPLE
4
EXAMPLE
PREDICTING IF A STUDENT PASSES AN EXAM
Hours Studied (X) Pass (Y)
1 0
2 0
3 0
4 1
5 1
6 1
Predict for a Student Who Studied 3.5 Hours
5
MULTIVARIATE ANALYSIS
COMPARISON OF UNIVARIATE, BIVARIATE, AND MULTIVARIATE ANALYSIS
NUMBER
ANALYSIS KEY TECHNIQUES AND
OF PURPOSE EXAMPLE
TYPE FOCUS
VARIABLES
Summarize and describe Descriptive statistics—
One characteristics of a single Heights of students mean, median, mode;
Univariate
variable variable without in a class measures of dispersion;
investigating relationships histograms, bar charts
Relationship
Explore relationship or Correlation, scatter
Two between
Bivariate association between two plots, simple linear
variables temperature and
variables regression
ice cream sales
Analyze interactions and Multiple linear/logistic
More than Analyzing ad type,
relationships among regression, MANOVA,
Multivariate two gender, and click
multiple variables PCA, factor & cluster
variables rates together
simultaneously analysis
.
Multivariate analysis is a collection of powerful statistical techniques used to analyze data involving more
than two variables at a time, facilitating the discovery of complex relationships and patterns in large
datasets.
Multivariate analysis evaluates the relationships among multiple variables simultaneously, moving beyond
univariate (one variable) and bivariate (two variable) approaches. It is invaluable for extracting richer
insights from data and is widely used in fields such as marketing, healthcare, finance, social sciences, and
environmental studies.
Discovers patterns and associations that may not be visible looking at one or two variables alone.
Empowers more accurate prediction, decision-making, and error correction.
Allows companies and researchers to analyze complex phenomena by considering all relevant
influencing factors.
METHODS AND TECHNIQUES
● MULTIPLE LINEAR REGRESSION
○ Predicts a single outcome using several predictors.
● MULTIPLE LOGISTIC REGRESSION
○ Models a binary outcome using multiple variables.
● MANOVA (MULTIVARIATE ANALYSIS OF VARIANCE)
○ Tests for significant differences in multiple dependent variables across groups.
6
● FACTOR ANALYSIS
○ Identifies underlying dimensions or factors among related variables.
● PRINCIPAL COMPONENT ANALYSIS (PCA)
○ Reduces data complexity by combining correlated variables into principal components.
● CLUSTER ANALYSIS
○ Classifies observations into groups based on similarities among multiple variables.
TYPES
● Dependence Techniques
○ One or more dependent variables are predicted by independent variables (e.g., regression,
MANOVA).
● Interdependence Techniques
○ No clear dependent variable; seeks to reveal structure or patterns among variables (e.g.,
factor analysis, cluster analysis).
FACTOR ANALYSIS
Factor analysis aims to uncover hidden patterns or structures within data by grouping related variables
into factors. These factors represent shared variance, summarizing the information in several correlated
variables with fewer dimensions, making data easier to analyze and interpret.
WORK
Collect data on many variables believed to be related.
Compute the correlation or covariance matrix to assess relationships between variables.
Extract common factors that explain the maximum shared variance.
Each observed variable loads onto one or more factors with different weights (factor loadings).
The end result is a smaller number of factors representing most of the information in the dataset.
WHY
Dimensionality Reduction: Decreases the number of variables to work with.
Identify Latent Constructs: Reveals hidden dimensions like personality traits or customer
satisfaction.
Simplify Data: Helps summarize data for better visualization and understanding.
Variable Selection: Identifies which variables contribute most to key factors.
Improve Models: Reduces multicollinearity and noise in predictive modeling.
APPLICATIONS
Psychology and social sciences (e.g., measuring personality or attitudes).
Marketing (e.g., understanding customer preferences).
Finance (e.g., analyzing market factors influencing asset prices).
Health sciences (e.g., grouping symptoms into syndromes)
7
CLUSTER ANALYSIS
Cluster analysis is a statistical method used to group a set of objects or data points into clusters based on
their similarities, where objects within a cluster are more similar to each other than to those in other
clusters.
It is an unsupervised learning technique—meaning no predefined labels or categories are needed.
The goal is to discover natural groupings or structures in data.
Clusters formed have high internal similarity and are well-separated from other clusters.
WORKS
1. Choose the clustering method like k-means, hierarchical, or two-step clustering depending on the
data size and goals.
2. Select number of clusters or let the algorithm determine the best number based on data patterns.
3. Choose variables/features to include for measuring similarity.
4. Calculate similarity or distance between data points using metrics such as Euclidean distance,
Manhattan distance, or cosine similarity.
5. Assign data points into clusters where the total intra-cluster distance is minimized.
6. Visualize and interpret the clusters using plots or dendrograms.
COMMON APPLICATIONS
Marketing: Customer segmentation to tailor marketing strategies.
Healthcare: Grouping patients by symptoms to assist in diagnosis.
Biology: Classifying species based on traits.
Retail: Market basket analysis to find product groupings.
PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of large
datasets while retaining most of the original information. It transforms potentially correlated variables into
a smaller set of new, uncorrelated variables called principal components, which capture the maximum
variance in the data.
PCA creates new variables (principal components) that are linear combinations of original variables.
The first principal component explains the largest amount of variation in the dataset.
Each subsequent component explains the next highest amount of variation, with the constraint
that it is uncorrelated (orthogonal) to the previous components.
By using a few principal components, you reduce complexity while preserving essential
information.
PCA WORKS (STEP-BY-STEP)
8
1. Standardize Data: Variables r scaled to have mean zero & unit variance to make variables
comparable.
2. Compute Covariance Matrix: Shows how variables vary together.
3. Calculate Eigenvalues and Eigenvectors: Eigenvectors determine the direction of principal
components; eigenvalues measure the amount of variance explained.
4. Select Principal Components: Choose components with the highest eigenvalues (variance
explained).
5. Transform Data: Original data projected into new components to create a smaller, simplified
dataset.
MANOVA (MULTIVARIATE ANALYSIS OF VARIANCE)
Multivariate Analysis of Variance (MANOVA) is a statistical technique used to compare the means of
multiple dependent variables across two or more groups defined by one or more categorical independent
variables. It extends the Analysis of Variance (ANOVA) by evaluating several outcome variables
simultaneously rather than one at a time.
MANOVA tests if the group means on a combination of dependent variables differ significantly.
Considers correlations & interrelations among dependent variables while assessing group
differences.
Commonly used when researchers want to understand how multiple outcomes vary by group or
treatment type.
MANOVA WORKS
Tests the null hypothesis that the mean vectors of all dependent variables are equal across groups.
Calculates test statistics such as Wilks’ Lambda, Pillai’s Trace, Hotelling’s Trace, and Roy’s Largest
Root to determine significance.
Assumes multivariate normality, homogeneity of covariance matrices, independence of
observations, and linear relationships among variables.
TYPES OF MANOVA
One-Way MANOVA: Compares groups based on one independent variable.
Two-Way MANOVA: Examines effects of two independent variables and their interaction.
Repeated Measures MANOVA: Measures same subjects multiple times under different conditions.