0% found this document useful (0 votes)
6 views

Unit 4 DA Revised

Uploaded by

Lol Bhai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 4 DA Revised

Uploaded by

Lol Bhai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Noida Institute of Engineering and Technology, Greater Noida

Exploratory Data Analysis

Unit: 4

Data Analytics ACSAI0512


Dr. Kumod Gupta
Associate Professor
B.Tech 5th Semester
CSE-AI

Dr. Kumod Kumar Gupta Data Analytics Unit-4


1
9 December 2024
Faculty Introduction

Name Dr. Kumod Kumar Gupta


Qualification PhD
Designation Associate Professor
Department CSE- AI
NIET Experience 03 Years
Subject Taught Deep Learning, Machine Learning, Programming for Data
Analytics

Dr. Kumod Kumar Gupta Data Analytics Unit-4


2
9 December 2024
Evaluation Scheme

Dr. Kumod Kumar Gupta Data Analytics Unit-4


3
9 December 2024
Syllabus

UNIT-I: Introduction to Data Science

Introduction to Data Science, Big Data, the 5 V’s, Evolution of


Data Science, Datafication, Skillsets needed, Data Science
Lifecycle, types of Data Analysis, Data Science Tools and
technologies, Need for Data Science, Analysis Vs Analytics Vs
Reporting, Big Data Ecosystem, Future of Data Science,
Applications of Data Science in various fields, Use cases of
Data science-Facebook, Netflix, Amazon, Uber, AirBnB.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 4


Syllabus

UNIT-II: Data Handling

Type of Data: structured, semi-structured, unstructured


data, Numeric, Categorical, Graphical, High Dimensional
Data, Transactional Data, Spatial Data, Social Network
Data, standard datasets, Data Classification, Sources of
Data, Data manipulation in various formats, for example,
CSV file, pdf file, XML file, HTML file, text file, JSON, image
files etc. import and export data in R/Python.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 5


Syllabus

UNIT-III: Data Preprocessing

Form of Data Pre-processing, data Attribute and its types,


understanding and extracting useful variables,KDD process,
Data Cleaning: Missing Values, Noisy Data, Discretization and
Concept hierarchy generation (Binning, Clustering,
Histogram), Inconsistent Data, Data Integration and
Transformation. Data Reduction: Data Cube Aggregation,
Data Compression, Numerosity Reduction

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 6


Syllabus

UNIT-IV: Exploratory Data Analysis

Handling Missing data, Removing Redundant variables, variable


Selection, identifying outliers, Removing Outliers, Time series
Analysis, Data transformation and dimensionality reduction
techniques such as Principal Component Analysis (PCA), Factor
Analysis (FA) and Linear Discriminant Analysis (LDA), Univariate
and Multivariate Exploratory Data Analysis. Data Munging, Data
Wrangling- APIs and other tools for scrapping data from the web/
internet using R/Python.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 7


Syllabus

UNIT-V: Data Visualization

Introductions and overview, Debug and troubleshoot installation and configuration


of the Tableau. Creating Your First visualization: Getting started with Tableau
Software, Using Data file formats, connecting your Data to Tableau, creating basic
charts (line, bar charts, Tree maps), Using the Show me panel. Tableau Calculations:
Overview of SUM, AVR, and Aggregate Features Creating custom calculations and
fields, Applying new data calculations to your visualization. Manipulating Data in
Tableau: Cleaning-up the data with the Data Interpreter, structuring your data,
Sorting, and filtering Tableau data, Pivoting Tableau data. Advanced Visualization
Tools: Using Filters, Using the Detail panel Using the Size panels, customizing filters,
Using and Customizing tooltips, Formatting your data with colours, Creating
Dashboards & Stories, Distributing & Publishing Your Visualization

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 8


Branch Wise Applications

1.Security
2. Digital Advertising
3. E-Commerce
4. Publishing
5. Massively Multiplayer Online Games
6. Backend Services and Messaging
7. Project Management & Collaboration
8. Real time Monitoring Services
9.Live Charting and Graphing
10. Group and Private Chat

Dr. Kumod Kumar Gupta Data Analytics Unit-4


9
9 December 2024
Course Objective

The objective of this course is to understand the fundamental concepts of Data


analytics and learn about various types of data formats and their manipulations. It
helps students to learn exploratory data analysis and visualization techniques in
addition to R/Python/Tableau programming language.

Dr. Kumod Kumar Gupta Data Analytics Unit-4


10
9 December 2024
Course Outcomes

At the end of course, the student will be


able to:

Understand the fundamental concepts of data analytics in the areas that


plays major role within the realm of data science.
Explain and exemplify the most common forms of data and its
representations.

Understand and apply data pre-processing techniques.

Analyse data using exploratory data analysis.

Illustrate various visualization methods for different types of data sets


and application scenarios.

Dr. Kumod Kumar Gupta Data Analytics Unit-4


11
9 December 2024
Program Outcomes

Engineering Graduates will be able


to:
PO1 : Engineering Knowledge

PO2 : Problem Analysis

PO3 : Design/Development of solutions

PO4 : Conduct Investigations of complex problems

PO5 : Modern tool usage

PO6 : The engineer and society

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 12


Program Outcomes

Engineering Graduates will be able


to:

PO7 : Environment and sustainability

PO8 : Ethics

PO9 : Individual and teamwork

PO10 : Communication

PO11 : Project management and finance


PO12 : Life-long learning

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 13


CO-POs Mapping

CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 2 2 2 3 3 - - - - - - -

CO2 3 2 3 2 3 - - - - - - -

CO3 3 2 3 2 3 - - - - - - -

CO4 3 2 3 2 3 - - - - - - -

CO5 3 2 3 3 3 - - - - - - -

AVG 2.8 2.0 2.8 2.4 3.0 - - - - - - -

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 14


Program Specific Outcomes

Program Specific
S. No. PSO Description
Outcomes (PSO)

Design innovative intelligent systems for the


1 PSO1
welfare of the people using machine learning and its
applications.

Demonstrate ethical, professional and team -


oriented skills while providing innovative solutions in A
2 PSO2
rtificial Intelligence and Machine Learning for life-long
learning.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 15


CO-PSOs Mapping

CO.K PSO1 PSO2 PSO3 PSO4

CO1 3 - - -

CO2 3 2 - -

CO3 3 3 - -

CO4 3 3 - -

CO5 3 3 - -

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 16


Program Educational Objectives

Program Educational
PEOs Description
Objectives (PEOs)
• Pursue higher education and professional career to excel in the field of
Artificial Intelligence and Machine Learning.
PEOs

• Lead by example in innovative research and entrepreneurial zeal for


21st century skills.
PEOs

• Proactively provide innovations solutions for societal


PEOs problems to promote life-long learning.

9 December 2024 Dr. Kumod


AarushiKumar
ThusuGuptaACSAI0622
Data Analytics
Social Media
Unit-4
Analytics Unit 5 17
Pattern of External Exam Question Paper

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 18


Pattern of External Exam Question Paper

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 19


Pattern of External Exam Question Paper

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 20


Pattern of External Exam Question Paper

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 21


Brief Introduction about the Subject and Videos

Data analytics (DA) is the area of examining data sets in order to find trends and draw
conclusions about the information they contain. Increasingly, data analytics is done with
the aid of specialized systems and software.

YouTube/other Video Links


https://www.youtube.com/watch?v=KxryzSO1Fjs

Dr. Kumod Kumar Gupta Data Analytics Unit-4


9 December 2024 22
Unit Content

• Unit -4 Exploratory Data Analysis


➢ Handling Missing data,
➢ Removing Redundant variables,
➢ variable Selection,
➢ identifying outliers,
➢ Removing Outliers,
➢ Time series Analysis,
➢ Data transformation and dimensionality reduction techniques
➢ Principal Component Analysis (PCA),
➢ Factor Analysis (FA) and
➢ Linear Discriminant Analysis (LDA),
➢ Univariate and Multivariate Exploratory Data Analysis.
➢ Data Munging, Data Wrangling- APIs and other tools for scrapping data from the web/
internet using R/Python.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 23


Unit Objectives

The objective of the Unit 4 is :

1. To describe how to handle the missing data.

2. To remove the redundant variable.

3. To explore Data transformation and dimensionality reduction techniques such as Principal


Component Analysis (PCA), Factor Analysis (FA) and Linear Discriminant Analysis (LDA)

4. To describe the services an operating system provides to users, processes, and other systems

5. To discuss Data Munging, Data Wrangling- APIs and other tools for scrapping data from the web/
internet using Python.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 24


Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of investigating a dataset to summarize its
key characteristics, often using statistical techniques and visualizations.
• EDA is crucial in understanding the data before applying more complex statistical
models or machine learning algorithms.
• It helps in identifying patterns, spotting anomalies, testing hypotheses, and checking
assumptions.
Objectives of EDA:
1. Discover patterns: Identify trends, clusters, and relationships.
2. Spot anomalies or outliers: Detect unusual or extreme values.
3. Test hypotheses: Evaluate assumptions about the data.
4. Determine relationships: Understand how different variables interact.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 25


Exploratory Data Analysis

Key EDA Techniques:


Descriptive Statistics:
o Summarize the data using metrics such as mean, median, mode, range, variance, and standard deviation.
Data Visualization:
o Histograms: Show the distribution of a variable.
o Box plots: Highlight the spread and potential outliers.
o Scatter plots: Display relationships between two continuous variables.
o Heatmaps: Visualize correlations between multiple variables.
Missing Data Analysis:
o Identify missing values and determine strategies to handle them (e.g., imputation, removal).
Correlation and Covariance:
o Quantify the relationships between variables using correlation matrices or covariance calculations.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 26


Exploratory Data Analysis

Benefits of EDA:
• Improved Data Understanding: Helps you understand the structure and relationships within
the data.
• Error Detection: Helps identify data errors, missing values, or outliers that could affect results.
• Hypothesis Generation: EDA can help in forming new hypotheses for further analysis.
• Model Selection: Helps in choosing the right type of machine learning models by
understanding data distributions and relationships.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 27


Exploratory Data Analysis
these four primary types of Exploratory Data Analysis (EDA):
1. Univariate
Univariate Non-Graphical EDA
• Focus: Analyzing a single variable without using visual tools.
• Techniques:
o Summary statistics: Mean, median, mode, variance, standard deviation.
o Frequency distribution tables: Count how often each value occurs.
• Purpose: Understand the central tendency, spread, and shape of the distribution.
Univariate Graphical EDA
• Focus: Analyzing one variable using visual methods.
• Techniques:
o Histograms: Show the distribution of the variable.
o Box plots: Highlight the spread, median, and potential outliers.
o Density plots: Show the probability density of the variable.
• Purpose: Gain better insights into the distribution, shape, and spread of the data.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 28
Exploratory Data Analysis

2. Bivariate Analysis
•Bivariate Non-Graphical EDA:
• Focuses on the relationship between two variables using measures like correlation
coefficients (e.g., Pearson’s correlation).
•Bivariate Graphical EDA:
• Uses visual tools like scatter plots, bar plots, and line charts to explore
relationships between two variables.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 29


Exploratory Data Analysis
3. Multivariate
Multivariate Non-Graphical EDA
• Focus: Analyzing relationships between two or more variables without visual representation.
• Techniques:
o Cross-tabulation (contingency tables): Summarizes categorical data to show relationships
between variables.
o Correlation coefficients: Show the strength of relationships between numerical variables.
• Purpose: Understand the interaction between variables through numerical or tabular data.
Multivariate Graphical EDA
• Focus: Visualizing relationships between two or more variables.
• Techniques:
o Scatter plots: Display the relationship between two continuous variables.
o Pair plots: Display relationships between multiple pairs of variables.
o Heatmaps: Show correlations between many variables at once.
o 3D plots or contour plots: Used for visualizing more complex multivariate relationships.
• Purpose: Discover patterns, correlations,
9 December 2024
and potential interactions among variables visually.
Dr. Kumod Kumar Gupta Data Analytics Unit-4 30
Handling Missing data

Missing Values:-

The data has some missing values in its columns. There are three major categories of
missing values:

1. MCAR (Missing completely at random): These are values that are randomly missing
and do not depend on any other values.

2. MAR (Missing at random): These values are dependent on some additional features.

3. MNAR (Missing not at random): There is a reason behind why these values are
missing.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 31


Handling Missing data
1. MCAR (Missing Completely at Random)
2. MAR (Missing at Random)

MCAR (Missing Completely at Random) refers to a situation where the missing data is independent of
both the observed and unobserved data in the dataset.
• In other words, the likelihood of any particular value being missing is unrelated to any of the variables
in the dataset. The missingness occurs purely by chance, and there’s no systematic reason for why the
data is missing.
Characteristics of MCAR:
• Completely Random: The missing data is random and not influenced by any variables (neither the
missing variable itself nor any other variables).
• No Bias Introduced: If data is MCAR, dropping the missing data or filling it with simple imputations
(like the mean) won’t introduce bias in the analysis.
• Hard to Prove: It’s difficult to verify that data is MCAR because you need to show that the missing
data is not related to any variables.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 32


Handling Missing data

Example of MCAR:
• Imagine you’re conducting a survey where some participants don’t answer a question about
their favorite color.
• If the missing responses are completely random and not related to the participants’ age,
gender, or any other factor, this is MCAR.
• The fact that the data is missing is purely due to chance, like someone accidentally skipping a
question.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 33


Handling Missing data

How to Handle MCAR:


1. Remove Missing Data: If the percentage of missing data is small, you can safely drop
the rows or columns with missing data, as it won’t bias the results.

# Drop rows with missing values


df.dropna()

# Drop columns with missing values


df.dropna(axis=1)

2. Simple Imputation: You can use simple imputation methods like filling missing values with the
mean, median, or mode, since the data is missing randomly and this won’t introduce bias.

# Fill missing values with mean


df['column_name'].fillna(df['column_name'].mean(),
inplace=True)

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 34


Handling Missing data

Impact of MCAR:
•No Bias: Since the missingness is completely random, removing or imputing the data won’t
significantly affect the results or introduce bias.
•Efficiency Loss: While MCAR doesn't introduce bias, removing too much data can reduce the sample
size, leading to a loss in the statistical power of your analysis.
Conclusion:
When data is MCAR, it’s the simplest case to handle. You can safely remove or fill in the missing data
without worrying about bias.
However, confirming that your data is truly MCAR can be challenging in practice, as you need to
demonstrate that the missing data is completely unrelated to any other factors.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 35


Handling Missing data

Imputation is the process of filling in missing data values with substituted values to allow
for complete data analysis.
2. MAR (Missing at Random) data, where the missingness is dependent on other observed
variables, specific imputation techniques can be used to estimate the missing values.
Imputation Techniques for MAR:
a) Mean, Median, or Mode Imputation:
o Method: Replace the missing values with the mean, median, or mode of the
observed data for a particular variable.
o Usage: Simple but effective when the relationship between variables isn’t complex.
o Limitation: Can underestimate variability and distort relationships between
variables.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 36


Handling Missing data

b) Imputation Using Other Features (Regression Imputation):


o Method: Use regression models to predict the missing value based on other
related features. For example, if age is missing, you can predict it using related
variables like income or education level.
o Usage: More appropriate for MAR data, as it accounts for the relationship
between variables.
o Limitation: Can still introduce bias if the assumptions of the regression model
are violated.
c) K-Nearest Neighbours (KNN) Imputation:
o Method: Finds the "K" closest observations (neighbours) to a missing data
point and imputes the missing value based on the average (or mode) of these
neighbours.
o Usage: Works well for numerical and categorical data when the dataset is
relatively small and relationships between features exist.
Limitation: Computationally expensive for large datasets
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 37
Handling Missing data

Why Imputation for MAR is Important:


• For MAR data, simple deletion of rows or columns containing missing data could lead to
biased results, as the missingness is systematically related to other observed variables.
• Imputation techniques, especially those that take other variables into account, help
minimize this bias and preserve the integrity of the data.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 38


Handling Missing data

Comparison of MCAR and MAR

MCAR (Missing Completely at


Feature MAR (Missing at Random)
Random)
Observed variables (but not the
Missingness depends on Nothing (completely random)
missing values)
Can introduce bias if not handled
Effect on analysis No bias if data is dropped
correctly

More advanced imputation needed


Can drop or use simple imputation
Handling (regression, KNN, multiple
methods (mean, median)
imputation)

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 39


Handling Missing data

Implications of imputation

• Imputation has some effects that can impact analysis.

1. The central tendency of data is retained. For example, if we impute missing


data using the mean of a numeric variable, the mean after imputation will not
change. This is a good reason to impute based on estimates of central
tendency.

2. The spread of the data will change. After imputation, the spread of the data
will be smaller relative to spread if we ignore missing values. This could be
problematic as underestimating the spread of data can yield over-confident
inferences in downstream analysis.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 40


Removing Redundant variables

1. Removing redundant columns -


Sometimes more than one column can contain the same/similar value. In that case
having two columns does not add any value to the model. So, it is wise to delete the
redundant column.
2. Remove redundant rows -
This depends on the use case, but if having duplicate records does not make sense,
then it is wise to remove the redundant rows -
df.drop_duplicates()

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 41


Variable selection
Variable selection, also called feature selection, is the process of choosing the most important variables (or
features) from a dataset for building a machine learning model. This helps improve the model's performance and
reduces the time it takes to train.
There are three main types of variable selection techniques:
1. Filter Methods: These techniques rank variables based on statistical measures like correlation. You select
the top-ranked variables for the model. Common methods include:
o Correlation coefficient
o Chi-square test
o Variance threshold
2. Wrapper Methods: These methods evaluate different combinations of variables by actually building models
on them. They are more accurate but take more time. Examples include:
o Forward selection (starting with no variables and adding the best ones one by one)
o Backward elimination (starting with all variables and removing the least important ones)
o Recursive feature elimination (removing the least important feature repeatedly)
3. Embedded Methods: These techniques perform variable selection during the model training process. Some
algorithms, like decision trees or LASSO (L1 regularization), have built-in feature selection.
Choosing the right variables can make your model simpler and better at predictions.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 42
Variable selection

Typical preventive measures during variable selection include:


•Collaboration with experts in the field to identify the important variables.
•Awareness of any problems in relation to data source, reliability or mismeasurement.
•Cleaning the data.
•Using control variables to account for banned variables or specific events such as an economic drift.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 43


Variable selection

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 44


Identifying Outliers

An outlier is something separate or different from the crowd. Outliers can be a result of a
mistake during data collection or it can be just an indication of variance in your data. Some of
the methods for detecting and handling outliers:
•Box Plot
•Scatter plot
•Z-score
•IQR(Inter-Quartile Range)

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 45


Identifying Outliers

Outlier Detection Methods in Machine Learning


• Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning
models.
• By identifying and removing or handling outliers effectively, we can prevent them from biasing the
model, reducing its performance, and hindering its interpretability. Here’s an overview of various
outlier detection methods:
1. Statistical Methods:
•Z-Score: This method calculates the standard deviation of the data points and identifies outliers as
those with Z-scores exceeding a certain threshold (typically 3 or -3).
•Interquartile Range (IQR): IQR identifies outliers as data points falling outside the range defined by
Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1 and Q3 are the first and third quartiles, and k is a factor
(typically 1.5).

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 46


Identifying Outliers

Distance-Based Methods:
•K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are
far away from them.
•Local Outlier Factor (LOF): This method calculates the local density of data points and identifies
outliers as those with significantly lower density compared to their neighbors.
3. Clustering-Based Methods:
•Density-Based Spatial Clustering of Applications with Noise (DBSCAN): In DBSCAN, clusters
data points based on their density and identifies outliers as points not belonging to any cluster.
•Hierarchical clustering: Hierarchical clustering involves building a hierarchy of clusters by iteratively
merging or splitting clusters based on their similarity. Outliers can be identified as clusters containing
only a single data point or clusters significantly smaller than others.

https://www.geeksforgeeks.org/machine-learning-outlier/

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 47


Remove Outliers

To remove outliers from a dataset, you can follow different approaches depending on the
method of detection and the type of data you're working with. Here’s a simple guide:
Steps to Remove Outliers:
1.Identify the Outliers: First, you need to detect the outliers using methods such as the Z-
score, Interquartile Range (IQR), or visualization tools like box plots.
a) Using Z-score (for numerical data):

b) Using IQR (Interquartile Range):


The IQR method finds the spread of the middle 50% of the data and marks values outside a
certain range as outliers.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 48


Time Series Analysis

• Time series analysis is a method used to analyze data that is collected over time.
• The goal is to understand the patterns, trends, and behaviors of the data and to make predictions for
the future.
• Common examples of time series data include stock prices, weather measurements, and sales figures
over time.
Here are some key concepts in time series analysis:
1.Trend: A long-term increase or decrease in the data. For example, if sales data shows a steady rise over
years, that's a trend.
2.Seasonality: Recurring patterns or cycles in the data at specific times, such as higher sales during the
holiday season or higher temperatures in summer.
3.Noise: Random variation in the data that can't be explained by trends or seasonality. Noise makes it
harder to see the patterns clearly.
4. Stationarity: A time series is stationary if its statistical properties (like mean and variance) don't change
over time. If the data shows trends or seasonality, it is non-stationary.
5. Autocorrelation: This measures how a time point in the series is related to earlier time points. In simple
terms, it helps you see if past values are influencing future values.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 49


Time Series Analysis

Popular models for time series analysis include:

•ARIMA (Auto Regressive Integrated Moving Average):


A widely used model that combines autoregressive (AR), differencing (I for integrated), and
moving average (MA) elements.

•Exponential Smoothing:
•A method that gives more weight to recent data points for making predictions.

•LSTM (Long Short-Term Memory networks): A type of neural network that is often
used for time series forecasting, especially when working with deep learning.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 50


Time Series Analysis

Why organizations use time series data analysis


• Organizations use time series data analysis for several reasons, primarily because it helps them make
informed decisions based on trends, patterns, and future predictions.
Here are some key reasons why organizations use time series analysis:
1. Forecasting Future Trends

•Sales Forecasting: Retailers and businesses analyze historical sales data to predict future demand,
allowing them to adjust inventory, marketing strategies, and production schedules accordingly.
•Stock Market Predictions: Financial organizations use time series analysis to forecast stock prices,
currency exchange rates, and interest rates, helping them make investment decisions.

2. Understanding Patterns
•Seasonality Detection: Organizations can detect seasonal trends, such as higher sales during the
holiday season or increased electricity consumption during summer, and plan operations around these
patterns.
•Customer Behavior Analysis: Time series analysis helps companies understand when customers are
most active or likely to make purchases, aiding in targeted marketing.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 51


Time Series Analysis
3. Anomaly Detection
•Fraud Detection: Financial institutions monitor time series data to identify unusual patterns that
could indicate fraudulent transactions.
•Equipment Failure Prediction: In industries like manufacturing, time series data from sensors can
be used to predict when a machine might fail, enabling preventive maintenance.
4. Risk Management
•Market Risk: Financial firms analyze time series data to assess market risks, such as potential stock
market crashes, allowing them to take measures to mitigate these risks.
•Supply Chain Optimization: Companies use time series data to anticipate disruptions or delays in
the supply chain and adjust operations accordingly.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 52


Time Series Analysis
5. Improving Operations
•Inventory Management: By analyzing historical sales data, companies can optimize inventory
levels, ensuring they stock the right amount of products at the right time.
•Staffing and Resource Allocation: Time series analysis helps organizations predict peak times and
allocate resources like staffing or equipment effectively.
6. Financial Analysis
•Revenue Projections: Companies can project future revenue by analyzing past performance trends.
•Budgeting and Cost Management: Time series analysis helps organizations manage their finances
by predicting future expenses and revenues.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 53


Time Series Analysis

7. Optimization of Marketing Campaigns


• Targeted Advertising: Time series data helps marketers understand when specific products or
services are in demand, allowing them to target advertisements at the right time.
• Customer Engagement: Companies use time series analysis to find the best times to engage
with customers via promotions, emails, or ads.
By leveraging time series data, organizations can not only understand their past and current
performance but also position themselves to make better decisions for the future. This data-driven
approach leads to improved efficiency, cost savings, and increased profitability.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 54


Time Series Analysis

Time series analysis examples:- Examples of time series analysis


in action include:
•Weather data •Brain monitoring (EEG)

•Rainfall measurements •Quarterly sales

•Temperature readings •Stock prices

•Heart rate monitoring (EKG) •Automated stock trading


•Industry forecasts
•Interest rates

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 55


Time Series Analysis

Time Series Analysis Types


•Classification: Identifies and assigns categories to the data.
•Curve fitting: Plots the data along a curve to study the relationships of
variables within the data.
•Descriptive analysis: Identifies patterns in time series data, like trends, cycles,
or seasonal variation.
•Explanative analysis: Attempts to understand the data and the relationships
within it, as well as cause and effect.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 56


Time Series Analysis

Time Series Analysis Types


1. Exploratory analysis
2. Forecasting
3. Intervention analysis: Intervention analysis is a statistical method used to assess the impact of an event or
intervention on a time series. It helps determine whether a specific event has caused a significant change in the
behavior of the data over time. This method is commonly used in areas such as economics, marketing, and
medicine, where researchers are interested in understanding how policies, treatments, or changes in conditions
influence outcomes.
4. Segmentation : Segmentation is the process of dividing a larger set of data into smaller, more manageable or
meaningful parts, often called segments. It’s widely used in various fields like marketing, image processing,
and machine learning, to make data analysis more focused and efficient.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 57


Time Series Analysis

Types of Segmentation:
1.Market Segmentation:
1. Definition: In marketing, segmentation refers to dividing a broad consumer or business market into
subgroups based on shared characteristics.
2. Types:
1. Demographic Segmentation: Dividing based on age, gender, income, education, etc.
2. Geographic Segmentation: Dividing based on location (country, city, region).
3. Behavioral Segmentation: Based on behavior patterns, like buying habits, brand loyalty.
4. Psychographic Segmentation: Based on lifestyle, personality traits, values, and interests.
3. Purpose: Helps companies target specific groups with tailored marketing strategies, leading to more
efficient use of resources and higher customer satisfaction.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 58


Time Series Analysis

2. Image Segmentation:
1. Definition: In image processing, segmentation refers to dividing an image into different parts or
regions to make it easier to analyze and interpret.
2. Types:
1. Thresholding: Divides the image into segments based on pixel intensity (e.g., separating dark
objects from bright backgrounds).
2. Edge-Based Segmentation: Detects boundaries or edges of objects in the image.
3. Region-Based Segmentation: Divides the image into regions based on similarities (color,
texture).
4. Clustering Methods (e.g., K-means): Group pixels that are similar in characteristics, such as
color or intensity.
3. Purpose: Used for tasks like object detection, medical image analysis, and facial recognition, making
images easier for computers to process.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 59


Time Series Analysis

3. Customer Segmentation in Machine Learning:


1. Definition: In machine learning, segmentation often involves grouping data
points (such as customers) based on similarities in their behavior or
characteristics.
2. Methods:
1. Clustering Algorithms (e.g., K-means, DBSCAN): These algorithms group
similar data points without predefined labels.
2. Decision Trees: These can be used to segment data by creating branches
based on feature importance.
3. Principal Component Analysis (PCA): Reduces the dimensionality of the
data to find segments in high-dimensional spaces.
3. Purpose: Helps businesses predict customer behavior, improve personalization,
and increase retention by understanding different customer groups better.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 60


Data transformation and dimensionality reduction

Dimensionality Reduction refers to the technique of reducing the dimension of a data feature set. Usually,
machine learning datasets (feature set) contain hundreds of columns (i.e., features) or an array of points, creating
a massive sphere in a three-dimensional space. By applying dimensionality reduction, you can decrease or bring
down the number of columns to quantifiable counts, thereby transforming the three-dimensional sphere into a
two-dimensional object (circle).

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 61


Data transformation and dimensionality reduction

Dimensionality reduction has many other benefits, such as:


•It eliminates noise and redundant features.
•It helps improve the model’s accuracy and performance.
•It facilitates the usage of algorithms that are unfit for more substantial dimensions.
•It reduces the amount of storage space required (less data needs lesser storage space).
•It compresses the data, which reduces the computation time and facilitates faster training of the data.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 62


Data transformation and dimensionality reduction

Dimensionality Reduction Techniques:-Dimensionality reduction techniques can be


categorized into two broad categories:
1. Feature selection
The feature selection method aims to find a subset of the input variables (that are
most relevant) from the original dataset. Feature selection includes three strategies,
namely:
•Filter strategy
•Wrapper strategy
•Embedded strategy

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 63


Data transformation and dimensionality reduction

Dimensionality Reduction Techniques


2. Feature extraction
Feature extraction, feature projection, converts the data from the high-dimensional space to one with lesser
dimensions. This data transformation may either be linear or it may be nonlinear as well. This technique finds a
• smaller set of new variables, each of which is a combination of input variables (containing the same information
as the input variables).

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 64


Weekly/Monthly Assignment

1. What is Data Pre-processing?


2. What is Data cleaning?
3. Explain KDD process.
4. What is Data Reduction?
5. What is Data Discretization?
6. Explain Dimensionality Reduction.
7. What is Inconsistency of Data?
8. What is Outliers?
9. Explain Binning Techniques.
10. What is Data Cube aggregation?

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 65


Data transformation and dimensionality reduction

1. Principal Component Analysis (PCA)


➢ Principal Component Analysis is one of the leading linear
techniques of dimensionality reduction.
➢ It is a statistical procedure that orthogonally converts the
‘n’ coordinates of a dataset into a new set of n coordinates,
known as the principal components. This conversion results in
the creation of the first principal component having the
maximum variance.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 66


Principal component analysis

Ca Milk Lemon Juice Oil PC1(Pane PC2(Ghee


ke er) )

PCA is an unsupervised method as it does not use the output information

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 67


Principal component analysis

• Principal component analysis (PCA) is a dimensionality reduction unsupervised machine learning


method

• which is used to reduce the number of variables of a dataset into a smaller number of variables while
preserving/maintaining significant patterns and trends in the dataset.

• Principal components are new variables that are constructed as linear combinations or mixtures of the
initial variables.

• These combinations are done in such a way that the new variables (i.e., principal components) are
uncorrelated and most of the information within the initial variables is squeezed or compressed into the
first components.

• So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum
possible information in the first component, then maximum remaining information in the second and so
on, until having something like shown in the scree plot below.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 68
Principal component analysis

• In Principal Component Analysis (PCA), PC1 (first principal component) and PC2 (second principal
component) are uncorrelated because PCA is designed to create new axes (principal components) that are
orthogonal (i.e., at 90 degrees) to each other. This means that:

• PC1 and PC2 capture different types of variance: PC1 explains the maximum variance in the data, and
PC2 explains the second highest variance in the data but in a direction that is orthogonal to PC1. Since
they are orthogonal, their dot product is zero, implying no correlation.

• Pc1.Pc2 = Pc1.Pc2 Cos θ where θ = 90

• Pc1.Pc2 = 0

• No linear relationship: If two variables are uncorrelated, there is no linear relationship between them. In
PCA, this means that knowing the value of PC1 gives you no information about PC2, and vice versa.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 69


Principal component analysis

2D to 1 D across Axis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 70


Principal component analysis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 71


Principal component analysis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 72


Principal component analysis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 73


Principal component analysis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 74


Principal component analysis

PC1 is having more information compare to other PCs

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 75


Principal component analysis

How Do You Do a Principal Component Analysis?:


• Standardize the range of continuous initial variables
• Compute the covariance matrix to identify correlations
• Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
• Create a feature vector to decide which principal components to keep
• Recast the data along the principal components axes.

https://www.askpython.com/python/examples/principal-
component-analysis

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 76


Data transformation and dimensionality reduction

Factor Analysis
➢ Factor Analysis (FA) is an exploratory data analysis method used
to search influential underlying factors or latent variables from a
set of observed variables. It helps in data interpretations by
reducing the number of variables. It extracts maximum common
variance from all variables and puts them into a common score.
➢ Factor analysis is a linear statistical model. It is used to explain
the variance among the observed variable and condense a set of
the observed variable into the unobserved variable called
factors.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 77
Data transformation and dimensionality reduction

Types of Factor Analysis


•Exploratory Factor Analysis: It is the most popular factor analysis
approach among social and management researchers. Its basic
assumption is that any observed variable is directly associated with
any factor.
•Confirmatory Factor Analysis (CFA): Its basic assumption is that
each factor is associated with a particular set of observed variables.
CFA confirms what is expected on the basic.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 78


Data transformation and dimensionality reduction

How does Factor Analysis Work?


The primary objective of factor analysis is to reduce the number of
observed variables and find unobservable variables. These
unobserved variables help the market researcher to conclude the
survey. This conversion of the observed variables to unobserved
variables can be achieved in two steps:
1. Factor Extraction
2. Factor Rotation

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 79


Data transformation and dimensionality reduction

•Factor Extraction: In this step, the number of factors and


approach for extraction selected using variance partitioning
methods such as principal components analysis and common factor
analysis.
•Factor Rotation: In this step, rotation tries to convert factors into
uncorrelated factors — the main goal of this step to improve the
overall interpretability. There are lots of rotation methods that are
available such as: Varimax rotation method, Quartimax rotation
method, and Promax rotation method.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 80
Data Munging

What is Data Munging?


Data munging, also known as data wrangling, is the process of converting
raw data into a more usable format. Often, data munging occurs as a
precursor to data analytics or data integration. High-quality data is essential
for sophisticated data operations.
The munging process typically begins with a large volume of raw data.
Data scientists will mung the data into shape by removing any errors or
inconsistencies.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 81


Data Munging

The modern data munging process now involves six main steps:
1. Discover: First, the data scientist performs a degree of data exploration.
This is a first glance at the data to establish the most important patterns. It
also allows the scientist to identify any major structural issues, such as
invalid data formats.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 82


Data Munging

2. Structure: Raw data might not have an appropriate structure for the
intended usage. The data scientists will organize and normalize the data so
that it’s more manageable. This also makes it easier to perform the next
steps in the munging process.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 83


Data Munging

3. Clean: Raw data can contain corrupt, empty, or invalid cells. There may
also be values that require conversions, such as dates and currencies.. For
instance, the state in a customer's address might appear as Texas, Tex, or TX.
The cleaning process will standardize this value for every address.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 84


Data Munging

4. Enrich: Data enrichment is the process of filling in missing details by


referring to other data sources. Data enrichment lets you fill in all address
fields by looking up the missing values elsewhere, such as in the CRM
database or a postal records lookup.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 85


Data Munging

5. Validate: Finally, it’s time to ensure that all data values are logically
consistent. This means checking things like whether all phone numbers
have ten digits, that there are no numbers in name fields, and that all
dates are valid calendar dates. Data validation also involves some
deeper checks, such as ensuring that all values are compatible with the
specified data type.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 86


Data Munging

6. Publish: When the data munging process is complete, the data


science team will push it towards its final destination. Often this is a
data repository, where it will integrate with data from other sources.
This will make the munged data permanently available to all
consumers.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 87


Data Munging

Web Scraping with Python Web scraping is an automated method used to


extract large amounts of data from websites. The data on the websites are
unstructured. Web scraping helps collect these unstructured data and store it in a
structured form. There are different ways to scrape websites such as online
Services, APIs or writing your own code.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 88


Data Munging

To extract data using web scraping with python, you need to follow these basic
steps:
1.Find the URL that you want to scrape
2.Inspecting the Page
3.Find the data you want to extract
4.Write the code
5.Run the code and extract the data
6.Store the data in the required format

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 89


Data Munging

Libraries used for Web Scraping


As we know, Python is has various applications and there are different
libraries for different purposes. In our further demonstration, we will be
using the following libraries:

•Selenium: Selenium is a web testing library. It is used to automate


browser activities.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 90


Data Munging

Libraries used for Web Scraping


•BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and
XML documents. It creates parse trees that is helpful to extract the data
easily.
•Pandas: Pandas is a library used for data manipulation and analysis. It is
used to extract the data and store it in the desired format.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 91


Data Munging

Web Scraping Example : Scraping Flipkart Website


Step 1: Find the URL that you want to scrape
For this example, we are going scrape Flipkart website to extract the Price, Name,
and Rating of Laptops. The URL for this page
is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-
/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 92


Data Munging

Step 2: Inspecting the Page


The data is usually nested in tags. So, we inspect the page to see,
under which tag the data we want to scrape is nested. To inspect the
page, just right click on the element and click on “Inspect”.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 93


Data Munging

When you click on the “Inspect” tab, you will see a “Browser
Inspector Box” open.

Dr. Kumod Kumar Gupta Data Analytics Unit-4


9 December 2024 94
Data Munging

Step 4: Write the code


First, let’s create a Python file. To do this, open the terminal
in Ubuntu and type gedit <your file name> with .py
extension.
gedit web-s.py
First, let us import all the necessary libraries:

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 95


Data Munging

To configure web driver to use Chrome browser, we


have to set the path to chrome driver

Refer the below code to open the URL:

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 96


Data Munging

find the div tags with those respective class-names, extract


the data and store the data in a variable

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 97


Data Munging

Step 5: Run the code and extract the data


To run the code, use the below command:

Step 6: Store the data in a required format


After extracting the data, you might want to store it in a format.
This format varies depending on your requirement. For this
example, we will store the extracted data in a CSV (Comma
Separated Value) format.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 98


Faculty Video Links, You tube & NPTEL Video Links and Online
Courses Details

You Tube video

https://www.youtube.com/watch?v=q4pyaVZjqk0

https://www.youtube.com/watch?v=7sJaRHF03K8

https://www.youtube.com/watch?v=mKxFfjNyj3c

https://www.youtube.com/watch?v=azXCzI57Yfc

https://www.youtube.com/watch?v=83x5X66uWK0

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 99


Glossary Questions

• Explain KDD process with Diagram.


• 2.Explain Data Reduction Techniques.
• 3.What is Data preprocessing?
• 4.What is Data Integration?
• 5.Differentiate Loose Coupling and Loose Coupling approaches of Data Integration.
• 6.What is Data Transformation?Why it is needed?
• 7.What are the benefits of Data Reduction?
• 8.What is Binning?
• 9.What is Data Aggregation?
• 10.Write Short notes on:
I. Discretization Operation
II. Data Cleaning
III. Histogram
IV. Data Compression

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 100
References

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 101
Expected Questions for End Semester Exam

1.Explain Form of Data Pre-processing.


2.Explain Data Attribute and its types
3. Explain the techniques how to extract useful variables.
4.Explain KDD process with Diagram.
5.Explain Data Cleaning techniques.
6.Explain Missing Values and Noisy Data.
8.Explain Binning techniques .
9.What is Data Reduction?
10.Write short Notes on:
I. Data Cube Aggregation,
II. Data Compression
III. Numerosity Reduction.

9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 102

You might also like