Unit 4 DA Revised
Unit 4 DA Revised
Unit: 4
1.Security
2. Digital Advertising
3. E-Commerce
4. Publishing
5. Massively Multiplayer Online Games
6. Backend Services and Messaging
7. Project Management & Collaboration
8. Real time Monitoring Services
9.Live Charting and Graphing
10. Group and Private Chat
PO8 : Ethics
PO10 : Communication
CO.K PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 2 2 3 3 - - - - - - -
CO2 3 2 3 2 3 - - - - - - -
CO3 3 2 3 2 3 - - - - - - -
CO4 3 2 3 2 3 - - - - - - -
CO5 3 2 3 3 3 - - - - - - -
Program Specific
S. No. PSO Description
Outcomes (PSO)
CO1 3 - - -
CO2 3 2 - -
CO3 3 3 - -
CO4 3 3 - -
CO5 3 3 - -
Program Educational
PEOs Description
Objectives (PEOs)
• Pursue higher education and professional career to excel in the field of
Artificial Intelligence and Machine Learning.
PEOs
Data analytics (DA) is the area of examining data sets in order to find trends and draw
conclusions about the information they contain. Increasingly, data analytics is done with
the aid of specialized systems and software.
4. To describe the services an operating system provides to users, processes, and other systems
5. To discuss Data Munging, Data Wrangling- APIs and other tools for scrapping data from the web/
internet using Python.
Exploratory Data Analysis (EDA) is the process of investigating a dataset to summarize its
key characteristics, often using statistical techniques and visualizations.
• EDA is crucial in understanding the data before applying more complex statistical
models or machine learning algorithms.
• It helps in identifying patterns, spotting anomalies, testing hypotheses, and checking
assumptions.
Objectives of EDA:
1. Discover patterns: Identify trends, clusters, and relationships.
2. Spot anomalies or outliers: Detect unusual or extreme values.
3. Test hypotheses: Evaluate assumptions about the data.
4. Determine relationships: Understand how different variables interact.
Benefits of EDA:
• Improved Data Understanding: Helps you understand the structure and relationships within
the data.
• Error Detection: Helps identify data errors, missing values, or outliers that could affect results.
• Hypothesis Generation: EDA can help in forming new hypotheses for further analysis.
• Model Selection: Helps in choosing the right type of machine learning models by
understanding data distributions and relationships.
2. Bivariate Analysis
•Bivariate Non-Graphical EDA:
• Focuses on the relationship between two variables using measures like correlation
coefficients (e.g., Pearson’s correlation).
•Bivariate Graphical EDA:
• Uses visual tools like scatter plots, bar plots, and line charts to explore
relationships between two variables.
Missing Values:-
The data has some missing values in its columns. There are three major categories of
missing values:
1. MCAR (Missing completely at random): These are values that are randomly missing
and do not depend on any other values.
2. MAR (Missing at random): These values are dependent on some additional features.
3. MNAR (Missing not at random): There is a reason behind why these values are
missing.
MCAR (Missing Completely at Random) refers to a situation where the missing data is independent of
both the observed and unobserved data in the dataset.
• In other words, the likelihood of any particular value being missing is unrelated to any of the variables
in the dataset. The missingness occurs purely by chance, and there’s no systematic reason for why the
data is missing.
Characteristics of MCAR:
• Completely Random: The missing data is random and not influenced by any variables (neither the
missing variable itself nor any other variables).
• No Bias Introduced: If data is MCAR, dropping the missing data or filling it with simple imputations
(like the mean) won’t introduce bias in the analysis.
• Hard to Prove: It’s difficult to verify that data is MCAR because you need to show that the missing
data is not related to any variables.
Example of MCAR:
• Imagine you’re conducting a survey where some participants don’t answer a question about
their favorite color.
• If the missing responses are completely random and not related to the participants’ age,
gender, or any other factor, this is MCAR.
• The fact that the data is missing is purely due to chance, like someone accidentally skipping a
question.
2. Simple Imputation: You can use simple imputation methods like filling missing values with the
mean, median, or mode, since the data is missing randomly and this won’t introduce bias.
Impact of MCAR:
•No Bias: Since the missingness is completely random, removing or imputing the data won’t
significantly affect the results or introduce bias.
•Efficiency Loss: While MCAR doesn't introduce bias, removing too much data can reduce the sample
size, leading to a loss in the statistical power of your analysis.
Conclusion:
When data is MCAR, it’s the simplest case to handle. You can safely remove or fill in the missing data
without worrying about bias.
However, confirming that your data is truly MCAR can be challenging in practice, as you need to
demonstrate that the missing data is completely unrelated to any other factors.
Imputation is the process of filling in missing data values with substituted values to allow
for complete data analysis.
2. MAR (Missing at Random) data, where the missingness is dependent on other observed
variables, specific imputation techniques can be used to estimate the missing values.
Imputation Techniques for MAR:
a) Mean, Median, or Mode Imputation:
o Method: Replace the missing values with the mean, median, or mode of the
observed data for a particular variable.
o Usage: Simple but effective when the relationship between variables isn’t complex.
o Limitation: Can underestimate variability and distort relationships between
variables.
Implications of imputation
2. The spread of the data will change. After imputation, the spread of the data
will be smaller relative to spread if we ignore missing values. This could be
problematic as underestimating the spread of data can yield over-confident
inferences in downstream analysis.
An outlier is something separate or different from the crowd. Outliers can be a result of a
mistake during data collection or it can be just an indication of variance in your data. Some of
the methods for detecting and handling outliers:
•Box Plot
•Scatter plot
•Z-score
•IQR(Inter-Quartile Range)
Distance-Based Methods:
•K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are
far away from them.
•Local Outlier Factor (LOF): This method calculates the local density of data points and identifies
outliers as those with significantly lower density compared to their neighbors.
3. Clustering-Based Methods:
•Density-Based Spatial Clustering of Applications with Noise (DBSCAN): In DBSCAN, clusters
data points based on their density and identifies outliers as points not belonging to any cluster.
•Hierarchical clustering: Hierarchical clustering involves building a hierarchy of clusters by iteratively
merging or splitting clusters based on their similarity. Outliers can be identified as clusters containing
only a single data point or clusters significantly smaller than others.
https://www.geeksforgeeks.org/machine-learning-outlier/
To remove outliers from a dataset, you can follow different approaches depending on the
method of detection and the type of data you're working with. Here’s a simple guide:
Steps to Remove Outliers:
1.Identify the Outliers: First, you need to detect the outliers using methods such as the Z-
score, Interquartile Range (IQR), or visualization tools like box plots.
a) Using Z-score (for numerical data):
• Time series analysis is a method used to analyze data that is collected over time.
• The goal is to understand the patterns, trends, and behaviors of the data and to make predictions for
the future.
• Common examples of time series data include stock prices, weather measurements, and sales figures
over time.
Here are some key concepts in time series analysis:
1.Trend: A long-term increase or decrease in the data. For example, if sales data shows a steady rise over
years, that's a trend.
2.Seasonality: Recurring patterns or cycles in the data at specific times, such as higher sales during the
holiday season or higher temperatures in summer.
3.Noise: Random variation in the data that can't be explained by trends or seasonality. Noise makes it
harder to see the patterns clearly.
4. Stationarity: A time series is stationary if its statistical properties (like mean and variance) don't change
over time. If the data shows trends or seasonality, it is non-stationary.
5. Autocorrelation: This measures how a time point in the series is related to earlier time points. In simple
terms, it helps you see if past values are influencing future values.
•Exponential Smoothing:
•A method that gives more weight to recent data points for making predictions.
•LSTM (Long Short-Term Memory networks): A type of neural network that is often
used for time series forecasting, especially when working with deep learning.
•Sales Forecasting: Retailers and businesses analyze historical sales data to predict future demand,
allowing them to adjust inventory, marketing strategies, and production schedules accordingly.
•Stock Market Predictions: Financial organizations use time series analysis to forecast stock prices,
currency exchange rates, and interest rates, helping them make investment decisions.
2. Understanding Patterns
•Seasonality Detection: Organizations can detect seasonal trends, such as higher sales during the
holiday season or increased electricity consumption during summer, and plan operations around these
patterns.
•Customer Behavior Analysis: Time series analysis helps companies understand when customers are
most active or likely to make purchases, aiding in targeted marketing.
Types of Segmentation:
1.Market Segmentation:
1. Definition: In marketing, segmentation refers to dividing a broad consumer or business market into
subgroups based on shared characteristics.
2. Types:
1. Demographic Segmentation: Dividing based on age, gender, income, education, etc.
2. Geographic Segmentation: Dividing based on location (country, city, region).
3. Behavioral Segmentation: Based on behavior patterns, like buying habits, brand loyalty.
4. Psychographic Segmentation: Based on lifestyle, personality traits, values, and interests.
3. Purpose: Helps companies target specific groups with tailored marketing strategies, leading to more
efficient use of resources and higher customer satisfaction.
2. Image Segmentation:
1. Definition: In image processing, segmentation refers to dividing an image into different parts or
regions to make it easier to analyze and interpret.
2. Types:
1. Thresholding: Divides the image into segments based on pixel intensity (e.g., separating dark
objects from bright backgrounds).
2. Edge-Based Segmentation: Detects boundaries or edges of objects in the image.
3. Region-Based Segmentation: Divides the image into regions based on similarities (color,
texture).
4. Clustering Methods (e.g., K-means): Group pixels that are similar in characteristics, such as
color or intensity.
3. Purpose: Used for tasks like object detection, medical image analysis, and facial recognition, making
images easier for computers to process.
Dimensionality Reduction refers to the technique of reducing the dimension of a data feature set. Usually,
machine learning datasets (feature set) contain hundreds of columns (i.e., features) or an array of points, creating
a massive sphere in a three-dimensional space. By applying dimensionality reduction, you can decrease or bring
down the number of columns to quantifiable counts, thereby transforming the three-dimensional sphere into a
two-dimensional object (circle).
• which is used to reduce the number of variables of a dataset into a smaller number of variables while
preserving/maintaining significant patterns and trends in the dataset.
• Principal components are new variables that are constructed as linear combinations or mixtures of the
initial variables.
• These combinations are done in such a way that the new variables (i.e., principal components) are
uncorrelated and most of the information within the initial variables is squeezed or compressed into the
first components.
• So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum
possible information in the first component, then maximum remaining information in the second and so
on, until having something like shown in the scree plot below.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 68
Principal component analysis
• In Principal Component Analysis (PCA), PC1 (first principal component) and PC2 (second principal
component) are uncorrelated because PCA is designed to create new axes (principal components) that are
orthogonal (i.e., at 90 degrees) to each other. This means that:
• PC1 and PC2 capture different types of variance: PC1 explains the maximum variance in the data, and
PC2 explains the second highest variance in the data but in a direction that is orthogonal to PC1. Since
they are orthogonal, their dot product is zero, implying no correlation.
• Pc1.Pc2 = 0
• No linear relationship: If two variables are uncorrelated, there is no linear relationship between them. In
PCA, this means that knowing the value of PC1 gives you no information about PC2, and vice versa.
2D to 1 D across Axis
https://www.askpython.com/python/examples/principal-
component-analysis
Factor Analysis
➢ Factor Analysis (FA) is an exploratory data analysis method used
to search influential underlying factors or latent variables from a
set of observed variables. It helps in data interpretations by
reducing the number of variables. It extracts maximum common
variance from all variables and puts them into a common score.
➢ Factor analysis is a linear statistical model. It is used to explain
the variance among the observed variable and condense a set of
the observed variable into the unobserved variable called
factors.
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 77
Data transformation and dimensionality reduction
The modern data munging process now involves six main steps:
1. Discover: First, the data scientist performs a degree of data exploration.
This is a first glance at the data to establish the most important patterns. It
also allows the scientist to identify any major structural issues, such as
invalid data formats.
2. Structure: Raw data might not have an appropriate structure for the
intended usage. The data scientists will organize and normalize the data so
that it’s more manageable. This also makes it easier to perform the next
steps in the munging process.
3. Clean: Raw data can contain corrupt, empty, or invalid cells. There may
also be values that require conversions, such as dates and currencies.. For
instance, the state in a customer's address might appear as Texas, Tex, or TX.
The cleaning process will standardize this value for every address.
5. Validate: Finally, it’s time to ensure that all data values are logically
consistent. This means checking things like whether all phone numbers
have ten digits, that there are no numbers in name fields, and that all
dates are valid calendar dates. Data validation also involves some
deeper checks, such as ensuring that all values are compatible with the
specified data type.
To extract data using web scraping with python, you need to follow these basic
steps:
1.Find the URL that you want to scrape
2.Inspecting the Page
3.Find the data you want to extract
4.Write the code
5.Run the code and extract the data
6.Store the data in the required format
When you click on the “Inspect” tab, you will see a “Browser
Inspector Box” open.
https://www.youtube.com/watch?v=q4pyaVZjqk0
https://www.youtube.com/watch?v=7sJaRHF03K8
https://www.youtube.com/watch?v=mKxFfjNyj3c
https://www.youtube.com/watch?v=azXCzI57Yfc
https://www.youtube.com/watch?v=83x5X66uWK0
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 100
References
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 101
Expected Questions for End Semester Exam
9 December 2024 Dr. Kumod Kumar Gupta Data Analytics Unit-4 102