DS Unit 2
DS Unit 2
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Sampling
DATA CLEANING
Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
➢ Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
➢ Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
DATA CLEANING
Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
❑ Binning Method:
➢ This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
➢ Binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data
about a group of people, you might want to arrange their ages into a smaller number of age intervals.
❑ Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
❑ Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
DATA TRANSFORMATION
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (0.0 to 1.0). e.g. Min-max, z-score
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Data reduction is preprocessing technique that helps in obtaining reduced representation of dataset from the
available dataset that is much smaller in volume.
➢ Integrity of the original data should be maintained even after reduction in data volume.
➢ It should produce same analytics result as on original data.
Need of Data Reduction
➢ Visualization
➢ Increase efficiency of data science/mining algortihms
➢ Need less memory space
DATA REDUCTION
3. Numerosity Reduction:
Data is replaced by estimated/alternative value. This enable to store the model of data instead of whole
data, for example: Regression Models.
DATA REDUCTION
4. Dimensionality Reduction:
➢ Remove redundant attributes.
➢ This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction
from compressed data, original data can be retrieved, such reduction are called lossless reduction else it
is called lossy reduction. The two effective methods of dimensionality reduction are:
➢ Wavelet transforms and PCA (Principal Component Analysis).
DATA PRE-PROCESSING
Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the independent
variables of the dataset in a specific range. In feature scaling, we put our variables in the same range and in the same scale so
that not a single variable dominate the other variable.
For feature scaling, we will import StandardScaler class of sklearn.preprocessing.
By Applying above scalar, we will get all the values range between -1 to 1.
EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods. It helps determine how best to manipulate data
sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and
provides a provides a better understanding of data set variables and the relationships between them. It can also
help determine if the statistical techniques you are considering for data analysis are appropriate. Originally
developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used
method in the data discovery process today.
Exploratory data analysis is a simple classification technique usually done by visual methods. It is an approach to
analyzing data sets to summarize their main characteristics. When you are trying to build a machine learning
model you need to be pretty sure whether your data is making sense or not.
Exploratory data analysis (EDA) is a task of analyzing data using simple tools from statistics, simple plotting tools.
IMPORTANCE OF EXPLORATORY DATA ANALYSIS
The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious
errors, as well as better understand patterns within the data, detect outliers or anomalous events, find
interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to
any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the
right questions. EDA can help answer questions about standard deviations, categorical variables, and
confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more
sophisticated data analysis or modeling, including machine learning.
NEED OF EXPLORATORY DATA ANALYSIS
Every machine learning problem solving starts with EDA. It is probably one of the most important part of a
machine learning project. With the growing market, the size of data is also growing. It becomes harder for
companies to make decisions without analyzing it properly.
With the use of charts and certain graphs, one can make sense out of the data and check whether there is
any relationship or not.
Various plots are used to determine any conclusions. This helps the company to make a firm and profitable
decisions. Once Exploratory Data Analysis is complete and insights are drawn, its feature can be used for
supervised and unsupervised machine learning modelling.
EXPLORATORY DATA ANALYSIS TOOLS
Some of the most common data science tools used to create an EDA
include:
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or glue
language to connect existing components together. Python and EDA can
be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine
learning.
R: An open-source programming language and free software
environment for statistical computing and graphics supported by the R
Foundation for Statistical Computing. The R language is widely used
among statisticians in data science in developing statistical
observations and data analysis.
EDA LEVEL OF ANALYSIS (STATISTICS: QUANTITATIVE DATA
ANALYSIS)
EDA level of analysis depends on various quantitative data analysis and the analysis of number of variables/features
considered for a particular case study is one of them. There are three different analysis mentioned below:
Univariate analysis
Bivariate analysis
Multivariate analysis
UNIVARIATE ANALYSIS
Univariate analysis is the most basic form of statistical data analysis technique. When the data contains only one
variable and doesn’t deal with a causes or effect relationships then a Univariate analysis is used.
The key objective of univariate analysis is to simply describe the data to find the patterns within the data. The
relationship or pattern within data can be found by looking into the mean, median, mode, dispersion, variance,
range, standard deviation etc.
e.g. In a survey of classroom, the researcher may be looking to count the number of boys and girls. In this instance,
the data would simply reflect the number, i.e. a single variable and the quantity of boys and girls.
Another example could be, if we have given height, and weight as input and output should be predicted as type
(Obesity, Slim, Fit). So, by considering only weight can we predict type. The answer is Yes but it might be possible that
the data may get overlapped between Obese and Slim/Slim and Fit etc. So this will work only when we have
continuous data not categorical data.
STATISTICAL TECHNIQUES TO CONDUCT UNIVARIATE ANALYSIS
Multivariate analysis is a more complex form of statistical analysis technique and used when there are more than
two variables in the dataset and tries to understand the relationship of each variable with each other.
e.g. A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on eating habits of
the subjects (e.g. how many ounces of red meat, fish, dairy products, and chocolate consumed per week). She wants
to investigate the relationship between the three measures of health and eating habits?
STATISTICAL TECHNIQUES TO CONDUCT MULTIVARIATE ANALYSIS
Specific statistical functions and techniques you can perform with EDA tools include:
▪ Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional
data containing many variables.
▪ Univariate visualization of each field in the raw dataset, with summary statistics.
▪ Bivariate visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you’re looking at.
▪ Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
▪ K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K
groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points
closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly
used in market segmentation, pattern recognition, and image compression.
▪ Predictive models, such as linear regression, use statistics and data to predict outcomes.
TYPES OF EXPLORATORY DATA ANALYSIS
Violin plots- It is a
extension of box plots in this the
kernel density plot is also plotted
with box plots.
STACKED HISTOGRAMS FOR MULTIVARIATE DATA
https://www.javatpoint.com/data-preprocessing-machine-learning
https://medium.com/analytics-vidhya/data-visualization-and-exploratory-data-analysis-eda-in-data-science-
984e84942fda
https://datavizcatalogue.com/methods/parallel_coordinates.html
https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Mosaic_Plots.pdf