0% found this document useful (0 votes)

41 views

DS Unit 2

Uploaded by

Vaishnavi Mahale

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

DS Unit 2

Uploaded by

Vaishnavi Mahale

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

DATA PRE-PROCESSING AND VISUALIZATION

Dr. Aakanksha Sharaff

Department of Computer Science and Engineering
National Institute of Technology Raipur C.G. India
 Data preprocessing is a process of preparing the raw data and making it
suitable for any learning model. It is the first and crucial step while
creating a model.
 A real-world data generally contains noises, missing values, and maybe
in an unusable format which cannot be directly used for machine/deep
learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
DATA PRE-
PROCESSING
DATA PRE-PROCESSING STEPS

 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
 Data Sampling
DATA CLEANING

Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
➢ Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
➢ Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
DATA CLEANING

Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
❑ Binning Method:

➢ This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
➢ Binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data
about a group of people, you might want to arrange their ages into a smaller number of age intervals.
❑ Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
❑ Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
DATA TRANSFORMATION
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (0.0 to 1.0). e.g. Min-max, z-score

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
DATA REDUCTION

Data reduction is preprocessing technique that helps in obtaining reduced representation of dataset from the
available dataset that is much smaller in volume.
➢ Integrity of the original data should be maintained even after reduction in data volume.
➢ It should produce same analytics result as on original data.
Need of Data Reduction
➢ Visualization
➢ Increase efficiency of data science/mining algortihms
➢ Need less memory space
DATA REDUCTION

The various steps to data reduction are:

1. Data Cube Aggregation: Year 2019 Year 2020
Year Sales
Quarter Sales Quarter Sales

Q1 500 Q1 600 2019 800

Q2 300 Q2 500
2020 1100

➢ It is a process in which information is gathered and expressed in summary form.

➢ Aggregation operation is applied to data for the construction of the data cube.
DATA REDUCTION
2. Attribute Subset Selection:
➢ Attribute subset selection reduces the dataset size by removing irrelevant or redundant features/attributes/dimensions.
➢ The highly relevant attributes should be used, rest all can be discarded.
➢ Keeping irrelevant attributes may be detrimental (non-beneficial), causing confusion for mining algorithm employed. This can result in
discovering pattern of poor quality.
➢ Moreover, the added volume of irrelevant or redundant attribute can slow down the mining process.
➢ E.g. S. No., Roll No., Name, Age, DoB, Gender, Marks are attributes. The query is to find the number of males and females who secured more
than 80 marks.
DATA REDUCTION

3. Numerosity Reduction:
Data is replaced by estimated/alternative value. This enable to store the model of data instead of whole
data, for example: Regression Models.
DATA REDUCTION

4. Dimensionality Reduction:
➢ Remove redundant attributes.
➢ This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction
from compressed data, original data can be retrieved, such reduction are called lossless reduction else it
is called lossy reduction. The two effective methods of dimensionality reduction are:
➢ Wavelet transforms and PCA (Principal Component Analysis).
DATA PRE-PROCESSING

 Data Preprocessing contains the following necessary steps:

• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Feature scaling
DATA PRE-PROCESSING (CONTD.)

Get the Dataset

 To create a machine learning model, the first thing we required is a dataset as a machine learning model completely works on
data. The collected data for a particular problem in a proper format is known as the dataset. To use the dataset in our code, we
usually put it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file. CSV stands for "Comma-
Separated Values" files. It is a file format which allows us to save the tabular data, such as spreadsheets. It is useful for huge
datasets and can use these datasets in programs.
Importing Libraries
 In order to perform data preprocessing using Python, we need to import some predefined Python libraries. These libraries are
used to perform some specific jobs.There are three specific libraries that we will use for data preprocessing, which are:
•Numpy
•Pandas
•Matplotlib
Importing the Datasets
▪ Now we need to import the datasets which we have collected for our machine learning project. Now to import the dataset, we
will use read_csv() function of pandas library, which is used to read a csv file and performs various operations on it. Using this
function, we can read a csv file locally as well as through an URL.
DATA PRE-PROCESSING (CONTD.)
Handling Missing data
 If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is
necessary to handle missing values present in the dataset.
 There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to commonly deal with null values. In this way, we just delete the
specific row or column which consists of null values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.
• By calculating the mean: In this way, we will calculate the mean of that column or row which contains any missing value
and will put it on the place of missing value. This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach.
Encoding Categorical data
 Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical
variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into
numbers.
 We have two methods to do so:
• LabelEncoder :When only two class variables are present.
• OneHotEncoder :When more than two class variables are present.
DATA PRE-PROCESSING (CONTD.)

Feature Scaling
 Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the independent
variables of the dataset in a specific range. In feature scaling, we put our variables in the same range and in the same scale so
that not a single variable dominate the other variable.
 For feature scaling, we will import StandardScaler class of sklearn.preprocessing.
 By Applying above scalar, we will get all the values range between -1 to 1.
EXPLORATORY DATA ANALYSIS
 Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
 Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods. It helps determine how best to manipulate data
sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a
hypothesis, or check assumptions.
 EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and
provides a provides a better understanding of data set variables and the relationships between them. It can also
help determine if the statistical techniques you are considering for data analysis are appropriate. Originally
developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used
method in the data discovery process today.
 Exploratory data analysis is a simple classification technique usually done by visual methods. It is an approach to
analyzing data sets to summarize their main characteristics. When you are trying to build a machine learning
model you need to be pretty sure whether your data is making sense or not.
 Exploratory data analysis (EDA) is a task of analyzing data using simple tools from statistics, simple plotting tools.
IMPORTANCE OF EXPLORATORY DATA ANALYSIS

 The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious
errors, as well as better understand patterns within the data, detect outliers or anomalous events, find
interesting relations among the variables.
 Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to
any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the
right questions. EDA can help answer questions about standard deviations, categorical variables, and
confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more
sophisticated data analysis or modeling, including machine learning.
NEED OF EXPLORATORY DATA ANALYSIS

 Every machine learning problem solving starts with EDA. It is probably one of the most important part of a
machine learning project. With the growing market, the size of data is also growing. It becomes harder for
companies to make decisions without analyzing it properly.
 With the use of charts and certain graphs, one can make sense out of the data and check whether there is
any relationship or not.
 Various plots are used to determine any conclusions. This helps the company to make a firm and profitable
decisions. Once Exploratory Data Analysis is complete and insights are drawn, its feature can be used for
supervised and unsupervised machine learning modelling.
EXPLORATORY DATA ANALYSIS TOOLS

Some of the most common data science tools used to create an EDA
include:
 Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or glue
language to connect existing components together. Python and EDA can
be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine
learning.
 R: An open-source programming language and free software
environment for statistical computing and graphics supported by the R
Foundation for Statistical Computing. The R language is widely used
among statisticians in data science in developing statistical
observations and data analysis.
EDA LEVEL OF ANALYSIS (STATISTICS: QUANTITATIVE DATA
ANALYSIS)

EDA level of analysis depends on various quantitative data analysis and the analysis of number of variables/features
considered for a particular case study is one of them. There are three different analysis mentioned below:
 Univariate analysis
 Bivariate analysis
 Multivariate analysis
UNIVARIATE ANALYSIS

 Univariate analysis is the most basic form of statistical data analysis technique. When the data contains only one
variable and doesn’t deal with a causes or effect relationships then a Univariate analysis is used.
 The key objective of univariate analysis is to simply describe the data to find the patterns within the data. The
relationship or pattern within data can be found by looking into the mean, median, mode, dispersion, variance,
range, standard deviation etc.
e.g. In a survey of classroom, the researcher may be looking to count the number of boys and girls. In this instance,
the data would simply reflect the number, i.e. a single variable and the quantity of boys and girls.
Another example could be, if we have given height, and weight as input and output should be predicted as type
(Obesity, Slim, Fit). So, by considering only weight can we predict type. The answer is Yes but it might be possible that
the data may get overlapped between Obese and Slim/Slim and Fit etc. So this will work only when we have
continuous data not categorical data.
STATISTICAL TECHNIQUES TO CONDUCT UNIVARIATE ANALYSIS

 Frequency Distribution Tables

 Histograms
 Frequency Polygons
 Pie Charts
 Bar Charts
BIVARIATE ANALYSIS

 Bivariate analysis is slightly more analytical to univariate analysis.

 When the dataset contains two variables and researchers aim to undertake comparisons between the two
dataset then bivariate analysis is the right type of analysis technique.
 Bivariate analysis will measure the correlations between the two variables.
e.g. In a survey of data science class, the researcher want to analyse the ratio of students who scored above 85%
corresponding to their genders.
In this case there are two variables- gender X (independent variable) and result Y (dependent variable)
Another example could be, if we have given height, and weight as input and output should be predicted as type
(Obesity, Slim, Fit). So, by considering both variables can we predict type more accurately. The answer is Yes but it
might be possible that the data may get overlapped between Obese and Slim/Slim and Fit etc. So this will work only
when we have continuous data not categorical data.
STATISTICAL TECHNIQUES TO CONDUCT BIVARIATE ANALYSIS

Bivariate analysis is conducted using

 Correlation coefficients
 Regression analysis (Linear regression, Logistics regression)
MULTIVARIATE ANALYSIS

 Multivariate analysis is a more complex form of statistical analysis technique and used when there are more than
two variables in the dataset and tries to understand the relationship of each variable with each other.
e.g. A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on eating habits of
the subjects (e.g. how many ounces of red meat, fish, dairy products, and chocolate consumed per week). She wants
to investigate the relationship between the three measures of health and eating habits?
STATISTICAL TECHNIQUES TO CONDUCT MULTIVARIATE ANALYSIS

Multivariate analysis is conducted using commonly known techniques-

 Factor Analysis
 Cluster Analysis
 Variance Analysis
 Discriminant Analysis
 Multidimensional Scaling
 Principal Component Analysis
 Redundancy Analysis
EXPLORATORY DATA ANALYSIS TOOLS

Specific statistical functions and techniques you can perform with EDA tools include:
▪ Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional
data containing many variables.
▪ Univariate visualization of each field in the raw dataset, with summary statistics.
▪ Bivariate visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you’re looking at.
▪ Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
▪ K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K
groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points
closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly
used in market segmentation, pattern recognition, and image compression.
▪ Predictive models, such as linear regression, use statistics and data to predict outcomes.
TYPES OF EXPLORATORY DATA ANALYSIS

There are four primary types of EDA:

 Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since
it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and
find patterns that exist within it.
 Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required.
Common types of univariate graphics include:
 Stem-and-leaf plots, which show all data values and the shape of the distribution.
 Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
 Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
 Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques
generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
 Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used
graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group
representing the levels of the other variable.
TYPES OF EXPLORATORY DATA ANALYSIS

Other common types of multivariate graphics include:

 Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one
variable is affected by another.
 Multivariate chart, which is a graphical representation of the relationships between factors and a
response.
 Run chart, which is a line graph of data plotted over time.
 Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional
plot.
 Heat map, which is a graphical representation of data where values are depicted by color.
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

How can we perform EDA

 Talking about Python, we use certain libraries like NumPy, Pandas, Matplotlib and seaborn for EDA.
 The more creative we become with data more insights we can visualize. So, while performing EDA, always ask the right
question, be more creative towards data and understand the pattern thoroughly.
 Some methods and plots are distinguished as: -
• Univariate analysis:This is the analysis of one (“uni”) variable.
• Bivariate analysis:This is the analysis of exactly two variables.
Multivariate analysis:This is the analysis of more than two variables.
•
 Here are the common graphs used while performing EDA:-
• Scatter Plot
• Pair plots
• Histogram
• Box plots
• Violin Plots
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

 Scatter plot: — It is a type of plot which will be

in a scatter format. It is typically between 2 features.
This is used to check if there is any linearity between
these two features.
 Here two colours (Orange & Blue) represents two
features.
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

 Pair plots:- This is used to see the

behavior of all the features present in the
dataset. Also we get to see the PDF
representation.
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

 Box-Plots: Box plots tell us the

percentile plotting which other plots
cant tell easily. It also helps in detection
of outliers.
 Here we can know about the quartile
range and the outliers situation.
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

 Histogram: Histogram plots are used

to depict the distribution of any
continuous variable. These types of plots
are very popular in statistical analysis.
EXPLORATORY DATA ANALYSIS AND DATA VISUALIZATION (CONTD.)

 Violin plots- It is a
extension of box plots in this the
kernel density plot is also plotted
with box plots.
STACKED HISTOGRAMS FOR MULTIVARIATE DATA

 Stacked Histogram is a type of graph or

graphical representation of different
items on a single bar with different
colour where each colour represents
different item.
 This histogram is normally used when
we need to represent multiple
variables. It will be easy for multiple
categorical labels.
BIVARIATE SCATTER PLOT

 A bivariate plot graphs the relationship between two

variables that have been measured on a single sample
of subjects. Such a plot permits you to see at a glance
the degree and pattern of relation between the two
variables.
 On a bivariate plot, the abscissa (X-axis) represents
the potential scores of the predictor variable and the
ordinate (Y-axis) represents the potential scores of
the predicted or outcome variable.
 This is what we mean by "bivariate" plot. Each point
represents two variables.
PARALLEL COORDINATE PLOT

 This type of visualisation is used for plotting multivariate,

numerical data.
 Parallel Coordinates Plots are ideal for comparing many
variables together and seeing the relationships between them.
 In a Parallel Coordinates Plot, each variable is given its own
axis and all the axes are placed in parallel to each other. Each
axis can have a different scale, as each variable works off a
different unit of measurement, or all the axes can be
normalized to keep all the scales uniform. Values are plotted as
a series of lines that connected across all the axes. This means
that each line is a collection of points placed on each axis, that
have all been connected together.
 EX- Comparing computer or cars specs across different
models.
HEAT MAP (TABLE PLOT)

 Heat maps are a great tool for visualizing complex

statistical data.
 Doctors, engineers, marketers, sociologists, and
researchers of every kind use heat maps to make data
sets comprehensible and actionable.
 A heat map is data analysis software that uses color
the way a bar graph uses height and width: as a data
visualization tool.
 A heat map uses a warm-to-cool colour spectrum to
show you which parts of a page receive the most
attention.
 This also help you answer a crucial question: “Where
should the most important content be on this
dataset?”
MOSAIC PLOTS

 A mosaic plot is a graphical display of the cell

frequencies of a contingency table in which the
area of boxes of the plot are proportional to the
cell frequencies of the contingency table. This
procedure can construct mosaic plots for up to
four-way contingency tables.
 Mosaic plot is based on conditional probabilities.
MOSAIC PLOTS (CONTD.)

 For example the admission details of female and male

candidates has taken in to consideration.
 The widths of the boxes are proportional to the
percentage of females and males, respectively. In fact,
41% of applicants are female and 59% are male.
 The heights of the boxes are proportional to percent
admitted. In fact, 45% of the male applicants are
admitted, while only 30% of the female applicants are
admitted. This seems to show a large gender-bias in
admission.
 To make the plot easier to interpret, the boxes for
admitted females and males are colored blue while
the not admitted females and males are colored pink.
 It is easy to see that females’ blue box on the left is
much shorter than the males’ blue box on the right.
REFERENCES

 https://www.javatpoint.com/data-preprocessing-machine-learning
 https://medium.com/analytics-vidhya/data-visualization-and-exploratory-data-analysis-eda-in-data-science-
984e84942fda
 https://datavizcatalogue.com/methods/parallel_coordinates.html
 https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Mosaic_Plots.pdf

CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Unit - II
No ratings yet
Unit - II
56 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Unit 2
No ratings yet
Unit 2
46 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
DWM
No ratings yet
DWM
14 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Week2-2
No ratings yet
Week2-2
25 pages
253777
No ratings yet
253777
66 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
02_23ECE216_EDA_Pre Processing
No ratings yet
02_23ECE216_EDA_Pre Processing
16 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
LKY - Soil Structure - Clay Minerals
No ratings yet
LKY - Soil Structure - Clay Minerals
35 pages
Ishan
No ratings yet
Ishan
1 page
Surveying Compressed 231130 101614
No ratings yet
Surveying Compressed 231130 101614
280 pages
Damp Roofing
No ratings yet
Damp Roofing
21 pages
2024-02-21 Civil Engineering Drawing
100% (1)
2024-02-21 Civil Engineering Drawing
11 pages
Eclectika Brouchure-2024
No ratings yet
Eclectika Brouchure-2024
17 pages
4th Semester CIVIL
No ratings yet
4th Semester CIVIL
18 pages
Lagrange's Theorem and Cauchy's Theorem
No ratings yet
Lagrange's Theorem and Cauchy's Theorem
14 pages
Aptitude Refresher - The Booklet
No ratings yet
Aptitude Refresher - The Booklet
81 pages
Pack 201 Horizontal Flow Wrapper: Application Information
No ratings yet
Pack 201 Horizontal Flow Wrapper: Application Information
2 pages
BEE Viva Questions
No ratings yet
BEE Viva Questions
3 pages
Relative Clauses Practice
No ratings yet
Relative Clauses Practice
2 pages
International Cuisine Syllabus August 2020
No ratings yet
International Cuisine Syllabus August 2020
18 pages
Ga Unit 7C - FG 11
No ratings yet
Ga Unit 7C - FG 11
29 pages
BJT 1sdsdsa
No ratings yet
BJT 1sdsdsa
59 pages
The Birth of the Modern World 1780 1914 All Chapters Instant Download
100% (3)
The Birth of the Modern World 1780 1914 All Chapters Instant Download
24 pages
The Cold War 1st Edition Edition Allan M. Winkler - The ebook with rich content is ready for you to download
100% (2)
The Cold War 1st Edition Edition Allan M. Winkler - The ebook with rich content is ready for you to download
48 pages
Serenacel Mitsubishi Eng
No ratings yet
Serenacel Mitsubishi Eng
6 pages
The Impact of Telehealth Consultations On Adolescent Type II Diabetes 1
No ratings yet
The Impact of Telehealth Consultations On Adolescent Type II Diabetes 1
5 pages
F&E Crew Seat 387310 H127 CMM, Dated 7-Oct-14
No ratings yet
F&E Crew Seat 387310 H127 CMM, Dated 7-Oct-14
76 pages
SUNDO Workshop - Whipping The Debate Into Shape by Tobi Leung
No ratings yet
SUNDO Workshop - Whipping The Debate Into Shape by Tobi Leung
2 pages
Optitex Installation Guide (For Administrators)
No ratings yet
Optitex Installation Guide (For Administrators)
62 pages
Bài Tập Unit 1 - Lớp 11 New
No ratings yet
Bài Tập Unit 1 - Lớp 11 New
8 pages
ITCS (Unit 4)
No ratings yet
ITCS (Unit 4)
29 pages
MPLSin IPv 6
No ratings yet
MPLSin IPv 6
38 pages
Welcome To: Metacognitive Training For Depression (D-MCT)
No ratings yet
Welcome To: Metacognitive Training For Depression (D-MCT)
55 pages
PC131 Servo Control BD Parts Check e R0
No ratings yet
PC131 Servo Control BD Parts Check e R0
3 pages
VLLK
100% (1)
VLLK
24 pages
Styrolution PS 1300/1301: Technical Datasheet
No ratings yet
Styrolution PS 1300/1301: Technical Datasheet
2 pages
Linguistics for Singers 1st Edition Gregory Camp - Explore the complete ebook content with the fastest download
100% (4)
Linguistics for Singers 1st Edition Gregory Camp - Explore the complete ebook content with the fastest download
79 pages
SQL Queries With All Answers
No ratings yet
SQL Queries With All Answers
29 pages
Enhancing Learning Through Technology: When Students Resist The Change
No ratings yet
Enhancing Learning Through Technology: When Students Resist The Change
8 pages
Final Question Paper - CCN Spring 2021
No ratings yet
Final Question Paper - CCN Spring 2021
4 pages
How Oil Refining Works
No ratings yet
How Oil Refining Works
13 pages
Ficci Corporate Members
No ratings yet
Ficci Corporate Members
3 pages
4TH Grading
No ratings yet
4TH Grading
5 pages
onyx_communication_en
No ratings yet
onyx_communication_en
21 pages

DS Unit 2

Uploaded by

DS Unit 2

Uploaded by

DATA PRE-PROCESSING AND VISUALIZATION

Dr. Aakanksha Sharaff

4. Concept Hierarchy Generation:

The various steps to data reduction are:

Q1 500 Q1 600 2019 800

➢ It is a process in which information is gathered and expressed in summary form.

 Data Preprocessing contains the following necessary steps:

Get the Dataset

 Frequency Distribution Tables

 Bivariate analysis is slightly more analytical to univariate analysis.

Bivariate analysis is conducted using

Multivariate analysis is conducted using commonly known techniques-

There are four primary types of EDA:

Other common types of multivariate graphics include:

How can we perform EDA

 Scatter plot: — It is a type of plot which will be

 Pair plots:- This is used to see the

 Box-Plots: Box plots tell us the

 Histogram: Histogram plots are used

 Stacked Histogram is a type of graph or

 A bivariate plot graphs the relationship between two

 This type of visualisation is used for plotting multivariate,

 Heat maps are a great tool for visualizing complex

 A mosaic plot is a graphical display of the cell

 For example the admission details of female and male

You might also like