0% found this document useful (0 votes)

6 views15 pages

Handout 4

The document discusses data management techniques, focusing on cleaning missing values and transforming data for better modeling. It outlines methods for handling missing data, such as dropping rows, creating new categories, and imputing values based on other variables. Additionally, it covers data transformations like normalization, discretization, and log transformations to enhance the interpretability and effectiveness of predictive models.

Uploaded by

wipawil697

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Handout 4

Uploaded by

wipawil697

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Managing data:

I. Cleaning data:

First you’ll see how to treat missing values. Then we’ll discuss some common data
transformations and when they’re appropriate: converting continuous variables to discrete;
normalization and rescaling; and logarithmic transformations.

Treating missing values (NAs):

Let’s take another look at some of the variables with missing values in our customer dataset.
Fundamentally, there are two things you can do with these variables: drop the rows with missing
values, or convert the missing values to a meaningful value.

Variables with missing values:

To Drop or not to Drop:

Remember that we have a dataset of 1,000 customers; 56 missing values represents about 6% of
the data. That’s not trivial, but it’s not huge, either. The fact that three variables are all missing
exactly 56 values suggests that it’s the same 56 customers in each case.

Checking locations of missing data:

Step1:-

Step2:-

Because the missing data represents a fairly small fraction of the dataset, it’s probably safe just
to drop these customers from your analysis. But what about the variable is.employed ? Here
you’re missing data from a third of the customers.

Missing data in categorical variables

The most straightforward solution is just to create a new category for the variable, called
missing.

Remapping NA to a level:

Step1:-

Step2:-

Step3:-

Most analysis functions in R (and in a great many other statistical languages and packages) will,
by default, drop rows with missing data. Changing each NA (which is R’s code for missing
values) to the token missing (which is people-code for missing values) will prevent that.
As a data scientist, you ought to be interested in why so many customers are missing this
information. It could just be bad record-keeping, but it could be semantic, as well. In this case,
the format of the data (using the same row type for all customers) hints that the NA actually
encodes that the customer is not in the active workforce: they are a homemaker, a student,
retired, or otherwise not seeking paid employment. Assuming that you don’t want to differentiate
between retirees, students, and so on, naming the category appropriately will make it easier to
interpret the model that you build down the line.

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed), "not in active workforce",

ifelse(custdata$is.employed==T, "employed", "not employed"))

Notice that we created a new variable called is.employed.fix, rather than simply replacing
is.employed. This is a matter of taste. We prefer to have the original variable on hand, in case
we second-guess our data cleaning and want to redo it. This is mostly a problem when the data
cleaning involves a complicated transformation, like determining which customers are retirees
and which ones are students. On the other hand, having two variables about employment in your
data frame leaves you open to accidentally using the wrong one. Both choices have advantages
and disadvantages.

> summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 25000 45000 66200 82000 615000 328

You believe that income is still an important predictor of the probability of health insurance
coverage, so you still want to use the variable. What do you do?

When values are missing randomly

You might believe that the data is missing because of a faulty sensor—in other words, the data
collection failed at random. In this case, you can replace the missing values with the expected, or
mean, income:

There might be a relationship between state of residence or marital status and income, as well. If
you have this information, you can use it. Note that the method of imputing a missing value of an
input variable based on the other input variables can be applied to categorical data, as well. It’s
important to remember that replacing missing values by the mean, as well as many more
sophisticated methods for imputing missing values, assumes that the customers with missing
income are in some sense random (the “faulty sensor” situation). It’s possible that the customers
with missing income data are systematically different from the others. The customers with
missing income information truly have no income—because they’re not in the active workforce.
The customers with missing income information truly have no income—because they’re not in
the active workforce.
When values are missing systematically:

One thing you can do is to convert the numeric data into categorical data, and then use
the methods that we discussed previously. In this case, you would divide the income into some
income categories of interest, such as “below $10,000,” or “from $100,000 to $250,000” using
the cut() function,

Converting missing numeric data to a level :

Step1:

Step2:

Step3:

Step4:

Step5:

Step6:

This grouping approach can work well, especially if the relationship between income and
insurance is nonmonotonic (the likelihood of having insurance doesn’t strictly increase or
decrease with income).
You could also replace all the NAs with zero income—but the data already has customers with
zero income. Those zeros could be from the same mechanism as the NAs (customers not in the
active workforce), or they could come from another mechanism —for example, customers who
have been unemployed the entire year. A trick that has worked well for us is to replace the NAs
with zeros and add an additional variable (we call it a masking variable) to keep track of which
data points have been altered.

Tracking original NAs with an extra categorical variable:

You give both variables, missingIncome and Income.fix, to the modeling algorithm, and it can
determine how to best use the information to make predictions. Note that if the missing values
really are missing randomly, then the masking variable will basically pick up the variable’s mean
value.

In addition to fixing missing data, there are other ways that you can transform the data to address
issues that you found during the exploration phase.

II. Data transformations

The purpose of data transformation is to make data easier to model—and easier to

understand. For example, the cost of living will vary from state to state, so what would be a high
salary in one region could be barely enough to scrape by in another. If you want to use income as
an input to your insurance model, it might be more meaningful to normalize a customer’s income
by the typical income in the area where they live.

Normalizing income by state:

Step1:

Step2:
Step3:

Step4:

The need for data transformation can also depend on which modeling method you plan to use.
For linear and logistic regression, for example, you ideally want to make sure that the
relationship between input variables and output variable is approximately linear, and that the
output variable is constant variance (the variance of the output variable is independent of the
input variables). You may need to transform some of your input variables to better meet these
assumptions.

Converting continuous variables to discrete:

For example, you may notice that customers with incomes less than $20,000 have different
health insurance patterns than customers with higher incomes. Or you may notice that customers
younger than 25 and older than 65 have high probabilities of insurance coverage, because they
tend to be on their parents’ coverage or on a retirement plan, respectively, whereas customers
between those ages have a different pattern.

In these cases, you might want to convert the continuous age and income variables into ranges,
or discrete variables. Discretizing continuous variables is useful when the relationship between
input and output isn’t linear, but you’re using a modeling technique that assumes it is, like
regression.

Health insurance coverage versus income (log10 scale):

you can replace the income variable with a Boolean variable that indicates whether income is
less than $20,000:

> custdata$income.lt.20K <- custdata$income < 20000

> summary(custdata$income.lt.20K)
Mode FALSE TRUE NA's
logical 678 322 0
If you want more than a simple threshold (as in the age example), you can use the cut() function.
Converting age into ranges:
Step1:

Step2:

Step3:

Even when you do decide not to discretize a numeric variable, you may still need to transform it
to better bring out the relationship between it and other variables.
Normalization and rescaling:

Normalization is useful when absolute quantities are less meaningful than relative ones. For
example, you might be less interested in a customer’s absolute age than you are in how old or
young they are relative to a “typical” customer. Let’s take the mean age of your customers to be
the typical age. You can normalize by that, as shown in the following:

A value for age. normalized that is much less than 1 signifies an unusually young customer;
much greater than 1 signifies an unusually old customer.
Is a 35-year-old young?
The typical age spread of your customers is summarized in the standard deviation. You can
rescale your data by using the standard deviation as a unit of distance. A customer who is within
one standard deviation of the mean is not much older or younger than typical. A customer who is
more than one or two standard deviations from the mean can be considered much older, or much
younger.

Now, values less than -1 signify customers younger than typical; values greater than 1 signify
customers older than typical. Normalizing by mean and standard deviation is most meaningful
when the data distribution is roughly symmetric.

Log transf ormations f or skewed and wide distributions:

Monetary amounts—incomes, customer value, account, or purchase sizes—are some of the most
commonly encountered sources of skewed distributions in data science applications. monetary
amounts are often lognormally distributed—the log of the data is normally distributed. This leads
us to the idea that taking the log of the data can restore symmetry to it.
A nearly lognormal distribution and its log
The common interpretation of standard deviation as a unit of distance implicitly assumes that the
data is distributed normally. For a normal distribution, roughly two-thirds of the data (about
68%) is within plus/minus one standard deviation from the mean. About 95% of the data is
within plus/minus two standard deviations from the mean.

For the purposes of modeling, which logarithm you use—natural logarithm, log base 10, or log
base 2—is generally not critical. Inregression, for example, the choice of logarithm affects the
magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the
value of the outcome. We like to use log base 10 for monetary amounts, because orders of ten
seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.

Additive Process Vs Multiplicative Process:

For example, when you’re studying weight loss, the natural unit is often pounds or
kilograms. If you weigh 150 pounds and your friend weighs 200, you’re both equally active, and
you both go on the exact same restricted-calorie diet, then you’ll probably both lose about the
same number of pounds—in other words, how much weight you lose doesn’t (to first order)
depend on how much you weighed in the first place, only on calorie intake. This is an additive
process.

On the other hand, if management gives everyone in the department a raise, it probably isn’t
giving everyone $5,000 extra. Instead, everyone gets a 2% raise: how much extra money ends up
in your paycheck depends on your initial salary. This is a multiplicative process, and the natural
unit of measurement is percentage, not absolute dollars. When the process is multiplicative, log
transforming the process data can make modeling easier.
Of course, taking the logarithm only works if the data is nonnegative. There are other transforms,
such as arcsinh, that you can use to decrease data range if you have zero or negative values. We
don’t always use arcsinh, because we don’t find the values of the transformed data to be
meaningful. In applications where the skewed data is monetary (like account balances or
customer value), we instead use what we call a signed logarithm. A signed logarithm takes the
logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values
strictly between -1 and 1 are mapped to zero.

Signed log lets you visualize non-positive data on a logarithmic scale:

Sampling for modeling and validation:

Sampling is the process of selecting a subset of a population to represent the whole,

during analysis and modeling. It is easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier with a subsample of the data;
ggplot runs faster on smaller datasets, and too much data will often obscure the patterns in a
graph.

It is a good idea to pick customers randomly from all the states, because what predicts health
insurance coverage for Texas customers might be different from what predicts health insurance
coverage in Connecticut. The other reason to sample your data is to create test and training splits.
I. Test and training splits:

When you’re building a model to make predictions, like our model to predict the
probability of health insurance coverage, you need data to build the model. You also need data to
test whether the model makes correct predictions on new data. The first set is called the training
set, and the second set is called the test (or hold-out) set.

The training set is the data that you feed to the model-building algorithm—regression, decision
trees, and so on—so that the algorithm can set the correct parameters to best predict the outcome
variable. The test set is the data that you feed into the resulting model, to verify that the model’s
predictions are accurate. Many writers recommend train/calibration/test splits, which is also
good advice. Our philosophy is this: split the data into train/test early, don’t look at test until
final evaluation, and if you need calibration data, resplit it from your training subset.

II. Creating a sample group column:

A convenient way to manage random sampling is to add a sample group column to the data
frame. The sample group column contains a number generated uniformly from zero to one, using
the runif function. You can draw a random sample of arbitrary size from the data frame by using
the appropriate threshold on the sample group column. For example, once you’ve labeled all the
rows of your data frame with your sample group column (let’s call it gp), then the set of all rows
such that gp < 0.4 will be about four-tenths, or 40%, of the data. The set of all rows where gp is
between 0.55 and 0.70 is about 15% of the data (0.7 - 0.55 = 0.15). So you can repeatedly
generate a random sample of the data of any size by using gp.

Step1:

Step2:

Step3:
R also has a function called sample that draws a random sample (a uniform random
sample, by default) from a data frame. Why not just use sample to draw training and test sets?
You could, but using a sample group column guarantees that you’ll draw the same sample group
every time. This reproducible sampling is convenient when you’re debugging code.

You also want repeatable input samples for what software engineers call regression testing (not
to be confused with statistical regression). In other words, when you make changes to a model or
to your data treatment, you want to make sure you don’t break what was already working. If
model version 1 was giving “the right answer” for a certain input set, you want to make sure that
model version 2 does so also.

Reproducible sampling is not just a trick f or R : If your data is in a database or other external
store, and you only want to pull a subset of the data into R for analysis, you can draw a
reproducible random sample by generating a sample group column in an appropriate table in the
database, using the SQL command RAND .

III. Record grouping:

One caution is that the preceding trick works if every object of interest (every customer, in this
case) corresponds to a unique row. But what if you’re interested less in which customers don’t
have health insurance, and more about which households have uninsured members? If you’re
modeling a question at the household level rather than the customer level, then every member of
a household should be in the same group (test or training). In other words, the random sampling
also has to be at the household level.

Suppose your customers are marked both by a household ID and a customer ID (so the unique ID
for a customer is the combination (household_id, cust_id).
Example of dataset with customers and households Listing:
Ensuring test/train split doesn’t split inside a household:
Step1:

Step2:

Step3:

The resulting sample group column is shown in following figure. Now we can generate the test
and training sets as before. This time, however, the threshold 0.1 doesn’t represent 10% of the
data rows, but 10% of the households (which may be more or less than 10% of the data,
depending on the sizes of the households).

Example of dataset with customers and households:

IV. Data provenance:

You’ll also want to add a column (or columns) to record data provenance: when your dataset was
collected, perhaps what version of your data cleaning procedure was used on the data before
modeling, and so on. This is akin to version control for data. It’s handy information to have, to
make sure that you’re comparing apples to apples when you’re in the process of improving your
model, or comparing different models or different versions of a model.

Unit 2
No ratings yet
Unit 2
76 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Cleaning Techniques Guide
No ratings yet
Data Cleaning Techniques Guide
11 pages
Subtitle
No ratings yet
Subtitle
2 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Handout 3
No ratings yet
Handout 3
24 pages
Data Preparation and Cleaning Guide
No ratings yet
Data Preparation and Cleaning Guide
28 pages
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
No ratings yet
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
31 pages
Business Data Wrangling Guide
No ratings yet
Business Data Wrangling Guide
30 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
2 - Data Management and Wrangling
No ratings yet
2 - Data Management and Wrangling
33 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Capstone Project - Credit Risk Analysis
67% (6)
Capstone Project - Credit Risk Analysis
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Week 09 Class Exercise
No ratings yet
Week 09 Class Exercise
3 pages
M6 Predictive Analytics Presentation
No ratings yet
M6 Predictive Analytics Presentation
49 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
Analytics People Programming Parte 5
No ratings yet
Analytics People Programming Parte 5
14 pages
Bank Loan PPT
No ratings yet
Bank Loan PPT
45 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Null Values in Data Complete Guide
No ratings yet
Null Values in Data Complete Guide
5 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Part 5
No ratings yet
Part 5
22 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
Machine Learning Transport Analysis
100% (4)
Machine Learning Transport Analysis
42 pages
Summary of The Chapter "Working With Missing Values"
No ratings yet
Summary of The Chapter "Working With Missing Values"
5 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
7 pages
Bank Marketing Data Analysis
No ratings yet
Bank Marketing Data Analysis
18 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Bank Rpubs
No ratings yet
Bank Rpubs
24 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Capstone Project Vivek
100% (4)
Capstone Project Vivek
145 pages
Data Cleaning Exercise with RScript
No ratings yet
Data Cleaning Exercise with RScript
1 page
SMDM Project Report - Shubham Bakshi - 07.05.2023
0% (1)
SMDM Project Report - Shubham Bakshi - 07.05.2023
23 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Outliners
No ratings yet
Outliners
15 pages
EDA
100% (1)
EDA
9 pages
R Statistical Analysis Guide
No ratings yet
R Statistical Analysis Guide
52 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
Churn Assignment
No ratings yet
Churn Assignment
11 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
5 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
ISYE6740 Fall2024 HW4 Rubric
No ratings yet
ISYE6740 Fall2024 HW4 Rubric
5 pages
Seminar 3 Solution 2015
No ratings yet
Seminar 3 Solution 2015
12 pages
Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments
No ratings yet
Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments
32 pages
12 Bias-Variance - Underfit - Overfit
No ratings yet
12 Bias-Variance - Underfit - Overfit
4 pages
Module 4. Data Collection and Sampling Week 3
No ratings yet
Module 4. Data Collection and Sampling Week 3
29 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (6)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
Univariate Statistics Assignment Guide
No ratings yet
Univariate Statistics Assignment Guide
2 pages
LESSON 4.2.1 Mean, Median and Mode
No ratings yet
LESSON 4.2.1 Mean, Median and Mode
17 pages
Biostatistics Study Notes
No ratings yet
Biostatistics Study Notes
13 pages
10.1 Homework
No ratings yet
10.1 Homework
3 pages
Assumptions of MANOVA
100% (1)
Assumptions of MANOVA
2 pages
Data Mining
No ratings yet
Data Mining
35 pages
Estimating Moments
No ratings yet
Estimating Moments
22 pages
Dynamic Econometric Models Time Series Econometrics For Microeconometricians 2011
No ratings yet
Dynamic Econometric Models Time Series Econometrics For Microeconometricians 2011
51 pages
IASSC Lean Six Sigma Green Belt Exam Questions - 144q
100% (1)
IASSC Lean Six Sigma Green Belt Exam Questions - 144q
65 pages
Chapter 2 Normal Distribution
No ratings yet
Chapter 2 Normal Distribution
31 pages
Math Project
No ratings yet
Math Project
30 pages
100 Multiple-Choice Questions (MCQS) For Biostatistics - Clinical Corner
No ratings yet
100 Multiple-Choice Questions (MCQS) For Biostatistics - Clinical Corner
15 pages
June MEMO-Mathematics p2
No ratings yet
June MEMO-Mathematics p2
14 pages
Nonparametric Methods: Analysis of Ordinal Data
No ratings yet
Nonparametric Methods: Analysis of Ordinal Data
38 pages
Apna Mart Report Final
No ratings yet
Apna Mart Report Final
4 pages
The Math Behind PCA
No ratings yet
The Math Behind PCA
3 pages
Specification: Choosing The Independent Variables: Slides by Niels-Hugo Blunch Washington and Lee University
No ratings yet
Specification: Choosing The Independent Variables: Slides by Niels-Hugo Blunch Washington and Lee University
16 pages
WEEK 5 Conducting A Test of Hypothesis On Population Proportion
No ratings yet
WEEK 5 Conducting A Test of Hypothesis On Population Proportion
20 pages
CA Foundation BMRS - Key
No ratings yet
CA Foundation BMRS - Key
20 pages
ID None
No ratings yet
ID None
11 pages
IV AI-DS AD3491 FDSA QB Unit4
No ratings yet
IV AI-DS AD3491 FDSA QB Unit4
6 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages

Handout 4

Uploaded by

Handout 4

Uploaded by

Managing data:

Treating missing values (NAs):

Variables with missing values:

To Drop or not to Drop:

Checking locations of missing data:

Missing data in categorical variables

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed), "not in active workforce",

When values are missing randomly

Converting missing numeric data to a level :

Tracking original NAs with an extra categorical variable:

II. Data transformations

The purpose of data transformation is to make data easier to model—and easier to

Normalizing income by state:

Converting continuous variables to discrete:

Health insurance coverage versus income (log10 scale):

> custdata$income.lt.20K <- custdata$income < 20000

Log transf ormations f or skewed and wide distributions:

Additive Process Vs Multiplicative Process:

Signed log lets you visualize non-positive data on a logarithmic scale:

Sampling for modeling and validation:

Sampling is the process of selecting a subset of a population to represent the whole,

II. Creating a sample group column:

III. Record grouping:

Example of dataset with customers and households:

IV. Data provenance:

You might also like