Real Data
• Does not come nicely packaged in an organized table or
database format!
• Comes from multiple sources and therefore often in
multiple formats (e.g. text, #) which is not consistent
• Will be incomplete with missing values
• Will have some information repeated (duplication)
• Will have multiple types of input errors that must be
identified and fixed, including typos and outliers (data
validation)
• Real data is messy!!
Golden Rule for Data Pre-
processing!
Always make a copy of the original data before you start
to:
• Cleanse it
• Sample from it, sort or apply filters
• Analyze it
It is a good technique to document the steps undertaken
to prepare the data so that others can repeat it if
necessary or understand what assumptions were made
Ways to Convert Blocks of
Text Data to Numeric
1. Use the Excel message in top left cell to convert to number.
2. Multiply the text by 1 in another cell and Paste Special Value
result over original text
3. Use the Paste-Special > Multiply option to convert a range:
◦ On CreditCardData sheet, Enter 1 in cell L1
◦ Select L1 and Copy option
◦ Select the entire data range from I3 to I5002
◦ Do a Paste Special -> Multiply.
(Note that blank in row 5001 transformed to 0 though which is not good pre-
processing unless that is what you intentionally want to do….)
4. Use Text-to-Column menu option to replace all text in a column
as a number
Ways to Convert Blocks of Numeric
Data to Text
1. Use the Excel message in top left corner of cell to Ignore Error and keep it
as Text.
2. Use the Trim function to remove all spaces from a text string except for single
spaces between words and Clean function to remove hidden characters:
◦ =Clean(cell) – removes all non-print characters from text
◦ =Trim(cell) – removes all spaces at beginning and end of text
◦ =trim(clean(cell))
>>> Using =trim(clean(cell)) converts cell entry to a label which can be
copied/Paste Special Value back over a numerical entry
3. Use Text-to-Columns option in Data menu to replace all entries in highlighted
column as text by going to Step 3 and selecting “Text” instead of “General”
Ways to Remove Typos in Text for
Different Rows
1. Use Pivot Table to identify potential label typo problems
2. In worksheet, sort column in alphabetical order so can copy a correct spelling
by dragging it to other rows without replacing other words/items
3. Filter column to show all versions of text word/phrase that are or are not the
same in the Filter menu
4. Use =trim(clean(cell)) to remove any spaces before or after expression
entered in cell that looks correct as well as any hidden characters. Fix any
misspelling or spacing within words in phrase. Use Copy/Paste Special Value
to replace original text entry
5. Copy corrected text up or down to other filtered text versions in column and
then remove all filters.
Handling Missing Values
Possible strategies for missing data:
◦ Replace any missing data values by the mean or median for that
variable
◦ Replace any missing data values with a flag
◦ Remove records with missing values from data set
Using Excel with simple incidents of missing values (CreditCardData
sheet : highlight range A2:J5003)
◦ Click on Home –Find and Select – Go To and Go To Special – blank
◦ Input formula or text flag into active cell, then [CTRL][Enter]
◦ Can be used to replace missing data with any value or text
◦ Can calculate mean/median for data not flagged
Duplicate records
If you suspect that you have duplicate records, we can remove
them in a couple of ways:
{see Duplicates worksheet}
◦ Excel – Data – (Sort and Filter) – Advanced – Unique records
only
◦ Excel – Data – (Data Tools) – Remove Duplicates
You can also highlight and inspect duplicate values with
conditional formatting before deciding whether to remove a
record:
◦ Excel – Home – Conditional Formatting – Highlight Cells Rules
– Duplicate Values