Lec 05
Lec 05
Lecture 5
Data Wrangling
Source: https://bcourses.berkeley.edu/courses/1377158/pages/cs-194-16-introduction-to-data-science-fall-2015
2
Dimensions of Data Quality
Accuracy
Validity Completeness
Timeliness
3
[1] Completeness
• All necessary data have been recorded.
• Data can be considered complete even if optional
data is missing.
Scenario: At the end of the first week of the Autumn term, data
analysis was performed on the ‘First Emergency Contact Telephone
Number’ data item in the Contact table. There are 300 students in
the school and 294 out of a potential 300 records were populated,
therefore 294/300 x 100 = 98% completeness has been achieved for
this data item in the Contact table.
Scenario 1: A new year 9 teacher, Sally Hearn (without a middle name) is appointed
therefore there are only two initials. A decision must be made as to how to represent
two initials or the rule will fail and the database will reject the class identifier of
“SH09”. It is decided that an additional character “Z” will be added to pad the letters
to 3: “SZH09”, however this could break the accuracy rule. A better solution would be
to amend the database to accept 2 or 3 initials and 1 or 2 numbers.
Scenario 2: The age at entry to a UK primary & junior school is captured on the form
for school applications. This is entered into a database and checked that it is
between 4 and 11. If it were captured on the form as 14 or N/A it would be rejected
as invalid.
10
Why are there errors & inconsistencies?
❖ The diversity of data sources brings abundant data
types and complex data structures.
11
Data Science Process
12
The Problem with Noise
13
The Issue with Data Quality
❖ Gartner Reports constantly stressed on the issue of data
quality:
14
Data Cleaning / Wrangling
15
Data Cleaning / Wrangling
❖ Data Cleaning
❖ Data Preprocessing
❖ Data Preparation
❖ Data Scrubbing
❖ Data Munging
❖ Data Transformation
❖ …
16
Data Cleaning
❖ Often, Data Cleaning is a more specific process,
17
Data Cleaning
❖ Number ONE problem in data warehousing
❖ Routine tasks:
An important reminder:
A missing value =? an error in data !
18
Common sources of
discrepancies in data
Incomplete data comes from:
• Non available data value when collected
• Different criteria between the time when the data was collected and
when it is analysed
• Human/hardware/software problems
19
Problem I: Incomplete Data
20
Problem II: Noisy Data
21
Binning
❖ Sorted data for price (in dollars)
E.g. 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
❖ Techniques:
❖ Partition into equal-size bins:
Bin 1 Bin 2 Bin 3
4, 8, 9, 15 21, 21, 24, 25 26, 28, 29, 34
22
Binning
❖ These bins can now be re-labelled based on the bin
numbers (similar to “categories”)
23
Outliers
24
Discuss #1: Outliers
Source:
https://www.chegg.com/homework-help/questions-and-answers/local-ice-cream-shop-kept-track-number-cans-cold-soda-sold-day-temperature-day-two-months--q3418124
25
Problem III: Inconsistent Data
❖ Naming Conventions
❖ Synonyms
❖ Nicknames/Initials
❖ Abbreviations/Acronyms
❖ E.g. Car, vehicle
❖ New York, NY, NYC
❖ Different Representations
❖ How it is written: 2 vs TWO vs II
❖ Data types: integer vs float
❖ Units: ft vs cm; kg vs pound
26
Problem III: Inconsistent Data
❖ Formatting Issues
❖ Dates: mon-year, mm/yy, dd/mm/yy
❖ Numbers: 1000, 1000.00
❖ Currency: $, £, ¥
❖ ID: 911101-01-1234 or 107123456
27
Duplicate Records
28
Other Wrangling Tasks
❖ Data Integration: A process of
integrating data from multiple
sources as a single view
❖ Data Transformation: A
process involving
normalization, discretization,
and concept hierarchy
generation
29
Data Integration
30
The Need for Integration
2. Communication heterogeneity
❖ Some systems have web interfaces, some allow direct query languages,
some offer APIs…
3. Schema heterogeneity
❖ The structure of tables storing data can be different (even if storing the
same data)
31
The Need for Integration
4. Data type heterogeneity
❖ Storing the same data (and values) but with different data types
❖ E.g. Storing name as fixed length / variable length
❖ E.g. Storing the phone number as String or as Number
5. Value heterogeneity
❖ Same logical values stored in different ways
❖ E.g. Prof, Prof., Professor…
❖ E.g. “Right, “R”, 1….
6. Semantic heterogeneity
❖ Same values in different sources can mean different things
❖ E.g. Column “title” in one database means “Job Title” while in
another database it refers to “Person Title”
32
Entity Resolution
❖ Data coming from different sources may be different
even if representing the same objects
❖ Entity resolution:
• Process of figuring out which records represent the same
thing
• Linking relevant records together
33
Merging Similar Records
34
Problem with Redundancy
❖ What causes redundancy?
❖ An attribute may be redundant if it can be “derived” from another
attribute or set of attribute
❖ Inconsistencies in attribute or dimension naming
❖ Duplication at tuple levels (likely due to inaccurate data entry /
updating)
❖ The use of denormalized tables (could sometimes be done on purpose
for optimization)
Also highlights
the danger of
redundancy!
35
Data Reduction
❖ Strategies:
⮚ Dimensionality reduction
Why do we need to ⮚ Process of reducing number of random
variables or attributes under
reduce data? consideration
⮚ Numerosity reduction
⮚ Process of replacing original data
volume by alternative, smaller forms of
data representation
⮚ Data compression
⮚ Process of reducing the size of data
while preserving the representation of
original data (the best possible)
36
How to reduce number of attributes?
❖ To find a good subset of the original attributes:
❖ Heuristic approaches that explore a reduced search space:
1) Greedy approach,
2) Statistical measures such as information gain used in
building decision trees
❖ Combination of both
❖ At each iteration, select the best
attribute and remove the worst
from the balance set
38
Heuristic Methods for
Attribute Subset Selection
❖ Constructs a flowchart-like
structure where each internal
node (non-leaf) represents a test
on an attribute, each branch
represents an outcome to a test
on an attribute and each
external node (leaf) represents
the prediction
39
Data Transformation
❖ Normalization
❖ Discretization
❖ Concept Hierarchy Generation
40
Normalization
41
Normalization Methods
42
Discretization
43
Concept Hierarchy Generation
44
Tools for Data Wrangling
❖ Microsoft Excel
❖R
❖ Python
❖ SPSS
❖ Other commercial software / services
❖
45
Software
DataWatch Monarch
46
Software
Trifacta Wrangler
https://www.youtube.com/watch?v=ogU6BpLa1VQ
47
Software
OpenRefine (formerly Google Refine) http://openrefine.org/
48
Reading Material
• Endel & Piringer, “
Data Wrangling: Making data useful again”, 2015
• Trifacta, “
The Opportunity for Data Wrangling in Financial Services a
nd Insurance
”, 2016
• [YouTube] Daniel Chen: Cleaning and Tidying Data in Pand
as | PyData DC 2018
• [PDF] Data Wrangling with pandas Cheat Sheet
• Sign up @ Kaggle: https://www.kaggle.com/
49