0% found this document useful (0 votes)

14 views16 pages

Data ch2

Uploaded by

ranashahzaibtariq709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views16 pages

Data ch2

Uploaded by

ranashahzaibtariq709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla

wi
crosstabs

n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series 2 Beer, Bread
 Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
 Genetic sequence data
4 Beer, Bread, Diaper, Milk
 Spatial, image and multimedia:
 Spatial data: maps
5 Coke, Diaper, Milk
 Image data:
 Video data:
3
Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
5
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
9
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

10
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
11
Basic Statistical Descriptions of Data
 For data preprocessing to be successful, it is essential to
have an overall picture of your data.
 Basic statistical descriptions can be used to identify
 properties of the data
 highlight noise or outliers
 Three areas of basic statistical descriptions.
 Measures of Central Tendency: Middle or center of a data
distribution. (where do most of its values fall) e.g. mean,
median, mode.
 Dispersion of the data: how are the data spread out? (range,
quartiles, boxplots; and the variance and standard deviation of the data )
 Graphic displays of basic statistical descriptions: to visually
inspect our data. (bar charts, pie charts, and line graphs, histograms,
and scatter plots etc.)

12
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N

 Weighted arithmetic mean: The weights reflect the w x i i

significance, importance, or occurrence frequency x i 1

attached to their respective values. w

i 1
i

13
Measuring the Central Tendency
 Issues: A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
 For example, the mean salary at a company may be substantially pushed up
by that of a few highly paid managers.
 Similarly, the mean score of a class in an exam could be pulled down quite a
bit by a few very low scores.

 Trimmed mean: Mean obtained after chopping off values at the high and low
extremes, to offset the effect caused by a small number of extreme values.
 For example, we can sort the values observed for salary and remove the top
and bottom 2% before computing the mean.
 We should avoid trimming too large a portion (such as 20%) at both ends,
as this can result in the loss of valuable information

14
Measuring the Central Tendency
 Median:
 Middle value if odd number of values, or average of
the middle two values otherwise
 Estimated by interpolation (for grouped data)
 The mode for a set of data is the value that occurs
most frequently in the set
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 A data set with two or more modes is multimodal
 At the other extreme, if each data value occurs only
once, then there is no mode

15
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

May 24, 2024 Data Mining: Concepts and Techniques 16

Beginning Statistics An Introduction For Social Scientists - Liam Foster, Ian Diamond and Julie Jefferies
No ratings yet
Beginning Statistics An Introduction For Social Scientists - Liam Foster, Ian Diamond and Julie Jefferies
308 pages
PDF (eBook PDF) Essentials of Statistics for Business and Economics 7th Edition download
100% (2)
PDF (eBook PDF) Essentials of Statistics for Business and Economics 7th Edition download
41 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02Data
No ratings yet
02Data
24 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02Data
No ratings yet
02Data
65 pages
02 Data
No ratings yet
02 Data
65 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
02 Data
No ratings yet
02 Data
64 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
35 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Module 1
No ratings yet
Module 1
64 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
41 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02Data
No ratings yet
02Data
66 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 Data
No ratings yet
02 Data
62 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
02data InClass 20150827
No ratings yet
02data InClass 20150827
18 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
CH 2
No ratings yet
CH 2
68 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
data mining 2
No ratings yet
data mining 2
64 pages
DMDW Module2-Chapter 2
No ratings yet
DMDW Module2-Chapter 2
67 pages
9-1 Data analysis and pre-processing part 1.pdf
No ratings yet
9-1 Data analysis and pre-processing part 1.pdf
19 pages
Lec 2
No ratings yet
Lec 2
26 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
02data Part1
No ratings yet
02data Part1
19 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
STAT7055 T01 Sol
No ratings yet
STAT7055 T01 Sol
8 pages
Chapter 1
100% (1)
Chapter 1
75 pages
Module 2 - Define Phase - Lean Six Sigma Black Belt Course
No ratings yet
Module 2 - Define Phase - Lean Six Sigma Black Belt Course
151 pages
Stat Lab Zalaki
No ratings yet
Stat Lab Zalaki
77 pages
Business Statistics Assignment
No ratings yet
Business Statistics Assignment
9 pages
Mathematics LO - 2022
No ratings yet
Mathematics LO - 2022
44 pages
270 Book Solutions
No ratings yet
270 Book Solutions
78 pages
Emgt 512 SP 2024
No ratings yet
Emgt 512 SP 2024
156 pages
STAT Module
No ratings yet
STAT Module
25 pages
Geostatistics Lesson 1
No ratings yet
Geostatistics Lesson 1
25 pages
Box Plot For Excel
No ratings yet
Box Plot For Excel
8 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
MT Chapter 13
No ratings yet
MT Chapter 13
16 pages
Notched and Variable Width Box-Plots
No ratings yet
Notched and Variable Width Box-Plots
16 pages
A First Look at On-Device Models in iOS Apps: Han Hu, Yujin Huang, Qiuyuan Chen, Terry Yue Zhuo, Chunyang Chen
No ratings yet
A First Look at On-Device Models in iOS Apps: Han Hu, Yujin Huang, Qiuyuan Chen, Terry Yue Zhuo, Chunyang Chen
29 pages
6-7 - Statistics - Probability
No ratings yet
6-7 - Statistics - Probability
131 pages
Internshala Summer Training Report On Data Science
No ratings yet
Internshala Summer Training Report On Data Science
70 pages
Lec02 - Central Tendency (Student)
No ratings yet
Lec02 - Central Tendency (Student)
43 pages
CH 14 Statistics
No ratings yet
CH 14 Statistics
45 pages
Tutorial Week 10 Statistics Answers
No ratings yet
Tutorial Week 10 Statistics Answers
10 pages
Individual Assignment: Technology Park Malaysia
No ratings yet
Individual Assignment: Technology Park Malaysia
23 pages
Basic Statistics (Module - 3)
100% (2)
Basic Statistics (Module - 3)
12 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
15 pages
IGCSE Mathematics 0580 - 41 Paper 4 May-June 2023
No ratings yet
IGCSE Mathematics 0580 - 41 Paper 4 May-June 2023
10 pages
Errors While Importing Matplotlib:: General FAQ's of Week-3 (PDS)
No ratings yet
Errors While Importing Matplotlib:: General FAQ's of Week-3 (PDS)
5 pages
S1 Paper for Nick
No ratings yet
S1 Paper for Nick
8 pages
Crop Rotation As An Effective Strategy For Saffron (Crocus Sativus L.) Cultivation
No ratings yet
Crop Rotation As An Effective Strategy For Saffron (Crocus Sativus L.) Cultivation
7 pages

Data ch2

Uploaded by

Data ch2

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Data sets are made up of data objects.

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Weighted arithmetic mean: The weights reflect the w x i i

significance, importance, or occurrence frequency x i 1

attached to their respective values. w

positively skewed negatively skewed

May 24, 2024 Data Mining: Concepts and Techniques 16

You might also like