0% found this document useful (0 votes)

14 views28 pages

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

tanvipal6661

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views28 pages

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

tanvipal6661

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA MINING AND

DATA WAREHOUSING

CS30013 (Cr-3)

Dr. Pradeep Kumar Mallick

Associate Professor(II)
Email: [email protected]

Getting to Know Your Data

Outline
Data Objects and Attribute Types
Types of data sets
Data Quality
What is Data?
Attributes
 Collection of data objects
and their attributes Tid Refund Marital Taxable
Status Income Cheat
 An attribute is a property
1 Yes Single 125K No
or characteristic of an
2 No Married 100K No
object
3 No Single 70K No
◦ Examples: eye color of a
person, temperature, etc. 4 Yes Married 120K No

Objects
◦ Attribute is also known as 5 No Divorced 95K Yes

variable, field, characteristic, 6 No Married 60K No

dimension, or feature 7 Yes Divorced 220K No
A collection of attributes 8 No Single 85K Yes

describe an object 9 No Married 75K No

◦ Object is also known as record, 10 No Single 90K Yes

point, case, sample, entity, or

instance
What are Data Attributes?
 Data attributes refer to the specific characteristics or
properties that describe individual data objects within a
dataset.
 These attributes provide meaningful information about the
objects and are used to analyze, classify, or manipulate the
data.
 Understanding and analyzing data attributes is fundamental
in various fields such as statistics, machine learning, and data
analysis, as they form the basis for deriving insights and
making informed decisions from the data.
 Within predictive models, attributes serve as the predictors
influencing an outcome. In descriptive models, attributes
constitute the pieces of information under examination for
inherent patterns or correlations.
 We can say that a set of attributes used to describe a
given object are known as attribute vector or feature
vector.
Types of attributes:
Attributes can be broadly classified into two main types:
 Qualitative (Nominal (N), Ordinal (O), Binary(B)).
 Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
 Nominal attributes, as related to names, refer to categorical data
where the values represent different categories or labels without
any inherent order or ranking. These attributes are often used to
represent names or labels associated with objects, entities, or
concepts.
 Example

2. Binary Attributes:
 Binary attributes are a type of qualitative attribute where the
data can take on only two distinct values or states.
 These attributes are often used to represent yes/no,
presence/absence, or true/false conditions within a dataset.
 They are particularly useful for representing categorical data
where there are only two possible outcomes.
 For instance, in a medical study, a binary attribute could
represent whether a patient is affected or unaffected by a
Binary Attributes…
 Symmetric : In a symmetric attribute, both values or states
are considered equally important or interchangeable. For
example, in the attribute “Gender” with values “Male” and
“Female,” neither value holds precedence over the other, and
they are considered equally significant for analysis purposes.

 Asymmetric: An asymmetric attribute indicates that the two

values or states are not equally important or interchangeable.
For instance, in the attribute “Result” with values “Pass” and
“Fail,” the states are not of equal importance; passing may
hold greater significance than failing in certain contexts, such
as academic grading or certification exams
Qualitative Attributes:
3. Ordinal Attributes : Ordinal attributes are a type of qualitative
attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely
quantified. In other words, while the order of values indicates
their relative importance or precedence, the numerical difference
between them is not standardized or known.
 Example:
Quantitative Attributes:
Numeric:
 A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values.
 Numerical attributes are of 2 types: interval, and ratio-
scaled.
 An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the
correct reference point, or we can call zero points. Data can be
added and subtracted at an interval scale but can not be
multiplied or divided.
 Consider an example of temperature in degrees Centigrade. If
a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
 A ratio-scaled attribute is a numeric attribute with a fix zero-
point. If a measurement is ratio-scaled, we can say of a value
as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between
values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.
Quantitative Attributes:
2. Discrete : Discrete data refer to information that can take on
specific, separate values rather than a continuous range. These
values are often distinct and separate from one another, and
they can be either numerical or categorical in nature.

3. Continuous: Continuous data, unlike discrete data, can take

on an infinite number of possible values within a given range. It
is characterized by being able to assume any value within a
specified interval, often including fractional or decimal values.
 Example :
Test Yourself ?
Q. What is the difference between nominal and ordinal
attributes?
Ans: Nominal attributes represent categories without any
inherent order or ranking, while ordinal attributes have a
meaningful sequence or ranking between values, but the
magnitude between values is not precisely known.
Q. How do discrete and continuous attributes differ?
Ans: Discrete attributes represent countable values or whole
numbers, while continuous attributes can take on any value
within a range and are typically associated with
measurements.

Q. What are attributes in warehouse?

Ans: In a data warehouse, attributes typically refer to the
descriptive characteristics or properties of data entities, such as
dimensions or features, which are used for analysis, reporting,
and decision-making.
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
◦ sales database: customers, store items, sales
◦ medical database: patients, treatments
◦ university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
Properties of Attribute Values
 The type of an attribute depends on which of
the following properties/operations it
possesses:
◦ Distinctness : = and 
◦ Order : <, ≤, >, and ≥
◦ Addition : + and -
(Differences are meaningful)
◦ Multiplication : * and /
(Ratios are meaningful)

◦ Nominal attribute: distinctness

◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful
differences
Different types of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
Types of data sets
Record
◦ Data Matrix
◦ Document Data
◦ Transaction Data
Graph
◦ World Wide Web
◦ Molecular Structures
Ordered
◦ Spatial Data
◦ Temporal Data
◦ Sequential Data
◦ Genetic Sequence Data
Data Matrix
 If data objects have the same fixed set of
numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute
 Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
 Each document becomes a ‘term’ vector
◦ Each term is a component (attribute) of the
vector
◦ The value of each component is the number
of times the corresponding term occurs in
the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
◦ Each record (transaction) involves a set of
items.
◦ For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph, a molecule, and
webpages
2
5 1
2
5

Benzene Molecule:
Ordered Data
Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
 Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Spatio-Temporal Data

Average Monthly Temperature of land and ocean

Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?

Data Preparation = Cleaning the Data

Data Preparation can take 40-80% (or
more) of the effort in a data mining
project
– Dealing with NULL (missing) values
– Dealing with errors
– Dealing with noise
– Dealing with outliers (unless that is your science!)
– Transformations: units, scale, projections
– Data normalization
– Relevance analysis: Feature Selection
– Remove redundant attributes
Noise
 For objects, noise is an extraneous object
 For attributes, noise refers to modification of original values

◦ Examples: distortion of a person’s voice when talking on a

poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Outliers
 Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
◦ Case 1: Outliers are
noise that interferes
with data analysis

◦ Case 2: Outliers are

the goal of our analysis
 Credit card fraud
Intrusion detection
Missing Values
Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values

◦ Eliminate data objects or variables
◦ Estimate missing values
 Example: time series of temperature
 Example: census results
◦ Ignore the missing value during analysis
◦ Replace with all possible values (weighted by their
Types of Missing Values
Some definitions are based on representation:
Missing data is the lack of a recorded answer
for a particular field.
 Missing completely at random (MCAR)
 Missing at Random (MAR)
 Missing Not at Random (MNAR)
Duplicate Data
 Data set may include data objects that are
duplicates, or almost duplicates of one another
◦ Major issue when merging data from heterogeneous
sources

 Examples:
◦ Same person with multiple email addresses

 Data cleaning
◦ Process of dealing with duplicate data issues

 When should duplicate data not be removed?

Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
68 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
53 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Lect 2
No ratings yet
Lect 2
77 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Full
No ratings yet
Full
367 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Chapter2 Data Exploration
No ratings yet
Chapter2 Data Exploration
78 pages
Data
No ratings yet
Data
84 pages
Ids Unit-Ii
No ratings yet
Ids Unit-Ii
44 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
Data Types and Statistical Analysis Guide
No ratings yet
Data Types and Statistical Analysis Guide
38 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Attributes
No ratings yet
Attributes
66 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
DMW Unit1
No ratings yet
DMW Unit1
21 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
2 Data Types Quality
No ratings yet
2 Data Types Quality
15 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Week 2
No ratings yet
Week 2
73 pages
CAC 428 Topic 1 - Introduction To Data
No ratings yet
CAC 428 Topic 1 - Introduction To Data
24 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Data Exploration
No ratings yet
Data Exploration
12 pages
02data Part1
No ratings yet
02data Part1
19 pages
Numerical Techniques & Stats Exam
No ratings yet
Numerical Techniques & Stats Exam
3 pages
English Test Unit 16: Pronunciation and Language Focus
No ratings yet
English Test Unit 16: Pronunciation and Language Focus
4 pages
非谓语动词练习题
No ratings yet
非谓语动词练习题
7 pages
ADHD Adult Coping Strategies Study
No ratings yet
ADHD Adult Coping Strategies Study
8 pages
Syllabus-MKC 501E - 2017-2018 - Güz
No ratings yet
Syllabus-MKC 501E - 2017-2018 - Güz
2 pages
Environmental Awareness Activities Guide
No ratings yet
Environmental Awareness Activities Guide
7 pages
Iere High School Exam Timetable 2025
No ratings yet
Iere High School Exam Timetable 2025
2 pages
Anh 10 - Chuyên Quốc Học - Huế - Đề Đề Xuất
No ratings yet
Anh 10 - Chuyên Quốc Học - Huế - Đề Đề Xuất
16 pages
Project Design Report: Proposed Twin Duplex Residential House
No ratings yet
Project Design Report: Proposed Twin Duplex Residential House
14 pages
Cell Biology 3rd Edition Thomas D. Pollard Download
100% (4)
Cell Biology 3rd Edition Thomas D. Pollard Download
64 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
6 pages
Weld Symbols for Engineers
No ratings yet
Weld Symbols for Engineers
7 pages
Design Thinking Strategy Guide
No ratings yet
Design Thinking Strategy Guide
23 pages
MIASA Workshop for African Scholars
No ratings yet
MIASA Workshop for African Scholars
7 pages
The Drive Behind Extreme Lifestyles
No ratings yet
The Drive Behind Extreme Lifestyles
10 pages
Introduction To Mass Communication 10th Edition Baran Full Download
No ratings yet
Introduction To Mass Communication 10th Edition Baran Full Download
405 pages
SMAT 05 MODULE EXERCISE (3) Hbahbhsh
100% (1)
SMAT 05 MODULE EXERCISE (3) Hbahbhsh
9 pages
Better Subset Regression Using The Nonnegative Garrote: Technometrics
No ratings yet
Better Subset Regression Using The Nonnegative Garrote: Technometrics
13 pages
Intelligence 10 - Emotional 2023 2
No ratings yet
Intelligence 10 - Emotional 2023 2
32 pages
Virtual Reality - History Applications Technology
No ratings yet
Virtual Reality - History Applications Technology
78 pages
Classroom Observation: Whole Class Teaching
No ratings yet
Classroom Observation: Whole Class Teaching
20 pages
Politeness & Impoliteness in Use Strategies - Presentation
No ratings yet
Politeness & Impoliteness in Use Strategies - Presentation
18 pages
Results
No ratings yet
Results
30 pages
Selg Project Proposal
No ratings yet
Selg Project Proposal
4 pages
4339B Resistance Meter Service Manual
No ratings yet
4339B Resistance Meter Service Manual
66 pages
Consumerism Lecture Notes
No ratings yet
Consumerism Lecture Notes
4 pages
World Travel Geography Course Work
No ratings yet
World Travel Geography Course Work
10 pages
Company Contact Directory India
No ratings yet
Company Contact Directory India
6 pages
Effect of ExtrusionProcess Parameters On Mechanial Properties of 3D Printed PLA Product
No ratings yet
Effect of ExtrusionProcess Parameters On Mechanial Properties of 3D Printed PLA Product
8 pages

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

DATA MINING AND

Dr. Pradeep Kumar Mallick

Getting to Know Your Data

variable, field, characteristic, 6 No Married 60K No

describe an object 9 No Married 75K No

◦ Object is also known as record, 10 No Single 90K Yes

point, case, sample, entity, or

 Asymmetric: An asymmetric attribute indicates that the two

3. Continuous: Continuous data, unlike discrete data, can take

Q. What are attributes in warehouse?

◦ Nominal attribute: distinctness

Ordinal Ordinal attribute hardness of minerals, median,

values are correlation, t and

10.23 5.27 15.22 2.7 1.2

Average Monthly Temperature of land and ocean

Data Preparation = Cleaning the Data

◦ Examples: distortion of a person’s voice when talking on a

Two Sine Waves Two Sine Waves + Noise

◦ Case 2: Outliers are

Handling missing values

 When should duplicate data not be removed?

You might also like