DATA MINING AND
DATA WAREHOUSING
CS30013 (Cr-3)
Dr. Pradeep Kumar Mallick
Associate Professor(II)
Email:
[email protected]Getting to Know Your Data
Outline
Data Objects and Attribute Types
Types of data sets
Data Quality
What is Data?
Attributes
Collection of data objects
and their attributes Tid Refund Marital Taxable
Status Income Cheat
An attribute is a property
1 Yes Single 125K No
or characteristic of an
2 No Married 100K No
object
3 No Single 70K No
◦ Examples: eye color of a
person, temperature, etc. 4 Yes Married 120K No
Objects
◦ Attribute is also known as 5 No Divorced 95K Yes
variable, field, characteristic, 6 No Married 60K No
dimension, or feature 7 Yes Divorced 220K No
A collection of attributes 8 No Single 85K Yes
describe an object 9 No Married 75K No
◦ Object is also known as record, 10 No Single 90K Yes
10
point, case, sample, entity, or
instance
What are Data Attributes?
Data attributes refer to the specific characteristics or
properties that describe individual data objects within a
dataset.
These attributes provide meaningful information about the
objects and are used to analyze, classify, or manipulate the
data.
Understanding and analyzing data attributes is fundamental
in various fields such as statistics, machine learning, and data
analysis, as they form the basis for deriving insights and
making informed decisions from the data.
Within predictive models, attributes serve as the predictors
influencing an outcome. In descriptive models, attributes
constitute the pieces of information under examination for
inherent patterns or correlations.
We can say that a set of attributes used to describe a
given object are known as attribute vector or feature
vector.
Types of attributes:
Attributes can be broadly classified into two main types:
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data
where the values represent different categories or labels without
any inherent order or ranking. These attributes are often used to
represent names or labels associated with objects, entities, or
concepts.
Example
2. Binary Attributes:
Binary attributes are a type of qualitative attribute where the
data can take on only two distinct values or states.
These attributes are often used to represent yes/no,
presence/absence, or true/false conditions within a dataset.
They are particularly useful for representing categorical data
where there are only two possible outcomes.
For instance, in a medical study, a binary attribute could
represent whether a patient is affected or unaffected by a
Binary Attributes…
Symmetric : In a symmetric attribute, both values or states
are considered equally important or interchangeable. For
example, in the attribute “Gender” with values “Male” and
“Female,” neither value holds precedence over the other, and
they are considered equally significant for analysis purposes.
Asymmetric: An asymmetric attribute indicates that the two
values or states are not equally important or interchangeable.
For instance, in the attribute “Result” with values “Pass” and
“Fail,” the states are not of equal importance; passing may
hold greater significance than failing in certain contexts, such
as academic grading or certification exams
Qualitative Attributes:
3. Ordinal Attributes : Ordinal attributes are a type of qualitative
attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely
quantified. In other words, while the order of values indicates
their relative importance or precedence, the numerical difference
between them is not standardized or known.
Example:
Quantitative Attributes:
Numeric:
A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values.
Numerical attributes are of 2 types: interval, and ratio-
scaled.
An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the
correct reference point, or we can call zero points. Data can be
added and subtracted at an interval scale but can not be
multiplied or divided.
Consider an example of temperature in degrees Centigrade. If
a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-
point. If a measurement is ratio-scaled, we can say of a value
as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between
values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.
Quantitative Attributes:
2. Discrete : Discrete data refer to information that can take on
specific, separate values rather than a continuous range. These
values are often distinct and separate from one another, and
they can be either numerical or categorical in nature.
3. Continuous: Continuous data, unlike discrete data, can take
on an infinite number of possible values within a given range. It
is characterized by being able to assume any value within a
specified interval, often including fractional or decimal values.
Example :
Test Yourself ?
Q. What is the difference between nominal and ordinal
attributes?
Ans: Nominal attributes represent categories without any
inherent order or ranking, while ordinal attributes have a
meaningful sequence or ranking between values, but the
magnitude between values is not precisely known.
Q. How do discrete and continuous attributes differ?
Ans: Discrete attributes represent countable values or whole
numbers, while continuous attributes can take on any value
within a range and are typically associated with
measurements.
Q. What are attributes in warehouse?
Ans: In a data warehouse, attributes typically refer to the
descriptive characteristics or properties of data entities, such as
dimensions or features, which are used for analysis, reporting,
and decision-making.
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
◦ sales database: customers, store items, sales
◦ medical database: patients, treatments
◦ university database: students, professors, courses
Also called samples , examples, instances, data points, objects,
tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
Properties of Attribute Values
The type of an attribute depends on which of
the following properties/operations it
possesses:
◦ Distinctness : = and
◦ Order : <, ≤, >, and ≥
◦ Addition : + and -
(Differences are meaningful)
◦ Multiplication : * and /
(Ratios are meaningful)
◦ Nominal attribute: distinctness
◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful
differences
Different types of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test
Ordinal Ordinal attribute hardness of minerals, median,
values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric
values are correlation, t and
meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
Types of data sets
Record
◦ Data Matrix
◦ Document Data
◦ Transaction Data
Graph
◦ World Wide Web
◦ Molecular Structures
Ordered
◦ Spatial Data
◦ Temporal Data
◦ Sequential Data
◦ Genetic Sequence Data
Data Matrix
If data objects have the same fixed set of
numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a ‘term’ vector
◦ Each term is a component (attribute) of the
vector
◦ The value of each component is the number
of times the corresponding term occurs in
the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
◦ Each record (transaction) involves a set of
items.
◦ For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph, a molecule, and
webpages
2
5 1
2
5
Benzene Molecule:
Ordered Data
Sequences of transactions
Items/Events
An element of
the sequence
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Spatio-Temporal Data
Average Monthly Temperature of land and ocean
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Data Preparation = Cleaning the Data
Data Preparation can take 40-80% (or
more) of the effort in a data mining
project
– Dealing with NULL (missing) values
– Dealing with errors
– Dealing with noise
– Dealing with outliers (unless that is your science!)
– Transformations: units, scale, projections
– Data normalization
– Relevance analysis: Feature Selection
– Remove redundant attributes
Noise
For objects, noise is an extraneous object
For attributes, noise refers to modification of original values
◦ Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
Outliers
Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
◦ Case 1: Outliers are
noise that interferes
with data analysis
◦ Case 2: Outliers are
the goal of our analysis
Credit card fraud
Intrusion detection
Missing Values
Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
◦ Eliminate data objects or variables
◦ Estimate missing values
Example: time series of temperature
Example: census results
◦ Ignore the missing value during analysis
◦ Replace with all possible values (weighted by their
Types of Missing Values
Some definitions are based on representation:
Missing data is the lack of a recorded answer
for a particular field.
Missing completely at random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)
Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one another
◦ Major issue when merging data from heterogeneous
sources
Examples:
◦ Same person with multiple email addresses
Data cleaning
◦ Process of dealing with duplicate data issues
When should duplicate data not be removed?