0% found this document useful (0 votes)
14 views28 pages

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

tanvipal6661
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views28 pages

Lect-2 Getting To Know Your Data-Part-I

Uploaded by

tanvipal6661
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA MINING AND

DATA WAREHOUSING

CS30013 (Cr-3)

Dr. Pradeep Kumar Mallick


Associate Professor(II)
Email: [email protected]

Getting to Know Your Data


Outline
Data Objects and Attribute Types
Types of data sets
Data Quality
What is Data?
Attributes
 Collection of data objects
and their attributes Tid Refund Marital Taxable
Status Income Cheat
 An attribute is a property
1 Yes Single 125K No
or characteristic of an
2 No Married 100K No
object
3 No Single 70K No
◦ Examples: eye color of a
person, temperature, etc. 4 Yes Married 120K No

Objects
◦ Attribute is also known as 5 No Divorced 95K Yes

variable, field, characteristic, 6 No Married 60K No


dimension, or feature 7 Yes Divorced 220K No
A collection of attributes 8 No Single 85K Yes

describe an object 9 No Married 75K No

◦ Object is also known as record, 10 No Single 90K Yes


10

point, case, sample, entity, or


instance
What are Data Attributes?
 Data attributes refer to the specific characteristics or
properties that describe individual data objects within a
dataset.
 These attributes provide meaningful information about the
objects and are used to analyze, classify, or manipulate the
data.
 Understanding and analyzing data attributes is fundamental
in various fields such as statistics, machine learning, and data
analysis, as they form the basis for deriving insights and
making informed decisions from the data.
 Within predictive models, attributes serve as the predictors
influencing an outcome. In descriptive models, attributes
constitute the pieces of information under examination for
inherent patterns or correlations.
 We can say that a set of attributes used to describe a
given object are known as attribute vector or feature
vector.
Types of attributes:
Attributes can be broadly classified into two main types:
 Qualitative (Nominal (N), Ordinal (O), Binary(B)).
 Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
 Nominal attributes, as related to names, refer to categorical data
where the values represent different categories or labels without
any inherent order or ranking. These attributes are often used to
represent names or labels associated with objects, entities, or
concepts.
 Example

2. Binary Attributes:
 Binary attributes are a type of qualitative attribute where the
data can take on only two distinct values or states.
 These attributes are often used to represent yes/no,
presence/absence, or true/false conditions within a dataset.
 They are particularly useful for representing categorical data
where there are only two possible outcomes.
 For instance, in a medical study, a binary attribute could
represent whether a patient is affected or unaffected by a
Binary Attributes…
 Symmetric : In a symmetric attribute, both values or states
are considered equally important or interchangeable. For
example, in the attribute “Gender” with values “Male” and
“Female,” neither value holds precedence over the other, and
they are considered equally significant for analysis purposes.

 Asymmetric: An asymmetric attribute indicates that the two


values or states are not equally important or interchangeable.
For instance, in the attribute “Result” with values “Pass” and
“Fail,” the states are not of equal importance; passing may
hold greater significance than failing in certain contexts, such
as academic grading or certification exams
Qualitative Attributes:
3. Ordinal Attributes : Ordinal attributes are a type of qualitative
attribute where the values possess a meaningful order or
ranking, but the magnitude between values is not precisely
quantified. In other words, while the order of values indicates
their relative importance or precedence, the numerical difference
between them is not standardized or known.
 Example:
Quantitative Attributes:
Numeric:
 A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values.
 Numerical attributes are of 2 types: interval, and ratio-
scaled.
 An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the
correct reference point, or we can call zero points. Data can be
added and subtracted at an interval scale but can not be
multiplied or divided.
 Consider an example of temperature in degrees Centigrade. If
a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
 A ratio-scaled attribute is a numeric attribute with a fix zero-
point. If a measurement is ratio-scaled, we can say of a value
as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between
values, and the mean, median, mode, Quantile-range, and Five
number summary can be given.
Quantitative Attributes:
2. Discrete : Discrete data refer to information that can take on
specific, separate values rather than a continuous range. These
values are often distinct and separate from one another, and
they can be either numerical or categorical in nature.

3. Continuous: Continuous data, unlike discrete data, can take


on an infinite number of possible values within a given range. It
is characterized by being able to assume any value within a
specified interval, often including fractional or decimal values.
 Example :
Test Yourself ?
Q. What is the difference between nominal and ordinal
attributes?
Ans: Nominal attributes represent categories without any
inherent order or ranking, while ordinal attributes have a
meaningful sequence or ranking between values, but the
magnitude between values is not precisely known.
Q. How do discrete and continuous attributes differ?
Ans: Discrete attributes represent countable values or whole
numbers, while continuous attributes can take on any value
within a range and are typically associated with
measurements.

Q. What are attributes in warehouse?


Ans: In a data warehouse, attributes typically refer to the
descriptive characteristics or properties of data entities, such as
dimensions or features, which are used for analysis, reporting,
and decision-making.
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
◦ sales database: customers, store items, sales
◦ medical database: patients, treatments
◦ university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
Properties of Attribute Values
 The type of an attribute depends on which of
the following properties/operations it
possesses:
◦ Distinctness : = and 
◦ Order : <, ≤, >, and ≥
◦ Addition : + and -
(Differences are meaningful)
◦ Multiplication : * and /
(Ratios are meaningful)

◦ Nominal attribute: distinctness


◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful
differences
Different types of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
Types of data sets
Record
◦ Data Matrix
◦ Document Data
◦ Transaction Data
Graph
◦ World Wide Web
◦ Molecular Structures
Ordered
◦ Spatial Data
◦ Temporal Data
◦ Sequential Data
◦ Genetic Sequence Data
Data Matrix
 If data objects have the same fixed set of
numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute
 Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
 Each document becomes a ‘term’ vector
◦ Each term is a component (attribute) of the
vector
◦ The value of each component is the number
of times the corresponding term occurs in
the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
◦ Each record (transaction) involves a set of
items.
◦ For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph, a molecule, and
webpages
2
5 1
2
5

Benzene Molecule:
Ordered Data
Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
 Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Spatio-Temporal Data

Average Monthly Temperature of land and ocean


Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?

Data Preparation = Cleaning the Data


Data Preparation can take 40-80% (or
more) of the effort in a data mining
project
– Dealing with NULL (missing) values
– Dealing with errors
– Dealing with noise
– Dealing with outliers (unless that is your science!)
– Transformations: units, scale, projections
– Data normalization
– Relevance analysis: Feature Selection
– Remove redundant attributes
Noise
 For objects, noise is an extraneous object
 For attributes, noise refers to modification of original values

◦ Examples: distortion of a person’s voice when talking on a


poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Outliers
 Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
◦ Case 1: Outliers are
noise that interferes
with data analysis

◦ Case 2: Outliers are


the goal of our analysis
 Credit card fraud
Intrusion detection
Missing Values
Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values


◦ Eliminate data objects or variables
◦ Estimate missing values
 Example: time series of temperature
 Example: census results
◦ Ignore the missing value during analysis
◦ Replace with all possible values (weighted by their
Types of Missing Values
Some definitions are based on representation:
Missing data is the lack of a recorded answer
for a particular field.
 Missing completely at random (MCAR)
 Missing at Random (MAR)
 Missing Not at Random (MNAR)
Duplicate Data
 Data set may include data objects that are
duplicates, or almost duplicates of one another
◦ Major issue when merging data from heterogeneous
sources

 Examples:
◦ Same person with multiple email addresses

 Data cleaning
◦ Process of dealing with duplicate data issues

 When should duplicate data not be removed?

You might also like