Types of Data
and
Data Quality
Data Mining
• Data Mining is
– Also known as knowledge discovery in data (KDD)
• Mining data includes knowing about
• Data
• Finding relations among data
Data
– To know about the data, it is necessary to know about
• Data objects
• Attributes of data
• Different types of data attributes
Data
• Data
– Refers to distinct pieces of information, usually formatted and
stored in a way that is efficient for movement or processing
– Collection of data objects and their attributes
• Object
– Defined by a set of attributes (attribute vector or feature vector)
– Also referred to as a record, entity, sample, etc.
Attributes and Their Types
• Property or characteristics of an object
• Also referred to as variable, field, characteristic, or feature
• Examples
– Eye color of a person, temperature, etc.
• Different types of attributes
Qualitative Quantitative
• Numeric
• Nominal
• Interval-scaled
• Ordinal
• Ratio-scaled
• Binary
• Discrete
• Continuous
Qualitative Attributes :: Nominal
• Related to names
• The values of a Nominal attribute
– Are names of things, some kind of symbols
– Represents some category or state
• Also referred as categorical attributes
– No ordering (rank, position) among values
• Example
Qualitative Attributes :: Ordinal
• Provides sufficient information to order the objects
• But the magnitude between values is not actually known
• Example
Qualitative Attributes :: Binary
• Has only 2 values or states
• For Example
– Yes or no, affected or unaffected, true or false etc.
• Symmetric:
– Both values are equally important (Gender)
• Asymmetric:
– Both values are not equally important (Result)
Quantitative Attributes :: Numeric
• Quantitative
– It is a measurable quantity
– Represented in integer or real values
– Of two types
• Interval-Scaled
• Ratio-Scaled
Quantitative Attributes :: Numeric
• Interval-Scaled
– Has values whose differences are interpretable
– Data can be added and subtracted but can not be multiplied or
divided
– Examples: Calendar dates, Temperatures in Celsius or Fahrenheit
• Ratio-Scaled
– Both differences and ratios are significant
– The values are ordered, and the difference between values, the
mean, median, mode etc. can be computed
• Examples: length, time, counts etc.
Quantitative Attributes :: Discrete
• Have finite values which can be numerical or categorical
• Has finite or countable infinite set of values
• Example:
Quantitative Attributes :: Continuous
• Has real numbers as attribute values
• Typically represented as floating point variables
• Examples: temperature, height, or weight etc.
Data Quality
• The measure of how well suited a data set is to serve its specific
purpose
• Measures of data quality are based on factors such as
– Accuracy
– Completeness
– Consistency
– Validity
– Uniqueness
– Timeliness
Data Quality
• Accuracy
– The data should reflect actual, real-world scenarios
– The measure of accuracy can be confirmed with a verifiable
source.
• Completeness
– Ability of the data to effectively deliver all the required values
that are available
• Consistency
– The uniformity of data as it moves across networks and
applications.
– The same data values stored in difference locations should not
conflict with one another.
Data Quality
• Validity
Data should be collected according to defined business rules and
parameters
Data should conform to the right format and fall within the right
range
• Uniqueness
– Ensures that there are no duplications or overlapping of values
across all data sets
• Timeliness
– Timely data is data that is available when it is required
– Data may be updated in real time to ensure that it is readily
available and accessible.