Data Mining
and
Predictive Modelling
Lecture 2: Functionalities, KDD Process, Data attributes and Properties
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Knowledge Data Selection (where data relevant to the analysis task are retrieved
Discovery
from the database)
Data mining (an essential process where intelligent methods are
applied to extract data patterns)
from Data Pattern evaluation (to identify the truly interesting patterns
representing knowledge)
Knowledge representation (where visualization and knowledge
representation techniques are used to present mined knowledge.)
Evaluation and
presentation
Knowledge
Data Mining
Selection and
Patterns
transformation
Cleaning and
integration
Data
warehouse
Database
Data Characterization
• Summarization of the general characteristics of an object
class of data.
• The data corresponding to the user-specified class is
generally collected by a database query.
Data discrimination
Data Mining • Comparison of the general characteristics of target class
data objects with the general characteristics of objects
Functionalities from one or a set of contrasting classes.
• The target and contrasting classes can be represented by
the user.
Association Analysis
• It analyses the set of items that generally occur together
in a transactional datasets. Two parameters ‘support and
confidence’ is used for determining association rules.
Classification
• Is the procedure of discovering a model that represents and
distinguishes data classes or concepts.
• The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common)
Prediction
Data Mining • It defines predict some unavailable data values or pending
trends.
Functionalities • It can be a prediction of missing numerical values or
increase/decrease trends in time-related information.
Clustering
• It is similar to classification, but the classes are not
predefined.
• The classes are represented by the data attributes and known
as unsupervised learning.
Outlier Analysis
• These are the data elements that
cannot be grouped in a given class or
cluster.
Data Mining • These have multiple behavior from the
general behavior of other data objects.
Functionalities
Evolution Analysis
• It defines the trends for objects whose
behavior changes over some time.
Data and Attribute Types
• A data object represents an entity.
• The data objects are typically described by
attributes.
• Data objects can also be referred to as samples,
instances, data points, objects.
Data Objects • If the objects are stored in a database, they are
and data tuples.
• An attribute is a data field representing a
Attributes characteristic or feature of a data object.
• The distribution of data involving one attribute
(or variable) is called univariate.
• A bivariate distribution involves two attributes
and so on.
Nominal Attributes
• Nominal means “relating to names”.
• The values of nominal attributes are symbols or
names of things.
• Each value represents some kind of category, code,
or state, and so nominal attributes are also referred
to as categorical. These values do not have any
Attributes meaningful order.
Binary Attributes
• A binary attribute is a nominal attribute with only
two categories: 0 and 1, where 0 typically means
that the attribute is absent, and 1 means that it is
present.
• Binary attributes are referred to as Boolean if the
two states correspond to true or false.
Ordinal Attributes
• Is an attribute with possible values that have a
meaningful order or ranking among them, but the
magnitude between successive values is not known.
• Example: Student grades, customer satisfaction
Attributes Numeric Attributes (Quantitative)
• Interval-scaled attribute: temperature attribute is
interval-scaled.
• Ratio-scaled: is a numerical attribute with an
inherent zero point.
• If a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value.
Discrete Attribute
• Has a finite and countably infinite set of values,
which may or may not be represented as
integers.
• Example: Student grades, customer satisfaction
• Countably infinite: customer ID, Pin codes
Attributes Continuous Attribute
• The attribute which is not discrete is continuous.
• Continuous attributes are typically represented
as floating point variables.
• The term numeric attribute and continuous
attribute are often used interchangeably in the
literature.