Unit I
Unit I
UNIT - I
Data mining access of a database differs from this traditional access in several ways:
• Query:
The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what
he wants to see.
• Data:
The data accessed is usually a different version from that of the original operational database. The data have
been cleansed and modified to better support the mining process.
• Output:
The output of the data mining query probably is not a subset of the database. Instead it is the output of some
analysis of the contents of the database.
Data mining involves many different algorithms to accomplish different tasks.
All of these algorithms attempt to fit a model to the data.
The algorithms examine the data and determine a model that is closest to the characteristics of the
data being examined.
Data mining algorithms can be characterized as consisting of three parts:
• Model:
The purpose of the algorithm is to fit a model to the data.
• Preference:
Some criteria must be used to fit one model over another.
• Search:
All algorithms require some technique to search the data
A predictive model makes a prediction about values of data using known results found from different data.
Predictive modeling may be made based on the use of other historical data.
A descriptive model identifies patterns or relationships in data. Unlike the predictive model, a descriptive model
serves as a way to explore the properties of the data examined, not to predict new properties.
BASIC DATA MINING TASKS
1. Classification:
2. Regression
With time series analysis, the value of an attribute is examined as it varies over time.
The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.).
A time series plot, is used to visualize the time series.
4. Prediction:
Many real-world data mining applications can be seen as predicting future data states based on past and
current data.
Prediction can be viewed as a type of classification. (Note: This is a data mining task that is different from the
prediction model, although the prediction task is a type of prediction model.)
The difference is that prediction is predicting a future state rather than a current state. Here we are referring
to a type of application rather than to a type of data mining modeling approach.
Prediction applications include flooding, speech recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or regression techniques, other approaches
may be used as well.
5. Clustering
Clustering is similar to classification except that the groups are not predefined, but rather defined by the data
alone.
Clustering is alternatively referred to as unsupervised learning or segmentation.
It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed.
The clustering is usually accomplished by determining the similarity among the data on predefined attributes.
The most similar data are grouped into clusters.
Since the clusters are not predefined, a domain expert is often required to interpret the meaning of the created
clusters.
6. Summarization
Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of
uncovering relationships among data.
The best example of this type of application is to determine association rules.
An association rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community to identify items that are frequently purchased
together
8. Sequence Discovery
DEFINITION 1.1.
Knowledge discovery in databases (KDD) is the process of finding useful information and patterns in data.
DEFINITION 1.2.
Data mining is the use of algorithms to extract the information and patterns derived by the KDD process.
2. Preprocessing:
The data to be used by the process may have incorrect or missing data.
There may be anomalous data from multiple sources involving different data types and metrics.
There may be many different activities performed at this time.
Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (often using data mining
tools).
3. Transformation:
Data from different sources must be converted into a common format for processing.
Some data may be encoded or transformed into more usable formats.
Data reduction may be used to reduce the number of possible data values being considered.
4. Data mining:
Based on the data mining task being performed, this step applies algorithms to the transformed data to generate the desired
results.
5. Interpretation/evaluation:
How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent
on it.
Various visualization and GUI strategies are used at this last step.
Visualization refers to the visual presentation of data. Visualization techniques include:
• Graphical:
Traditional graph structures including bar charts, pie charts, histograms, and line graphs may be used.
• Geometric:
Geometric techniques include the. box plot and scatter diagram techniques.
• Icon-based:
Using figures, colors, or other icons can improve the presentation of the results.
• Pixel-based:
With these techniques each data value is shown as a uniquely colored pixel.
• Hierarchical:
These techniques hierarchically divide the display area (screen) into regions based on data values.
• Hybrid:
The preceding approaches can be combined into one display.
Parameter KDD Data Mining
KDD refers to a process of identifying valid, novel, Data Mining refers to a process of extracting useful
Definition potentially useful, and ultimately understandable and valuable information or patterns from large
patterns and relationships in data. data sets.
Objective To find useful knowledge from data. To extract useful information from data.
Data cleaning, data integration, data selection, data Association rules, classification, clustering,
Techniques Used transformation, data mining, pattern evaluation, and regression, decision trees, neural networks, and
knowledge representation and visualization. dimensionality reduction.
Structured information, such as rules and models, that Patterns, associations, or insights that can be used to
Output
can be used to make decisions or predictions. improve decision-making or understanding.
Focus is on the discovery of useful knowledge, rather Data mining focus is on the discovery of patterns or
Focus
than simply finding patterns in data. relationships in data.
Domain expertise is important in KDD, as it helps in Domain expertise is less critical in data mining, as
Role of domain
defining the goals of the process, choosing appropriate the algorithms are designed to identify patterns
expertise
data, and interpreting the results. without relying on prior knowledge.
DATA MINING ISSUES
There are many important implementation issues associated with data mining:
1. Human interaction:
Since data mining problems are often not precisely stated, interfaces may be needed with both domain and
technical experts. Technical experts are used to formulate the queries and assist in interpreting the results. Users
are needed to identify training data and desired results.
2. Overfitting:
When a model is generated that is associated with a given database state it is desirable that the model also fit
future database states. Overfitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of the training database. For
example, a classification model for an employee database may be developed to classify employees as short,
medium, or tall. If the training database is quite small, the model might erroneously indicate that a short person is
anyone under five feet eight inches because there is only one entry in the training database under five feet eight.
In this case, many future employees would be erroneously classified as short. Overfitting can arise under other
circumstances as well, even though the data are not changing.
3. Outliers:
There are often many data entries that do not fit nicely into the derived model. This becomes even more of an issue
with very large databases. If a model is developed that includes these outliers, then the model may not behave well for
data that are not outliers.
4. Interpretation of results :
Currently, data mining output may require experts to correctly interpret the results, which might otherwise be
meaningless to the average database user.
5. Visualization of results:
To easily view and understand the output of data mining algorithms, visualization of the results is helpful.
6. Large datasets:
The massive datasets associated with data mining create problems when applying algorithms designed for small
datasets. Many modeling applications grow exponentially on the dataset size and thus are too inefficient for larger
datasets. Sampling and parallelization are effective tools to attack this scalability problem.
7. High dimensionality:
A conventional database schema may be composed of many different attributes. The problem here is that not all
attributes may be needed to solve a given data mining problem. In fact, the use of some attributes may interfere
with the correct completion of a data mining task. The use of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm. This problem is sometimes referred to as the
dimensionality curse, meaning that there are many attributes (dimensions) involved and it is difficult to determine
which ones should be used. One solution to this high dimensionality problem is to reduce the number of attributes,
which is known as dimensionality reduction. However, determining which attributes not needed is not always easy
to do.
8. Multimedia data:
Most previous data mining algorithms are targeted to traditional data types (numeric, character, text, etc.). The
use of multimedia data such as is found in GIS databases complicates or invalidates many proposed algorithms.
9. Missing data:
During the preprocessing phase of KDD, missing data may be replaced with estimates. This and other approaches
to handling missing data can lead to invalid results in the data mining step.
Some attributes in the database might not be of interest to the data mining task being developed.
11. Noisy data:
Some attribute values might be invalid or incorrect. These values are often corrected before running data mining
applications.
Databases cannot be assumed to be static. However, most data mining algorithms do assume a static database. This
requires that the algorithm be completely rerun anytime the database changes.
13. Integration:
The KDD process is not currently integrated into normal data processing activities. KDD requests may be treated as
special, unusual, or one-time needs. This makes them inefficient, ineffective, and not general enough to be used on an
ongoing basis. Integration of data mining functions into traditional DBMS systems is certainly a desirable goal.
14. Application:
Determining the intended use for the information obtained from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is sometimes considered the more difficult part, not the running of the
algorithms themselves. Because the data are of a type that has not previously been known, business practices may have
to be modified to determine how to effectively use the information uncovered.
DATA MINING METRICS
Measuring the effectiveness or usefulness of a data mining approach is not always straightforward. In fact, different
metrics could be used for different techniques and also based on the interest level. From an overall business or
usefulness perspective, a measure such as return on investment (ROI) could be used. ROI examines the difference
between what the data mining technique costs and what the savings or benefits from its use are. Of course, this would
be difficult to measure because the return is hard to quantify. It could be measured as increased sales, reduced
advertising expenditure, or both. In a specific advertising campaign implemented via targeted catalog mailings, the
percentage of catalog recipients and the amount of . purchase per recipient would provide one means to measure the
effectiveness of the mailings.
In this text, however, we use a more computer science/database perspective to measure various data mining
approaches. We assume that the business management has determined that a particular data mining application be
made. They subsequently will determine the overall effectiveness of the approach using some ROI (or related)
strategy. Our objective is to compare different alternatives to implementing a specific data mining task. The metrics
used include the traditional metrics of space and time based on complexity analysis. In some cases, such as accuracy
in classification, more specific metrics targeted to a data mining task may be used.
Social implications of DATA MINING
The integration of data mining techniques into normal day-to-day activities . has become commonplace. We are
confronted daily with targeted advertising, and businesses have become more efficient through the use of data mining
activities to reduce costs. Data mining adversaries, however, are concerned that this information is being obtained at
the cost of reduced privacy. Data mining applications can derive much demographic information concerning
customers that was previously not known or hidden in the data. The unauthorized use of such data could result in the
disclosure of information that is deemed to be confidential.
We have recently seen an increase in interest in data mining techniques targeted to such applications as fraud
detection, identifying criminal suspects, and prediction of potential terrorists. These can be viewed as types of
classification problems . The approach that is often used here is one of "profiling" the typical behavior or
characteristics involved. Indeed, many classification techniques work by identifying the attribute values that
commonly occur for the target class. Subsequent records will be then classified based on these attribute values. Keep in
mind that these approaches to classification are imperfect. Mistakes can be made. Just because an individual makes a
series of credit card purchases that are similar to those often made when a card is stolen does not mean that the card is
stolen or that the individual is a criminal.
Users of data mining techniques must be sensitive to these issues and must not violate any privacy directives or
guidelines.
DATA MINING from a database perspective
The study of data mining from a database perspective involves looking at all types of data mining applications and
techniques.
However, we are interested primarily in those that are of practical interest.
While our interest is not limited to any particular type of algorithm or approach, we are concerned about the
following implementation issues:
• Scalability:
Algorithms that do not scale up to perform well with massive real-world datasets are of limited application. Related
to this is the fact that techniques should work regardless of the amount of available main memory.
• Real-world data:
Real-world data are noisy and have many missing attribute values. Algorithms should be able to work even in the
presence of these problems.
• Update:
Many data mining algorithms work with static datasets. This is not a realistic assumption.
• Ease of use:
Although some algorithms may work well, they may not be well received by users if they are difficult to use or
understand.
These issues are crucial if applications are to be accepted a:nd used in the workplace. Throughout the text we will
mention how techniques perforn1 in these and other implementation categories.