Unit 1 Data Mining task
Unit 1 Data Mining task
Unit -1
Data mining is often defined as finding hidden information in a database.
Traditional database queries (Figure ), access a database using a well-defined query stated in a language
such as SQL.
Data mining access of a database differs from this traditional access in several ways:
• Query: The query might not be well formed or precisely stated. The data miner might not even be
exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that of the original operational database. The
data have been cleansed and modified to better support the mining process.
• Output: The output of the data mining query probably is not a subset of the database. Instead it is the
output of some analysis of the contents of the database
A predictive model makes a prediction about values of data using known results found from
different data
Predictive model data mining tasks include classification, regression, time series analysis, and
prediction.
A descriptive model identifies patterns or relationships in data
A descriptive model serves as a way to explore the properties of the data examined, not to
predict new properties.
Clustering, summarization, association rules, and sequence discovery are usually viewed as
descriptive in nature.
Classification
Classification maps data into predefined groups or classes. It is often referred to as supervised
learning because the classes are determined before examining the data. Two examples of
classification applications are determining whether to make a bank loan and identifying credit
risks
Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes
Regression
Regression is used to map a data item to a real valued prediction variable
Regression assumes that the target data fit into some known type of function (e.g., linear, logistic,
etc.) and then determines the best function of this type that models the given data
The plots for Y and Z have similar behaviour, while X appears to have less volatility.
Prediction
Predicting future data states based on past and current data
The difference is that prediction is predicting a future state rather than a current state.
Prediction applications include flooding, speech recognition, machine learning, and pattern
recognition.
Clustering
Clustering is similar to classification except that the groups are not predefined, but rather defined
by the data alone.
Clustering is alternatively referred to as unsupervised learning or segmentation
A special type of clustering is called segmentation. With segmentation a database is partitioned
into disjointed groupings of similar tuples called segments. Segmentation is often viewed as
being identical to clustering
Summarization
Summarization maps data into subsets with associated simple descriptions. Summarization is also called
characterization or generalization. It extracts or derives representative information about the database
Association Rules
An association rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community to identify items that are
frequently purchased together.
Example: the use of association rules in market basket analysis
Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential patterns in data. These patterns
are based on a time sequence of actions. These patterns are similar to associations in that data (or events)
are found to be related, but the relationship is based on time.
• Graphical: Traditional graph structures including bar charts, pie charts, histograms,
and line graphs may be used.
• Geometric: Geometric techniques include the box plot and scatter diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the presentation of the
results.
• Pixel-based: With these techniques each data value is shown as a uniquely colored
pixel.
• Hierarchical: These techniques hierarchically divide the display area (screen) into
regions based on data values.
• Hybrid: The preceding approaches can be combined into one display
Visualization tools can be used to summarize data as a data mining technique itself. The data
mining process itself is complex. The algorithms must be carefully applied to be effective.
Discovered patterns must be correctly interpreted and properly evaluated to ensure that the
resulting information is meaningful and accurate.
The current evolution of data mining functions and products is the result of years of influence
from many disciplines, including databases, information retrieval, statistics, algorithms, and
machine learning
Table shows developments in the areas of artificial intelligence (AI), information retrieval (IR),
databases (DB), and statistics (Stat) leading to the current view of data mining.
• The primary objective of data mining is to describe some characteristics of a set of data by a
general model, this approach can be viewed as a type of compression .
• An ongoing direction of data mining research in how to define a data mining query and whether
a query language (like SQL) can be developed to capture the many different types of data mining
queries.
• A large database can be viewed as using approximation to help uncover hidden information
about the data.
• When dealing with large databases, the impact of size and efficiency of developing an abstract
model can be thought of as a type of search problem.
The various data mining problems viewed several different perspectives based on the viewpoint
and background of the researcher or developer
1. Human interaction:
Interfaces may be needed with both domain and technical experts.
Technical experts are used to formulate the queries and assist in interpreting the
results.
Users are needed to identify training data and desired results.
2. Overfitting:
Overfitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of
the training database.
Example:-
A classification model for an employee database may be developed to classify
employees as short, medium, or tall.
If the training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in the
training database under five feet eight. In this case, many future employees would be
erroneously classified as short.
3. Outliers:
There are often many data entries that do not fit nicely into the derived model. This
becomes even more of an issue with very databases.
4. Interpretation of results :
Currently, data mining output may require experts to correctly interpret the results, which
might otherwise be meaningless to the average database user.
5. Visualization of results:
To easily view and understand the output of data mining algorithms, visualization of the
results is helpful.
6. Large datasets:
The massive datasets associated with data mining create problems when applying
algorithms designed for small datasets.
7. High dimensionality:
The dimensionality curse, meaning that there are many attributes (dimensions)
involved and it is difficult to determine which ones should be used.
One solution to this high dimensionality problem is to reduce the number of
attributes, which is known as dimensionality reduction.
8. Multimedia data:
The use of multimedia data such as is found in GIS databases complicates or invalidates
many proposed algorithms.
9. Missing data:
During the preprocessing phase of KDD, missing data may be replaced with estimates.
This and other approaches to handling missing data can lead to invalid results in the data
mining step.
10. Irrelevant data:
Some attributes in the database might not be of interest to the data mining task being
developed.
11. Noisy data:
Some attribute values might be invalid or incorrect. These values are often corrected
before running data mining applications.
12. Changing data: Databases cannot be assumed to be static. However, most data mining
algorithms do assume a static database. This requires that the algorithm be completely
rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal data processing
activities. KDD requests may be treated as special, unusual, or one-time needs. This
makes them inefficient, ineffective, and not general enough to be used on an ongoing
basis.
14. Application: Determining the intended use for the information obtained from the data
mining function is a challenge. Indeed, how business executives can effectively use the
output is sometimes considered the more difficult part, not the running of the algorithms
themselves.