0% found this document useful (0 votes)
78 views

Unit 1 Data Mining task

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Unit 1 Data Mining task

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Mining and Warehousing

Unit -1
Data mining is often defined as finding hidden information in a database.

Traditional database queries (Figure ), access a database using a well-defined query stated in a language
such as SQL.

Data mining access of a database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The data miner might not even be
exactly sure of what he wants to see.

• Data: The data accessed is usually a different version from that of the original operational database. The
data have been cleansed and modified to better support the mining process.

• Output: The output of the data mining query probably is not a subset of the database. Instead it is the
output of some analysis of the contents of the database

BASIC DATA MINING TASKS

The basic outline of tasks shown in Figure

 A predictive model makes a prediction about values of data using known results found from
different data
 Predictive model data mining tasks include classification, regression, time series analysis, and
prediction.
 A descriptive model identifies patterns or relationships in data
 A descriptive model serves as a way to explore the properties of the data examined, not to
predict new properties.
 Clustering, summarization, association rules, and sequence discovery are usually viewed as
descriptive in nature.

Classification
 Classification maps data into predefined groups or classes. It is often referred to as supervised
learning because the classes are determined before examining the data. Two examples of
classification applications are determining whether to make a bank loan and identifying credit
risks
 Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes

Regression
 Regression is used to map a data item to a real valued prediction variable
 Regression assumes that the target data fit into some known type of function (e.g., linear, logistic,
etc.) and then determines the best function of this type that models the given data

Time Series Analysis


 With time series analysis, the value of an attribute is examined as it varies over time. The values
usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). A time series plot
(Figure 1.3), is used to visualize the time series

The plots for Y and Z have similar behaviour, while X appears to have less volatility.

Prediction
 Predicting future data states based on past and current data
 The difference is that prediction is predicting a future state rather than a current state.
 Prediction applications include flooding, speech recognition, machine learning, and pattern
recognition.
Clustering
 Clustering is similar to classification except that the groups are not predefined, but rather defined
by the data alone.
 Clustering is alternatively referred to as unsupervised learning or segmentation
 A special type of clustering is called segmentation. With segmentation a database is partitioned
into disjointed groupings of similar tuples called segments. Segmentation is often viewed as
being identical to clustering

Summarization
Summarization maps data into subsets with associated simple descriptions. Summarization is also called
characterization or generalization. It extracts or derives representative information about the database

Association Rules
 An association rule is a model that identifies specific types of data associations.
 These associations are often used in the retail sales community to identify items that are
frequently purchased together.
 Example: the use of association rules in market basket analysis

Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential patterns in data. These patterns
are based on a time sequence of actions. These patterns are similar to associations in that data (or events)
are found to be related, but the relationship is based on time.

DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

 Knowledge discovery in databases (KDD) is the process of finding useful information


and patterns in data.
 Data mining is the use of algorithms to extract the information and patterns derived by
the KDD process.
 KDD is a ' process that involves many different steps. The input to this process is the
data, and the output is the useful information desired by the users.
 Figure illustrates the overall KDD process.

The KDD process consists of the following five steps


• Selection: The data needed for the data mining process may be obtained from many
different and heterogeneous data sources. This first step obtains the data from various
databases, files, and nonelectronic sources.
• Preprocessing: The data to be used by the process may have incorrect or missing data.
There may be anomalous data from multiple sources involving different data types and
metrics. There may be many different activities performed at this time. Erroneous data
may be corrected or removed, whereas missing data must be supplied or predicted (often
using data mining tools).
• Transformation: Data from different sources must be converted into a common
format for processing. Some data may be encoded or transformed into more usable
formats. Data reduction may be used to reduce the number of possible data values being
considered.
• Data mining: Based on the data mining task being performed, this step applies
algorithms to the transformed data to generate the desired results.
• Interpretation/evaluation: How the data mining results are presented to the users is
extremely important because the usefulness of the results is dependent on it.

Visualization refers to the visual presentation of data. Visualization techniques include:

• Graphical: Traditional graph structures including bar charts, pie charts, histograms,
and line graphs may be used.
• Geometric: Geometric techniques include the box plot and scatter diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the presentation of the
results.
• Pixel-based: With these techniques each data value is shown as a uniquely colored
pixel.
• Hierarchical: These techniques hierarchically divide the display area (screen) into
regions based on data values.
• Hybrid: The preceding approaches can be combined into one display

Visualization tools can be used to summarize data as a data mining technique itself. The data
mining process itself is complex. The algorithms must be carefully applied to be effective.

Discovered patterns must be correctly interpreted and properly evaluated to ensure that the
resulting information is meaningful and accurate.

The Development of Data Mining

The current evolution of data mining functions and products is the result of years of influence
from many disciplines, including databases, information retrieval, statistics, algorithms, and
machine learning
Table shows developments in the areas of artificial intelligence (AI), information retrieval (IR),
databases (DB), and statistics (Stat) leading to the current view of data mining.

Different views of what data mining functions actually are


• Induction is used to proceed from very specific knowledge to more general information. This
type of technique is often found in AI applications.

• The primary objective of data mining is to describe some characteristics of a set of data by a
general model, this approach can be viewed as a type of compression .

• An ongoing direction of data mining research in how to define a data mining query and whether
a query language (like SQL) can be developed to capture the many different types of data mining
queries.

• A large database can be viewed as using approximation to help uncover hidden information
about the data.

• When dealing with large databases, the impact of size and efficiency of developing an abstract
model can be thought of as a type of search problem.

The various data mining problems viewed several different perspectives based on the viewpoint
and background of the researcher or developer

DATA MINING ISSUES


There are many important implementation issues associated with data mining:

1. Human interaction:
 Interfaces may be needed with both domain and technical experts.
 Technical experts are used to formulate the queries and assist in interpreting the
results.
 Users are needed to identify training data and desired results.
2. Overfitting:
Overfitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of
the training database.
Example:-
A classification model for an employee database may be developed to classify
employees as short, medium, or tall.
If the training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in the
training database under five feet eight. In this case, many future employees would be
erroneously classified as short.
3. Outliers:
There are often many data entries that do not fit nicely into the derived model. This
becomes even more of an issue with very databases.
4. Interpretation of results :
Currently, data mining output may require experts to correctly interpret the results, which
might otherwise be meaningless to the average database user.
5. Visualization of results:
To easily view and understand the output of data mining algorithms, visualization of the
results is helpful.
6. Large datasets:
The massive datasets associated with data mining create problems when applying
algorithms designed for small datasets.
7. High dimensionality:
 The dimensionality curse, meaning that there are many attributes (dimensions)
involved and it is difficult to determine which ones should be used.
 One solution to this high dimensionality problem is to reduce the number of
attributes, which is known as dimensionality reduction.
8. Multimedia data:
The use of multimedia data such as is found in GIS databases complicates or invalidates
many proposed algorithms.
9. Missing data:
During the preprocessing phase of KDD, missing data may be replaced with estimates.
This and other approaches to handling missing data can lead to invalid results in the data
mining step.
10. Irrelevant data:
Some attributes in the database might not be of interest to the data mining task being
developed.
11. Noisy data:
Some attribute values might be invalid or incorrect. These values are often corrected
before running data mining applications.
12. Changing data: Databases cannot be assumed to be static. However, most data mining
algorithms do assume a static database. This requires that the algorithm be completely
rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal data processing
activities. KDD requests may be treated as special, unusual, or one-time needs. This
makes them inefficient, ineffective, and not general enough to be used on an ongoing
basis.
14. Application: Determining the intended use for the information obtained from the data
mining function is a challenge. Indeed, how business executives can effectively use the
output is sometimes considered the more difficult part, not the running of the algorithms
themselves.

You might also like