Amity School of Engineering and Technology: Submitted To
Amity School of Engineering and Technology: Submitted To
on
Title: Data Mining
Submitted to
AMITY SCHOOL OF ENGINEERING AND TECHNOLOGY
Guided By: Submitted By:
Dr. Sachin Kumar Srivastava Aishwarya Gupta
Department of Applied Sciences A2324711007
ASET Computer Science and Engineering
3CSE6 (Y)
Amity University, Uttar Pradesh
2
ACKNOWLEDGEMENT
I would like to take this opportunity to thank my guide Dr. Sachin Kumar
Srivastava, Senior Lecturer, Amity School of Engineering and Technology, for
giving his time and having patience to listen to me. It would not have been
possible to complete the Term Paper without his timely encouragement and
guidance.
I would also like to thank Dr. Abhay Bansal, HOD, Computer Science &
Engineering, ASET and our Institution for giving us this opportunity to work on
this project and for providing us a platform which would broaden our mindset.
Lastly, I offer my regards and blessings to all of those who supported me in any
respect during the completion of the Term Paper.
Aishwarya Gupta
3
CERTIFICATE
This is to certify that Ms. Aishwarya Gupta, student of B.Tech. in Computer
Science and Engineering has carried out the work presented in the project of the
Term paper entitle " Data Mining " as a part of First year programme of Bachelor
of Technology in Computer Science from Amity School of Engineering and
Technology, Amity University, Noida, Uttar Pradesh under my supervision.
Dr. Sachin Kumar Srivastava
Department of Applied Sciences
ASET, Noida.
4
INDEX
1. Introduction 6
2. Definition 7
3. Preparation 8
3.1. Data Sources 8
3.2. Data Understanding 8
3.3. Data Preparation 10
3.3.1. Cleaning the data 10
3.3.2. Removing Variable 10
3.3.3. Data Transformations 10
3.3.4. Segmentation 12
4. Tables 13
5. Graphs 14
5.1. Frequency Polygrams 14
5.2. Histograms 15
5.3. Scatterplots 15
5.4. Box Plots 16
5.5. Multiple Graphs 16
6. Descriptive Statistics 17
6.1. Central Tendency 17
6.2. Variation 18
6.3. Shape 18
7. Grouping 19
8. Clustering 20
8.1. Hierarchical Agglomerative Clustering 20
8.2. K-means Clustering 21
9. Associative Rules 22
10. Decision Trees 22
11. Artificial Intelligence Techniques 25
12. Genetic Algorithm 25
13. Conclusion 27
14. References 28
5
ABSTRACT
The technologies for generating and collecting data have been advancing rapidly. At the current
stage, lack of data is no longer a problem; the inability to generate useful information from data
is! The explosive growth in data and database results in the need to develop new technologies
and tools to process data into useful information and knowledge intelligently and automatically.
Data mining, therefore, has become a research area with increasing importance.
It is the process of nontrivial extraction of implicit, previously unknown and potentially useful
information such as knowledge rules, constraints, and regularities from data stored in repositories
using pattern recognition technologies as well as statistical and mathematical techniques.
In a 1997 report, Stamford, Connecticut-based Gartner Group mentioned: Data mining and
artificial intelligence are at the top of the five key technology areas that will clearly have a major
impact across a wide range of industries within the next three to five years.
Many companies currently use computers to capture details of business transactions such as
banking and credit card records, retail sales, manufacturing warranty, telecommunications, and
myriad other transactions. Data mining tools are then used to uncover useful patterns and
relationships from the data captured. This paper discusses the requirements and challenges of
data mining, and describes major data mining techniques such as statistics, artificial intelligence,
decision tree approach and genetic algorithm.
6
1. INTRODUCTION
What is data mining?
Data mining is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.
Why is data mining required?
The explosive growth in data collection
The storing of data in data warehouses, so that the entire enterprise has access to a
reliable current database.
The availability of increased access to data from Web navigation and intranets.
The competitive pressure to increase market share in a globalized economy.
The development of off-the-shelf commercial data mining software suites.
The tremendous growth in computing power and storage capacity.
Any exploratory data mining project should include the following steps:
Problem definition: The problem to be solved should be clearly defined and a plan
should be generated for executing the analysis.
Data preparation: Prior to starting any data mining project, the data should be collected,
characterized, cleaned, transformed and partitioned into an appropriate form for
processing further.
Implementation of the analysis: Appropriate analysis techniques should be selected,
and often these methods need to be optimized. Any task that involves making decisions
from data falls into one of the following categories:
Summarizing the data: The data is reduced for interpretation without sacrificing
any important information
Finding hidden relationships: This refers to the identification of important facts,
relationships, anomalies or trends in the data, which are not obvious from a
summary alone.
Making predictions: Prediction is the process where an estimate is calculated for
something that is unknown.
Deployment of results: The results should be communicated and/or deployed into a
preexisting process. It is in the deployment step that the analysis is translated into a
benefit to the business.
7
2. DEFINITION
2.1. Objectives
It is critical to spend time defining how the project will impact specific business objectives. A
broad description of the project is useful as a headline. The description should be divided into
smaller problems that ultimately solve the broader issue.
2.2. Deliverables
It is important to identify the deliverables of the project. Defining all variables will provide the
correct expectations for all those working on the project. When developing predictive models, it
is useful to understand any required level of accuracy. This will help prioritize the types of
approaches. The accuracy of the model can often relate directly to the business objective.
2.3. Project Plan
The extent of project plan depends on the size and scope of the project. A plan should be put
together which defines the problem, the proposed deliverables along with the team who will
execute the analysis. The current situation should be assessed. The sources and locations of the
data should be identified. A timetable of events should be put together that includes preparation,
implementation, and deployment steps. The quality of data is important as it determines the
quality of analysis results.
At the end of the project, a valuable exercise is to spend time evaluation what worked and what
did not work during the project, providing insights for future projects. [1]
8
3. PREPARATION
Preparing the data is one of the most time consuming parts of any data mining project.
3.1. Data Sources
The quality of data influences the quality of results from analysis. The data should be reliable
and represent the defined target population. Data is often collected using following types of
studies:
Surveys or polls
A survey or poll can be useful for gathering data to answer specific questions. An
interview using a set of predefined questions is usually conducted over the phone, in
person, or over the Internet. Any bias in the survey should be eliminated. To achieve this,
a true random of the target population should be taken.
Experiments
Experiments measure and collect data to answer a specific question in a highly controlled
manner. The data collected should be reliably measured. Experiments attempt to
understand cause and affect phenomenon by controlling other factors that may be
important.
Observational and other studies
In certain situations it is impossible on either logistical or ethical grounds to conduct a
controlled experiment. In these situations, a large number of observations are measured
and care taken when interpreting the results.
3.2. Data Understanding
3.2.1. Data Tables
In a data set there may be many observations for a particular object. Many features
can describe the object which is known as a variable. Each observation has a specific
value for each variable. Data sets used for data analysis are usually described in
tables. Each column describes a variable (a specific attribute).
It is important to understand the variables first, before performing data analysis.
Certain characteristics of the variables have implications in terms of how the results
of the analysis will be interpreted.
9
3.2.2. Continuous and Discrete Variables
Each variable has to be first defined in terms of the type of values the variable can
take. The descriptive terms used for categorizing variables are:
Constant: A variable where every data value is the same.
Dichotomous: A variable where there are only two values. A special case is a
binary variable whose values are 0 and 1.
Discrete: A variable that can only take a certain number of values.
Continuous: A variable where an infinite number of numeric values are
possible within a specific range
3.2.3. Scales of Measurement
The variables scale indicates the accuracy at which the data has been measured. The
classification has implications as to the type of analysis that can be performed on the
variable. Scales are categorized as:
Nominal: Scale describing a variable with a limited number of different
values. The scale is made up of a list of possible values that the variable may
take.
Ordinal: This scale describes a variable whose values are ordered. The
difference between the values does not describe the magnitude of actual
difference.
Interval: Scales that describe values where the interval between the values has
meaning.
Ratio: Scales that describe variables where the same difference between
values has the same meaning but where a double, tripling, etc. of the values
implies a double, tripling, etc. of the measurement.
3.2.4. Roles in Analysis
It is useful to think how the variables will be used in any subsequent analysis.
Example roles in data mining include:
Labels: Variables that describe individual observations in data.
Descriptors: These variables are almost always collected to describe an
observation. They are used in creating both a predictive model and generating
predictions from these models. They are referred to as X variables.
Response: These variables (usually one variable) are predicted from a
predictive model (using the descriptor variables as input). They are referred to
as Y variables.
10
3.2.5. Frequency Distribution
For variables with an ordered scale (ordinal, interval, or ratio), it is useful to look at
the frequency distribution. A symmetrical bell-shaped distribution is describes as
normal (or Gaussian) distribution.
3.3. Data Preparation
The data must be cleaned and translated into a form suitable for data analysis and data mining. IT
will help us become familiar with the data which will further help in the analysis.
3.3.1. Cleaning the data
It is important to spend time cleaning the data as the data available for analysis nay
not have been originally collected with the projects goal in mind.
For variables measured on a nominal or ordinal scale, it is useful to inspect all
possible values to uncover mistakes and/or inconsistencies. Any assumptions
made concerning possible that the variable can take should be tested.
Outliers are a single or small number of data values that are not similar to the
rest of the data set. There are many reasons for outliers. It may be an error in
measurement. A series of outlier data points could be a result of
measurements made using a different calibration. It can also be a genuine data
point. Methods such as clustering and regression re used to identify outliers.
A particular variable may have been measured over different units. Such
observations should be standardized to a single scale.
By combining data from multiple sources, an observation may have been
recorded more than once and any duplicate entries should be removed.
3.3.2. Removing Variables
Constants and variables with too many missing data points should be considered for
removal. Further analysis of the correlations between multiple variables may identify
variables that provide no additional information and hence could be removed.
3.3.3. Data Transformations
It is important to consider applying certain mathematical transformations to the data
since many data mining programs will have difficulty making sense of the data in its
raw form. Some common transformations that should be considered are:
11
3.3.3.1. Normalization: It is a process where numeric columns are transformed
using a mathematical function to a new range. Certain data mining methods
require the data to be normalized prior to analysis.
o Min-Max Normalization
It works by seeing how much greater the field value is than the
minimum value min(X) and scaling the difference by the range
X* = X min(X)
max(X) min (X)
Min-max normalization will range from zero to one, unless new data
values are encountered that lie outside the original range.
o Z-Score Standardization
In Z-Score Standardization, we take the difference between the field
value and the field mean value and scaling this difference by the
standard deviation of the field values. It normalizes the value around
mean of the set.
X = X mean(X)
SD(X)
Z-score standardization values will usually range between -4 and 4,
with the mean value having a Z-score standardization of zero.
o Decimal scaling
This transformation moves the decimal to ensure the range is between
1 and -1.
X* = X
10
n
12
Where n is the number of digits of the maximum absolute value.
Example, if the largest number is 9948, then n would be 4. 9948 would
be normalized to 9948/10
4
3.3.3.2. Value Mapping: To use variables that have been assigned as ordinal and
described using text values within certain numerical analysis methods, it will
be necessary to convert the variables value into numbers.
3.3.3.3. Discretization: Converting continuous data into discrete values is desirable
in a number of situations. Where a value is dened on an interval or ratio scale
but when knowledge about how the data was collected suggests the accuracy
of the data does not warrant these scales, a variable may be a candidate for
discretization. It may be more desirable to convert the data into more broadly
dened categories that reect the true variation in the data. Certain techniques
can only process categorical data and hence converting continuous data into
discrete values makes the variable accessible to these methods.
3.3.3.4. Aggregation: The variable that you are trying to use may not be present in
the data set, but it may be derived from other variables present. Any
mathematical operation, such as average or sum, could be applied to one or
more variables to create an additional variable.
3.3.4. Segmentation
Larger data sets take more computational time to analyze. Segmenting the data can
speed up analysis. One approach is to take a random subset. This approach is
effective where the data set closely matches the target population. Another approach
is to use the problem definition guide how the subset is constructed.
It may be necessary to select a set of observations that more closely matches the new
target population.
Breaking the data set down into subsets based on your knowledge of the data may
allow you to create multiple simpler models. It is important to note the criteria used to
subset the data. [1]
13
4. TABLES
4.1. Data Tables
The most common way of looking at data is through a table, where the raw data is displayed in
rows and columns of variables. Sorting the table based on one or more variables is useful for
organizing data. It is not possible to identify any trends or relationships looking at the raw data
alone.
4.2. Contingency Tables
Contingency Tables are also referred to as two-way cross-classification table. It provides insight
into the relationship between two variables. A contingency table can represent variables with
more than two values.
4.3. Summary Tables
A summary table is a common way to understand data. Each row of the table represents a single
group. Descriptive statistics that summarize a set of observations can be used. Commonly used
statistics are:
Mean: The average value.
Median: The value at mid-point.
Sum: The sum over all observations in a group.
Minimum: The minimum value.
Maximum: The maximum value.
Standard deviation: A standardized measure of deviation of a variable from mean.
14
5. GRAPHS
Graphs present the data visually replacing numbers with graphical elements. Graphs enable us to
visually identify trends, ranges, frequency distributions, relationships, outliers and make
comparisons. There are many ways to visualize information in the form of a graph. Looking at
multiple graphs simultaneously and viewing common subsets can offer new insights into the
whole data set.
5.1. Frequency Polygrams
Frequency polygrams plot information according to the number of observations reported for each
value for a particular variable.
The shape of the plot reveals trends, that is, the number of observations fluctuates within a
narrow range of around 25 40.
0
5
10
15
20
25
30
35
40
45
70 71 72 73 74 75 76 77 78 79 80
15
5.2. Histograms
Histograms present similar information to frequency polygrams. The length of bar is
proportional to the size of the group. Variables that are not continuous can be shown as
histograms.
Histogram shows two values: yes and no. The length of the bars represent the number of
observations. It is also referred to as a bar chart.
5.3. Scatterplots
Scatterplots are used to identify whether any relationship exists between two continuous
variables based on the ratio or interval scales. The two variables are plotted on the x- and y-axes.
Each point displayed on the scatterplot is a single observation. When the points follow a straight
line or a curve, a simple relationship exists between the two variables. Some points which lie
outside the curve or straight line, those points are referred to as outliers. Points which are
scattered throughout the whole graph indicates there is no immediate obvious relationship
between them.
0
100
200
300
400
500
600
Yes No
16
5.4. Box Plots
Box plots provide a succinct summary of the overall distribution for a variable. The points
displayed are:
Lower extreme: The lowest value of the variable
Lower quartile: The point below which 25% of all observations fall.
Median: The point below which 50% of all observations fall.
Upper quartile: The point below which 75% of all observations fall.
Upper extreme: The highest value for the variable.
Mean: The average value for the variable.
Lower extreme Lower quartile Upper quartile Upper extreme
Median
Mean
In certain versions of box plot, outliers are not included in the plot. These outliers are explicitly
drawn (using small circles) outside the main plot.
5.5. Multiple Graphs
It is often informative to display multiple graphs at the same time in a table format, often referred
to s a matrix. This gives an overview of the data from multiple angles. [4]
17
6. DESCRIPTIVE STATISTICS
Parameters are numbers that characterize a population, whereas statistics are numbers that
summarize the data collected from a sample of the population. The use of statistical methods can
play an important role including:
Summarizing the data: Statistics not only provide us with methods for summarizing sample
data sets, they also allow us to make confident statements about entire population.
Characterizing the data: Prior to building a predictive model or looking for hidden trends in
the data, it is important to characterize the variables and the relationships between them.
Descriptive statistics allow us to quantify descriptions of the data. They calculate different
metrics for defining the center of the variable (central tendency), they define metrics to
understand the range of values (variation) and they quantify the shape of distribution.
6.1. Central Tendency
6.1.1. Mode
The mode is the most commonly reported value for a particular variable. Mode
provides the only measure of central tendency for variables measured on a nominal
scale. The mode can also be calculated for variables measured on the ordinal, interval
and ratio scales.
6.1.2. Median
The median is the middle value of a variable once it has been sorted from low to high.
The median can be calculated for variables measured on the ordinal scale. It is a good
indication of the central value as it does not get distorted by any extreme values.
6.1.3. Mean
The mean (also referred to as average) is the most commonly used indication of
central tendency for variables measured on the interval or ratio scales. If both mean
and median are approximately the same, the distribution should be fairly symmetrical.
18
6.2. Variation
Range
The range is a simple measure of the variation for a particular variable. It is calculated
as the difference between highest and lowest values.
Quartiles
Quartiles divide a variable into four even segments based on the number of
observations. The first quartile Q1 is at the 25% mark, the second quartile Q2 is at the
50% mark, the third quartile Q3 is at the 75% mark.
Variance
The variance describes the spread of data. It is a measure of the deviation of a
variable from the mean. The sample variance is referred to as s
2
. The variation
represents the average squared deviation.
Standard Deviation
The standard deviation (also described as root mean square) is the square root of
variance. The higher the value, the more widely distributed the variable data values
are around the mean.
6.3. Shape
6.3.1. Skewness
There are methods for quantifying the lack of symmetry or skewness in the
distribution of a variable. A skewness value of zero indicates that the distribution is
symmetrical.
6.3.2. Kurtosis
In addition to the symmetry of the distribution, the type of peak that the distribution
has is important to consider. This measurement is defined as kurtosis.
19
7. GROUPING
Dividing a data set into smaller subsets of related observations or groups is important for
exploratory data analysis for a number of reasons:
Finding hidden relationships: Grouping methods organize observations in different
ways. Looking at the data from different angles will allow us to find relationships that are
not so obvious from a summary alone.
Becoming familiar with the data: Before creating a data set to create a predictive
model, it is beneficial to become highly familiar with the contents of the set. Grouping
methods allow us to discover which types of observations are present in the data.
Segmentation: Techniques for grouping data may lead to divisions that simplify the data
for analysis.
Grouping Approaches
There exist numerous methods for grouping observations. When selecting grouping methods,
there are a number of issues to consider.
Supervised versus unsupervised: Methods that do not use any data to guide how the
groups are generated are called unsupervised methods, whereas methods that make use of
the response variable to guide group generation are called supervised methods.
Type of variables: Certain grouping methods will only accept categorical data, whereas
others only accept continuous data.
Data size limit: There are methods that only work with data sets less than a certain size.
In situations where the data set is too large to process, one solution would be to segment
the data prior to grouping.
Interpretable and actionable: Certain grouping methods generate results that are easy
to interpret, whereas other methods require additional analysis.
Overlapping groups: In certain grouping methods, observations can only fall in one
group. There are other grouping methods where the same observations may be a member
of multiple groups.
20
8. CLUSTERING
Clustering will group the data into sets of related observations or clusters. Observations within
each group are more similar to other observations within the group than to observations within
any other group. Clustering is an unsupervised method for grouping.
Clustering has the following advantages:
Flexible: There are many ways of adjusting how clustering is implemented, including
options for determining the similarity between two observations and options for selecting
the size of clusters.
Hierarchical and nonhierarchical approaches: Certain clustering techniques organize
the data hierarchically, which may provide additional insight into the problem under
investigation.
Clustering has the following limitations:
Subjective: Different subjects will require different clustering options and specifying
those options requires repeatedly examining the results and adjusting clustering options
accordingly.
Interpretation: Observations are grouped together based on some measure of similarity.
Making sense of a particular cluster may require additional analysis.
Speed: There are many techniques for clustering data and it can be time consuming to
generate clusters, especially for large data sets.
Size limitation: Certain techniques for clustering have limitations on the number of
observations that they can process. [1]
8.1. Hierarchical Agglomerative Clustering
Hierarchical agglomerative clustering is an example of hierarchical method for grouping
observations. It uses a bottom-up approach to clustering as it starts with each observation as a
member of a separate cluster. The major limitation of this approach is that it is normally limited
to small data sets and the speed to generate the hierarchical tree can be slow for higher number of
observations.
21
Linkage Rules
A linkage rule is used to determine a distance between an observation and an already identified
group.
Average linkage: The distance between all members of the cluster and the observation
under consideration are determined and the average is calculated.
Single linkage: The distance between all members of the cluster and the observation
under consideration are determined and the smallest is selected.
Complete linkage: The distance between all members of the cluster and the observation
under consideration are determined and the highest is selected.
8.2. K-means Clustering
K-means clustering is an example of a nonhierarchical method for grouping a data ser. It groups
data using a top-down approach since it starts with a predefined number of clusters and assigns
observations to them. There are no overlaps in the groups, that is, all observations are assigned to
a single group. This approach is faster and can handle greater number of observations than
agglomerative hierarchical clustering. A few disadvantages are:
Predefined number of clusters: You must define the number of groups before creating
the clusters.
Distorted by outliers: When a data set contains many outliers, k-means clustering may
not create an optimal grouping. This is because the outliers will be assigned to one of the
many allocated groups.
No hierarchical organization: No hierarchical organization is generated using k-means
clustering. [5]
Grouping Process
The process of generating clusters starts by defining the number of groups to create (k). The
method allocates an observation to each of these groups, usually randomly.
Next, all other observations are compared to each of these allocated observations and placed in
the group they are most similar to.
The center point for each of these groups is then calculated. The grouping process continues by
determining the distance from all observations to these new group centers. If an observation is
closer to the center of another group, it is moved to the group it is closest to. The centers of its
old and new groups are recalculated. The process is repeated until there is no further need to
move any observations
22
9. ASSOCIATIVE RULES
The associative rules method is an example of an unsupervised grouping method, that is, the goal
is not used to direct how the grouping is generated. This method attempts to understand links or
associations between different attributes of the group.
Associative rules have a number of advantages:
Easy to interpret: The results are presented in the form of a rule that is easily
understood.
Actionable: It is possible to perform some sort of action based on the rule.
Large data sets: It is possible to use this technique with large number of observations.
The limitations to this method are:
Only categorical variables: The method forces you to either restrict your analysis to
variables that are categorical or convert continuous variables to categorical variables.
Time consuming: Generating the rules can be time consuming, especially when a data
set has many variables.
Rule prioritization: The method can generate many rules that must be prioritized and
interpreted.
10. DECISION TREES
It is often necessary to ask a series of questions before coming to a decision. The answer to one
question may lead to another question or may lead to decision being reached.
Decision tress should be used because:
Easy to understand: Decision trees are widely used to explain how decisions are
reached based on multiple criteria.
Categorical and continuous variables: Decision trees can be generated using either
categorical data or continuous data.
Complex relationships: A decision tree can partition a data set into distinct regions
based on ranges or specific values.
23
The disadvantages of decision trees are:
Computationally expensive: Building decision trees can be computationally expensive,
particularly when analyzing a large data set with many continuous variables.
Difficult to optimize: Generating a useful decision tree automatically can be challenging,
since large and complex trees can be easily generated. Trees that are too small may not
capture enough information.
10.1. Tree Generation
A tree is made up of a series of decision points, where the entire set of observations or a subset
of the observations is split based on some criteria. Each point in the tree represents a set of
observations and is called a node. The relationship between two nodes that are joined is defined
as a parent-child relationship. The larger set which is divided into two or more smaller sets is
called a parent node. The nodes resulting from the division of the parent are called child nodes.
A child node with no more divisions is called a leaf node. [3]
Parent Node
Size=392
Av=23.45
Cylinders < 5 Cylinders 5
Leaf Node Size=203 Size=189 Further divided
Av=29.11 Av=17.36
Cylinders < 7 Cylinders 7
Size-86 Size=103
Av=20.23 Av=14.96
24
10.2. Splitting Criteria
A table of data is used to generate a decision tree where certain variables are assigned as
descriptors and one variable is assigned to be the response. The descriptors will be used to build
the tree, that is, these variables will divide the data set. The response will be used to guide which
descriptors are selected and at what value the split is made.
It is common for the split to be a two-way split. There are methods that split more than two
ways. However, care should be taken using these methods since splitting the set in many ways
early in the construction of the tree may result in missing relationships.
Any variable can be split using a two-way split:
Dichotomous: Variables with two values are the most straightforward to split since each
branch represents a specific value.
Nominal: Since nominal values are discrete without order, a two-way split is
accomplished with one subset being comprised of a set of observations that equal a
certain value and the other subset being those observations that do not equal that value.
Ordinal: In the case where a variables discrete values are ordered, the resulting subset
may be made up of more than one value, as long as the ordering is retained.
Continuous: For these variables a specific cutoff value needs to be determined, where
on one side of the split there are values less than cutoff and on the other side, there are
values greater than or equal to the cutoff.
A splitting criterion has two components: The variable to split on and the value of the variable to
be split on.
To determine the best split, all possible splits of all variables must be considered. Since it is
necessary to rank the splits, a score should be calculated for each split. There are many ways to
rank the split.
25
11. ARTIFICIAL INTELLIGENCE TECHNIQUES
Artificial Intelligence techniques are widely used in Data mining techniques such as pattern
recognition, machine learning, and neural networks. Other techniques in AI such as knowledge
acquisition, knowledge representation, and search, are relevant to the various process steps in
data mining. Classification is one of the major DM problems. Classification is the process of
dividing a data set into mutually exclusive groups such that the members of each group are as
close as possible to one another, and the members of different groups are as far as possible
from one another. One solution to the classification problem is to use neural network. According
to Lu et al. (1996), neural network-based DM approach consists of three major phases:
Network Construction and Training: In this phase, a layered neural network based on
the number of attributes, number of classes, and chosen input coding method are trained
and constructed.
Network Pruning: In this phase, redundant links and units are removed without
increasing the classification error rate of the network.
Rule Extraction: Classification rules are extracted in this phase. Other AI techniques that
can be used for DM include case-based reasoning and intelligent agents. Case-based
reasoning uses historical cases to recognize patterns and the intelligent agent approach
employs a computer program.(i.e. an agent) to sift through data.
12. GENETIC ALGORITHM
Genetic algorithm is a relatively new software paradigm inspired by Darwins theory of
evolution. A population of rules, each representing a possible solution to a problem, is initially
created at random. Then pairs of rules (usually the strongest rules are selected as parents) are
combined to produce offspring for the next generation. A mutation process is used to randomly
modify the genetic structures of some members of each new generation. The system runs for
dozens or hundreds of generations. The process is terminated when an acceptable or optimum
solution is found, or after some fixed time limit. Genetic algorithms are appropriate for problems
that require optimization with respect to some computable criterion. This paradigm can be
applied to data mining problems. The quantity to be minimized is often the number of
classification errors on a training set. Large and complex problems require a fast computer in
order to obtain appropriate solutions in a reasonable amount of time. Mining large data sets by
genetic algorithms has become practical only recently due to the availability of affordable high-
speed computers. [2]
26
These are the key components associated with any GA implementation:
Problem encoding: A genetic representation for solutions to the problem
Initial population creation
Fitness function: An objective function that plays the role of the environment by
assigning possible solutions numeric values that are an indicator to the quality of the
solution
Genetic operators: To alter the composition of chromosomes during reproduction
After the creation of a random set of possible solutions to the problem at hand, the fitness
function is used to assign numeric values to each one of those solutions. Genetic operators are
now applied to this first population to create a new population of children solutions. The process
of creation of a new population from a preceding population is referred to as a generation. The
two basic genetic operators used are mutation and crossover. Mutation arbitrarily alters some
genes in a selected chromosome between generations. Crossover takes two parent chromosomes
and interchanges strings of consecutive genes between the two to create two new offspring. The
concept of elitism (or survival of the fittest) is also critical at this step: solutions with a higher
fitness have higher probabilities of surviving and reproducing. With each new generation, the
relatively good solutions or the fit individuals from the previous generation survive and the
bad solutions or the unfit perish. Inheritance is also practiced between generations, where the
best solution(s) from the preceding generation is passed on to the next generation without any
changes. As the GA progresses through multiple such generations, the population should
converge towards the global optima for the problem at hand.
Genetic Algorithms are perhaps unique in their combination of elements of directed and
stochastic searches. Not only does Genetic Algorithms provide alternative methods to solving a
problem, it also consistently outperforms other traditional methods.
27
13. CONCLUSION
Exploratory data analysis and data mining is a process involving defining the problem, collecting
and preparing the data, and implementing the analysis. Once completed and evaluated, the
project should be delivered to the consumer concerned by the information. Following a process
has many advantages as it helps us avoid pitfalls. The four main steps are:
Problem definition: Prior to any analysis, the problem to be solved should be clearly
defined and related to one or more business objectives. Describing the deliverables will
focus the team on delivering the solution and provides correct expectations. A plan for
the project should be developed.
Data Preparation: The quality of the data is the most important aspect that influences
the quality of the results from the analysis. The data should be carefully collected,
integrated, characterized and prepared for analysis. Data preparation includes cleaning
the variables to remove potential errors. The variables should be characterized and
potentially transformed to enable the use of data with multiple analysis methods. The data
set should be portioned into smaller sets to simplify the analysis. One should be familiar
with the data. The steps performed in preparing the data should be documented.
Implementation of the analysis: The three primary tasks that relate to any data mining
project are: summarizing the data, finding hidden relationships and making predictions.
When implementing the analysis one should select appropriate methods that match the
task, the data and the objectives of the project. When assessing the quality of a prediction
model, a separate test and training set should be used. When presenting the results of the
analysis, any transformed data should be presented in its original form. Appropriate
methods for explaining and qualifying the results should be developed when needed.
Deployment: A plan should be set up to deliver the results of the analysis to the
customer. This plan needs to take into account the nontechnical issues of introducing a
solution that potentially changes the users daily routine. The performance should be
measured. The performance should directly relate to the business objectives of the
project. The performance may change over time and should be monitored.
28
14. REFERENCES
[1] Discovering knowledge in data, by Daniel T. Larose
[2] A review of data mining techniques, by Sang Jun Lee and Keng Siau
[3] Making Sense of Data: A Practical Guide to Exploratory Data Analysis and
Data Mining, by Glenn J. Myatt
[4] http://www.itl.nist.gov/div898/handbook/eda/eda.htm
[5] http://www.statsoft.com/textbook/stathome.html
[6] An introduction to genetic algorithms, M Mitchell