Shree Swaminarayan Institute of Technology CE DEPT.
(VI SEMESTER)
EXPERIMENT NO: 7
TITLE: Survey of Different Data Mining Tools.
OBJECTIVE:On completion of this exercise student will able to know about…
This practical attemptto support the decision-making process by discussing
the historical developmentand presenting a range of existing state-of-the-art
data mining and related [Link], the tool categorization based on
differentuser groups, data structures, data mining tasks and methods,
visualization andinteraction styles, import and export options for data and
models, platforms, andlicense policies.
THEORY:
There are three stages for introduction to data mining tools.
1. The first section Historical Development and State-of-the-Art highlights the historical
development of data mining software until present;
2. The criteria to compare data mining software are explained in the second section Criteria for
Comparing Data Mining Software.
3. The last section Categorization of Data Mining Software into Different Types proposes a
categorization of data mining software and introduces typical software tools for the different
types.
Historical Development and State-Of-The-Art
Following the original definition given in Ref [1]
“Data mining is a step in the knowledge discovery from databases (KDD) process that consists of
applying data analysis and discovery algorithms to produce a particular enumeration of patterns
(or models) across the data. In that same article, KDD is defined as the nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”
Today, a large number of standard data miningmethods are available,from a historical perspective,these
methods have different roots. One early groupof methods was adopted from classical statistics: thefocus
was changed from the proof of known hypothesesto the generation of new hypotheses. Examplesinclude
methods from Bayesian decision theory, regressiontheory, and principal component [Link]
group of methods stemmed from artificial intelligence- like decision trees, rule-based systems,
andothers. The term ‘machine learning’ includes methodssuch as support vector machines and artificial
neuralnetworks. There are several different and sometimesoverlapping categorizations; for example,
fuzzylogic, artificial neural networks, and evolutionary algorithms,which are summarized as
computational intelligence.[2]
The typical life cycle of new data mining methods begins with theoretical papers based on
inhousesoftware prototypes, followed by public oron-demand software distribution of successful
algorithms as research prototypes. Then, either special commercial or open source packages containing a
family of similar algorithms are developed or the algorithms are integrated into existing open source or
commercial packages. Many companies have tried to promote their own stand alone packages, but only
few have reached notable market shares. The life cycle of some data mining tools is remarkably short.
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
Typical reasons include internal marketing decisions and acquisitions of specialized companies by larger
ones, leading to a renaming and integration of product lines.
The largest commercial success stories resulted from the step-wise integration of data mining methods
into established commercial statistical tools. Companies such as SPSS, founded in 1975 with precursors
from 1968, or SAS, founded in 1976, have been offering statistical tools for mainframe computers since
the 1970s.
Many companies offering business intelligence products have integrated data mining solutions into their
database products; one example is Oracle Data Mining (established in 2002). Many of these products are
also a product of the acquisition and integration of specialized datamining companies.
In 2008, the worldwide market for business intelligence (i.e., software and maintenance fees) was 7.8
billion USD, including 1.5 billion USD in socalled‘advanced analytics’, containing data mining and
statistics.7 This sector has grown 12.1% between 2007 and 2008, with large players including companies
such as SAS (33.2%, tool: SAS Enterprise Miner), SPSS (14.3%, since 2009, an IBM company; tool:
IBMSPSS Modeler), Microsoft (1.7%, tool: SQL Server Analysis Services), Teradata (1.5%, tool:
Teradata Database, former name TeraMiner), and TIBCO (1.4%, tool: TIBCO Spotfire). [3]
Open-source libraries have also become very popular since the 1990s. The most prominent example is
Waikato Environment for Knowledge Analysis (WEKA), see Ref 8. WEKA started in 1994 as a C++
library, with its first public release in 1996. In 1999, it was completely rebuilt as a JAVA package; since
that time, it has been regularly updated.
Criteria for Comparing Data Mining Software
In the following, different criteria for comparison of data mining software are introduced. These criteria
are based on user groups, data structures, data mining tasks and methods, import and export options, and
license models.
A. User Groups
There are many different data mining tools available,which fit the needs of quite different user
groups:
1. Business applications: This group uses datamining as a tool for solving commerciallyrelevant
business applications such as customerrelationship management, fraud detection,and so on.
2. Applied research: A user group that appliesdata mining to research problems, for
example,technology and life sciences. Here,users are mainly interested in tools with
wellprovenmethods, a graphical user interface(GUI), and interfaces to domain-related
dataformats or databases.
3. Algorithm development: Develops new datamining algorithms, and requires tools to both
integrate its own methods and compare thesewith existing methods. The necessary toolsshould
contain many concurrent algorithms.
4. Education: For education at universities, datamining tools should be very intuitive, witha
comfortable interactive user interface, andinexpensive. In addition, they should allowthe
integration of in-house methods duringprogramming seminars.
B. Data Structures
An important criterion is the dimensionality of the underlyingraw data in the processed
[Link] first data mining applications were focused onhandling datasets represented as two-
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
dimensional featuretables. In this classical format, a dataset consistsof a set of N examples (e.g.,
clients of an insurancecompany) with s features containing real values or
usually integer-coded classes or symbols (e.g., income,age, number of contracts, and alike). This
format issupported by nearly all existing tools. In some cases,the dataset can be sparse, with only a
few nonzerofeatures such as a list of s shopping items for N differentcustomers. The
computational and memory effortcan be reduced if a tool exploits this sparse structure.
There are different types of data format like feature data (e.g. age and income), texts, time-series
data, sequences, images, graphs, 3D images, Videos,3D videos etc.
C. Task and Methods
The most important tasks in data mining are
1. supervised learning, with a known outputvariable in the dataset, including
- classification: class prediction, withthe variable typically coded as an integeroutput;
- fuzzy classification: with gradualmemberships with values in-between0 and 1
applied to the differentclasses;
- regression: prediction of a real-valuedoutput variable, including specialcases of
predicting future values ina time series out of recent or pastvalues;
2. unsupervised learning, without a known outputvariable in the dataset, including
- clustering: finds and describes groupsof similar examples in the data usingcrisp of
fuzzy clustering algorithms;
- association learning: finds typical groups of items that occur frequently together in
examples;
3. Semi supervised learning, whereby the outputvariable is known only for some examples.
D. Platforms
Data mining tools can be subdivided into standalone and client/server solutions. Client/server
solutionsdominate, especially in products designed forbusiness users. They are available for
different platforms,including Windows, MAC OS, Linux, or specialmainframe supercomputers.
There is a growingnumber of JAVA-based systems that are platformindependentfor users in
research and appliedresearch.
E. Licenses
There exists a wide variety of data mining tools withcommercial and open-source licenses. This is
particularlytrue in the business application user group,where commercial software is very
attractive dueto high software stability, good coupling with othercommercial tools for data
warehouses, included softwaremaintenance, and the possibility of user trainingfor sophisticated
topics. For all other user groups,there is a strong trend toward open-source software,but different
types of licenses exist for this.
Categorization of Data Mining Software into Different Types
Following the criteria from the previous section, different types of similar data mining tools can be found.
In addition, for commercial data mining tools, related tools and their group membership are summarized
in different tables for commercial (Tables 1 and 2), free, and open-source data mining tools (Table 3). In
these tables, very popular tools are marked in bold.
Thefollowing types are proposed:[3]
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
1. Data mining suites (DMS) focus largely on data mining and include numerous methods. They
support feature tables and time series, while additional tools for text mining are sometime
available. Typical examples include IBM SPSS Modeler, SAS Enterprise Miner, Alice d’Isoft,
DataEngine, DataDetective, GhostMiner, Knowledge Studio, KXEN, NAG Data Mining
Components, Partek Discovery Suite, STATISTICA, and TIBCO Spotfire.
2. Business intelligence packages (BIs) have no special focus to data mining, but include basic data
mining functionality, especially for statistical methods in business applications. Most BI softwares
are commercial (IBM Cognos 8 BI,OracleDataMining, SAPNetweaver Business Warehouse,
Teradata Database, DB2 Data Warehouse from IBM, and PolyVista), but a few open-source
solutions exist (Pentaho).
3. Mathematical packages (MATs) have no special focus on data mining, but provide a large and
extendable set of algorithms and visualization routines MATs are attractive to users in algorithm
development and applied research because data mining algorithms can be rapidly implemented,
mostly in the form of extensions (EXT) and research prototypes (RES). MAT packages exist as
commercial (MATLAB and R-PLUS) or open-source tools (R, Kepler).
4. Integration packages (INTs) are extendable bundles of many different open-source algorithms,
either as stand-alone software (mostly based on Java; as KNIME, the GUI-version of WEKA,
KEEL, and TANAGRA) or as a kind of larger extension package for tools from the MAT type
(such as Gait-CAD, PRTools for MATLAB, and RWEKA for R).
5. EXT are smaller add-ons for other tools such as Excel,Matlab, R, and so forth, with limited but
quite useful functionality Data mining libraries (LIBs) implement data mining methods as a
bundle of functions. These functions can be embedded in other software tools using an
Application Programming Interface (API) for the interaction between the software tool and the
data mining functions. A graphical user interface is missing, but some functions can support the
integration of specific visualization tools. They are often written in JAVA or C++ and the
solutions are platform independent. Open source examples are WEKA (Java-based), MLC++ (C+
+ based), JAVA Data Mining Package, and LibSVM (C++ and JAVAbased) for support vector
machines
6. Specialties (SPECs) are similar to DMS tools, but implement only one special family of methods
such as artificial neural networks. Examples are CART for decision trees, Bayesia Lab for
Bayesian networks, C5.0, WizRule, Rule Discovery System for rule-based systems, MagnumOpus
for association analysis, and JavaNNS, Neuroshell.
7. RES are usually the first—and not always stable—implementations of new and innovative
algorithms. They contain only one or a few algorithms with restricted graphical support and
without automation support. RES tools are mostly opensource. Examples are GIFT for content-
based image retrieval, Himalaya for mining maximal frequent item sets, sequential pattern mining
and scalable linear regression trees, Rseslibs for rough sets, and Pegasus for graph mining.
8. Solutions (SOLs) describe a group of tools that are customized to narrow application fields such
as text mining, image processing etc.
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
Table 1 List of Commercial Tools (Part 1) [3]
TOOL TYPE LINK
ADAPA (Zementis) DMS [Link]
Alice (d’Isoft) DMS [Link]
Bayesia Lab SPEC [Link]
C5.0 SPEC [Link]
CART SPEC [Link]
Data Applied DMS [Link]
DataDetective DMS [Link]/?dden
DataEngine DMS [Link]
Datascope DMS [Link]
DB2 Data Warehouse BI [Link]/software/data/infosphere/warehouse
DeltaMaster BI [Link]/deltamaster
Forecaster XL EXT [Link]
GhostMiner DMS [Link]/businessintelligence/products/ghostminer
IBM Cognos 8 BI BI [Link]/software/data/cognos/[Link]
IBM SPSS Modeler DMS [Link]/software/modeling/modeler
IBM SPSS Statistics MAT [Link]/software/statistics
iModel DMS [Link]/products/imodel
InfoSphere Warehouse BI [Link]/software/data/infosphere/warehouse
JMP DMS [Link]
KnowledgeMiner SPEC [Link]
KnowledgeStudio DMS [Link]
KXEN DMS [Link]
Magnum Opus SPEC [Link]
MATLAB MAT [Link]
MATLAB Neural Network
EXT [Link]
Toolbox
Model Builder DMS [Link]
ModelMAX SOL [Link]/products/[Link]
Table 2 List of Commercial Tools (Part 2) [3]
TOOL TYPE LINK
Molegro Data Modeler SOL [Link]
NAG Data Mining Components LIB [Link]/numeric/DR/[Link]
NeuralWorks Predict SPEC [Link]/[Link]
Neurofusion LIB [Link]
Neuroshell SPEC [Link]
[Link]/technology/products/bi/odm/
Oracle Data Mining (ODM) DMS
[Link]
Partek Discovery Suite DMS [Link]/software
Partek Genomics Suite SOL [Link]/software
PolyAnalyst DMS [Link]/[Link]
PolyVista BI [Link]
Random Forests SPEC [Link]
RapAnalyst SPEC [Link]/[Link]
R-PLUS MAT [Link]
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
SAP Netweaver Business [Link]/platform/netweaver/components/
BI
Warehouse (BW) businesswarehouse
SAS Enterprise Miner DMS [Link]/products/miner
See5 SPEC [Link]
SPAD Data Mining DMS [Link]
SQL Server Analysis Services DMS [Link]/sql
[Link]/products/data-mining-solutions/
STATISTICA DMS
G259
SuperQuery DMS [Link]
Teradata Database BI [Link]
Think Enterprise Data Miner
DMS [Link]
(EDM)
TIBCO Spotfire DMS [Link]
UnicaPredictiveInsight DMS [Link]
WizRule and WizWhy SPEC [Link]
XAffinity SPEC [Link]
Table 3 List of Free and Open-Source Tools [3]
TOOL TYPE LINK
ADaM LIB [Link]/adam
CellProfilerAnalyst SOL [Link]/[Link]
D2K DMS [Link]
Gait-CAD INT [Link]/projects/gait-cad
GATE SOL [Link]/download
GIFT RES [Link]/software/gift
Gnome Data Mine Tools DMS [Link]/datamining/gdatamine
Himalaya RES [Link]
ImageJ SOL [Link]/ij
ITK SOL [Link]
JAVA Data Mining Package LIB [Link]/projects/jdmp
[Link]/software/JavaNNS/welcome
JavaNNS SPEC
[Link]
KEEL INT [Link]
Kepler MAT [Link]
KNIME INT [Link]
LibSVM LIB [Link]/ cjlin/libsvm
MEGA SOL [Link]/m [Link]
MLC++ LIB [Link]/tech/mlc
Orange LIB [Link]/orange
Pegasus RES [Link]/ pegasus
Pentaho BI [Link]/projects/pentaho
Proximity SPEC [Link]/proximity/[Link]
PRTools EXT [Link]
R MAT [Link]
RapidMiner DMS [Link]
Rattle INT [Link]
ROOT LIB [Link]/root
ROSETTA SPEC [Link]/tools/rosetta/[Link]
Rseslibs RES [Link]/ rses
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
Rule Discovery System∗ SPEC [Link]
RWEKA INT [Link]/web/packages/RWeka/[Link]
TANAGRA INT [Link]/ ricco/tanagra/en/[Link]
Waffles LIB [Link]
WEKA DMS, LIB [Link]/projects/weka
XELOPES Library∗ LIB [Link]/en/technology/xelopes
XLMiner∗ EXT [Link]/xlminer
References:
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag
1996, 17:37–54.
[2] Engelbrecht AP. Computational Intelligence - An Introduction. Chichester: John Wiley; 2007.
[3] Ralf Mikut∗ and Markus Reischl,Data mining tools.
EXERCISE:
1) Write down the functionality and advantage of the top 5 analytical tool.
EVALUATION:
Observation &
Timely completion Viva Total
Implementation
4 2 4 10
Signature: ____________
Date: ________________
Student Name (Enrollment No) Page no