1) Architecture of Data Mining
1) Architecture of Data Mining
____________________________________________________________________________________________
Unit II
DATA MINING
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data in databases. While
data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery process.
Data Mining refers to the detection and extraction of new patterns from the already collected data. Data
mining is the amalgamation of the field of statistics and computer science aiming to discover patterns in
incredibly large datasets and then transforming them into a comprehensible structure for later use.
Data mining is a very important process where potentially useful and previously unknown information is
extracted from large volumes of data. There are a number of components involved in the data mining
process. These components constitute the architecture of a data mining system.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 1
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
Data Source
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other
documents. A huge amount of historical data is needed for data mining to be successful. Organizations
typically store data in databases or data warehouses. Data warehouses may comprise one or more databases,
text files spreadsheets, or other repositories of data. Sometimes, even plain text files or spreadsheets may
contain information. Another primary source of data is the World Wide Web or the internet.
The data from multiple sources are integrated into a common source known as Data Warehouse.
1. Flat Files
o Flat files is defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
o Data stored in flat files have no relationship or path among themselves, like if a relational
database is stored on flat file, then there will be no relations between the tables.
o Flat files are represented by data dictionary. Eg: CSV file.
o Application: Used in DataWarehousing to store data, Used in carrying data to and from
server, etc.
2. Relational Databases
o A Relational database is defined as the collection of data organized in tables with rows and
columns.
o Physical schema in Relational databases is a schema which defines the structure of tables.
o Logical schema in Relational databases is a schema which defines the relationship among
tables.
o Standard API of relational database is SQL.
o Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
o A datawarehouse is defined as the collection of data integrated from multiple sources that
will queries and decision making.
o There are three types of datawarehouse: Enterprise datawarehouse, Data Mart and Virtual
Warehouse.
o Two approaches can be used to update data in DataWarehouse: Query-driven Approach
and Update-driven Approach.
o Application: Business decision making, Data mining, etc.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 2
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
4. Transactional Databases
o Transactional databases is a collection of data organized by time stamps, date, etc to
represent transaction in databases.
o This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
o Highly flexible system where users can modify information without changing any sensitive
information.
o Follows ACID property (atomicity, Consistency, Isolation and Durability) of DBMS.
o Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
o Multimedia databases consists audio, video, images and text media.
o They can be stored on Object-Oriented Databases.
o They are used to store complex information in a pre-specified formats.
o Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6. Spatial Database
o Store geographical information.
o Stores data in the form of coordinates, topology, lines, polygons, etc.
o Application: Maps, Global positioning, etc.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 3
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
7. Time-series Databases
o Time series databases contains stock exchange data and user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
o WWW refers to World wide web is a collection of documents and resources like audio,
video, text, etc which are identified by Uniform Resource Locators (URLs) through web
browsers, linked by HTML pages, and accessible via the Internet network.
o It is the most heterogeneous repository as it collects data from multiple resources.
o It is dynamic in nature as Volume of data is continuously increasing and changing.
o Application: Online shopping, Job search, Research, studying, etc.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 4
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
Data Mining Process is classified into two stages: Data preparation or data preprocessing and data mining
Data preparation process includes data cleaning, data integration, data selection and data transformation.
Whereas the second phase includes data mining, pattern evaluation, and knowledge representation.
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
o Cleaning in case of Missing values.
o Cleaning noisy data, where noise is a random or variance error.
o Cleaning with Data discrepancy detection and Data transformation tools.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 5
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined
in a common source(DataWarehouse).
o Data integration using Data Migration tools.
o Data integration using Data Synchronization tools.
o Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.
o Data selection using Neural network.
o Data selection using Decision Trees.
o Data selection using Naive bayes.
o Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure.
Data Transformation is a two step process:
o Data Mapping: Assigning elements from source base to destination to capture
transformations.
o Code generation: Creation of the actual transformation program.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 6
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
o Transforms task relevant data into patterns.
o Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing patterns
representing knowledge based on given measures.
o Find interestingness score of each pattern.
o Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes
visualization tools to represent data mining results.
o Generate reports.
o Generate tables.
o Generate discriminant rules, classification rules, characterization rules, etc.
Note:
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
new data can be integrated and transformed in order to get different and more appropriate results.
Preprocessing of databases consists of Data cleaning and Data Integration.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 7
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
KNOWLEDGE BASE
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search
or evaluate the stake of the result patterns. The knowledge base may even contain user views and data from
user experiences that might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable. The pattern assessment
module regularly interacts with the knowledge base to get inputs, and also update it.
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 8
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 9
19CS521 Data Warehousing and Data Mining Unit II – Data Mining
____________________________________________________________________________________________
_____________________________________________________________________________________________
Dr. Anitha Julian, Professor / CSE 10