Module1 DataMining Ktustudents - in
Module1 DataMining Ktustudents - in
Data mining
We live in a world where vast amounts of data are collected daily. Analyzing such
data is an important need. “We are living in the information age” is a popular saying;
however, we are actually living in the data age. Terabytes or petabytes of data pour into
our computer networks, the World Wide Web (WWW), and various data storage devices
every day from business society, science and engineering, medicine, and almost every
other aspect of daily life
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. The terms knowledge discovery in
databases (KDD) and data mining are often used interchangeably.
KTUStudents.in
Over the last few years KDD has been used to refer to a process consisting of
many steps, while data mining is only one of these steps.
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
KTUStudents.in
Steps 1 through 4 are different forms of data preprocessing, where data are
prepared for mining. The data mining step may interact with the user or a knowledge
base. The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base. The preceding view shows data mining as one step
in the knowledge discovery process, albeit an essential one because it uncovers hidden
patterns for evaluation.
2. A database or data warehouse server which fetches the relevant data based on users’
data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search or to
evaluate the interestingness of resulting patterns. For example, the knowledge base may
contain metadata which describes data from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks such as
KTUStudents.in
classification, association, classification, cluster analysis, and evolution and deviation
analysis.
5. A pattern evaluation module that works in tandem with the data mining modules by
employing interestingness measures to help focus the search towards interestingness
patterns.
6. A graphical user interface that allows the user an interactive approach to the
data mining system.
KTUStudents.in
For example, to study the characteristics of software products with sales that
increased by 10% in the previous year, the data related to such products can be
collected by executing an SQL query on the sales database.
Classification is the process of finding a model (or function) that describes and
distinguishesdata classes or concepts. The model is derived based on the analysis of a
set oftraining data (i.e., data objects for which the class labels are known). The model is
KTUStudents.in
usedto predict the class label of objects for which the the class label is unknown.
A decisiontree is a flowchart-like tree structure,where each node denotes a test
on an attribute value, each branch represents an outcomeof the test, and tree leaves
represent classes or class distributions.
4. Clustering analyzes data objects without consulting class labels. In many cases,
classlabeleddata may simply not exist at the beginning. Clustering can be used to
generateclass labels for a group of data. The objects are clustered or grouped based on
the principle of maximizing the intraclass similarity and minimizing the interclass
6. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or
modelof the data. These data objects are outliers. Many data mining methods discard
outliersas noise or exceptions. However, in some applications (e.g., fraud detection) the
rareevents can be more interesting than the more regularly occurring ones. The analysis
ofoutlier data is referred to as outlier analysis or anomaly mining.
KTUStudents.in
Outlier analysis may uncover fraudulent usage of credit cards bydetecting
purchases of unusually large amounts for a given account number in comparisonto
regular charges incurred by the same account. Outlier values may also be detectedwith
respect to the locations and types of purchase, or the purchase frequency.
Statistics:
Machine learning: Investigates how computers can learn (or improve their
performance)based on data. A main research area is for computer programs to
KTUStudents.in
automatically learn torecognize complex patterns and make intelligent decisions based
on data. For example, atypical machine learning problem is to program a computer so
that it can automaticallyrecognize handwritten postal codes on mail after learning from
a set of examples.Machine learning is a fast-growing discipline. Here, we illustrate
classic problems inmachine learning that are highly related to data mining.
Active learning is a machine learning approach that lets users play an active rolein the
learning process. An active learning approach can ask a user (e.g., a domainexpert) to
label an example, which may be from a set of unlabeled examples orsynthesized by the
learning program. The goal is to optimize the model quality byactively acquiring
knowledge from human users, given a constraint on how manyexamples they can be
asked to label.
KTUStudents.in
numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction:
Interactive mining allows users to focus the search for patterns, providing and refining
data mining requests based on returned results. The user can interact with the data
mining system to view data and discovered patterns at multiple granularities and from
different angles.
Incorporation of background knowledge:
Background knowledge, or information regarding the domain under study, may
be used to guide the discovery process and allow discovered patterns to be expressed in
concise terms and at different levels of abstraction
3. Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable. The running time of a data mining
algorithm must be predictable and acceptable in large databases.
KTUStudents.in
Data Warehouse
Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions.
Data warehouse refers to a database that is maintained separately from an
organization’s operational databases.A data warehouse is a subject-oriented, integrated,
time-variant, and non-volatile collection of data in support of management’s decision
making process”
A data warehouse focuses on the modelling and analysis of data for decision
makers(not on day to day transaction).
Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
KTUStudents.in
stored in a warehouse for direct querying and analysis.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended
relational DBMS that maps operations on multidimensional data to standard relational
KTUStudents.in
operations A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the
entire organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-functional in scope. It
typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may
be implemented on traditional mainframes, computer super servers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design and build.
3.virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
KTUStudents.in
Meta Data Repository: *****
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names and definitions of the
given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added by
data cleaning or integration processes.
A metadata repository should contain the following: A description of the structure of
the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies,
and derived data definitions, as well as data mart locations and contents. Operational
metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged), and
monitoring information (warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports. The mapping from the operational
environment to the data warehouse, which include source databases and their contents,
gateway descriptions, data partitions, data extraction, cleaning, transformation rules and
defaults, data refresh and purging rules, and security (user authorization and access control).
Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles. Business metadata, which include business terms and
definitions, data ownership information, and charging policies.
Consolidation involves the aggregation of data that can be accumulated and computed
in one or more dimensions.
For example, all sales offices are rolled up to the sales department or sales division to
anticipate sales trends.
The drill-down is a technique that allows users to navigate through the details. For
KTUStudents.in
instance, users can view the sales by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the
OLAP cube and view (dicing) the slices from different viewpoints.
Types of OLAP:
KTUStudents.in
purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting.
Data Warehouse
serve users or knowledge workers in the role of data analysis and
decision making.
Such systems can organize and present data in various formats in order to
accommodate the diverse needs of the different users. These systems are
known as on-line analytical processing (OLAP) systems.
KTUStudents.in
For example, AllElectronics shop may create a sales data warehouse in order to
keep records of the store’s sales with respect to the dimensions time, item, branch, and
location. These dimensions allow the store to keep track of things like monthly sales of
items and the branches and locations at which the items were sold. Each dimension may
have a table associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes item
name, brand, and type. Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data distributions.
KTUStudents.in
Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional
KTUStudents.in
tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However,
this saving of space is negligible in comparison to the typical magnitude of the fact
table. Furthermore, the snowflake structure can reduce the effectiveness of browsing,
since more joins will be needed to execute a query. Consequently, the system
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
KTUStudents.in
Need for Data Warehousing
1. The data ware house market supports such diverse industries as manufacturing,
retail, telecommunications, and health care. Think of a personnel database for a
company that is continually modified as personnel are added and deleted... If
management wishes determine if there is a problem with too many employees
quitting. To analyze this problem, they would need to know which employees
have left, when they left, why they left, and other information about their
employment. For management to make these types of high-level business
analyses, more historical data not just the current snapshot are required.
6. The below figure shows a simple view of a data warehouse. The basic
components of a data warehousing system include data migration, the
warehouse, and access tools. The data are extracted from operational systems,
but must be reformatted, cleansed, integrated, and summarized before being
placed in the warehouse.
KTUStudents.in
Challenges for Data Warehousing
8. Data Quality – In a data warehouse, data is coming from many disparate sources
from all facets of an organization. When a data warehouse tries to combine
inconsistent data from disparate sources, it encounters errors. Inconsistent data,
duplicates, logic conflicts, and missing data all result in data quality challenges.
Poor data quality results in faulty reporting and analytics necessary for optimal
decision making.
KTUStudents.in
of resources to ensure the information provided is accurate.
11. Performance – Building a data warehouse is similar to building a car. A car must
be carefully designed from the beginning to meet the purposes for which it is
intended. Yet, there are options each buyer must consider to make the vehicle
truly meet individual performance needs. A data warehouse must also be
carefully designed to meet overall performance requirements. While the final
product can be customized to fit the performance needs of the organization, the
initial overall design must be carefully thought out to provide a stable foundation
from which to start.
12. Designing the Data Warehouse – People generally don’t want to “waste” their
time defining the requirements necessary for proper data warehouse design.
Usually, there is a high level perception of what they want out of a data
warehouse. However, they don’t fully understand all the implications of these
perceptions and, therefore, have a difficult time adequately defining them. This
results in miscommunication between the business users and the technicians
building the data warehouse. The typical end result is a data warehouse which
does not deliver the results expected by the user. Since the data warehouse is
inadequate for the end user, there is a need for fixes and improvements
immediately after initial delivery.
13. User Acceptance – People are not keen to changing their daily routine especially
if the new process is not intuitive. There are many challenges to overcome to
make a data warehouse that is quickly adopted by an organization.
Applications of DWH
KTUStudents.in
associations, constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools.
Banking Industry
Finance Industry
KTUStudents.in
Healthcare
All of their financial, clinical, and employee records are fed to warehouses
as it helps them to strategize and predict outcomes, track and analyze
their service feedback, generate patient reports, share data with tie-in
insurance companies, medical aid services, etc.
Hospitality Industry
Insurance
They also use them for product shipment records, records of product
portfolios, identify profitable product lines, analyze previous data and
customer feedback to evaluate the weaker product lines and eliminate
them.
The Retailers
They also analyze sales to determine fast selling and slow selling product
lines and determine their shelf space through a process of elimination.
KTUStudents.in
Telephone Industry
The telephone industry operates over both offline and online data
burdening them with a lot of historical data which has to be consolidated
and integrated.
Transportation Industry
***************************************************************************************