ICS 2408 Lecture 1 Introduction
ICS 2408 Lecture 1 Introduction
BY
Muchina S.K
1
Introduction
2
Motivation
Data explosion problem
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What is data mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data.
5
Goals of Data Mining
The typical goals of data mining projects are:
Identification of groups, clusters, strata, or dimensions in data that
display no obvious structure,
Data Mining
Patterns
Task-relevant Data
Data Warehouse
Databases
7
Steps of a KDD Process
1. Data cleaning: removal of noise and inconsistent data.
2. Data integration: combination of multiple data sources.
3. Data selection: retrieval of data relevant to the analysis task from the
database.
4. Data transformation: data consolidation into forms appropriate for mining
through summarization, aggregation.
5. Data mining: application of intelligent methods to extract patterns of
interest.
6. Pattern evaluation: identification of truly interesting patterns representing
knowledge based on some interestingness measures.
7. Knowledge presentation: use of visualization and presentation techniques
to present mined knowledge to the user.
8
Data Mining and Business Intelligence
Increasing potential
to support Decision End User
business decisions Making
10
Data Mining System Architecture
Pattern evaluation
Knowledge-base
Database or data warehouse server
10
Necessity for Data Mining
Tremendous amount of data
Algorithms must be highly scalable to handle such data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Desired analyses
Support for planning, Yield management, System performance, Mature
database analysis
12
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, Web mining, etc.
13
Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
14
Data Mining Models and Tasks
Life Cycle of Data Mining Projects
Business understanding - project
objectives from business
perspective, data mining problem
definition
Data understanding - initial data
collection, get familiar with data
Data preparation - construct final
dataset from raw data
Modeling - Select and apply
modeling techniques
Evaluation - Evaluate model,
decide on further deployment
Deployment - Create report, carry
out actions based on new insights Standardized Data Mining Process [CRISP]
Data Mining Algorithms
Description Prediction
SQL Query Tools
13
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web 18
Data Mining Functionalities
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
20
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
21
Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining
Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
22
Major Issues in Data Mining
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and global
information systems (www)
23
Primitives that Define a Data Mining Task
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
26
Primitive 5: Presentation of Discovered Patterns
Different backgrounds/usages may require different forms of representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchy is also important
Discovered knowledge might be more understandable when represented
at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing provide different
perspectives to data
Different kinds of knowledge require different representation: association,
classification, clustering, etc.
27
Data mining Potential Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of events identify fraudulent events
28
Data mining Potential Applications
Manufacturing and production:
automatically adjust knobs when process parameter changes
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
29
Ex. 1: Market Analysis and Management
Source of data —Credit card transactions, customer complaint calls, (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics: interest, income level, spending
habits, etc.
Determine customer purchasing patterns over time
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
31
Ex.3: Fraud Detection & Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees
32