0% found this document useful (0 votes)
28 views

ICS 2408 Lecture 1 Introduction

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

ICS 2408 Lecture 1 Introduction

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

CIT 4207:

DATA MINING & WAREHOUSING

BY

Muchina S.K

1
Introduction

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining

2
Motivation
 Data explosion problem
 Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Extraction of interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems

4
What is data mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data.

 Process of semi-automatically analyzing large databases to find patterns


that are:
 valid: hold on new data with some certainty
 novel: non-obvious to the system
 useful: should be possible to act on the item
 understandable: humans should be able to interpret the pattern

5
Goals of Data Mining
 The typical goals of data mining projects are:
 Identification of groups, clusters, strata, or dimensions in data that
display no obvious structure,

 The identification of factors that are related to a particular outcome of


interest (root-cause analysis)

 Accurate prediction of outcome variable(s) of interest (in the future, or


in new customers, clients, applicants, etc.; this application is usually
referred to as predictive data mining)
Knowledge Discovery (KDD) Process
Data mining—core of Evaluation and
knowledge discovery Presentation
process

Data Mining
Patterns
Task-relevant Data

Selection and Transformation

Data Warehouse

Cleaning & Integration

Databases
7
Steps of a KDD Process
1. Data cleaning: removal of noise and inconsistent data.
2. Data integration: combination of multiple data sources.
3. Data selection: retrieval of data relevant to the analysis task from the
database.
4. Data transformation: data consolidation into forms appropriate for mining
through summarization, aggregation.
5. Data mining: application of intelligent methods to extract patterns of
interest.
6. Pattern evaluation: identification of truly interesting patterns representing
knowledge based on some interestingness measures.
7. Knowledge presentation: use of visualization and presentation techniques
to present mined knowledge to the user.

8
Data Mining and Business Intelligence

Increasing potential
to support Decision End User
business decisions Making

Data Presentation Business


Visualization Techniques Analyst

Data Mining Data


Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


Data Sources
Paper, Files, Web documents, Scientific experiments, Database DBA
Systems
8
Confluence of Multiple Disciplines

10
Data Mining System Architecture

Graphical user interface

Pattern evaluation

Data mining engine

Knowledge-base
Database or data warehouse server

Data cleaning, integration and selection

Data World Wide Other Info


Database
Warehouse Web Repositories

10
Necessity for Data Mining
 Tremendous amount of data
 Algorithms must be highly scalable to handle such data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
 Desired analyses
 Support for planning, Yield management, System performance, Mature
database analysis
12
Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, Web mining, etc.

13
Classification Schemes
 General functionality
 Descriptive data mining
 Predictive data mining

 Different views lead to different classifications


• Data view: Kinds of data to be mined
• Knowledge view: Kinds of knowledge to be discovered
• Method view: Kinds of techniques utilized
• Application view: Kinds of applications adapted

14
Data Mining Models and Tasks
Life Cycle of Data Mining Projects
 Business understanding - project
objectives from business
perspective, data mining problem
definition
 Data understanding - initial data
collection, get familiar with data
 Data preparation - construct final
dataset from raw data
 Modeling - Select and apply
modeling techniques
 Evaluation - Evaluate model,
decide on further deployment
 Deployment - Create report, carry
out actions based on new insights Standardized Data Mining Process [CRISP]
Data Mining Algorithms

Online Analytical Discovery Driven Methods


Processing

Description Prediction
SQL Query Tools

Visualization Classification Regressions


Clustering
Decision Trees
Association
Sequential Analysis Neural Networks

13
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web 18
Data Mining Functionalities
 Multidimensional concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions

 Frequent patterns, association, correlation vs. causality


 Diaper  Beer [0.5%, 75%] (Correlation or causality?)

 Classification and prediction


 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(fuel consumption)
 Predict some unknown or missing numerical values 19
Data Mining Functionalities(2)
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: Data object that does not comply with the general behavior of
the data
 Noise or exception? Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e.g., digital camera  large SD memory
 Periodicity analysis
 Similarity-based analysis
 Other pattern-directed or statistical analyses

20
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
21
Major Issues in Data Mining
Mining methodology and user interaction
 Mining different kinds of knowledge in databases
 Interactive mining of knowledge at multiple levels of abstraction
 Incorporation of background knowledge
 Data mining query languages and ad-hoc data mining
 Expression and visualization of data mining results
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem

Performance and scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed and incremental mining methods

22
Major Issues in Data Mining
Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from heterogeneous databases and global
information systems (www)

Issues related to applications and social impacts


 Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
 Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
 Protection of data security, integrity, and privacy

23
Primitives that Define a Data Mining Task
 Task-relevant data
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria

 Type of knowledge to be mined


 Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
 Background knowledge
 Pattern interestingness measurements
 Visualization/presentation of discovered patterns 24
Primitive 3: Background Knowledge
 A typical kind of background knowledge: Concept hierarchies
 Schema hierarchy
 E.g., street < city < county < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
 email address: [email protected]
 login-name < department < university < country
 Rule-based hierarchy
 low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
25
Primitive 4: Pattern Interestingness Measure
 Simplicity
e.g., (association) rule length, (decision) tree size
 Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.
 Utility
potential usefulness, e.g. support (association), noise threshold (description)
 Novelty
not previously known, surprising (used to remove redundant rules)

26
Primitive 5: Presentation of Discovered Patterns
 Different backgrounds/usages may require different forms of representation
 E.g., rules, tables, crosstabs, pie/bar chart, etc.
 Concept hierarchy is also important
 Discovered knowledge might be more understandable when represented
at high level of abstraction
 Interactive drill up/down, pivoting, slicing and dicing provide different
perspectives to data
 Different kinds of knowledge require different representation: association,
classification, clustering, etc.

27
Data mining Potential Applications
 Banking: loan/credit card approval
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor.
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, financial transactions
 from an online stream of events identify fraudulent events

28
Data mining Potential Applications
 Manufacturing and production:
 automatically adjust knobs when process parameter changes
 Medicine: disease outcome, effectiveness of treatments
 analyze patient disease history: find relationship between diseases
 Molecular/Pharmaceutical: identify new drugs
 Scientific data analysis:
 identify new galaxies by searching for sub clusters
 Web site/store design and promotion:
 find affinity of visitor to pages and modify layout

29
Ex. 1: Market Analysis and Management
 Source of data —Credit card transactions, customer complaint calls, (public) lifestyle studies
 Target marketing
 Find clusters of “model” customers who share the same characteristics: interest, income level, spending
habits, etc.
 Determine customer purchasing patterns over time

 Cross-market analysis—Find associations/co-relations between product sales, & predict based on


such association
 Customer profiling—What types of customers buy what products (clustering or classification)
 Customer requirement analysis
 Identify the best products for different groups of customers
 Predict what factors will attract new customers

 Provision of summary information


 Multidimensional summary reports
 Statistical summary information (data central tendency and variation)
30
Ex. 2: Corporate Analysis & Risk Management
 Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

 Resource planning
 summarize and compare the resources and spending

 Competition
 monitor competitors and market directions
 group customers into classes and a class-based pricing procedure
 set pricing strategy in a highly competitive market
31
Ex.3: Fraud Detection & Mining Unusual Patterns
 Approaches: Clustering & model construction for frauds, outlier analysis
 Applications: Health care, retail, credit card service, telecomm.
 Auto insurance: ring of collisions
 Money laundering: suspicious monetary transactions
 Medical insurance
 Professional patients, ring of doctors, and ring of references
 Unnecessary or correlated screening tests
 Telecommunications: phone-call fraud
 Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
 Retail industry
 Analysts estimate that 38% of retail shrink is due to dishonest employees
32

You might also like