1-Data Mining and Applications
1-Data Mining and Applications
Traffic Patterns
Sensor Networks Computational Simulations
Types of data
▪ Numeric data: Each object is a point in a
multidimensional space
▪ Categorical data: Each object is a vector of categorical
values
▪ Set data: Each object is a set of values (with or without
counts)
▪ Sets can also be represented as binary vectors, or vectors
of counts
▪ Ordered sequences: Each object is an ordered
sequence of values.
▪ Graph data
What can you do with the data?
▪ Suppose that you are the owner of a supermarket and
you have collected billions of market basket data. What
information would you extract from it and how would you
use it?
TID Items
1 Bread, Coke, Milk Product placement
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Catalog creation
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Recommendations
Improving health care and reducing costs Predicting the impact of climate change
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
Statistics
▪ Discovery of structures or patterns in data sets
▪ hypothesis testing, parameter estimation
▪ Optimal strategies for collecting data
▪ efficient search of large databases
▪ Static data
▪ constantly evolving data
▪ Models play a central role
▪ algorithms are of a major concern
▪ patterns are sought
Relational Databases
▪ A relational database can contain several tables
▪ Tables and schemas
▪ The goal in data organization is to maintain data and
quickly locate the requested data
▪ Queries and index structures
▪ Query execution and optimization
▪ Query optimization is to find the “best” possible evaluation
▪ method for a given query
▪ Providing fast, reliable access to data for data mining
Artificial Intelligence
▪ Intelligent agents
▪ Perception-Action-Goal-Environment
▪ Search
▪ Uniform cost and informed search algorithms
▪ Knowledge representation
▪ FOL, production rules, frames with semantic networks
▪ Knowledge acquisition
▪ Knowledge maintenance and application
Machine Learning
▪ Focusing on complex representations, data-intensive
problems, and search-based methods
▪ Flexibility with prior knowledge and collected data
▪ Generalization from data and empirical validation
▪ statistical soundness and computational efficiency
▪ constrained by finite computing & data resources
▪ Challenges from KDD
▪ scaling up, cost info, auto data preprocessing, more
knowledge types
Visualization
▪ Producing a visual display with insights into the
structure of the data with interactive means
▪ zoom in/out, rotating, displaying detailed info
▪ Various types of visualization methods
▪ show summary properties and explore relationships
between variables
▪ investigate large DBs and convey lots of information
▪ analyze data with geographic/spatial location
▪ A pre- and post-processing tool for KDD
Applications of Data Mining
▪ Banking: loan/credit card approval
▪ predict good customers based on old customers
▪ Customer relationship management:
▪ identify those who are likely to leave for a competitor.
▪ Targeted marketing:
▪ identify likely responders to promotions
▪ Fraud detection: telecommunications, financial
transactions
▪ from an online stream of event identify fraudulent events
▪ Manufacturing and production:
▪ automatically adjust knobs when process parameter changes
Applications of Data Mining
▪ Medicine: disease outcome, effectiveness of treatments
▪ analyze patient disease history: find relationship between
diseases
▪ Molecular/Pharmaceutical: identify new drugs
▪ Scientific data analysis:
▪ identify new galaxies by searching for sub clusters
▪ Web site/store design and promotion:
▪ find affinity of visitor to pages and modify layout
▪ So on………….
Database Processing vs. Data
Mining Processing
▪ Query ▪ Query
▪ Well defined ▪ Poorly defined
▪ SQL ▪ No precise query language
◼ Output ◼ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
Database Processing vs. Data
Mining Processing
▪ Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased bread
▪ Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
bread. (association rules)
Goals of Data Mining and KDD
▪ Prediction: how certain attributes within the data will
behave in the future.
▪ Description Methods
▪ Find human-interpretable patterns that describe the
data.
Data Mining Tasks
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Basic Operations of Data Mining
▪ Prediction Methods:
▪ Regression
▪ Classification
▪ Collaborative Filtering
▪ Description Methods:
▪ Clustering / similarity matching
▪ Frequent Item sets, Association rules and variants
▪ Deviation detection
Predictive: Regression
▪ Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
▪ Extensively studied in statistics, neural network fields.
▪ Examples:
▪ Predicting sales amounts of new product based on
advertising expenditure.
▪ Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
▪ Time series prediction of stock market indices.
Classification and Prediction
▪ Classification is the process of learning a model that
describes different classes of data, the classes are
predetermined
▪ Given a collection of records (training set )
▪ Each record contains a set of attributes, one of the attributes
is the class.
▪ Find a model for class attribute as a function of the values
of other attributes.
▪ Goal: previously unseen records should be assigned a
class as accurately as possible.
▪ A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
Predictive: Classification
▪ Find a model for class attribute as a function of the values of
other attributes
Model for predicting credit
worthiness
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Classification Example
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10
Set
Training
Learn
Model
Set Classifier
Examples of Classification Task
▪ Classifying credit card transactions
as legitimate or fraudulent
▪ Classifying land covers (water bodies, urban
areas, forests, etc.) using satellite data
▪ Categorizing news stories as finance,
weather, entertainment, sports, etc
▪ Identifying intruders in the cyberspace
▪ Predicting tumor cells as benign or malignant
▪ Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random coil
Predictive: Collaborative Filtering
▪ Given database of user preferences, predict preference
of new user
▪ Example: predict what new movies you will like based
on
▪ your past preferences
▪ others with similar past preferences
▪ their preferences for the new movies
▪ Example: predict what books/CDs a person may want to
buy
▪ (and suggest it, or give discounts to tempt customer)
Descriptive: Cluster Analysis
▪ The previous data mining task of classification deals with
partitioning data based on a pre-classified training
sample
▪ Clustering is an automated process to group related
records together.
▪ Related records are grouped together on the basis of
having similar values for attributes
▪ The groups are usually disjoint
▪ Given a set of data points, each having a set of attributes,
and a similarity measure among them, find clusters such
that
▪ Data points in one cluster are more similar to one another.
▪ Data points in separate clusters are less similar to one another.
Cluster Analysis
▪ Similarity Measures?
▪ Euclidean Distance if attributes are continuous.
▪ Other Problem-specific Measures
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering: Application 1
▪ Market Segmentation:
▪ Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
▪ Approach:
▪ Collect different attributes of customers based on their
geographical and lifestyle related information.
▪ Find clusters of similar customers.
▪ Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Clustering: Application 2
▪ Document Clustering:
TID Items
Itemsets Discovered:
1 Bread, Coke, Milk {Milk,Coke}
2 Beer, Bread {Diaper, Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk Rules Discovered:
5 Coke, Diaper, Milk {Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Definition of Association Rule
Definition of Association Rule
Association Rule Mining
▪ Two-step approach:
1. Generate all frequent itemsets (sets of items whose
support ≥ minsup)
2. Generate high confidence association rules from each
frequent itemset
▪ Each rule is a binary partitioning of a frequent itemset
▪ Frequent itemset generation is the more expensive
operation
Frequent Item sets: Applications
▪ Text mining: finding associated phrases in text
▪ There are lots of documents that contain the phrases
“association rules”, “data mining” and “efficient algorithm”
▪ Recommendations:
▪ Users who buy this item often buy this item as well
▪ Users who watched James Bond movies, also watched Jason
Bourne movies.
Association Analysis: Applications
▪ Market-basket analysis
▪ Rules are used for sales promotion, shelf management,
and inventory management
▪ Medical Informatics
▪ Rules are used to find combination of patient symptoms
and test results associated with certain diseases
Deviation/Anomaly/Change Detection
▪ Detect significant deviations from
normal behavior
▪ Applications:
▪ Credit Card Fraud Detection
▪ Network Intrusion Detection
▪ Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
▪ Detecting changes in the global forest
cover.
Trend and Evolution Analysis
▪ Describes and models regularities or trends for objects
whose behavior changes over time
▪ Sequential pattern mining
▪ Cross selling:
digital camera → large memory card
▪ Stock market
Major Challenges in Data Mining
▪ Efficiency and scalability of data mining algorithms
▪ Parallel, distributed, stream, and incremental mining methods
▪ Handling high-dimensionality
▪ Handling noise, uncertainty, and incompleteness of data
▪ Incorporation of constraints, expert knowledge, and
background knowledge in data mining
▪ Pattern evaluation and knowledge integration
▪ Mining diverse and heterogeneous kinds of data: e.g.,
bioinformatics, Web, software/system engineering, information
networks
▪ Application-oriented and domain-specific data mining
▪ Invisible data mining (embedded in other functional modules)
▪ Protection of security, integrity, and privacy in data mining
What is a Data Warehouse?
A process of transforming
data into information and
making it available to users in
a timely enough manner to
make a difference
Data Warehousing – A Process
▪ It is a relational or multidimensional database
management system designed to support management
decision making.
▪ A data warehousing is a copy of transaction data
specifically structured for querying and reporting.
▪ Technique for assembling and managing data from
various sources for the purpose of answering business
questions. Thus making decisions that were not previous
possible
Data Warehouses
▪ Defined in many different ways, but not rigorously.
▪ A decision support database that is maintained separately
from the organization’s operational database
▪ Support information processing by providing a solid platform
of consolidated, historical data for analysis.
▪ “A data warehouse is a subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
▪ Data warehousing: The process of constructing and using
data warehouses
Data Warehouses
▪ A data warehouse thus not contain simply accumulated
data at a central point, but the data is carefully assembled
from a variety of information sources around the
organization, cleaned up, quality assured, and then
released (published).
▪ A Data Warehouse is a repository of integrated
information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as
they are generated....This makes it much easier and more
efficient to run queries over data that originally came from
different sources.
▪ The goal of data warehousing is to support decision
making with data!
Data Warehouse—Subject Oriented
▪ Organized around major subjects, such as customer,
product, sales.
▪ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
▪ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
Data Warehouse—Integrated
▪ Constructed by integrating multiple, heterogeneous data
sources
▪ relational databases, flat files, on-line transaction records
▪ Data cleaning and data integration techniques are
applied.
▪ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
▪ E.g., Hotel price: currency, tax, breakfast covered, etc.
▪ When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
▪ The time horizon for the data warehouse is significantly
longer than that of operational systems.
▪ Operational database: current value data.
▪ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
▪ Every key structure in the data warehouse
▪ Contains an element of time, explicitly or implicitly
▪ But the key of operational data may or may not contain “time
element”.
Data Warehouse—Non-Volatile
▪ A physically separate store of data transformed from the
operational environment.
▪ Operational update of data does not occur in the data
warehouse environment.
▪ Does not require transaction processing, recovery, and
concurrency control mechanisms
▪ Requires only two operations in data accessing:
▪ initial loading of data and access of data.
OLTP
▪ OLTP- ONLINE TRANSACTION PROCESSING
▪ Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional queries)
▪ OLTP systems are tuned for known transactions and
workloads while workload is not known a priori in a data
warehouse
– e.g., average amount spent on phone calls between 9AM-5PM
in Dhaka during the month of December
OLTP vs Data Warehouse
Administrative task
Completely
OLTP vs Data Warehouse
Database and Data Ware Housing
▪ The Difference:
▪ DWH constitute entire information base for all time.
▪ Database constitute real time information…
▪ DWH supports DM and business intelligence.
▪ Database is used to running the business
▪ DWH is how to run the business
Data Warehouse Architecture
Data Warehouse Architecture
▪ The data has been selected from various sources and then
integrate and store the data in a single and particular format.
▪ Data warehouses contain current detailed data, historical
detailed data, lightly and highly summarized data, and
metadata.
▪ Current and historical data are voluminous because they are
stored at the highest level of detail.
▪ Lightly and highly summarized data are necessary to save
processing time when users request them and are readily
accessible.
▪ Metadata are “data about data”. It is important for designing,
constructing, retrieving, and controlling the warehouse data.
Disadvantages of data warehouses
▪ Data warehouses are not the optimal environment for
unstructured data.
▪ Because data must be extracted, transformed and loaded into
the warehouse, there is an element of latency in data
warehouse data.
▪ Over their life, data warehouses can have high costs.
Maintenance costs are high.
▪ Data warehouses can get outdated relatively quickly. There is a
cost of delivering suboptimal information to the organization.
▪ There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality may be
developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in
the operational systems and vice versa.