0% found this document useful (0 votes)
4 views

1-Data Mining and Applications

The document discusses data mining, its applications, and the importance of extracting valuable insights from large datasets generated in various fields such as e-commerce, healthcare, and scientific research. It outlines the data mining process, including problem formulation, data collection, pre-processing, and evaluation, while also distinguishing between traditional database processing and data mining. Additionally, it highlights the goals of data mining, such as prediction, classification, and optimization, along with various methods and tasks involved in the data mining process.

Uploaded by

Sadbin Mohshin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

1-Data Mining and Applications

The document discusses data mining, its applications, and the importance of extracting valuable insights from large datasets generated in various fields such as e-commerce, healthcare, and scientific research. It outlines the data mining process, including problem formulation, data collection, pre-processing, and evaluation, while also distinguishing between traditional database processing and data mining. Additionally, it highlights the goals of data mining, such as prediction, classification, and optimization, along with various methods and tasks involved in the data mining process.

Uploaded by

Sadbin Mohshin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Data Mining

Course No: CSE 4221

Topic 1: Data Mining and Applications


Data is Everywhere
▪ There has been enormous data growth in both
commercial and scientific databases due to advances in
data generation and collection technologies
▪ New mantra
▪ Gather whatever data you can whenever and wherever
possible.
▪ Expectations
▪ Gathered data will have value either for the purpose
collected or for a purpose not envisioned.
Data is Everywhere

Cyber Security E-Commerce Social Networking: Twitter

Traffic Patterns
Sensor Networks Computational Simulations
Types of data
▪ Numeric data: Each object is a point in a
multidimensional space
▪ Categorical data: Each object is a vector of categorical
values
▪ Set data: Each object is a set of values (with or without
counts)
▪ Sets can also be represented as binary vectors, or vectors
of counts
▪ Ordered sequences: Each object is an ordered
sequence of values.
▪ Graph data
What can you do with the data?
▪ Suppose that you are the owner of a supermarket and
you have collected billions of market basket data. What
information would you extract from it and how would you
use it?

TID Items
1 Bread, Coke, Milk Product placement
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Catalog creation
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Recommendations

❑ What if this was an online store?


What can you do with the data?
▪ Suppose you are a search engine and you have a
toolbar log consisting of
▪ pages browsed,
▪ queries, Ad click prediction
▪ pages clicked,
▪ ads clicked Query reformulations

each with a user id and a timestamp. What information


would you like to get our of the data?
What can you do with the data?
▪ Suppose you are biologist who has microarray expression
data: thousands of genes, and their expression values
over thousands of different settings (e.g. tissues). What
information would you like to get out of your data?

Groups of genes and tissues


What can you do with the data?
▪ You are the owner of a social network, and you have full
access to the social graph, what kind of information do
you want to get out of your graph?

▪ Who is the most important node in the graph?


▪ What is the shortest path between two nodes?
▪ How many friends two nodes have in common?
▪ How does information spread on the network?
Why Data Mining?
▪ Cascade of data
▪ In the digital age, TB of data is generated by the second
▪ Need to analyze the raw data to extract knowledge
▪ Data Mining helps extract such Knowledge
▪ The possibility to use computers to analyze data
▪ Large amounts of data can be more powerful than
complex algorithms and models
▪ Google has solved many Natural Language Processing
problems, simply by looking at the data
▪ Example: misspellings, synonyms
Why Data Mining? Commercial Viewpoint
▪ Lots of data is being collected and warehoused
▪ Web data
▪ Yahoo has Peta Bytes of web data
▪ Facebook has billions of active users
▪ purchases at department/ grocery stores, e-commerce
▪ Amazon handles millions of visits/day
▪ Bank/Credit Card transactions
▪ Data has become the key competitive advantage of
companies
▪ Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
▪ Being able to extract useful information out of the data is
key for exploiting them commercially
Why Data Mining? Scientific Viewpoint
▪ Data collected and stored at enormous speeds
▪ remote sensors on a satellite
▪ NASA EOSDIS archives over petabytes of earth science data / year
▪ telescopes scanning the skies
▪ Sky survey data

▪ High-throughput biological data


▪ scientific simulations
▪ terabytes of data generated in a few hours

▪ Data mining helps scientists


▪ in automated analysis of massive datasets
▪ In hypothesis formation
▪ We need the tools to analyze such data to get a better
understanding of the world and advance science
Why Data Mining? Scale
▪ Scale (in data size and feature dimension)
▪ Why not use traditional analytic methods?
▪ Enormity of data, curse of dimensionality
▪ The amount and the complexity of data does not allow for
manual processing of the data. We need automated
techniques.
Great opportunities to improve
productivity in all walks of life
Great Opportunities to Solve
Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Finding alternative/ green energy sources Reducing hunger and poverty by


increasing agriculture production
Data Mining
Many Definitions:
▪ “Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarize the data in novel ways
that are both understandable and useful to the data
analyst” (Hand, Mannila, Smyth)
▪ Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
▪ Exploration & analysis, by automatic or semi-
automatic means, of large quantities of data in order to
discover
meaningful patterns
Data Mining
Data Mining:
▪ Process of semi-automatically analyzing large databases
to find patterns that are:
▪ valid: hold on new data with some certainity
▪ novel: non-obvious to the system
▪ useful: should be possible to act on the item
▪ understandable: humans should be able to interpret the
pattern
▪ Also known as Knowledge Discovery in Databases
(KDD)
Data Mining
Data Mining process
▪ Problem formulation
▪ Data collection
▪ subset data: sampling might hurt if highly skewed data
▪ feature selection: principal component analysis, heuristic search
▪ Pre-processing: cleaning
▪ name/address cleaning, different meanings (annual, yearly),
duplicate removal, supplying missing values
▪ Transformation:
▪ map complex objects e.g. time series data to features e.g.
frequency
▪ Choosing mining task and mining method:
▪ Result evaluation and Visualization:
Knowledge discovery is an iterative process
KDD process
What is (not) Data Mining?
▪ What is not Data ▪ What is Data Mining?
Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US locations
directory (O’Brien, O’Rourke, O’Reilly… in
Boston area)
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their context
“Amazon” (e.g., Amazon rainforest,
Amazon.com)
Origins of Data Mining
▪ Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
▪ Traditional techniques may be unsuitable due to data
that is
▪ Large-scale
▪ High dimensional
▪ Heterogeneous
▪ Complex
▪ Distributed

▪ A key component of the emerging field of data science and


data-driven discovery
Origins of Data Mining

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases
Statistics
▪ Discovery of structures or patterns in data sets
▪ hypothesis testing, parameter estimation
▪ Optimal strategies for collecting data
▪ efficient search of large databases
▪ Static data
▪ constantly evolving data
▪ Models play a central role
▪ algorithms are of a major concern
▪ patterns are sought
Relational Databases
▪ A relational database can contain several tables
▪ Tables and schemas
▪ The goal in data organization is to maintain data and
quickly locate the requested data
▪ Queries and index structures
▪ Query execution and optimization
▪ Query optimization is to find the “best” possible evaluation
▪ method for a given query
▪ Providing fast, reliable access to data for data mining
Artificial Intelligence
▪ Intelligent agents
▪ Perception-Action-Goal-Environment
▪ Search
▪ Uniform cost and informed search algorithms
▪ Knowledge representation
▪ FOL, production rules, frames with semantic networks
▪ Knowledge acquisition
▪ Knowledge maintenance and application
Machine Learning
▪ Focusing on complex representations, data-intensive
problems, and search-based methods
▪ Flexibility with prior knowledge and collected data
▪ Generalization from data and empirical validation
▪ statistical soundness and computational efficiency
▪ constrained by finite computing & data resources
▪ Challenges from KDD
▪ scaling up, cost info, auto data preprocessing, more
knowledge types
Visualization
▪ Producing a visual display with insights into the
structure of the data with interactive means
▪ zoom in/out, rotating, displaying detailed info
▪ Various types of visualization methods
▪ show summary properties and explore relationships
between variables
▪ investigate large DBs and convey lots of information
▪ analyze data with geographic/spatial location
▪ A pre- and post-processing tool for KDD
Applications of Data Mining
▪ Banking: loan/credit card approval
▪ predict good customers based on old customers
▪ Customer relationship management:
▪ identify those who are likely to leave for a competitor.
▪ Targeted marketing:
▪ identify likely responders to promotions
▪ Fraud detection: telecommunications, financial
transactions
▪ from an online stream of event identify fraudulent events
▪ Manufacturing and production:
▪ automatically adjust knobs when process parameter changes
Applications of Data Mining
▪ Medicine: disease outcome, effectiveness of treatments
▪ analyze patient disease history: find relationship between
diseases
▪ Molecular/Pharmaceutical: identify new drugs
▪ Scientific data analysis:
▪ identify new galaxies by searching for sub clusters
▪ Web site/store design and promotion:
▪ find affinity of visitor to pages and modify layout
▪ So on………….
Database Processing vs. Data
Mining Processing

Data Base Data Mining

▪ Query ▪ Query
▪ Well defined ▪ Poorly defined
▪ SQL ▪ No precise query language

◼ Output ◼ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
Database Processing vs. Data
Mining Processing
▪ Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased bread

▪ Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
bread. (association rules)
Goals of Data Mining and KDD
▪ Prediction: how certain attributes within the data will
behave in the future.

▪ Identification: identify the existence of an item, an


event, an activity.

▪ Classification: partition the data into categories.

▪ Optimization: optimize the use of limited resources.


Data Mining Tasks
▪ Prediction Methods
▪ Use some variables to predict unknown or future
values of other variables.

▪ Description Methods
▪ Find human-interpretable patterns that describe the
data.
Data Mining Tasks

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk
Basic Operations of Data Mining
▪ Prediction Methods:
▪ Regression
▪ Classification
▪ Collaborative Filtering

▪ Description Methods:
▪ Clustering / similarity matching
▪ Frequent Item sets, Association rules and variants
▪ Deviation detection
Predictive: Regression
▪ Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
▪ Extensively studied in statistics, neural network fields.
▪ Examples:
▪ Predicting sales amounts of new product based on
advertising expenditure.
▪ Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
▪ Time series prediction of stock market indices.
Classification and Prediction
▪ Classification is the process of learning a model that
describes different classes of data, the classes are
predetermined
▪ Given a collection of records (training set )
▪ Each record contains a set of attributes, one of the attributes
is the class.
▪ Find a model for class attribute as a function of the values
of other attributes.
▪ Goal: previously unseen records should be assigned a
class as accurately as possible.
▪ A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
Predictive: Classification
▪ Find a model for class attribute as a function of the values of
other attributes
Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No
Classification Example
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … … Test
10

Set

Training
Learn
Model
Set Classifier
Examples of Classification Task
▪ Classifying credit card transactions
as legitimate or fraudulent
▪ Classifying land covers (water bodies, urban
areas, forests, etc.) using satellite data
▪ Categorizing news stories as finance,
weather, entertainment, sports, etc
▪ Identifying intruders in the cyberspace
▪ Predicting tumor cells as benign or malignant
▪ Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random coil
Predictive: Collaborative Filtering
▪ Given database of user preferences, predict preference
of new user
▪ Example: predict what new movies you will like based
on
▪ your past preferences
▪ others with similar past preferences
▪ their preferences for the new movies
▪ Example: predict what books/CDs a person may want to
buy
▪ (and suggest it, or give discounts to tempt customer)
Descriptive: Cluster Analysis
▪ The previous data mining task of classification deals with
partitioning data based on a pre-classified training
sample
▪ Clustering is an automated process to group related
records together.
▪ Related records are grouped together on the basis of
having similar values for attributes
▪ The groups are usually disjoint
▪ Given a set of data points, each having a set of attributes,
and a similarity measure among them, find clusters such
that
▪ Data points in one cluster are more similar to one another.
▪ Data points in separate clusters are less similar to one another.
Cluster Analysis
▪ Similarity Measures?
▪ Euclidean Distance if attributes are continuous.
▪ Other Problem-specific Measures

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering: Application 1
▪ Market Segmentation:
▪ Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
▪ Approach:
▪ Collect different attributes of customers based on their
geographical and lifestyle related information.
▪ Find clusters of similar customers.
▪ Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Clustering: Application 2
▪ Document Clustering:

▪ Goal: To find groups of documents that are similar to


each other based on the important terms appearing in
them.

▪ Approach: To identify frequently occurring terms in


each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
Descriptive: Frequent Item sets
and Association Rules
▪ Given a set of records each of which contain some number
of items from a given collection;
▪ Identify sets of items (itemsets) occurring frequently together
▪ Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
Itemsets Discovered:
1 Bread, Coke, Milk {Milk,Coke}
2 Beer, Bread {Diaper, Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk Rules Discovered:
5 Coke, Diaper, Milk {Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Definition of Association Rule
Definition of Association Rule
Association Rule Mining
▪ Two-step approach:
1. Generate all frequent itemsets (sets of items whose
support ≥ minsup)
2. Generate high confidence association rules from each
frequent itemset
▪ Each rule is a binary partitioning of a frequent itemset
▪ Frequent itemset generation is the more expensive
operation
Frequent Item sets: Applications
▪ Text mining: finding associated phrases in text
▪ There are lots of documents that contain the phrases
“association rules”, “data mining” and “efficient algorithm”

▪ Recommendations:
▪ Users who buy this item often buy this item as well
▪ Users who watched James Bond movies, also watched Jason
Bourne movies.
Association Analysis: Applications
▪ Market-basket analysis
▪ Rules are used for sales promotion, shelf management,
and inventory management

▪ Telecommunication alarm diagnosis


▪ Rules are used to find combination of alarms that occur
together frequently in the same time period

▪ Medical Informatics
▪ Rules are used to find combination of patient symptoms
and test results associated with certain diseases
Deviation/Anomaly/Change Detection
▪ Detect significant deviations from
normal behavior
▪ Applications:
▪ Credit Card Fraud Detection
▪ Network Intrusion Detection
▪ Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
▪ Detecting changes in the global forest
cover.
Trend and Evolution Analysis
▪ Describes and models regularities or trends for objects
whose behavior changes over time
▪ Sequential pattern mining
▪ Cross selling:
digital camera → large memory card
▪ Stock market
Major Challenges in Data Mining
▪ Efficiency and scalability of data mining algorithms
▪ Parallel, distributed, stream, and incremental mining methods
▪ Handling high-dimensionality
▪ Handling noise, uncertainty, and incompleteness of data
▪ Incorporation of constraints, expert knowledge, and
background knowledge in data mining
▪ Pattern evaluation and knowledge integration
▪ Mining diverse and heterogeneous kinds of data: e.g.,
bioinformatics, Web, software/system engineering, information
networks
▪ Application-oriented and domain-specific data mining
▪ Invisible data mining (embedded in other functional modules)
▪ Protection of security, integrity, and privacy in data mining
What is a Data Warehouse?

A single, complete and consistent


store of data obtained from a
variety of different sources made
available to end users in a what
they can understand and use in a
business context.
What is a Data Warehouse?

A process of transforming
data into information and
making it available to users in
a timely enough manner to
make a difference
Data Warehousing – A Process
▪ It is a relational or multidimensional database
management system designed to support management
decision making.
▪ A data warehousing is a copy of transaction data
specifically structured for querying and reporting.
▪ Technique for assembling and managing data from
various sources for the purpose of answering business
questions. Thus making decisions that were not previous
possible
Data Warehouses
▪ Defined in many different ways, but not rigorously.
▪ A decision support database that is maintained separately
from the organization’s operational database
▪ Support information processing by providing a solid platform
of consolidated, historical data for analysis.
▪ “A data warehouse is a subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
▪ Data warehousing: The process of constructing and using
data warehouses
Data Warehouses
▪ A data warehouse thus not contain simply accumulated
data at a central point, but the data is carefully assembled
from a variety of information sources around the
organization, cleaned up, quality assured, and then
released (published).
▪ A Data Warehouse is a repository of integrated
information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as
they are generated....This makes it much easier and more
efficient to run queries over data that originally came from
different sources.
▪ The goal of data warehousing is to support decision
making with data!
Data Warehouse—Subject Oriented
▪ Organized around major subjects, such as customer,
product, sales.
▪ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
▪ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
Data Warehouse—Integrated
▪ Constructed by integrating multiple, heterogeneous data
sources
▪ relational databases, flat files, on-line transaction records
▪ Data cleaning and data integration techniques are
applied.
▪ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
▪ E.g., Hotel price: currency, tax, breakfast covered, etc.
▪ When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
▪ The time horizon for the data warehouse is significantly
longer than that of operational systems.
▪ Operational database: current value data.
▪ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
▪ Every key structure in the data warehouse
▪ Contains an element of time, explicitly or implicitly
▪ But the key of operational data may or may not contain “time
element”.
Data Warehouse—Non-Volatile
▪ A physically separate store of data transformed from the
operational environment.
▪ Operational update of data does not occur in the data
warehouse environment.
▪ Does not require transaction processing, recovery, and
concurrency control mechanisms
▪ Requires only two operations in data accessing:
▪ initial loading of data and access of data.
OLTP
▪ OLTP- ONLINE TRANSACTION PROCESSING
▪ Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional queries)
▪ OLTP systems are tuned for known transactions and
workloads while workload is not known a priori in a data
warehouse
– e.g., average amount spent on phone calls between 9AM-5PM
in Dhaka during the month of December
OLTP vs Data Warehouse

Administrative task

Completely
OLTP vs Data Warehouse
Database and Data Ware Housing
▪ The Difference:
▪ DWH constitute entire information base for all time.
▪ Database constitute real time information…
▪ DWH supports DM and business intelligence.
▪ Database is used to running the business
▪ DWH is how to run the business
Data Warehouse Architecture
Data Warehouse Architecture
▪ The data has been selected from various sources and then
integrate and store the data in a single and particular format.
▪ Data warehouses contain current detailed data, historical
detailed data, lightly and highly summarized data, and
metadata.
▪ Current and historical data are voluminous because they are
stored at the highest level of detail.
▪ Lightly and highly summarized data are necessary to save
processing time when users request them and are readily
accessible.
▪ Metadata are “data about data”. It is important for designing,
constructing, retrieving, and controlling the warehouse data.
Disadvantages of data warehouses
▪ Data warehouses are not the optimal environment for
unstructured data.
▪ Because data must be extracted, transformed and loaded into
the warehouse, there is an element of latency in data
warehouse data.
▪ Over their life, data warehouses can have high costs.
Maintenance costs are high.
▪ Data warehouses can get outdated relatively quickly. There is a
cost of delivering suboptimal information to the organization.
▪ There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality may be
developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in
the operational systems and vice versa.

You might also like