0% found this document useful (0 votes)
31 views41 pages

1 Introduction

Uploaded by

sasank1613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views41 pages

1 Introduction

Uploaded by

sasank1613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

15CSE401 Machine Learning and Data Mining 1

15CSE401 Machine Learning and Data


Mining
Lecture 1,2,3
Course Information
An Introduction to Data Mining

Nalinadevi Kadiresan
CSE Dept.

Amrita School of Engg .


July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 2

Course Schedule
● Course code: 15CSE401
● Title: Machine Learning and Data Mining
● Semester: 7
● Batch : CSE- C
● Slots:
○ Tuesday : Slot -1 (9-10 )
○ Tuesday: Slot - 9 (18-19) (Discussion)
○ Wednesday: Slot -2 (10-11)
○ Saturday: Slot-4 (12- 13)

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 3

Course Objective
• To provide in-depth knowledge about data mining.

• To implement the machine learning models in


data mining problems .

• To improve the understanding of the on-going


research.

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 4

Course Outcome
Course Outcome BTL
CO1 Understand the fundamental concepts of Data mining and L2
basic theory underlying Machine learning.

CO2 Understand the types of the data to be mined and apply L3


pre-processing methods

CO3 Apply appropriate classification and clustering techniques L3


for real word applications

CO4 Analyze the performance of various classifiers and L4


clusters techniques

CO5 Apply and evaluate the interesting patterns discovered L4


from association mining

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 5

Course Syllabus
Unit 1: Introduction to Machine learning: Supervised learning, Unsupervised
learning, some basic concepts in machine learning, Review of probability,
Computational Learning theory. Bayesian concept learning, Likelihood, Posterior
predictive distribution, Naive Bayes classifiers, The log-sum-exp trick, Feature
selection using mutual information, Linear Regression, Logistic regression.
Unit 2: Introduction to data mining - challenges and tasks, measures of
similarity and dissimilarity, Classification - Rule based classifier, Nearest-
neighbour classifiers - Bayesian classifiers - decision trees; support vector
machines, Class imbalance problem performance evaluation of the classifier,
comparison of different classifiers.
Unit 3: Association analysis – frequent item generation rule generation,
evaluation of association patterns. Cluster analysis, K means algorithm, cluster
evaluation, application of data mining to web mining and Bioinformatics.
Classifying documents using bag of words advertising on the Web,
Recommendation Systems, and Mining Social network graphs.
The topics in red was covered in 15CSE432.
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 6

Text Books and References


1. Jiawei Han and Micheline
Kamber, Jian Pei, “Data
Mining: Concepts and
Techniques”, Third Edition,
Elsevier, 2012.
2. Kevin P. Murphey, “Machine
Learning, a probabilistic
perspective”, The MIT Press,
2012.
3. Tom Mitchell, “Machine Learning”,
McGraw Hill, 1997
4. Pang-Ning Tan, Michael Steinbach
and Vipin Kumar, “Introduction to
Data Mining”, First Edition,
Pearson Education, 2006.

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 7

Modified Course Content


● Data mining- Introduction, tasks ● Classification
and challenges o Random Forest
● Similarity and dissimilarity metrics o Bagging and Boosting
● Statistical concepts ● Clustering
o Distributions, P-value statistics o DBSCAN
● Association Rule mining o Fuzzy clustering
o Apriori (frequent itemset o Hierarchical clustering
generation & test) o Cluster evaluation
o Projection-based (FP-growth) ● Applications
o Vertical format approach o Recommender systems
(ECLAT) o Web mining
o Evaluation of Association Rule o Bioinformatics
Mining o Classifying documents,
o Mining social network
graphs
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 8

Evaluation Pattern
20 week semester Marks Weight BTL
Knowledge,
Quizzes: Min 1 quiz per 20 Comprehension
week
Internal: 70% Assignments: Minimum 30 70% Application,
1 assignment per unit Analysis

Project: 1 project per Synthesis,


20
semester Evaluation

Online End Sem exam


End Semester: 15
(15%)
30% 30%
Viva 15%(Mandatory) 15
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 9

Course Delivery
● Online classes (MS Teams)
○ Live lectures 20 to 30 minutes
○ Discussion
○ Quiz
● Tutorial/Discussion hour (MS Teams)
● Weekly Quizzes (AMPLE / AUMS)
● Assignments ( Problems and
programming)
● Case study (Group)
● Course Repository - AMPLE

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 10

General Comments

● Keep a separate course note book

● Active participation expected – In


class, Chats and discussion forums

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 11

Motivation: Why Data Mining??

• Data explosion problem


– Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories

• We are drowning in data, but starving for knowledge!


• Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Extraction of interesting knowledge

(rules, regularities, patterns, constraints) from data in large database s

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 12

Why Mine Data? Commercial


Viewpoint
• Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 13

Why Mine Data? Scientific


Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data
• Traditional techniques infeasible for
raw data
• Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 14

Examples

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 15

Examples

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 16

Examples

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 17

Examples

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 18

What is common in all??

● We are living in ‘BIG DATA’ age


○ Data is wealth
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 19

Review of Lecture-1

• To derive knowledge from raw data


for decision making in various
business and scientific applications.
• The knowledge inference is
challenging due to the nature of data.
– That is, Volume, Variety, Velocity, and
Veracity – Big Data

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 20

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 21

What is Data Mining???

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 22

What Is Data Mining?

• Data mining (knowledge discovery from data):


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data.

• Alternative names and their “inside stories”:


– Data mining: a misnomer?
 Knowledge Discovery(mining) from Data(KDD)
 Knowledge Extraction
 Data/Pattern Analysis
 Data Archeology
 Business Intelligence

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 23

Data Mining Definition

• Finding hidden information in a


database
• Fit data to a model
• Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 24

Applications of Data Mining


• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
– Fraud detection and detection of unusual patterns (outliers)
– Risk analysis and management
• Forecasting, customer retention, quality control, competitive
analysis
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 25

Market Analysis and Management


• Where does the data come from?
– Credit card transactions, discount coupons, customer
complaint calls
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits,
etc.
– Determine customer purchasing patterns over time

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 26

Market Analysis and Management


• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products

• Customer requirement analysis


– Identifying the best products for different customers
– Predict what factors will attract new customers

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 27

Examples: What is (not) Data


Mining?
 What is not Data Mining?  What is Data Mining?

– Look up phone number in - Certain names are more


phone directory prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)

– Group together similar


- Query a Web search
documents returned by search
engine for information about
engine according to their
“Amazon” context (e.g. Amazon rainforest,
Amazon.com,)

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 28

Database Processing vs. Data Mining Processing

• Query • Query
– Well defined – Poorly defined
– SQL – No precise query
 Data  Data
– Operational data language
– Not operational data (Analytical Data)

 Output  Output
– Precise – Fuzzy
– Subset of database – Not a subset of database

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 29

Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk

• Data Mining
– credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk.
(association rules)

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 30

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 31

Knowledge Discovery
(KDD) Process Pattern Evaluation

• This is a view from


typical database
systems and data Data Mining
warehousing
Task-relevant Data
communities
• Data mining plays an
essential role in the Data Warehouse Selection
knowledge discovery
process
Data Cleaning

Data Integration

July 2020 Nalinadevi Kadiresan


Databases
15CSE401 Machine Learning and Data Mining 32

Knowledge Discovery (KDD)


Process
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from
the database)
4. Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)4
5. Data mining (an essential process where intelligent methods are applied to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measure)
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to users)

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 33

Typical framework of a data


warehouse

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 34

Architecture of a Data mining system

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining 35

What Kinds of Patterns can be Mined?


• Data mining functionalities are used to specify the kinds of
patterns to be found in data mining tasks.

• Data mining functionalities


 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

• Data mining tasks


 Descriptive data mining - characterize properties of the data in a target data set

 Predictive data mining - perform induction on the current data in order to make predictions

July 2020 Nalinadevi Kadiresan


15CSE401 Machine Learning and Data Mining

Data Mining Functionalities


• Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
15CSE401 Machine Learning and Data Mining

Data Mining Tasks


• Description Tasks
– Find human-interpretable patterns that describe the data
• Prediction Tasks
– Use some variables to predict unknown or future values
of other variables

Common data mining tasks


 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
15CSE401 Machine Learning and Data Mining

Data Mining Models and Tasks


15CSE401 Machine Learning and Data Mining
15CSE401 Machine Learning and Data Mining

KDD Process: A View from ML and


Statistics

Input Data Data Pre- Data Post-


Processin Processing
g
Minin
g

Data integration Pattern Pattern evaluation


discovery
Normalization Pattern selection
Classification
Feature Pattern
selection Clustering interpretation
Dimension Outlier analysis Pattern
reduction ………… visualization

• This is a view from typical machine


learning and statistics communities
15CSE401 Machine Learning and Data Mining 41

Major Issues in Data Mining

• Mining Methodology: includes mining various


– knowledge kinds,
– multidimensional data,
– exploring domain specific mining,
– data with uncertainty, noise, and incompleteness,
– user-constraint guided mining.

July 2020 Nalinadevi Kadiresan

You might also like