15CSE401 Machine Learning and Data Mining 1
15CSE401 Machine Learning and Data
Mining
Lecture 1,2,3
Course Information
An Introduction to Data Mining
Nalinadevi Kadiresan
CSE Dept.
Amrita School of Engg .
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 2
Course Schedule
● Course code: 15CSE401
● Title: Machine Learning and Data Mining
● Semester: 7
● Batch : CSE- C
● Slots:
○ Tuesday : Slot -1 (9-10 )
○ Tuesday: Slot - 9 (18-19) (Discussion)
○ Wednesday: Slot -2 (10-11)
○ Saturday: Slot-4 (12- 13)
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 3
Course Objective
• To provide in-depth knowledge about data mining.
• To implement the machine learning models in
data mining problems .
• To improve the understanding of the on-going
research.
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 4
Course Outcome
Course Outcome BTL
CO1 Understand the fundamental concepts of Data mining and L2
basic theory underlying Machine learning.
CO2 Understand the types of the data to be mined and apply L3
pre-processing methods
CO3 Apply appropriate classification and clustering techniques L3
for real word applications
CO4 Analyze the performance of various classifiers and L4
clusters techniques
CO5 Apply and evaluate the interesting patterns discovered L4
from association mining
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 5
Course Syllabus
Unit 1: Introduction to Machine learning: Supervised learning, Unsupervised
learning, some basic concepts in machine learning, Review of probability,
Computational Learning theory. Bayesian concept learning, Likelihood, Posterior
predictive distribution, Naive Bayes classifiers, The log-sum-exp trick, Feature
selection using mutual information, Linear Regression, Logistic regression.
Unit 2: Introduction to data mining - challenges and tasks, measures of
similarity and dissimilarity, Classification - Rule based classifier, Nearest-
neighbour classifiers - Bayesian classifiers - decision trees; support vector
machines, Class imbalance problem performance evaluation of the classifier,
comparison of different classifiers.
Unit 3: Association analysis – frequent item generation rule generation,
evaluation of association patterns. Cluster analysis, K means algorithm, cluster
evaluation, application of data mining to web mining and Bioinformatics.
Classifying documents using bag of words advertising on the Web,
Recommendation Systems, and Mining Social network graphs.
The topics in red was covered in 15CSE432.
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 6
Text Books and References
1. Jiawei Han and Micheline
Kamber, Jian Pei, “Data
Mining: Concepts and
Techniques”, Third Edition,
Elsevier, 2012.
2. Kevin P. Murphey, “Machine
Learning, a probabilistic
perspective”, The MIT Press,
2012.
3. Tom Mitchell, “Machine Learning”,
McGraw Hill, 1997
4. Pang-Ning Tan, Michael Steinbach
and Vipin Kumar, “Introduction to
Data Mining”, First Edition,
Pearson Education, 2006.
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 7
Modified Course Content
● Data mining- Introduction, tasks ● Classification
and challenges o Random Forest
● Similarity and dissimilarity metrics o Bagging and Boosting
● Statistical concepts ● Clustering
o Distributions, P-value statistics o DBSCAN
● Association Rule mining o Fuzzy clustering
o Apriori (frequent itemset o Hierarchical clustering
generation & test) o Cluster evaluation
o Projection-based (FP-growth) ● Applications
o Vertical format approach o Recommender systems
(ECLAT) o Web mining
o Evaluation of Association Rule o Bioinformatics
Mining o Classifying documents,
o Mining social network
graphs
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 8
Evaluation Pattern
20 week semester Marks Weight BTL
Knowledge,
Quizzes: Min 1 quiz per 20 Comprehension
week
Internal: 70% Assignments: Minimum 30 70% Application,
1 assignment per unit Analysis
Project: 1 project per Synthesis,
20
semester Evaluation
Online End Sem exam
End Semester: 15
(15%)
30% 30%
Viva 15%(Mandatory) 15
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 9
Course Delivery
● Online classes (MS Teams)
○ Live lectures 20 to 30 minutes
○ Discussion
○ Quiz
● Tutorial/Discussion hour (MS Teams)
● Weekly Quizzes (AMPLE / AUMS)
● Assignments ( Problems and
programming)
● Case study (Group)
● Course Repository - AMPLE
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 10
General Comments
● Keep a separate course note book
● Active participation expected – In
class, Chats and discussion forums
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 11
Motivation: Why Data Mining??
• Data explosion problem
– Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
• We are drowning in data, but starving for knowledge!
• Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge
(rules, regularities, patterns, constraints) from data in large database s
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 12
Why Mine Data? Commercial
Viewpoint
• Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 13
Why Mine Data? Scientific
Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
• Traditional techniques infeasible for
raw data
• Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 14
Examples
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 15
Examples
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 16
Examples
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 17
Examples
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 18
What is common in all??
● We are living in ‘BIG DATA’ age
○ Data is wealth
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 19
Review of Lecture-1
• To derive knowledge from raw data
for decision making in various
business and scientific applications.
• The knowledge inference is
challenging due to the nature of data.
– That is, Volume, Variety, Velocity, and
Veracity – Big Data
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 20
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 21
What is Data Mining???
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 22
What Is Data Mining?
• Data mining (knowledge discovery from data):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data.
• Alternative names and their “inside stories”:
– Data mining: a misnomer?
Knowledge Discovery(mining) from Data(KDD)
Knowledge Extraction
Data/Pattern Analysis
Data Archeology
Business Intelligence
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 23
Data Mining Definition
• Finding hidden information in a
database
• Fit data to a model
• Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 24
Applications of Data Mining
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
– Fraud detection and detection of unusual patterns (outliers)
– Risk analysis and management
• Forecasting, customer retention, quality control, competitive
analysis
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 25
Market Analysis and Management
• Where does the data come from?
– Credit card transactions, discount coupons, customer
complaint calls
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits,
etc.
– Determine customer purchasing patterns over time
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 26
Market Analysis and Management
• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products
• Customer requirement analysis
– Identifying the best products for different customers
– Predict what factors will attract new customers
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 27
Examples: What is (not) Data
Mining?
What is not Data Mining? What is Data Mining?
– Look up phone number in - Certain names are more
phone directory prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
- Query a Web search
documents returned by search
engine for information about
engine according to their
“Amazon” context (e.g. Amazon rainforest,
Amazon.com,)
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 28
Database Processing vs. Data Mining Processing
• Query • Query
– Well defined – Poorly defined
– SQL – No precise query
Data Data
– Operational data language
– Not operational data (Analytical Data)
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 29
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
• Data Mining
– credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk.
(association rules)
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 30
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 31
Knowledge Discovery
(KDD) Process Pattern Evaluation
• This is a view from
typical database
systems and data Data Mining
warehousing
Task-relevant Data
communities
• Data mining plays an
essential role in the Data Warehouse Selection
knowledge discovery
process
Data Cleaning
Data Integration
July 2020 Nalinadevi Kadiresan
Databases
15CSE401 Machine Learning and Data Mining 32
Knowledge Discovery (KDD)
Process
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from
the database)
4. Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)4
5. Data mining (an essential process where intelligent methods are applied to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measure)
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to users)
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 33
Typical framework of a data
warehouse
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 34
Architecture of a Data mining system
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining 35
What Kinds of Patterns can be Mined?
• Data mining functionalities are used to specify the kinds of
patterns to be found in data mining tasks.
• Data mining functionalities
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
• Data mining tasks
Descriptive data mining - characterize properties of the data in a target data set
Predictive data mining - perform induction on the current data in order to make predictions
July 2020 Nalinadevi Kadiresan
15CSE401 Machine Learning and Data Mining
Data Mining Functionalities
• Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
15CSE401 Machine Learning and Data Mining
Data Mining Tasks
• Description Tasks
– Find human-interpretable patterns that describe the data
• Prediction Tasks
– Use some variables to predict unknown or future values
of other variables
Common data mining tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
15CSE401 Machine Learning and Data Mining
Data Mining Models and Tasks
15CSE401 Machine Learning and Data Mining
15CSE401 Machine Learning and Data Mining
KDD Process: A View from ML and
Statistics
Input Data Data Pre- Data Post-
Processin Processing
g
Minin
g
Data integration Pattern Pattern evaluation
discovery
Normalization Pattern selection
Classification
Feature Pattern
selection Clustering interpretation
Dimension Outlier analysis Pattern
reduction ………… visualization
• This is a view from typical machine
learning and statistics communities
15CSE401 Machine Learning and Data Mining 41
Major Issues in Data Mining
• Mining Methodology: includes mining various
– knowledge kinds,
– multidimensional data,
– exploring domain specific mining,
– data with uncertainty, noise, and incompleteness,
– user-constraint guided mining.
July 2020 Nalinadevi Kadiresan