0% found this document useful (0 votes)
73 views12 pages

Challenges in ML&DM

This document discusses the key differences between generative and discriminative methods in machine learning. Generative methods try to build models of the underlying patterns in the data by estimating probabilities like P(X|Y) and P(Y), while discriminative methods try to directly learn a function to minimize an error function, like classification error, without explicitly modeling the patterns. Examples of generative models include HMMs, Bayesian networks, and Naive Bayes. Discriminative models include SVMs, logistic regression, neural networks, and nearest neighbors algorithms. Generative models can be learned and adapted with small data but discriminative methods often require large labeled training datasets.

Uploaded by

sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views12 pages

Challenges in ML&DM

This document discusses the key differences between generative and discriminative methods in machine learning. Generative methods try to build models of the underlying patterns in the data by estimating probabilities like P(X|Y) and P(Y), while discriminative methods try to directly learn a function to minimize an error function, like classification error, without explicitly modeling the patterns. Examples of generative models include HMMs, Bayesian networks, and Naive Bayes. Discriminative models include SVMs, logistic regression, neural networks, and nearest neighbors algorithms. Generative models can be learned and adapted with small data but discriminative methods often require large labeled training datasets.

Uploaded by

sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What is machine learning?

g
 The goal of machine learning is to build computer systems that
Challenges in Machine can adapt and learn from their experience (Tom Dietterich).
 A computer
p p
program
g is said to learn from experience
p E with
Learning and Data Mining respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measure by P, improves with
experience
i (Tom
(T Mi
Mitchell
h ll book,
b k p. 2).
2)
 ML problems can be formulated as
Tu Bao Ho
Ho, JAIST
 Given: (x1, y1), (x2, y2), …, (xn, yn)
Based on materials from - xi is description of an object, phenomenon, etc.
1 9 challenges in ML (Caruana & Joachims)
1. - yi is some property of xi, if not available learning is unsupervised
2. 10 challenging problems in DM (Yang & Wu)
 Find: a function f(x) that f(xi) = yi

Finding hypothesis f in a huge hypothesis space F by narrowing the search with constraints (bias)

Overview of ML challenges
g 1. Generative vs. discriminative methods
Training classifiers involves estimating f: X  Y, or P(Y|X).
1.
1 Generative vs.
vs discriminative learning E
Examples:
l P(apple
P( d  round),
l | red d) P(
P(noun | ““cá”)
á”)
2. Learning from non-vectorial data
Generative classifiers Discriminative classifiers
3
3. B
Beyond d classification
l ifi ti and d regression
i
4. Distributed data mining  Assume some functional  Assume some functional form
form for P(X|Y)
P(X|Y), P(Y) for P(Y|X)
5. Machine learning bottlenecks  Estimate parameters of  Estimate parameters of P(Y|X)
6. Intelligible models ( | ), P(Y)
P(X|Y), ( ) directlyy from directlyy from trainingg data
training data, and use Bayes
7. Combining learning methods rule to calculate P(Y|X = xi)
 SVM, logistic regression,
traditional neural networks,,
8
8. Unsupervised learning comes of age  HMM, Markov random nearest neighbors, boosting,
9. More informed information access fields, Bayesian networks, MEMM, conditional random
Gaussians Naïve Bayes,
Gaussians, Bayes etc. fields etc.
fields, etc
(cá: fish, to bet)
1. Generative vs. discriminative methods Generative vs. discriminative methods
Training classifiers involves estimating f: X  Y, or P(Y|X).
E
Examples:
l P(apple
P( d  round),
l | red d) P(
P(noun | ““cá”)
á”) Generative approach Discriminative approach
 Try to build models for the  Try to learn to minimize an utility
Generative classifiers Discriminative classifiers underlying
d l i patterns
tt f
function
ti (e.g.
( classification
l ifi ti error))
 Assume some functional  Assume some functional form  Can be learned, adapted, but not to model, represent, or
form for P(X|Y)
P(X|Y), P(Y) for P(Y|X) and generalized with small “understand” the p pattern explicitly
p y
data. (detect 99.99% faces in real images
 Estimate parameters of  Estimate parameters of P(Y|X)
and do not “know” that a face has
P(X|Y),
( | ), P(Y)
( ) directlyy from directlyy from trainingg data
two eyes).
eyes)
training data, and use Bayes P(logistic
X | Y ) P (Yregression,
)
 SVM,  Often need large training data, say
rule to calculate P(Y|X = xi) P(Y | X )  P( Xneural
)
traditional networks,, 100,000
00,000 labeled
abe ed eexamples,
a p es, aandd
 HMM, Markov random nearest neighbors, boosting, can hardly be generalized.
fields, Bayesian networks, P(red  round | apple) P(apple)
| red  round ) conditional
P(apple MEMM,  random
P(red  round )
Gaussians Naïve Bayes,
Gaussians, Bayes etc. fields etc.
fields, etc
(cá: fish, to bet)

Generative vs. discriminative learningg 2. Kernel methods


 Objective: determine which is better for what, and why
 Objective: learning from non-vectorial
non vectorial input data
 Current:
 Discriminative learning (ANN, DT, KNN, SVM) typically more  Current:
accurate  Mostt learning
M l i algorithms
l ith workk on flat,fl t fixed
fi d length
l th ffeature
t
 Better with larger data vectors
 Faster to train  Each new data type requires a new learning algorithm
 Generative learning (graphical models, HMM) typically more flexible  Difficult to handle strings, gene/protein sequences, natural
 More complex problems
language parse trees, graph structures, pictures, plots, …
 M
More fl
flexible
ibl predictions
di ti
 Key Challenges:  Key Challenges:
 Vapnik: “When
When solving a problem,
problem don’t
don t first solve a harder problem
problem”  One data-interface for multiple
p learningg methods
 Making generative methods computationally more feasible  One learning method for multiple data types
 When to prefer discriminative vs. generative  Research alreadyy in progress
p g
Kernel methods: the basic ideas Kernel methods: math background
g
Input space X Feature space F Input space X
x1 x2
Feature space F
inverse map -1
(xn)
x1 x2 inverse map -11
(xn) (x)  (xn-1)
(x1)
(x)  (xn-1) … (x2)
(x1)
(x2) xn-1 xn

xn-1 xn k( i,xj) = (x
k(x ( i).(x
( j)

k(xi,xj) = (xi).(xj) kernel-based algorithm on K


kernel function k: XxX  R Gram matrix Knxn= {k(xi,xj)} -

kernel function k: XxX  R Kernel matrix Knxn kernel-based algorithm on K


(computation done on kernel matrix) Linear algebra, probability/statistics, functional analysis, optimization
 Mercer theorem: Any positive definite function can be written as an
inner product in some feature space.
 :X  R2  H  R3  Kernel trick: Using kernel matrix instead of inner product in the
( x1 , x2 )  ( x1 , x2 , x  x )
2
1
2
2
feature space. Every minimizer of min{C ( f , {x , y })  ( f ) admits
f H
i i H

m
 Representer theorem: a representa tion of the form f (.)    i K (., xi )
i1
1

3. Beyond
y classification and regression
g What is structured p
prediction? (Daume)
 Objective: learning to predict complex objects
 Structured prediction is a fundamental machine learning task
 Current: involving classification or regression in which the output
 Most machine learning focuses on classification and regression variables are mutuallyy dependent
p or constrained.
 Discriminative methods often outperform generative methods
 Generative methods used for learning complex objects (e.g. language  Such dependencies and constraints reflect sequential, spatial or
parsing protein sequence alignment
parsing, alignment, information extraction) combinatorial structure in the problem domain,
domain and capturing
 Key Challenges: these interactions is often as important for the purposes of
 Extend discriminative methods (ANN,, DT,, KNN,, SVM,, …) to more prediction as capturing input-output dependencies.
general settings
 Structured prediction (SP)  the machine learning task of
 Examples: ranking functions (e.g. Google top-ten, ROC), natural
language parsing,
parsing finite-state models generatingg outputs
g p with complexp internal structure .
 Find ways to directly optimize desired performance criteria (e.g.
ranking performance vs. error rate)
What is structured p
prediction? (Lafferty)
y Handwritingg recognition
g
 Text, sound, event logs, biological, handwriting, gene networks,
linked data structures like the Web can be viewed as graphs
connecting basic data elements-. x y
 IImportant
t t tasks
t k involving
i l i structured
t t dd data
t require
i the
th
computation of a labeling for the nodes or the edges of the

brace
underlying graph.
graph E.g.,
E g POS tagging of natural language text
can be seen as the labeling of nodes representing the successive
words with linguistic labels.
 A good labeling will depend not just on individual nodes but
also the contents and labels of nearbyy nodes,, that is,, the
preceding and following words--thus, the labels are not
Sequential structure
independent.

Structured prediction
p Labeling
g sequence
q data problem
p
 X is a random variable over data sequences
 Structured Learning / Structured Prediction  Y is a random variable over label sequences whose labels are assumed
to range over a finite label alphabet A
 Problem: Learn how to give labels from a closed set Y to a data
sequence X
structured x1 x2 x3
X
X: Thinking is being
yt-1 yt yt+1
Y: noun verb noun
y1 y2 y3
xt-1 xt xt+1
 POS tagging, phrase types, etc. (NLP),
 Named entity recognition (IE)
(a) Unstructured Model (b) Linear-Structured Model  Modeling protein sequences (CB)
 Image segmentation, object recognition (PR)
 etc
etc.
4. Distributed learning
g 5. Full auto: ML for the masses
 Objective: DM/ML with distributed data
 Objective: make ML easier to apply to real problems
 Current:
 Current:
 Most ML algorithms
g assume random access to all data
 ML applications
li i require
i detailed
d il d knowledge
k l d about
b the
h algs
l
 Often data comes from decentralized sources (e. g. sensor
networks, multiple organizations, learning across firewalls,  Preparing/Preprocessing takes at least 75% of the effort
diff
different
t security
it systems)
t )  Key Challenges:
 Many projects infeasible (e.g. organization not allowed to share  Automatic selection of machine learning method
data)
 Tools for preprocessing the data
 Key Challenges:  Reformatting, Sampling, Filling in missing values, Outlier detection
 Develop methods for distributing data while preserving privacy
 Robust performance estimation and model selection
 Develop methods for distributed learning without distributing the
data
 “Data Scoup”

6. Interpretable
p models 7. Ensemble methods
 Objective: make learning results more understandable  Objective:
j combiningg learningg methods automaticallyy
 Current:  Current:
 Methods often achieve good prediction accuracy  We do not have a single DM/ML method that “does does it all
all”
 The prediction rule appears complex & is difficult to verify  Results indicate that combining models results in large
improvements in performance
 Lack of trust in the rule
 Focus on boosting and bagging
 Lack of insight
 Key Challenges:
 Key Challenges:
 Develop methods that combine the best of different learning algs
 Machine learning methods that are understandable &
 Searching for good combinations might be more efficient than
generate accurate rules
designing one “global” learning algorithm
 Methods for generating explanations
 Theoretical explanation for why and when ensemble methods
 Model verification for user acceptance
p help
8. Unsupervised
p learning
g
 Objective: improve state-of-the-art in unsupervised learning
 Current:
 Research focus in 90’s was supervised learning
 Much progress on supervised learning methods like neural
networks, support vector machines, boosting, etc.
 U
Unsupervised
i d llearning
i needs d tto “catch
“ t h up””
 Key Challenges:
 More robust
M b t and d stable
t bl methods
th d for
f clustering
l t i
 Include user feedback into unsupervised learning (e.g. clustering
with constraints, semi
semi-supervised
supervised learning, transduction)
 Automatic distance metric learning
 Clustering
g as an interactive data analysis
y process
p

9. Information access What is data mining?


g
 Objective: more informed information access “Data
Data-driven
driven discovery of models and patterns
 Current:
 Bag-of-words
from massive observational data sets”
 Retrieval functions exploit document structure and link structure
 Information retrieval is a process without memory
 Key Challenges:
 Develop methods for exploiting usage data
 Learn from the query history of a user / user group Statistics,, Languages,
g g , Data
 Preserving privacy while mining access patterns Inference Representations Management
 Exploiting
p g common access ppatterns and finding g “groups”
g p of users
 Web Expert: agent that learns the web (beyond Google)
 Topic modelling Applications
Overview of DM challenges
g (ICDM’05) 1. Developing
p g a unifying
y g theoryy of DM
 Thee cu
current
e t state oof the
t e art
a t of
o  Example:
p VC dimension. In statistical
1. Developing a Unifying Theory of Data Mining data-mining research is too learning theory, or sometimes
“ad-hoc” computational learning theory, the VC
2. Scaling Up for High Dimensional Data/High Speed Streams  techniques are designed for dimension (for Vapnik-Chervonenkis
i di id l problems
individual bl dimension) is a measure of the capacity
3. Mining Sequence Data and Time Series Data  no unifying theory of a statistical classification algorithm,
4. Mining Complex Knowledge from Complex Data  Needs unifying research defined as the cardinality of the largest
 E l ti vs explanation
Exploration l ti set of points that the algorithm can
5. Data Mining in a Network Setting  Long standing theoretical issues shatter.
6. Distributed Data Mining and Mining Multi-agent Data  How to avoid spurious
correlations?
7. Data Mining for Biological and Environmental Problems  Deep research
8. Data-Mining-Process Related Problems  Knowledge discovery on
hidden causes?
9. Security, Privacy and Data Integrity  Similar to discovery of
Newton’s Law?
VC dimension of perceptron is 3.
10. Dealing with Non-static, Unbalanced and Cost-sensitive Data

2. Scalingg up
p for high
g dimensional data 3. Sequential
q and time series data
and high speed streams
 How to efficiently and
 Scaling up is needed accurately cluster, classify
 ultra-high dimensional and predict the trends?
classification p
problems (millions
or billions of features, e.g., bio  Time series data used for
data) predictions are
 Ultra-high
g speed
p data streams contaminated by noise
 Streams  How to do accurate short-
 continuous, online process
term and long-term
p
predictions?
 e.g. how
h to monitori network k
packets for intruders?  Signal processing techniques
introduce lags in the filtered
 concept drift and environment data, w
which
c reduces
educes accuracy
accu acy Real time series data obtained from
drift? Wireless sensors in Hong Kong UST
 Key in source selection,
 RFID network and sensor Excerpt from Jian Pei’s Tutorial CS department hallway
domain knowledge in rules,
network data http://www.cs.sfu.ca/~jpei/ and optimization methods
4. Mining complex knowledge from complex 5. Data mining
g in a network settingg
data (complexly structured data)
 Community and d Sociall Networks
k
 Mining graphs
 Linked data between emails,
 Data that are not i.i.d. (independent and identically distributed) Web pages, blogs, citations,
 many objects
bj are not iindependent
d d off each
h other,
h andd are not off a single
i l type. sequences and people
 mine the rich structure of relations among objects,  Static and dynamic structural
 E.g.: interlinked Web pages, social networks, metabolic networks in the cell behavior
 Integration of data mining and knowledge inference  Mining in and for Computer
 The biggest gap: unable to relate the results of mining to the real-world Networks Picture from Matthew Pirretti’s slides,penn state
decisions they affect - all they can do is hand the results back to the user.  detect anomalies (e.g.,
(e g sudden
 More research on interestingness of knowledge traffic spikes due to a DoS An Example of packet streams (data courtesy of
NCSA, UIUC)
(Denial of Service) attacks
 Need to handle 10Gig
Ethernet links (a) detect (b)
trace back (c ) drop packet

6. Distributed data mining and mining 7. Data mining


g for biological
g and
multi-agent
lti t data
d t environmental problems
 N d to
Need t correlate
l t the
th data
d t  Games  New problems raise new
seen at the various probes Player 1:miner questions
(such as in a sensor network)
Action: H T  Large scale problems especially so
 Adversary data mining:  Biological data mining, such as
d lib
deliberately
l manipulate
i l the h Player 2
HIV vaccine design
data to sabotage them (e.g., H T
T H  DNA, chemical properties, 3D Metabolomics
3000
metabolites
make
k th
them produce
d false
f l structures, and
d functionall
negatives) properties  need to be fused Proteomics
((-1,1)
1,1) (1,-1)
(1, 1) (1,-1)
(1, 1) ((-1,1)
, ) 2,000,000 Proteins

 Game theory may be needed  Environmental data mining


Outcome  Mining for solving the energy Genomics
for help 25,000 Genes
crisis
8. Data-mining-process
gp related p
problems 9. Security,
y, privacy
p y and data integrity
g y
 How to automate mining  How to ensure the  Perturbation Methods
process? users privacy while  Secure Multi-Party Computation
 the composition of data a step in the KDD 5
Putting the results their data are being (SMC) Methods
process consisting
p g of in p
practical use
mining
i i operations
ti methods that produce
useful patterns or mined? Alice
Interpret and evaluate ’s 30 | 70K | ... 50 | 40K |
models from the data 4
 Data cleaning, with discovered knowledge age
...
.
logging capabilities Maybe 70-
90% of effort 3
Data mining  How to do data Add Randomizer Randomizer

 Visualization and mining


and cost in
KDD
Extract Patterns/Models
mining for
random
number to
Age
.
65 | 20K | ... 25 | 60K |
automation
2
Collect and preprocess data
protection of security 30
becom
... ...
.
es 65
 Need a methodology: help KDD is inherently Reconstruct Reconstruct
1 Understand the domain
interactive and and privacy? (30+35
Distribution Distribution
users avoid many data mining and define problems
iterative
)
of Age of Salary
...
mistakes
it k (In many cases, viewed KDD as data mining)  Knowledge
K l d iintegrity
t it
Classification
 What is a canonical set of Model
assessment Algorithm

data mining operations

10. Dealing
g with non-static,, unbalanced
and cost-sensitive data
 The UCI datasets are small  Some papers can be found here
and not highly unbalanced
p j jp
http://www.jaist.ac.jp/~bao/K417-2008
 Real world data are large
(10^5 features) but only < pressure blood test essay
? ? ?
1% off the
th useful
f l classes
l (+’ve)
(+’ )
temperature cardiogram
 There is much information 39oc ?
on costs and benefits
benefits, but no
overall model of profit and
loss • Each test incurs a cost
• Data extremely unbalanced
 Data may evolve with a bias • Data change with time
introduced by sampling
Để máyy hiểu được
ợ nghĩa
g các văn bản? Topic
p modeling:
g keyy ideas
 Tìm tài liệu trên Google liên quan Latent semantic analysis  Topic modeling key idea Topic modeling
3 chủ
hủ đề “th
“thực phẩm”,
hẩ ” ““mắm
ắ documents dims (LDA Blei,
(LDA, Blei JMLR 2004) documents topics

tôm”, “dịch bệnh”. dims documents


mỗi văn bản là một mixture documents

words

words

dims

dims

words

words
C U D V của các chủ đề 

opics
C 
 Google cho ra rất nhiều tài liệu
liệu,

to
mỗi chủ đề là một phân bố
với precision và recall thấp.

xác suất trên các từ. Normalized co-


occurrence matrix
 Làm sao máy tính hiểu được nội D1 D2 D3 D4 D5 D6 Q1  Thí dụ

dung văn bản để tìm kiếm cho rock 2 1 0 2 0 1 1
 “thực phẩm” = {an toàn, rau,
hiệu quả?
granite 1 0 1 0 0 0 0
thịt, cá, không ngộ độc, Latent Dirichlet Allocation (LDA)
marble 1 2 0 0 0 0 1
không đau bụng …} Dirichlet Per-word Topic
 Thông qua chủ đề của văn bản music
song
0
0
0
0
0
0
1
1
2
0
0
2
0
0  “mắm
ắ tôm” = {tôm, mặn, đậu parameter topic
p assignment
g hyperparameter
yp p

phụ, thịt chó, lòng lợn, …} Per-document Observed Per-topic


 Latent semantic analysis band 0 0 0 0 1 0 0 topic proportions word word proportions
“dịch bệnh” = {nhiều người,
((Deerwester et al.,, 1990;;

cấp cứu
cứu, bệnh viện
viện, thuốc
thuốc,
Hofmann, 1999): Biểu diễn văn vacine, mùa hè, … }
bản trong một không gian Euclid, D1 D2 D3 D4 D5 D6 Q1
 D1= {thực phẩm 0.6, mắm
mỗi chiều là một tổ hợp tuyến dim1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845
tôm 0.35,, dịch
ị bệnh
ệ 0.8}}  d Zd,n
dn Wd,n
d N
t 
dim2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534
tính các từ (giống PCA).
d
M T

Latent Dirichlet allocation (LDA) model Example


p of topics
p learned
( ik1 i ) 1 1 
p (  )  1  k k 1   From 16000
 i 1
k
 ( )
i
  z wN T
documents
Dirichlet prior on the per-document topic distributions d
M of AP corpus
N  100-topic
100 t i
p ( , z, w  ,  )  p (  ) p ( zn  ) p ( wn zn ,  )
n 1
LDA model.
Joint distribution of topic mixture θ, a set of N topic z, a set of N words w
 Each color
 N  codes a
p (w  ,  )   p (  )    p ( zn  ) p ( wn zn ,  ) d k different
 n 1 zn 
Marginal distribution of a document by integrating over θ and summing over z factor from
which the
M  Nd  word is
p ( D  ,  )    p ( d  )    p ( zdn  d ) p ( wdn zdn ,  ) d k d
d 1  n 1 zdn 
putatively
P b bilit off collection
Probability ll ti b by product
d t off marginal
gi l probabilities
b biliti off single
i gl ddocuments
t generated
Dirichlet-Lognormal
g (DLN) topic
p model Traditional ML vs. TL (P. Langley 06)
Traditional ML in Transfer of learning
 
multiple domains across domains
 
  z w
N
T   z w
N
T

dM dM

Model descrption:
Spam classification

ttraining ittems

ttraining ittems
 |  ~ Dirichlet ( )

ms
ms

test item
test item
M th d
Method DLN LDA SVM
 |  ,  ~ Lognormal (  , )
z | ~ Multinomial ( )
Accuracy 0.5937 0.4984 0.4945
w |  ~ Multinormial ( f (  ))

1  1  Predicting crime
Pr((x1 ,,...,, xn )  g x   )T 1 ((log
p ((log
exp g x   )
(2 ) n / 2 . .x1...xn  2 
DLN LDA SVM
where log x  (log x1 ,..., log xn )T

0.2442 0.1035 0.2261 Humans can learn in many domains. Humans can also transfer from one
domain to other domains.
(Than Quang Khoat, Ho Tu Bao, 2010)

Traditional ML vs. TL Notation


Learning Process of Learning Process of
Traditional ML Transfer Learning Domain:
It consists of two components: A feature space , a marginal distribution
training items training items

In general, if two domains are different, then they may have different feature spaces
or different marginal distributions.

Task:
Given a specific domain and label space , for each in the domain, to
Learning System Learning System Learning System predict its corresponding label

In ggeneral,, if two tasks are different,, then theyy mayy have different label spaces
p or
different conditional distributions

Knowledge Learning System


Notation

For simplicity, we only consider at most two domains and two tasks.

Source domain:

Task in the source domain:

Target domain:

Task in the target domain

You might also like