Challenges in ML&DM
Challenges in ML&DM
g
The goal of machine learning is to build computer systems that
Challenges in Machine can adapt and learn from their experience (Tom Dietterich).
A computer
p p
program
g is said to learn from experience
p E with
Learning and Data Mining respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measure by P, improves with
experience
i (Tom
(T Mi
Mitchell
h ll book,
b k p. 2).
2)
ML problems can be formulated as
Tu Bao Ho
Ho, JAIST
Given: (x1, y1), (x2, y2), …, (xn, yn)
Based on materials from - xi is description of an object, phenomenon, etc.
1 9 challenges in ML (Caruana & Joachims)
1. - yi is some property of xi, if not available learning is unsupervised
2. 10 challenging problems in DM (Yang & Wu)
Find: a function f(x) that f(xi) = yi
Finding hypothesis f in a huge hypothesis space F by narrowing the search with constraints (bias)
Overview of ML challenges
g 1. Generative vs. discriminative methods
Training classifiers involves estimating f: X Y, or P(Y|X).
1.
1 Generative vs.
vs discriminative learning E
Examples:
l P(apple
P( d round),
l | red d) P(
P(noun | ““cá”)
á”)
2. Learning from non-vectorial data
Generative classifiers Discriminative classifiers
3
3. B
Beyond d classification
l ifi ti and d regression
i
4. Distributed data mining Assume some functional Assume some functional form
form for P(X|Y)
P(X|Y), P(Y) for P(Y|X)
5. Machine learning bottlenecks Estimate parameters of Estimate parameters of P(Y|X)
6. Intelligible models ( | ), P(Y)
P(X|Y), ( ) directlyy from directlyy from trainingg data
training data, and use Bayes
7. Combining learning methods rule to calculate P(Y|X = xi)
SVM, logistic regression,
traditional neural networks,,
8
8. Unsupervised learning comes of age HMM, Markov random nearest neighbors, boosting,
9. More informed information access fields, Bayesian networks, MEMM, conditional random
Gaussians Naïve Bayes,
Gaussians, Bayes etc. fields etc.
fields, etc
(cá: fish, to bet)
1. Generative vs. discriminative methods Generative vs. discriminative methods
Training classifiers involves estimating f: X Y, or P(Y|X).
E
Examples:
l P(apple
P( d round),
l | red d) P(
P(noun | ““cá”)
á”) Generative approach Discriminative approach
Try to build models for the Try to learn to minimize an utility
Generative classifiers Discriminative classifiers underlying
d l i patterns
tt f
function
ti (e.g.
( classification
l ifi ti error))
Assume some functional Assume some functional form Can be learned, adapted, but not to model, represent, or
form for P(X|Y)
P(X|Y), P(Y) for P(Y|X) and generalized with small “understand” the p pattern explicitly
p y
data. (detect 99.99% faces in real images
Estimate parameters of Estimate parameters of P(Y|X)
and do not “know” that a face has
P(X|Y),
( | ), P(Y)
( ) directlyy from directlyy from trainingg data
two eyes).
eyes)
training data, and use Bayes P(logistic
X | Y ) P (Yregression,
)
SVM, Often need large training data, say
rule to calculate P(Y|X = xi) P(Y | X ) P( Xneural
)
traditional networks,, 100,000
00,000 labeled
abe ed eexamples,
a p es, aandd
HMM, Markov random nearest neighbors, boosting, can hardly be generalized.
fields, Bayesian networks, P(red round | apple) P(apple)
| red round ) conditional
P(apple MEMM, random
P(red round )
Gaussians Naïve Bayes,
Gaussians, Bayes etc. fields etc.
fields, etc
(cá: fish, to bet)
m
Representer theorem: a representa tion of the form f (.) i K (., xi )
i1
1
3. Beyond
y classification and regression
g What is structured p
prediction? (Daume)
Objective: learning to predict complex objects
Structured prediction is a fundamental machine learning task
Current: involving classification or regression in which the output
Most machine learning focuses on classification and regression variables are mutuallyy dependent
p or constrained.
Discriminative methods often outperform generative methods
Generative methods used for learning complex objects (e.g. language Such dependencies and constraints reflect sequential, spatial or
parsing protein sequence alignment
parsing, alignment, information extraction) combinatorial structure in the problem domain,
domain and capturing
Key Challenges: these interactions is often as important for the purposes of
Extend discriminative methods (ANN,, DT,, KNN,, SVM,, …) to more prediction as capturing input-output dependencies.
general settings
Structured prediction (SP) the machine learning task of
Examples: ranking functions (e.g. Google top-ten, ROC), natural
language parsing,
parsing finite-state models generatingg outputs
g p with complexp internal structure .
Find ways to directly optimize desired performance criteria (e.g.
ranking performance vs. error rate)
What is structured p
prediction? (Lafferty)
y Handwritingg recognition
g
Text, sound, event logs, biological, handwriting, gene networks,
linked data structures like the Web can be viewed as graphs
connecting basic data elements-. x y
IImportant
t t tasks
t k involving
i l i structured
t t dd data
t require
i the
th
computation of a labeling for the nodes or the edges of the
brace
underlying graph.
graph E.g.,
E g POS tagging of natural language text
can be seen as the labeling of nodes representing the successive
words with linguistic labels.
A good labeling will depend not just on individual nodes but
also the contents and labels of nearbyy nodes,, that is,, the
preceding and following words--thus, the labels are not
Sequential structure
independent.
Structured prediction
p Labeling
g sequence
q data problem
p
X is a random variable over data sequences
Structured Learning / Structured Prediction Y is a random variable over label sequences whose labels are assumed
to range over a finite label alphabet A
Problem: Learn how to give labels from a closed set Y to a data
sequence X
structured x1 x2 x3
X
X: Thinking is being
yt-1 yt yt+1
Y: noun verb noun
y1 y2 y3
xt-1 xt xt+1
POS tagging, phrase types, etc. (NLP),
Named entity recognition (IE)
(a) Unstructured Model (b) Linear-Structured Model Modeling protein sequences (CB)
Image segmentation, object recognition (PR)
etc
etc.
4. Distributed learning
g 5. Full auto: ML for the masses
Objective: DM/ML with distributed data
Objective: make ML easier to apply to real problems
Current:
Current:
Most ML algorithms
g assume random access to all data
ML applications
li i require
i detailed
d il d knowledge
k l d about
b the
h algs
l
Often data comes from decentralized sources (e. g. sensor
networks, multiple organizations, learning across firewalls, Preparing/Preprocessing takes at least 75% of the effort
diff
different
t security
it systems)
t ) Key Challenges:
Many projects infeasible (e.g. organization not allowed to share Automatic selection of machine learning method
data)
Tools for preprocessing the data
Key Challenges: Reformatting, Sampling, Filling in missing values, Outlier detection
Develop methods for distributing data while preserving privacy
Robust performance estimation and model selection
Develop methods for distributed learning without distributing the
data
“Data Scoup”
6. Interpretable
p models 7. Ensemble methods
Objective: make learning results more understandable Objective:
j combiningg learningg methods automaticallyy
Current: Current:
Methods often achieve good prediction accuracy We do not have a single DM/ML method that “does does it all
all”
The prediction rule appears complex & is difficult to verify Results indicate that combining models results in large
improvements in performance
Lack of trust in the rule
Focus on boosting and bagging
Lack of insight
Key Challenges:
Key Challenges:
Develop methods that combine the best of different learning algs
Machine learning methods that are understandable &
Searching for good combinations might be more efficient than
generate accurate rules
designing one “global” learning algorithm
Methods for generating explanations
Theoretical explanation for why and when ensemble methods
Model verification for user acceptance
p help
8. Unsupervised
p learning
g
Objective: improve state-of-the-art in unsupervised learning
Current:
Research focus in 90’s was supervised learning
Much progress on supervised learning methods like neural
networks, support vector machines, boosting, etc.
U
Unsupervised
i d llearning
i needs d tto “catch
“ t h up””
Key Challenges:
More robust
M b t and d stable
t bl methods
th d for
f clustering
l t i
Include user feedback into unsupervised learning (e.g. clustering
with constraints, semi
semi-supervised
supervised learning, transduction)
Automatic distance metric learning
Clustering
g as an interactive data analysis
y process
p
2. Scalingg up
p for high
g dimensional data 3. Sequential
q and time series data
and high speed streams
How to efficiently and
Scaling up is needed accurately cluster, classify
ultra-high dimensional and predict the trends?
classification p
problems (millions
or billions of features, e.g., bio Time series data used for
data) predictions are
Ultra-high
g speed
p data streams contaminated by noise
Streams How to do accurate short-
continuous, online process
term and long-term
p
predictions?
e.g. how
h to monitori network k
packets for intruders? Signal processing techniques
introduce lags in the filtered
concept drift and environment data, w
which
c reduces
educes accuracy
accu acy Real time series data obtained from
drift? Wireless sensors in Hong Kong UST
Key in source selection,
RFID network and sensor Excerpt from Jian Pei’s Tutorial CS department hallway
domain knowledge in rules,
network data http://www.cs.sfu.ca/~jpei/ and optimization methods
4. Mining complex knowledge from complex 5. Data mining
g in a network settingg
data (complexly structured data)
Community and d Sociall Networks
k
Mining graphs
Linked data between emails,
Data that are not i.i.d. (independent and identically distributed) Web pages, blogs, citations,
many objects
bj are not iindependent
d d off each
h other,
h andd are not off a single
i l type. sequences and people
mine the rich structure of relations among objects, Static and dynamic structural
E.g.: interlinked Web pages, social networks, metabolic networks in the cell behavior
Integration of data mining and knowledge inference Mining in and for Computer
The biggest gap: unable to relate the results of mining to the real-world Networks Picture from Matthew Pirretti’s slides,penn state
decisions they affect - all they can do is hand the results back to the user. detect anomalies (e.g.,
(e g sudden
More research on interestingness of knowledge traffic spikes due to a DoS An Example of packet streams (data courtesy of
NCSA, UIUC)
(Denial of Service) attacks
Need to handle 10Gig
Ethernet links (a) detect (b)
trace back (c ) drop packet
10. Dealing
g with non-static,, unbalanced
and cost-sensitive data
The UCI datasets are small Some papers can be found here
and not highly unbalanced
p j jp
http://www.jaist.ac.jp/~bao/K417-2008
Real world data are large
(10^5 features) but only < pressure blood test essay
? ? ?
1% off the
th useful
f l classes
l (+’ve)
(+’ )
temperature cardiogram
There is much information 39oc ?
on costs and benefits
benefits, but no
overall model of profit and
loss • Each test incurs a cost
• Data extremely unbalanced
Data may evolve with a bias • Data change with time
introduced by sampling
Để máyy hiểu được
ợ nghĩa
g các văn bản? Topic
p modeling:
g keyy ideas
Tìm tài liệu trên Google liên quan Latent semantic analysis Topic modeling key idea Topic modeling
3 chủ
hủ đề “th
“thực phẩm”,
hẩ ” ““mắm
ắ documents dims (LDA Blei,
(LDA, Blei JMLR 2004) documents topics
words
words
dims
dims
words
words
C U D V của các chủ đề
opics
C
Google cho ra rất nhiều tài liệu
liệu,
to
mỗi chủ đề là một phân bố
với precision và recall thấp.
cấp cứu
cứu, bệnh viện
viện, thuốc
thuốc,
Hofmann, 1999): Biểu diễn văn vacine, mùa hè, … }
bản trong một không gian Euclid, D1 D2 D3 D4 D5 D6 Q1
D1= {thực phẩm 0.6, mắm
mỗi chiều là một tổ hợp tuyến dim1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845
tôm 0.35,, dịch
ị bệnh
ệ 0.8}} d Zd,n
dn Wd,n
d N
t
dim2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534
tính các từ (giống PCA).
d
M T
Model descrption:
Spam classification
ttraining ittems
ttraining ittems
| ~ Dirichlet ( )
ms
ms
test item
test item
M th d
Method DLN LDA SVM
| , ~ Lognormal ( , )
z | ~ Multinomial ( )
Accuracy 0.5937 0.4984 0.4945
w | ~ Multinormial ( f ( ))
1 1 Predicting crime
Pr((x1 ,,...,, xn ) g x )T 1 ((log
p ((log
exp g x )
(2 ) n / 2 . .x1...xn 2
DLN LDA SVM
where log x (log x1 ,..., log xn )T
0.2442 0.1035 0.2261 Humans can learn in many domains. Humans can also transfer from one
domain to other domains.
(Than Quang Khoat, Ho Tu Bao, 2010)
In general, if two domains are different, then they may have different feature spaces
or different marginal distributions.
Task:
Given a specific domain and label space , for each in the domain, to
Learning System Learning System Learning System predict its corresponding label
In ggeneral,, if two tasks are different,, then theyy mayy have different label spaces
p or
different conditional distributions
For simplicity, we only consider at most two domains and two tasks.
Source domain:
Target domain: