10 Challenging Problems in Data Mining Research
10 Challenging Problems in Data Mining Research
Scaling up is needed
ultra-high dimensional classification problems (millions or billions of
features, e.g., bio data)
Ultra-high speed data streams
Streams
continuous, online process
e.g. how to monitor network packets for intruders?
concept drift and environment drift?
RFID network and sensor network data
How to efficiently and accurately cluster, classify and predict the trends ?
Time series data used for predictions are contaminated by noise
How to do accurate short-term and long-term predictions?
Signal processing techniques introduce lags in the filtered data, which
reduces accuracy
Key in source selection, domain knowledge in rules, and optimization
methods
4. Mining Complex Knowledge from Complex Data
Mining graphs
Data that are not i.i.d. (independent and identically distributed)
many objects are not independent of each other, and are not of a
single type.
mine the rich structure of relations among objects,
E.g.: interlinked Web pages, social networks, metabolic networks in
the cell
Integration of data mining and knowledge inference
The biggest gap: unable to relate the results of mining to the real-
world decisions they affect - all they can do is hand the results back to
the user.
More research on interestingness of knowledge
Games Player1=miner
Action: H T
H T
T H
(1,-1) (1,-1)
Outcome
Sampling
Feature Sel
Mining…
How to ensure the users privacy while their data are being mined?
How to do data mining for protection of security and privacy?
Knowledge integrity assessment
Data are intentionally modified from their original version, in order
to misinform the recipients or for privacy and security
Development of measures to evaluate the knowledge integrity of a
collection of
Data
Knowledge and patterns
Data mining systems rely on databases to supply the raw data for input and this raises
problems in that databases tend be dynamic, incomplete, noisy, and large. Other problems
arise as a result of the adequacy and relevance of the information stored.
A database is often designed for purposes different from data mining and sometimes the
properties or attributes that would simplify the learning task are not present nor can they
be requested from the real world. Inconclusive data causes problems because if some
attributes essential to knowledge about the application domain are not present in the data
it may be impossible to discover significant knowledge about a given domain. For
example cannot diagnose malaria from a patient database if that database does not
contain the patients red blood cell count.
Databases are usually contaminated by errors so it cannot be assumed that the data they
contain is entirely correct. Attributes which rely on subjective or measurement
judgements can give rise to errors such that some examples may even be mis-classified.
Error in either the values of attributes or class information are known as noise. Obviously
where possible it is desirable to eliminate noise from the classification information as this
affects the overall accuracy of the generated rules.
Missing data can be treated by discovery systems in a number of ways such as;
1.5.3 Uncertainty
Uncertainty refers to the severity of the error and the degree of noise in the data. Data
precision is an important consideration in a discovery system.
Databases tend to be large and dynamic in that their contents are ever-changing as
information is added, modified or removed. The problem with this from the data mining
perspective is how to ensure that the rules are up-to-date and consistent with the most
current information. Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the `timeliness' of the data.
Another issue is the relevance or irrelevance of the fields in the database to the current
focus of discovery for example post codes are fundamental to any studies trying to
establish a geographical connection to an item of interest such as the sales of a product.
Workshop Description
Motivation
Early work in predictive data mining did not address the complex circumstances in which models are built and
applied. It was assumed that a fixed amount of training data were available and only simple objectives, namely
predictive accuracy, were considered. Over time, it became clear that these assumptions were unrealistic and that
the economic utility of acquiring training data, building a model, and applying the model had to be considered.
The machine learning and data mining communities responded with research on active learning, which focused on
methods for cost-effective acquisition of information for the training data, and research on cost-sensitive learning,
which considered the costs and benefits associated with using the learned knowledge and how these costs and
benefits should be factored into the data mining process.
All the different stages of the data mining process are affected by economic utility. In the data acquisition phase
we have to consider the costs of obtaining training data, such as the cost of labelling additional examples or
acquiring new feature values. In applying the data mining algorithm, we have to consider the running time of the
algorithm and the costs and benefits associated with cleaning the data, transforming the data and constructing
new features. Economic utility also impacts the assessment of the decisions made based on the learned
knowledge. Simple assessment measures like predictive accuracy have given way to more complex economic
measures, including measures of profitability. These considerations can in turn impact policies for model induction.
The latter topic has received more attention in the context of cost-sensitive learning.
Goals
Almost all work that considers the impact of economic utility on data mining focuses exclusively on one of the
stages in the data mining process. Thus, economic factors have been studied in isolation, without much attention
to how they interact. This workshop will begin to remedy this deficiency by bringing together researchers who
currently consider different economic aspects in data mining, and by promoting an examination of the impact of
economic utility throughout the entire data mining process. This workshop will attempt to encourage the field to
go beyond what has been accomplished individually in the areas of active learning and cost-sensitive learning
(although both of these areas are within the scope of this workshop). In addition, existing research which has
addressed the role of economic utility in data mining has focused on predictive data mining tasks. This workshop
will begin to explore methods for incorporating economic utility considerations into both predictive and descriptive
data mining tasks.
This workshop will be geared toward researchers with an interest in how economic factors affect data mining
(e.g., researchers in cost-sensitive learning and evaluation and active learning) and practitioners who have real-
world experience with how these factors influence data mining. Attendance is not limited to the paper authors and
we strongly encourage interested researchers from related areas to attend the workshop. This will be a full-day
workshop and will include invited talks, paper presentations, short position statements and two panel discussions.
Workshop Topics
• Types of economic factors in data mining
o What economic factors arise in the context of data mining and to what stage of the data mining
process do they apply?
o What assessment metrics are used in response to these economic factors?
o Can the use of economic utility help address previously studied problems in data mining, such as
the problems of learning rare classes and learning from skewed distributions?
• Algorithms
o Utility-based approaches for information acquisition, data preprocessing, mining and knowledge
application. This includes work in active learning/sampling and cost-sensitive learning.
o This workshop will also address how predictive and descriptive data mining tasks such as
predictive modeling, clustering and link analysis can be adapted to incorporate economic utility.
• Consideration of economic utility throughout the data mining process
o Work towards a comprehensive framework for incorporating economic utility to benefit the entire
data mining process. This work includes utility-based data mining techniques which take into
account the dependencies between different phases of the data mining process to maximize the
utility of more than a single phase. For example, methods for acquiring training data which take
into account the costs of errors in addition to the cost of training data; or methods for the
extraction of predictive patterns which take into account the cost of test features necessary at
prediction time.
• Applications
o What existing data mining applications have taken economic utility into account?
o What methods do these applications use to take economic utility into consideration?
o How does economic utility and the methods for dealing with it vary according to the specific
problem addressed (e.g., by industry)?