Evaluating Classification Algorithms Applied To Data Streams Esteban Donato

Evaluating classification algorithms applied to data streams Author: Ing. Esteban D. Donato Advisor: Dr. Fazel Famili Co-Advisor: Dra. Ana S. Haedo Dec-2009 Maestría en Explotación de Datos y Descubrimiento del Conocimiento

Introduction Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day. Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time. Concept drift: o ccurs when the underlying data distribution changes over time.

Objective To perform a benchmarking analysis between a number of known algorithms applied to data streams. The algorithms chosen for this study are: UFFT, CVFDT and VFDTc. The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.

Related work A data stream is a sequence of data items x 1 ,…,x i ,…,x n . Those items are read one at a time in increasing order of the indices. Off-line learning : Assumes that the dataset resides in a static database and that has been generated from a static distribution. Also, they assume that all the data is available before the training and that all the examples can fit into the memory. Incremental learning : The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.

Related work (Cont.): Data Streams Mining A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the data. It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm. The model should be up-to-date at any time. Types of Algorithms : Set of rules, Induction trees and Ensembles methods.

Related work (Cont.): Very Fast Decision Tree (VFDT) Requires each example to be read only once. Requires a small constant time to process it. Building process : given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves. To detect how many examples are needed at each node, The Hoeffding bound is used. The Hoeffding bound: with probability 1 - φ , the true mean of the variable is at least r - e, where: Let ∆ G = G (Xa) - G (Xb) >= 0, if ∆ G > e then ∆G >= ∆ G - e > 0 with probability 1 – φ Other features: Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans. Drawbacks : It does not detect Concept Drift.

Related work (Cont.): Concept Drift Change in the target concept Depends on some hidden attributes, not given explicitly in the form of predictive features, Examples: Weather prediction, customers’ buying, etc. Concept drift handling system should be able to: Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts. Types: sudden, gradual, frequent and virtual concept drift.

Conclusion of literature review Data stream is a sequence of time-ordered items, arriving faster than the time needed to be mined. Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes. The main challenge in incremental learning is how to detect and adapt to a concept drift. To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record. One of the first algorithms developed was VFDT, using the Hoeffding bound In concept drift, a difficult problem is to distinguish between a true concept drift and noise.

Algorithm: VFDTc V ery F ast D ecision T ree for C ontinuous attributes Extension of VFDT in three directions: continuous data, functional leaves, and concept drift. For a continuous attribute the split-test is a condition of the form attri <= cut_point. Use of Information gain to detect the cut_point. Functional tree leaves: An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves A leaf must see nmin examples before computing the evaluation function Concept Drift is based on the assumption that whatever is the cause of the drift, the decision surface moves. It supports two methods: Drift Detection based on Error Estimates (EE/EBP) Drift Detection based on Affinity Coefficient (AC) Reacting to Drift: method pushes up all the information of the descending leaves to node This is a forgetting mechanism.

Algorithm: UFFT U ltra F ast F orest T ree Generates a forest of binary trees Processes each example in constant time It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-Bayes ). Error follows a binomial distribution. Two confident interval levels: warning drift

Algorithm: CVFDT C oncept-adapting V ery F ast D ecision T ree Extension of VFDT with support to concept drifts It works by keeping its model consistent with a sliding window of examples. Updates just statistics. It uses information gain for selecting the best attribute. Grows an alternative subtree with the new best attribute at its root. Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.

Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)

Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1] d , is denoted by MOA (Massive Online Analysis) tool http://sourceforge.net/projects/moa-datastream/ Released under GNU. Free and open source Current configurabe attributes: instanceRandomSeed numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage New configurable attributes: driftFreq driftTran outlierPercentage distributionPercentage

Data sets generated Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data Dataset with 1% of outliers Dataset with 3 concept drift

Results Capacity to detect and respond to concept drift

Results Capacity to detect and respond to virtual concept drift

Results Capacity to detect and respond to recurring concept drift

Results Capacity to adapt to sudden concept drift

Results Capacity to adapt to gradual concept drift

Results Capacity to adapt to frequent concept drift

Results Accuracy of the classification task VFDTc (CA) VFDTc (EBP) UFFT CVFDT measures derived from the confusion matrix Predicted Predicted Class 1 Class 2 Actual Class 1 44.5% (887) 5.5% (109) Actual Class 2 5% (101) 45% (903) Predicted Predicted Class 1 Class 2 Actual Class 1 39% (777) 11% (219) Actual Class 2 9% (173) 41% (831) Predicted Predicted Class 1 Class 2 Actual Class 1 46% (928) 3.5% (68) Actual Class 2 2.5% (48) 48% (956) Predicted Predicted Class 1 Class 2 Actual Class 1 34.5% (685) 15.5% (311) Actual Class 2 15.5% (312) 34.5% (692) Accuracy (AC) True positive (TP) False Positive (FP) True Negative (TN) False Negative (FN) Precision (P) VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90 VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82 UFFT 0.94 0.93 0.05 0.95 0.07 0.95 CVFDT 0.69 0.69 0.31 0.69 0.31 0.69

Results Dealing with noisy data

Results Speed (Time to take to process an item in the stream)

Conclusions & future work Given that the data can be generated very fast, that give us a new and challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never end The changes in the data distribution are another challenging scenario that data stream mining has to deal with. VFDT was one of the first data stream mining algorithms developed. It implemented the Hoeffding bound We generated different datasets using the moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift

Conclusions & future work VFDTc (CA) is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms Future Work Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets (text, images, etc)

Questions ? E-mail: [email_address] Twitter: @eddonato

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato

More Related Content

What's hot(20)

Viewers also liked(13)

Similar to Evaluating Classification Algorithms Applied To Data Streams Esteban Donato(20)

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato