Evaluating classification algorithms applied to data streams Author: Ing. Esteban  D. Donato Advisor: Dr. Fazel Famili Co-Advisor: Dra. Ana S. Haedo Dec-2009   Maestría  en  Explotación de Datos y Descubrimiento del Conocimiento
Introduction Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day.  Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time. Concept drift: o ccurs when the underlying data distribution changes over time.
Objective To perform a benchmarking analysis between a number of known algorithms applied to data streams.  The algorithms chosen for this study are: UFFT, CVFDT and VFDTc.  The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.
Related work  A data stream  is a sequence of data items x 1 ,…,x i ,…,x n .  Those items are read one at a time in increasing order of the indices. Off-line learning : Assumes that the dataset resides in a static database and that has been generated from a static distribution.  Also, they assume that all the data is available before the training and that all the examples can fit into the memory.  Incremental learning : The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.
Related work (Cont.):  Data Streams Mining A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the data.  It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm. The model should be up-to-date at any time. Types of Algorithms : Set of rules, Induction trees and Ensembles methods.
Related work (Cont.):  Very Fast Decision Tree (VFDT) Requires each example to be read only once. Requires a small constant time to process it.  Building process : given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves. To detect how many examples are needed at each node,  The Hoeffding  bound is used.  The Hoeffding bound: with probability 1 -  φ , the true mean of the variable is at least  r  - e, where: Let ∆ G  =  G (Xa) -  G (Xb) >= 0, if ∆ G > e then  ∆G >= ∆ G - e > 0 with probability 1 –  φ Other features:  Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans. Drawbacks : It does not detect Concept Drift.
Related work (Cont.):  Concept Drift Change in the target concept Depends on some hidden attributes,  not given explicitly in the form of predictive features,  Examples: Weather prediction, customers’ buying, etc. Concept drift handling system should be able to:  Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts. Types:  sudden, gradual, frequent and virtual  concept drift.
Conclusion  of literature review Data stream is a sequence of time-ordered items, arriving faster than the time needed to be mined.   Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes.  The main challenge in incremental learning is how to detect and adapt to a concept drift. To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record. One of the first algorithms developed was VFDT, using the Hoeffding bound In concept drift, a difficult problem is to distinguish between a true concept drift and noise.
Algorithm: VFDTc V ery  F ast  D ecision  T ree for  C ontinuous attributes Extension of  VFDT in three directions: continuous data, functional leaves, and concept drift. For a continuous attribute the split-test is a condition of the form attri <= cut_point.  Use of Information gain to detect the cut_point. Functional tree leaves:  An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves  A leaf must see nmin examples before computing the evaluation function Concept Drift  is based on the assumption that whatever is the cause of the drift, the decision surface moves.  It supports two methods: Drift Detection based on Error Estimates (EE/EBP)  Drift Detection based on Affinity Coefficient (AC) Reacting to Drift:  method pushes up all the information of the descending leaves to node This is a forgetting mechanism.
Algorithm:  UFFT U ltra  F ast  F orest  T ree  Generates a forest of binary trees Processes each example in constant time  It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test  It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-Bayes ).  Error follows a binomial distribution. Two confident interval levels: warning drift
Algorithm: CVFDT C oncept-adapting  V ery  F ast  D ecision  T ree Extension of VFDT with support to concept drifts It works by keeping its model consistent with a sliding window of examples.  Updates just statistics. It uses information gain for selecting the best attribute. Grows an alternative subtree with the new best attribute at its root.  Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.
Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)
Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1] d , is denoted by  MOA (Massive Online Analysis) tool  http://sourceforge.net/projects/moa-datastream/   Released under GNU. Free and open source Current configurabe attributes: instanceRandomSeed  numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage New configurable attributes: driftFreq driftTran  outlierPercentage distributionPercentage
Data sets generated Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data Dataset with 1% of outliers Dataset with 3 concept drift
Results Capacity to detect and respond to concept drift
Results Capacity to detect and respond to virtual concept drift
Results Capacity to detect and respond to recurring concept drift
Results Capacity to adapt to sudden concept drift
Results Capacity to adapt to gradual concept drift
Results Capacity to adapt to frequent concept drift
Results Accuracy of the classification task VFDTc (CA) VFDTc (EBP) UFFT CVFDT measures derived from the confusion matrix     Predicted Predicted     Class 1 Class 2 Actual Class 1 44.5% (887) 5.5% (109) Actual Class 2 5% (101) 45% (903)     Predicted Predicted     Class 1 Class 2 Actual Class 1 39% (777) 11% (219) Actual Class 2 9% (173) 41% (831)     Predicted Predicted     Class 1 Class 2 Actual Class 1 46% (928) 3.5% (68) Actual Class 2 2.5% (48) 48% (956)     Predicted Predicted     Class 1 Class 2 Actual Class 1 34.5% (685) 15.5% (311) Actual Class 2 15.5% (312) 34.5% (692)   Accuracy (AC) True positive (TP) False Positive (FP) True Negative (TN) False Negative (FN) Precision (P) VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90 VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82 UFFT 0.94 0.93 0.05 0.95 0.07 0.95 CVFDT 0.69 0.69 0.31 0.69 0.31 0.69
Results Dealing with outliers
Results Dealing with noisy data
Results Speed (Time to take to process an item in the stream)
Conclusions & future work Given that the data can be generated very fast, that give us a new and challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never end The changes in the data distribution are another challenging scenario that data stream mining has to deal with.   VFDT was one of the first data stream mining algorithms developed.  It implemented the Hoeffding bound We generated different datasets using the  moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift
Conclusions & future work VFDTc (CA)  is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms Future Work Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets (text, images, etc)
Questions ? E-mail:  [email_address] Twitter: @eddonato

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato

  • 1.
    Evaluating classification algorithmsapplied to data streams Author: Ing. Esteban D. Donato Advisor: Dr. Fazel Famili Co-Advisor: Dra. Ana S. Haedo Dec-2009 Maestría en Explotación de Datos y Descubrimiento del Conocimiento
  • 2.
    Introduction Majority ofcompanies and organizations collect and maintain gigantic databases that grow to millions of registers per day. Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time. Concept drift: o ccurs when the underlying data distribution changes over time.
  • 3.
    Objective To performa benchmarking analysis between a number of known algorithms applied to data streams. The algorithms chosen for this study are: UFFT, CVFDT and VFDTc. The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.
  • 4.
    Related work A data stream is a sequence of data items x 1 ,…,x i ,…,x n . Those items are read one at a time in increasing order of the indices. Off-line learning : Assumes that the dataset resides in a static database and that has been generated from a static distribution. Also, they assume that all the data is available before the training and that all the examples can fit into the memory. Incremental learning : The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.
  • 5.
    Related work (Cont.): Data Streams Mining A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the data. It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm. The model should be up-to-date at any time. Types of Algorithms : Set of rules, Induction trees and Ensembles methods.
  • 6.
    Related work (Cont.): Very Fast Decision Tree (VFDT) Requires each example to be read only once. Requires a small constant time to process it. Building process : given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves. To detect how many examples are needed at each node, The Hoeffding bound is used. The Hoeffding bound: with probability 1 - φ , the true mean of the variable is at least r - e, where: Let ∆ G = G (Xa) - G (Xb) >= 0, if ∆ G > e then ∆G >= ∆ G - e > 0 with probability 1 – φ Other features: Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans. Drawbacks : It does not detect Concept Drift.
  • 7.
    Related work (Cont.): Concept Drift Change in the target concept Depends on some hidden attributes, not given explicitly in the form of predictive features, Examples: Weather prediction, customers’ buying, etc. Concept drift handling system should be able to: Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts. Types: sudden, gradual, frequent and virtual concept drift.
  • 8.
    Conclusion ofliterature review Data stream is a sequence of time-ordered items, arriving faster than the time needed to be mined. Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes. The main challenge in incremental learning is how to detect and adapt to a concept drift. To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record. One of the first algorithms developed was VFDT, using the Hoeffding bound In concept drift, a difficult problem is to distinguish between a true concept drift and noise.
  • 9.
    Algorithm: VFDTc Very F ast D ecision T ree for C ontinuous attributes Extension of VFDT in three directions: continuous data, functional leaves, and concept drift. For a continuous attribute the split-test is a condition of the form attri <= cut_point. Use of Information gain to detect the cut_point. Functional tree leaves: An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves A leaf must see nmin examples before computing the evaluation function Concept Drift is based on the assumption that whatever is the cause of the drift, the decision surface moves. It supports two methods: Drift Detection based on Error Estimates (EE/EBP) Drift Detection based on Affinity Coefficient (AC) Reacting to Drift: method pushes up all the information of the descending leaves to node This is a forgetting mechanism.
  • 10.
    Algorithm: UFFTU ltra F ast F orest T ree Generates a forest of binary trees Processes each example in constant time It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-Bayes ). Error follows a binomial distribution. Two confident interval levels: warning drift
  • 11.
    Algorithm: CVFDT Concept-adapting V ery F ast D ecision T ree Extension of VFDT with support to concept drifts It works by keeping its model consistent with a sliding window of examples. Updates just statistics. It uses information gain for selecting the best attribute. Grows an alternative subtree with the new best attribute at its root. Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.
  • 12.
    Performance measures Capacityto detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)
  • 13.
    Data sets generatedData sets based on a moving hyperplane d-dimensional space [0; 1] d , is denoted by MOA (Massive Online Analysis) tool http://sourceforge.net/projects/moa-datastream/ Released under GNU. Free and open source Current configurabe attributes: instanceRandomSeed numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage New configurable attributes: driftFreq driftTran outlierPercentage distributionPercentage
  • 14.
    Data sets generatedDataset with no concept drift, outlier of noise Dataset with 10% of noisy data Dataset with 1% of outliers Dataset with 3 concept drift
  • 15.
    Results Capacity todetect and respond to concept drift
  • 16.
    Results Capacity todetect and respond to virtual concept drift
  • 17.
    Results Capacity todetect and respond to recurring concept drift
  • 18.
    Results Capacity toadapt to sudden concept drift
  • 19.
    Results Capacity toadapt to gradual concept drift
  • 20.
    Results Capacity toadapt to frequent concept drift
  • 21.
    Results Accuracy ofthe classification task VFDTc (CA) VFDTc (EBP) UFFT CVFDT measures derived from the confusion matrix     Predicted Predicted     Class 1 Class 2 Actual Class 1 44.5% (887) 5.5% (109) Actual Class 2 5% (101) 45% (903)     Predicted Predicted     Class 1 Class 2 Actual Class 1 39% (777) 11% (219) Actual Class 2 9% (173) 41% (831)     Predicted Predicted     Class 1 Class 2 Actual Class 1 46% (928) 3.5% (68) Actual Class 2 2.5% (48) 48% (956)     Predicted Predicted     Class 1 Class 2 Actual Class 1 34.5% (685) 15.5% (311) Actual Class 2 15.5% (312) 34.5% (692)   Accuracy (AC) True positive (TP) False Positive (FP) True Negative (TN) False Negative (FN) Precision (P) VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90 VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82 UFFT 0.94 0.93 0.05 0.95 0.07 0.95 CVFDT 0.69 0.69 0.31 0.69 0.31 0.69
  • 22.
  • 23.
  • 24.
    Results Speed (Timeto take to process an item in the stream)
  • 25.
    Conclusions & futurework Given that the data can be generated very fast, that give us a new and challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never end The changes in the data distribution are another challenging scenario that data stream mining has to deal with. VFDT was one of the first data stream mining algorithms developed. It implemented the Hoeffding bound We generated different datasets using the moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift
  • 26.
    Conclusions & futurework VFDTc (CA) is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms Future Work Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets (text, images, etc)
  • 27.
    Questions ? E-mail: [email_address] Twitter: @eddonato