International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 4, Issue 3 (2016) ISSN 2320–4028 (Online)
Image Classification Using Naïve Bayes
Classifier
Dong-Chul Park
the case of a set of combined features that consist of different
Abstract—An image classification scheme using Naïve Bayes individual features for a classification problem, each feature is
Classifier is proposed in this paper. The proposed Naive Bayes usually normalized first. A normalization process is required
Classifier-based image classifier can be considered as the maximum a for using the different available features. However, there is no
posteriori decision rule. The Naïve Bayes Classifier can produce very
justification for making all the different individual features
accurate classification results with a minimum training time when
compared to conventional supervised or unsupervised learning have the same minimum and maximum magnitudes in the
algorithms. Comprehensive experiments for pattern classification normalization process. Therefore, it is necessary to devise a
tasks on an image dataset are performed in order to evaluate the more effective approach to deal with different individual
performance of the proposed classifier. The results show that the features.
proposed Naïve Bayes Classifier outperforms conventional classifiers Thus far, various approaches, including ensemble
in terms of training speed and classification accuracy.
classifiers, to utilize different feature vectors have been
proposed [1, 2]. A classifier model called partitioned feature-
Keywords— Bayes classifier, image classification, DCT
feature, neural networks. based classifier (PFC), which efficiently uses a variety of
available features extracted through various tools and
I. INTRODUCTION enhances the classification performance, has also been
proposed [3]. In PFC, all the available features are grouped
T HE image classification task is one of the ongoing
important topics in various computer vision tasks. A rapid
increase in the size of data in various areas has been witnessed
into several groups, where each group has homogeneous
features and forms a feature vector. Each feature vector in
PFC is separately used in the independent classifier and can
recently. Automatic processing of these contents requires preserve the properties of the individual features in the same
efficient pattern classification techniques. In general, group. Each local classifier is independently trained with a
automatic data classification tasks including image retrieval specific group of features. Since each feature group is used
tasks require two critical processes: an appropriate feature only for a local classifier in the training stage independently,
extraction process and an accurate classifier design process. features from different groups will not interfere with each
For image classification tasks, a feature extraction process can other. In PFC, each local classifier achieves specific
be considered the basis of content-based image retrieval. classification accuracy during the training stage. The accuracy
Features can be classified into two groups: general features for each local classifier using a group of features is then used
and domain-related features. General features include colors as a weight for the local classifier when making decisions at
and textures, and the domain-related features include faces, the test stage. The PFC model demonstrates that the
fingerprints, and human irises. Among these available classification accuracy of conventional clustering algorithms
features, we never know which one or which combination of can be significantly improved when the PFC model is used
features is suitable for characterizing an image perfectly from with them [3]. For the given data, however, the PFC model
the other images. The best strategy to utilize the available draws the classification results on the basis of each local
features may be the one that uses all of them if there exists any classifier’s performance irrespective of how each classifier
classifier that can use all the available features efficiently. performs on each class. If we know from the training dataset
With the use of various methods, different features are that a certain trained local classifier classifies very well on a
extracted for different reasons. Nevertheless, all the features certain class of data while it gives very inaccurate
can help to describe objects more precisely. In most cases, classification results on another class of data, then it may be
more features facilitate a more accurate classification. Note reasonable to make a final decision according to the
that each feature has different properties including magnitude. classification results from the local classifier.
When using different features for describing an object, In order to improve the classification accuracy, the
different magnitudes may cause a problem because each classifier integration model was proposed as a fusion method
feature is independent of the others. In for multiple classifiers [4]. As a classifier fusion algorithm,
individual local classifiers in CIM are applied in parallel and
their outputs are combined in a certain manner to reach an
Dong-Chul Park is with Department of Electronics Engineering, Myong
Ji University, Yong In, Gyeonggi-do, 449-728, South Korea (e-mail: optimal decision. As a multiple sensor data fusion method, the
[email protected]). CIM combines feature data from multiple sensors or multiple
135
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 4, Issue 3 (2016) ISSN 2320–4028 (Online)
feature extraction methods to achieve improved accuracies and The Naive Bayes classifier is often referred as the
more specific inferences as compared to those that could be maximum a posteriori (MAP) decision rule. Note that the
achieved by the use of a single sensor or feature alone. assumption of statistically independence in each feature
In this study, in order to achieve more accurate sometimes does not hold in certain cases and causes problems
classification accuracy and higher training speed than in some practical cases [3]. However, various applications and
conventional learning algorithms, a Naive Bayes algorithm [9] experimental studies show that training schemes based on the
based on a stochastic process is adopted for local classifiers. MAP decision rule with the Naive Bayes assumptions yield an
Since the Naive Bayes algorithm does not require any iterative optimal classifier even when the assumption does not hold.
procedure for its training process, its training process is quite The CNN [4] is utilized as the local classifier in this paper.
simple and requires a small amount of training data to estimate The CNN is an unsupervised competitive learning algorithm
the parameters while achieving very competitive accuracy of based on the classical k-means clustering algorithm. It finds
the classification results. the centroids of clusters at each presentation of the data vector.
The remainder of this paper is organized as follows: Section The CNN first introduces definitions of the winner neuron and
2 summarizes the Naive Bayes algorithm adopted for local the loser neuron. When a data xi is given to the network at the
classifiers in this study. The proposed SCIM is applied to two epoch (k), the winner neuron at the epoch (k) is the neuron
pattern classification problems in order to evaluate the with the minimum distance to xi. The loser neuron at the
performance of the proposed algorithm in terms of training epoch (k) to is the neuron that was the winner of xi at the
speed and classification accuracy in Section 3. The concluding epoch (k-1) but is not the winner of at the epoch (k). The
remarks are presented in Section 4. CNN updates its weights only when the status of the output
neuron for the presenting data has changed when compared to
II. NAIVE BAYES CLASSIFIER the status from the previous epoch.
When an input vector x is presented to the network at epoch
The Naive Bayes classifier is based on Bayes’ theorem of n, the weight update equations for winner neuron j and loser
probability [1]. In Bayes’ theorem, the conditional probability
neuron i in CNN can be summarized as follows [4]:
that an event belongs to a class can be calculated from
the conditional probabilities of finding particular events in
each class and the unconditional probability of the event in 𝑤𝑗 (𝑛 + 1) 𝑤𝑗 (𝑛) + [ (𝑛) − 𝑤𝑗 (𝑛)]
𝑁𝑗 +
each class. That is, for given data, , and classes,
where denotes a random variable, the conditional 𝑤 (𝑛 + 1) 𝑤 (𝑛) − [ (𝑛) − 𝑤 (𝑛)] (4)
𝑁+
probability that an event belongs to a class can be
calculated by using the following equation: where 𝑤𝑗 (𝑛) and 𝑤 (𝑛) represent the weight vectors of the
winner neuron and the loser neuron, iteration, respectively.
( | ) The CNN has several advantages over conventional
( | ) ( ) (1)
( )
algorithms such as SOM or k-means algorithm when used for
clustering and unsupervised competitive learning. The CNN
Equation (1) shows that the calculation of ( | ) is a
requires neither a predetermined schedule for learning gain
pattern classification problem since it finds the probability that
nor the total number of iterations for clustering. It always
the given data belongs to class and we can decide the
converges to sub-optimal solutions while conventional
optimum class by choosing the class with the highest
probability among all possible classes, , which can minimize algorithms such as SOM may give unstable results depending
the classification error. For doing so, we need to estimate on the initial learning gains and the total number of iterations.
( | ) and assume that any particular value of vector Note that the CNN was designed for deterministic data
conditional on is statistically independent of each because the distance measure used in the CNN is the quadric
dimension and can be written as follows: (Euclidean) distance. More detailed description on the CNN
can be found in [4].
( | ) ∏ ( | ) (2)
Conventional classifiers calculate the local classification
decision probability and uses the information for parameters
where is a n-dimensional vector data ( ). for making a global classification decision. However, the
The Naive Bayes classifier is based on equation (2) and proposed NBC calculates the classification probability by
assumes that each feature be statistically independent [2]. This using equation (1) directly when the model adopts a Naive
assumption results in simpler calculation cost and efficient Bayes classifier as its local classifier. When the feature value
data processing. By combining equation (1) and equation (2), is a continuous value, the proposed NBC estimates the
the Naive Bayes classifier can be summarized as the following probability that a feature vector component is classified as its
equation: class with the following probability density function:
( )∏ ( | ) (3) ( | ) ( )
(5)
√
where the denominator ( ) is omitted since the value is the
same for all class.
136
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 4, Issue 3 (2016) ISSN 2320–4028 (Online)
where the probability density function is formed during the
training stage of local classifiers with the mean, , and
standard deviation, , of each -th class data for each (a)
feature vector component
Note that ( | ) can be calculated by using equation (2)
with each dimension independently because NBC adopts the
Naive Bayes classifies as its local classifiers.
During training procedure, the probability density function,
( | ) for each dimension of feature vector on each local (b)
classifier is first calculated and the ( | ) for each local
classifier is found by using equation (2). Once the probability
density function is found for each local classifier, the training
procedure for the global classifier is terminated. The decision
making procedure for a given data is that the feature vectors
are pass through corresponding local classifiers and the class (c)
for the given data is found by the following equation:
( ) ∑𝑁
𝑗 ( 𝑗) ∏ ( | 𝑗 ) (6)
𝑁
(d)
III. EXPERIMENTS AND RESULTS
For experiments, image data sets are collected from the
Caltech image data set. The Caltech image data set consists of
different image classes (categories) in which each class Fig 1: Example of Caltech image data set:(a) Airplane (b) Car
(c) Motorbike (d) Face
contains different views of an object. The Caltech image data
were collected by the Computational Vision Group and are For the localized representation, the DCT coefficients are
available at the following website : first calculated with 8×8 window at the upper left corner of the
http://www.vision.caltech.edu/html-files/archive.html satellite image. The 8×8 sliding window for calculating the
Figure 1 shows examples of airplane, car, motorbike, and next DCT coefficients is then shifted by an increment of 2
bike images used in our experiments. Each class consists of pixels horizontally and vertically. The DCT coefficients from
200 images with different views resulting in a total of 800 each block consist of 64 dimensional coefficients and 32
images in the data set. Before any further processing for lowest frequency DCT coefficients from each image block are
feature extraction, the entire image sets were converted to grey used as a feature for the local block in our experiments
scale data with the same resolution because most of the energy is concentrated in these frequency
In order to describe the characteristics of each class, the region. Therefore, the DCT feature vectors used in our
feature vectors should be able to discriminate the images from experiments are 32 dimensional vectors.
different categories while producing similar feature values on
B. Local Binary Pattern (LBP)
the images from the same category in order to classify the
image data in the same class. The following features are The local binary pattern (LBP) is one of the most widely
employed in the experiments: used feature extraction methods for describing image textures
including points, lines, and surfaces because of its texture
A. Discrete cosine transform (DCT) representation capability and computational simplicity [6].
Discrete cosine transform (DCT) is a tool to convert an The most important advantageous feature of LBP in practical
image into frequency components and has been successfully uses is that the LBP is very robust to monotonic data value
applied to the image compression problem [5]. DCT can variations caused by illumination changes. The LBP operator
decompose an image signal into the underlying spatial labels the pixels of an image signal by applying a threshold on
frequencies, and the DCT coefficients of an image signal can the neighborhoods of each pixel with the center value and
be used as new features that have the ability to represent the producing the result as a binary number. As shown in Fig. 4, at
regularity and some textural features of the image signal. In a given pixel position (x_c,y_c), the LBP is defined as an
order to describe the DCT texture information of image ordered set of binary comparisons of pixel intensities between
signals, a localized image representation method computes the the center pixel and its predetermined S_p surrounding pixels
DCT coefficients at different points of interest in the image (in this case, S_p = 8).
signal. Uniform local binary pattern (ULBP) is a variant of a local
binary pattern (LBP) and has at most two circular 0 to 1 or 1
to 0 transitions. ULBP can reduce the length of the feature
histogram and improve the classifier performance by using
LBP
137
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 4, Issue 3 (2016) ISSN 2320–4028 (Online)
TABLE I
COMPARISON OF CLASSIFICATION ACCURACY ON CNN CLASSIFIER WITH
DIFFERENT FEATURE VECTORS (MEAN AND STANDARD VARIATION)
Accuracy(%)
Feature Std.
Average
Dev.
DCT 71.8 7.68
ULBP 67.4 10.12
CovD 63.2 9.81
Fig 2: DCT feature extraction WPT 58.5 11.28
features [6]. While the size of the LBP used in our image by a collection of local features extracted from the
experiments is 8 and there exist possible cases, the ULBP for image.
this case allows only 59 cases; the resulting ULBP produces These features are computed at different points of interest in
59-dimensional feature vectors. the image. Afterwards, the Gaussian distribution, wherein the
C. Covariance Descriptor(Cov.D) mean vector and the covariance matrix are estimated from all
local feature vectors obtained from the image, is used to
The covariance feature vector has been widely used in
represent the content of the image. Localized representation
object recognition and pattern classification problems for its
maintains the dimensions of the feature vectors tractable by
advantageous features including robustness against noises and
adjusting the sizes of blocks. This method is consequently
low dimensionality when compared with other feature
descriptors [7]. For our experiments, the covariance feature more robust to occlusions and clutters. In order to obtain the
vectors formed with the first derivative ( ), the second texture information from the image, conventional texture
descriptors based on a frequency domain analysis such as
derivative ( ), and the angle of image as shown in
Gabor filters [8] and wavelet filters [9] are often used.
equation (10) are used.
However, these algorithms often induce a high computational
load for feature extraction and are not suitable for real-time
( ) [ √ + 𝑛 ] (10) applications. In this paper, the Discrete Cosine Transform
(DCT) is adopted for extracting the texture information from
each block of the image [10]. The DCT transforms the image
D. Wavelet Packet Transform (WPT)
from the spatial domain into the frequency domain. For the
The wavelet transform is a useful multi-resolution analysis localized representation, images are transformed into a
tool and it has been widely applied to texture analysis and collection of 8×8 blocks. Each block is then shifted by an
classification. The features of images, such as edges of an increment of 2 pixels horizontally and vertically. The DCT
object, can be projected by the wavelet coefficients in low- coefficients of each block are then computed and returned in
pass and high-pass sub-bands. In our experiments, we used 6 64 dimensional coefficients. Only the 32 lowest frequency
step 2-D wavelet packet transform and produced 68 DCT coefficients that are visible to the human eye are kept.
dimensional feature vector for each image. Therefore, the feature vectors that are obtained from each
The dimensions of each feature vector obtained from DCT, block have 32 dimensions. In order to calculate the GPDF for
ULBP, CovD, and WPT are 36, 59, 36, and 68, respectively. the image, the mean vector and the covariance matrix are
Throughout the experiments, the 10-fold cross-validation estimated from all blocks obtained from the image. Finally, a
method is adopted to deal with the small sample size of our GPDF with a 32-dimensional mean vector and a 32 × 32
datasets. That is, the datasets are divided randomly into ten covariance matrix is used to represent the content of the image.
roughly equal parts. The first nine parts are used for training Figure 2 summarizes the data processing procedure used in
the classifiers, and the remaining part is used for evaluating our experiment.
the classifiers. The above process is repeated ten times so that Table II is a summary of classification accuracies among
each part is used once as the test dataset. different classifiers when DCT is adopted as the feature for
Table I summarizes the average classification accuracy classifiers. Note that the Naïve Bayes-based classifier
among three individual classifiers with a separate feature outperforms other classifiers. The classification accuracy for
vector for CNN classifier. As can be seen from Table I, CNN, FCM, MLPNN, and NB is 71.8%, 66.5%, 72.6%, and
individual classifiers with separate features show limited 77.2% on average, respectively.
classification accuracy, while the individual classifier with a Table III summarizes the training time required for different
DCT feature gives the most accurate classification classifiers for the image dataset. Since Multi-layered
performance among the four individual classifiers. Based on Perceptron type Neural Network (MLPNN) adopts Error
these observations, DCT feature is selected for further Back-Propagation learning algorithm adaptively, the training
experiments. times for MLPNN are estimated with the maximum of training
In order to describe the texture information of images, a times for the four classifiers. As can be seen from Table III,
localized image representation method is employed. The Naïve Bayes requires the minimum training time among the
localized representation method represents the content of the
138
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 4, Issue 3 (2016) ISSN 2320–4028 (Online)
four classifiers. Note that this training time advantage of ACKNOWLEDGMENT
Naïve Bayes comes from the fact that Naïve Bayes is not This work was supported by the IT R&D program of The
iterative training algorithm. MKE/KEIT (10040191, The development of Automotive
Synchronous Ethernet combined IVN/OVN and Safety control
TABLE II
COMPARISON OF CLASSIFICATION ACCURACY ON DIFFERENT CLASSIFIER system for 1Gbps class).
WITH DCT FEATURE VECTORS (MEAN AND STANDARD VARIATION)
Accuracy(%) REFERENCES
Classifier Std. [1] M. Jang, D.Park, “Stochastic Classifier Integration Model,”
Average International Journal of Applied Engineering Research, vol. 11, no.2,
Dev.
pp. 809-814, 2016.
Centroid Neural Network 71.8 7.68 [2] D. Lowd, P. Domingos, “Naive Bayes models for probability
Fuzzy C-Means 66.5 9.62 estimation,” in proc. of the 22th International Conference on Machine
Multi-Layer Perceptron Neural 72.6 8.04 Learning, 2005, pp. 529-536.
Network [3] D. Lewis, “Naive Bayes at forty: The independence assumption in
information retrieval,” Lecture Notes in Computer Science, vol. 1398,
Naïve Bayes 77.2 7.16 pp. 4-15, June 2005.
[4] D.C. Park. “Centroid Neural Network for Unsupervised Competitive
Learning”, IEEE Transaction on Neural Networks, vol. 11, no.8, pp.520-
528, March 2000.
TABLE III
[5] G. Strang. “The discrete cosine transform”, SIAM Rev, vol. 41, no. 1, pp.
COMPARISON OF TRAINING TIME
135-147, 1999
Training time [6] Z. Guo, L. Zhang, D. Zhang. “A completed modeling of local binary
Classifier pattern operator for texture classification”, IEEE Transaction on Image
(s)
Processing, vol. 19, no.6, pp.1657-1663, March 2010.
Centroid Neural Network 1.22 [7] O. Tuzel, F. Porikli, P. Meer. “Region covariance: A fast descriptor for
Fuzzy C-Means 1.86 detection and classification”, in proc. of Ninth European Conf.
Multi-Layer Perceptron Neural 4.22 Computer Vision, vol. 2, 2006, pp. 589-600.
Network [8] J. Daugman,“ Complete Discrete 2D Gabor Transform by Neural
Networks for Image Analysis and Compression,” IEEE Trans.
Naïve Bayes 0.42 Acoust.,Speech, and Signal Processing, vol. 36, pp 1169-11179, 1988.
[9] C. Pun,M. Lee,“ Extraction of Shift Invariant Wavelet Features for
Classification of Images with Different Sizes,” IEEE Trans. Pattern
IV. CONCLUSION Anal. Mach. Intel., vol.26, no.9, pp. 1228-1233, 2004.
[10] Y. Huang, R. Chang, “ Texture Features for DCT-Coded Image
A classification model for image data is proposed by using Retrieval and Classification,” in proc. of IEEE Int. Conf. on
Naïve Bayes classifier is proposed in this paper. Since the Acoustics, Speech, and Signal Processing, vol. 6, pp.3013-3016, 1999
Naive Bayes classifier does not require any excessive training
procedure commonly required in most of the artificial neural Dong-Chul Park received the B.S. degree in electronics engineering
network architectures, the resulting classifier can yield an from Sogang University, Seoul, Korea, in 1980, the M.S. degree in electrical
appropriate classification decision with very limited and electronics engineering from the Korea Advanced Institute of Science and
Technology, Seoul, Korea, in 1982, and the Ph.D. degree in electrical
computational efforts. The proposed classifier utilizes the engineering, with a dissertation on system identifications using artificial
Naive Bayes classifier for minimizing the training time over neural networks, from the University of Washington (UW), Seattle, in 1990.
the conventional classifiers while yielding accurate From 1990 to 1994, he was with the Department of Electrical and Computer
classification results by adopting the advantage of the Engineering, Florida International University, The State University of Florida,
Miami. Since 1994, he has been with the Department of Electronics
probability concepts of the Naive Bayes classifier. In order to Engineering, MyongJi University, Korea, where he is a Professor. From 2000
evaluate the performance of the proposed classifier, to 2001, he was a Visiting Professor at UW. He is a pioneer in the area of
experiments on Caltech image data sets are carried out. The electrical load forecasting using artificial neural networks. He has published
more than 140 papers, including 40 archival journals in the area of neural
performance of the proposed classifier is compared with those network algorithms and their applications to various engineering problems
of conventional classifiers such as CNN, FCM, and MLPNN including financial engineering, image compression, speech recognition, time-
in terms of training speed and classification accuracy. When series prediction, and pattern recognition. Dr. Park was a member of the
Editorial Board for the IEEE TRANSACTIONS ON NEURAL NETWORKS
the proposed classifier is compared with the conventional from 2000 to 2002. He is a Senior Member of IEEE.
classifiers in terms of training time and classification accuracy,
the results show that the proposed classifier outperforms the
conventional classifiers in terms of both training speed and
classification accuracy. The advantage of training speed of the
proposed classifier over the conventional classifiers is an
advantageous feature in practical applications. Further
research on how to overcome the assumption of the
independence of individual feature dimension in the Naive
Bayes classifier will be a subject in future research.
139