0% found this document useful (0 votes)
86 views11 pages

Automatic Amharic Text News Classification: Aneural Networks Approach

This document discusses automatic classification of Amharic news articles using neural networks. It uses the Learning Vector Quantization (LVQ) algorithm to classify new articles based on a classifier trained on sample data. Two feature weighting schemes, Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF*IDF), are used to weight keywords in documents. Experimental results show average accuracy of 75.5% using TF weighting and 71.96% using TF*IDF weighting across classification experiments with varying numbers of classes. The document provides background on the Amharic language and challenges in processing Amharic text for classification.

Uploaded by

Fayyo Olani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views11 pages

Automatic Amharic Text News Classification: Aneural Networks Approach

This document discusses automatic classification of Amharic news articles using neural networks. It uses the Learning Vector Quantization (LVQ) algorithm to classify new articles based on a classifier trained on sample data. Two feature weighting schemes, Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF*IDF), are used to weight keywords in documents. Experimental results show average accuracy of 75.5% using TF weighting and 71.96% using TF*IDF weighting across classification experiments with varying numbers of classes. The document provides background on the Amharic language and challenges in processing Amharic text for classification.

Uploaded by

Fayyo Olani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Ethiop. J. Sci. & Technol.

6(2) 127-137, 2013 127

Automatic Amharic text news classification: Aneural networks approach


Worku Kelemework

School of Computing and Electrical Engineering, Institute of Technology, Bahir Dar University, Bahir Dar
City, P.O.Box: 26, Ethiopia, e-mail: [email protected], [email protected]

ABSTRACT
The study is on classification of Amharic news automatically using neural networks approach. Learning Vector
Quantization (LVQ) algorithm is employed to classify new instance of Amharic news based on classifier
developed using training dataset. Two weighting schemes, Term Frequency (TF) and Term Frequency by Inverse
Document Frequency (TF*IDF), are used to weight the features or keywords in news documents. Based on the
two weighting methods, news by features matrix is generated and fed to LVQ. Using the TF weighting method,
94.81%, 61.61% and 70.08% accuracies are obtained at three, six and nine classes experiments respectively
with an average of 75.5% accuracy. For similar experiments, the application of TF*IDF weighting method
resulted in 69.63%, 78.22% and 68.03% accuracies with an average of 71.96% accuracy.

Key words - Learning Vector Quantization (LVQ), Text news classification, Term Frequency (TF), Term
Frequency by Inverse Document Frequency (TF*IDF)

INTRODUCTION languages such as Arabic and Hebrew (Wapedia,


2009). Amharic has its own writing system taken
Text classification is a mapping of text documents
from Ge’ez alphabet. The Amharic writing system
to classes (Sebastiani, 2008). In this study, text
consists of a core of thirty three characters each of
documents are Amharic news items and classes are
which occur in basic form and in six other forms
the categories each news item belongs.
called orders (Bender et al., 1976). Table 1 shows
The automated classification of texts has been
three core Amharic characters with their six orders.
flourishing in the last decade or so due to incredible
increase in electronic documents on the Internet; this A. Amharic Punctuation Marks
renewed the need for automated text classification Identifying punctuation marks is vital to know
(Klein, 2006). When Amharic is considered, word demarcation for natural language processing.
electronic documents are increasing that needs According to Tewodros Hailemeskel (2003), the
automatic classification. This paper describes how punctuation marks in Amharic are about ten though
to organize massively available Amharic news few of them used in computer writing system. ‘Hulet
items into meaningful way by undergoing automatic Neteb’ (‘:’)-word separator and ‘Arat Neteb’ (‘::’)-
classification. sentence separator are the major punctuation marks.
But, space is mostly used instead of Hulet Neteb (‘:’)
Amharic specially in computer writing system.
Amharic is the working language of the Federal
B. Amharic Number System
Government of Ethiopia. Twenty seven million people
speak the Language. It is the second largest Semitic Amharic number system consists of twenty characters.
language next to Arabic. Amharic is written from They represent numbers one to ten, multiples of
left to right similar to English unlike other Semitic ten (twenty to ninety), hundred and thousand. The
128 Worku Kelemework

Table 1. Amharic characters example Table 2. Amharic characters with different


forms of the same sound
1st 2nd 3rd 4th 5th 6th 7th
Order Order Order Order Order Order Order
Character Other form/s of the
ሀ ሁ ሂ ሃ ሄ ህ ሆ character
Hä Hu Hi Ha He H Ho ሀ (hä) ሐ and ኀ
ለ ሉ ሊ ላ ሌ ል ሎ ሠ (sä) ሰ
Lä Lu Li La Le L Lo አ (ä) ዐ
መ ሙ ሚ ማ ሜ ም ሞ ጸ (tsä) ፀ
Mä Mu Mi Ma Me m Mo

numbering system is not suitable for arithmetic Compound words: there is no standard way of
computation because there is no representation for writing Amharic compound words (Bender et
zero (0) symbol, no place value, no comma and no al., 1976). Space or hyphen is used between two
decimal point. Amharic numbering system is used in words in a compound word; sometimes the words
dates specially calendar; otherwise western numerals are merged together. According to Tewodros
are used in most literature these days (Bender et al., Hailemeskel (2003), there is a meaning difference
1976). when compound words separated by space are
C. Problem of Amharic Writing System treated separately. For example, the word ‘ሆደ-ሰፊ’
(‘tolerant’) formed from the words ‘ሆደ’ meaning
There are a number of problems associated with
‘stomach’ and ‘ሰፊ’ meaning ‘wide’. One can
Amharic writing system which are challenging
imagine how the meaning of the original word is
natural language processing of Amharic documents;
diverted to different contexts.
which are dealt below.
Redundancy of some characters: sometimes more than
Spelling variation of the same word: the same word
one character is used for similar sound in Amharic
is written in various forms (Tewodros Hailemeskel,
(Ethiopia Tadesse, 2002; Zelalem Sintayehu, 2001).
2003; Ethiopia, 2002; Zelalem, 2001). For example,
Though the various forms have their own meaning
the word ‘ሰምቶአል’ (‘he hears’) can be written in
in Ge’ez, there is no clear cut rule that shows its
Amharic as ሰምቶአል, ሰምቷል, ሰምትዋል, etc.
purpose and use in Amharic according to Bender et
Spelling variation may happen also in the case of
al. (1976).Table 2 illustrates the different forms of
translating foreign word to Amharic. For instance,
Amharic characters with similar sound.
the word ‘ቴሌቪዥን’ (‘television’) can be written as
ቴሌቭዢን, ቴሌቭዥን, ቴሌቪዥን, etc.
The problem of the same sound with various
characters is not only observed with core characters,
Abbreviation: no consistency is kept in abbreviating
but also exhibited in the same order of characters.
Amharic words (Ethiopia Tadesse, 2002) and
For example, ሀ and ሃ, ኀ and ኃ; አ and ኣ;
Zelalem Sintahyeu, 2001). The word ‘ዓመተ
etc 9tewodros, 2003). The use of various forms of
ምህረት’, meaning ‘AD’, can be abbreviated as ዓም,
characters for the same sound poses a problem in
ዓ.ም, ዓ.ም., ዓ/ም, etc.
the process of feature preparation for the classifier
learning since the same word is represented in
All the aforementioned problems pose challenges
different forms. For example, the word ‘ጸሀይ’ (‘sun’)
since the same word is treated in different forms in
can be represented in Amharic as ጸሀይ, ጸሐይ,
the process of feature preparation for text classifier.
ጸኀይ, ፀሀይ, ፀሐይ, ፀኀይ, etc.
Ethiop. J. Sci. & Technol. 6(2) 127-137, 2013 129

So, care is taken to solve such problems. classes learned by the competitive layer are referred
Amharic is technologically under resourced as subclasses and the classes of the linear layer are
language (Solomon Tefera and Menzel, 2007). called target classes.
Only three researches have been tried on the area of
text classification till this research is done as to the A text classifier based on a neural network approach
researcher’s knowledge. Zelalem Sintayehu (2001), is a network of units, where the input units represent
Surafel (2003) and Yohannes Afework (2007) have terms or features of news, the output units represent
done research on Amharic text classification using the classes of interest (Sebastiani, 2008). Figure
Statistical method, K-Nearest Neighbor (KNN) 1 indicates the architecture of LVQ according to
and Naïve Bayes algorithms, and Support Vector
Machine (SVM) and decision tree algorithms
respectively. The major challenge in all the studies is
the decrease in accuracy when the number of classes
increases. All of these studies apply only TF*IDF
weight method.

This research uses one of the neural networks learning


algorithm called Learning Vector Quantization
(LVQ) to study Amharic text news classification. It Figure 1. Architecture of LVQ
tries to answer the following questions:
Is neural network approach using LVQ learning
Demuth and Beale (2004).
method feasible for automatic Amharic text news
The construction of the classifier has been done
classification?
using Learning Vector Quantization (LVQ) in a
supervised manner. Hence, the algorithm demands
Can we reduce the effect of increasing number
training and test datasets which are pre-classified
of classes and news items on Amharic text news
by the experts. LVQ algorithm uses training dataset
classification performance using LVQ learning
for classifier construction; and test dataset for the
method?
evaluation of the classifier constructed. LVQ uses
different parameters to experiment one of which
What is the effect of TF and TF*IDF weighting
is epoch, which is the learning step. In this study,
methods on Amharic text news classification
different levels of epoch are experimented. Nine
performance?
epoch levels are used for training: 100, 500, 1000,
1500, 2000, 2500, 3000, 3500 and 4000. 100 is the
Text classification using Learning Vector
default epoch for LVQ algorithm. epoch lower than
Quantization (LVQ)
100 are not selected based on preliminary trial. Thus,
LVQ is supervised version of Kohonon neural
experiment is made from the default epoch level up
network (Martin-Valdivia et al., 2007). LVQ
to 4000 increasing at interval of 500 (except the
network has two layers called competitive layer
first). Interval of 500 is selected to see the impact
and linear layer (Demuth and Beale, 2004). The
of higher epoch levels because if smaller interval is
competitive layer learns to classify input vectors.
chosen it takes long time to reach to 4000.
The linear layer transforms the competitive layer’s
classes into target classes defined by the user. The
130 Worku Kelemework

Architecture of automatic Amharic text news THE DATA SET


classification The data source for this study is news of Ethiopian
News Agency (ENA). The news data are classified
Amharic news items are used as an input to the in accordance with 13 major classifications and 103
system. Then preprocessing tasks like normalization sub classifications in the Agency. For the purpose of
(changing varying Amharic characters with similar this study, nine classes are taken into consideration
sound to one common form, changing punctuation with a total of 1, 538 news items. The nine classes
marks to space), tokenization, stop word and number have been selected based on random sampling, which
removal, stemming, weighting terms and dimension are Bank and insurance, Tourism development,
reduction are done. After all these preprocesses, Mines and energy, ICT, Art, Educational coverage,
datasets are prepared in a matrix form, from which Weather forecast, Religious assemblies and reports,
training and test datasets are prepared and used as and Creativity work. The number of news items in
training and testing purpose respectively. From the each class and the total number of news items are
training dataset, model (classifier) is constructed. shown in Table 3.
The model is tested using the test dataset. The testing
outcome is the assignment of classes for news items Table 3. Number of Amharic News dataset
that are not encountered during training. Finally,
Class Class News No.
evaluation is made based on the test result using No.
accuracy. Figure 2 shows architecture of automatic 1 Bank and insurance 297
Amharic text news classification. 2 Tourism development 253
3 Mines and energy 251
4 ICT 167
5 Art 152
6 Educational coverage 138
7 Weather forecast 132
8 Religious assemblies and reports 103
9 Creativity work 45
Total 1, 538

Amharic dataset preprocessing


The tasks that are done for preprocessing of
Amharic news includes tokenization, stop words
and number removal, stemming, index term weight,
dimension reduction and matrix generation. After
the accomplishment of preprocesses, the classifier is
constructed by Learning Vector Quantization (LVQ)
learning method using MATLAB as a tool. Finally,
the system is evaluated based on the results obtained
using accuracy. The subsequent sections discuss the
Figure 2. Architecture of automatic Amharic text news methods used for preprocessing the data to make it
classification ready for classification task.
Ethiop. J. Sci. & Technol. 6(2) 127-137, 2013 131

A. Tokenization frequent word in the class ‘Tourism development’,


Tokenization is the process by which tokens are which is crucial in discriminating the class. Hence,
identified as candidates to be used as features (Baeza- such words are not included in the stop list.
Yates and Ribeiro-Neto, 1999). Candidates in the
sense that stop words and numbers are removed from The purpose of identifying stop words is, to remove
tokens. And tokens which do not satisfy Document such words from the list of index terms. Index
Frequency (DF) thresholding are not considered. terms are features and believed to represent news or
discriminate one news item from the others; whereas,
In this study, words are taken as tokens. All stop words are not. Hence, using those words in the
punctuation marks are converted to space and space list of index terms is unimportant. That is why their
is used as a word demarcation. Hence, if a sequence exclusion from index term list is vital.
of characters is followed by space, that sequence is
identified as a word. In most cases, numbers are less discriminating
among documents (Baeza-Yates and Ribeiro-Neto,
B. Stop Word and Number Removal
1999). In this study also, numbers are not considered
Stop words are non content bearing words, which as index terms. So, index terms list does not contain
are less discriminating among documents since they any number.
appear in most of them features (Baeza-Yates and
C. Stemming
Ribeiro-Neto, 1999).
Stemming is changing varying words, due to
There are common stop words in Amharic which grammatical reasons, to the root form of the word
are used for grammatical purposes like ነዉ, ነበር, (Frakes and Baeza-Yates, 2002). Stemming is one
ሆኖም, እና, ነገርግን, etc, which are non informative of the preprocessing made on Amharic text news
to identify documents. In addition to the common for this study. Stemmer that can remove common
stop words, there are also news specific stop words Amharic prefixes and suffices is developed. Table
like ገለፁ, ዘግበዋል, አስታወቀ, etc; their use is for 4 shows an example of the prefixes and suffices
elaboration and common to all news in accordance removed and an example under each affix.
with the reporters of ENA. Because of the The stemmer developed for this study is based on
unavailability of standard stop list done by previous Nega and Willett (2002). In such case, rules are
researchers, the researcher of this study is obligated
to develop stop list. Table 4. Example of affix removed during stemming

Since stop words are highly frequent words, total Type Affix Example

frequency of terms aided by manual inspection, is Word Translated to

the method employed in the process of identification Prefix ለ ለጂማ ጂማ

of stop words. Stop list is prepared after identifying ስለ ስለጂማ ጂማ


stop words; the list that contains, words which have በ በጂማ ጂማ
to be removed from tokens generated during the Suffix ም ጂማም ጂማ
tokenization process. The need of manual inspection ና ጂማና ጂማ
is, because of frequently occurring keywords. For ን ጂማን ጂማ
example, the word ‘ቱሪዝም’ (‘tourism’) is the most
132 Worku Kelemework

applied to find the stem of Amharic words. The rules


to remove prefix or suffix from a given word may TF is the number of occurrences of a term in a
not hold true always. For instance, removing ‘ዉ’ document. The weight of term k in document i, is
(‘wu’) from the word ‘ሰዉ’ (‘person’) would give ‘ሰ’ given by:
(‘se’), which is meaningless; and removing ‘በ’ (‘be’) TF = FREQik (1)
from ‘በልግ’ (‘autumn’) gives ‘ልግ’ (‘lg’),which
In (1), FREQik is the number of occurrence of term
does not represent the original meaning. Hence, two
k, in document i. TF is zero if the term does not
exception lists are prepared for which affix removal
appear in document i.
rules do not applied;
IDF is a measure of the general importance of the
List of words that prefix removal rule does not hold
term. (2) depicts IDF of a term.
true and list of words from which suffix removal N
dk
rule is not applied. IDF = log 2 (2)
The stemmer developed takes words as an input In (2), N is the total number of documents in the
and removes prefix of the word. After the prefix is collection, dk the number of documents in which
removed, the word is again checked if it lasts with term k occurs. TF*IDF is the combination of TF and
suffix in the suffix list, if so, the suffix is removed IDF weighting methods. TF*IDF incorporates two
from the word. Table 5 shows an example.
intuitions:

Table 5. Example of stemming If an index term occurs more frequently in a document,


the index term is more important for that document,
Example 1 Example 2 Example 3 Example 4 the Term Frequency intuition. If more number of
documents contain the index term, the index term
Input: ጂማ ጂማን የጂማ የጂማን is less discriminating between the documents, the
Prefix No No የ የ Inverse Document Frequency intuition.
N
Output1 ጂማ ጂማን ጂማ ጂማን dk
TF * IDF = FREQik * log 2 (3)
Suffix: No ን No ን
F i n a l ጂማ In (3), FREQik is the number of occurrence of term
ጂማ ጂማ ጂማ
output: k in document i, N is the total number of documents
in the collection, dk the number of documents in
which term k occurs.
D. Index Term Weight E. Dimension Reduction
All index terms are not equally important in The feature space comprises one new dimension for
representing and discriminating a document; it is each unique term that occurs in the text documents,
thus, required to measure how important a term is which can lead to tens of thousands of dimensions
with regard to representation and discrimination of a for even a small-sized text collection, so, there is
document ( Giorgino, 2008; Liao et al., 2003). Term a need to integrate dimension reduction phase in
Frequency (TF) and Term Frequency by Inverse text classification (Skarmeta et al., 2000; Yi and
Document Frequency (TF*IDF) are the weighing Beheshti. 2008).).
schemes used in this study. TF, IDF and TF*IDF are
explained below based on Baeza-Yates and Ribeiro- After identifying the number of tokens generated
Neto (1999) and (Manning et al., 2008). during tokenization, stop words and numbers removal
Ethiop. J. Sci. & Technol. 6(2) 127-137, 2013 133

and stemming are applied to reduce the number of Table 7. Matrix using TF*IDF weight method
tokens to be used as features. But still the dimension
has to be reduced so that the most important features ቱሪዝም 0 0 0 0 0 0 0
‘turizm’
of each class is identified. The need to reduce the
ትምህርት 0 0 0 0 0 0 0
dimension is: Irrelevant features are removed which ‘tmhrt’
may affect performance badly and for convenient ባንክ 0 3.46 3.46 0 10.38 0 0
‘bank’
computational complexity.
ምንዛሪ 6.64 3.32 0 3.32 3.32 6.64 6.64
‘mnzari’

In this study, Document Frequency (DF) thresholding ማእድን 0 0 0 0 0 0 0


‘maIdn’
is used to reduce the dimension of features
Class 1 1 1 1 1 1 1
generated. DF is the number of documents that
contain a certain feature (Krishnakumar, 2006). The
system is supported by manual observation; whether In Table 6 and Table 7, zero indicates that the feature
stop words which are not eliminated during stop does not occur in that class; otherwise, its weight
word removal are mixed and if there are important value is used to show its importance. In both Tables,
features which do not satisfy the threshold.After all class 1 indicates “Bank and Insurance” as depicted
the preprocessing done on the dataset, 80 features in Table 3.
are identified to represent news.
Amharic text news classification performance
Matrix
The input to the learning algorithm is a matrix Classifier is constructed using training dataset using
generated with the value of term weights using TF 66.67% of the total dataset. The remaining 33.33%
and TF*IDF. Table 6 and Table 7 show an example of the total dataset is used to test the accuracy of
of matrix generated for the nine classes experiment the classifier. Fifty four experiments were carried
using TF and TF*IDF weight methods respectively; out for the nine epoch levels in the three, six and
the rows and columns are reduced for viewing nine news classes using both TF and TF*IDF weight
purpose. That means all 80 terms (features) and all 9 methods excluding preprocessing experiments.
classes are not shown in the Tables. The experiment was carried out by considering
increasing number of classes. Three classes: ICT,
Art and Educational coverage.
Table 6. Matrix using TF weight method
Six classes: ICT, Art, Educational coverage Weather
forecast, Religious assemblies and reports, and
ቱሪዝም 0 0 0 0 0 0 0
‘turizm’
Creativity work. Nine classes: Bank and insurance,
ትምህርት 0 0 0 0 0 0 0 Tourism development, Mines and energy, ICT, Art,
‘tmhrt’ Educational coverage, Weather forecast, Religious
ባንክ 0 1 1 0 3 0 0 assemblies and reports, and Creativity work.
‘bank’
Table 8 and Table 9 show accuracy result of the
ምንዛሪ 2 1 0 1 1 2 2
‘mnzari’ classifier evaluated by the test dataset using TF and
ማእድን 0 0 0 0 0 0 0 TF*IDF weight methods respectively.
‘maIdn’
The best accuracy obtained for the three, six and nine
Class 1 1 1 1 1 1 1
classes using the two weight methods are indicated
in Table 10.
134 Worku Kelemework

DISCUSSION
Table 8. Accuracy using TF weighting Scheme at 3,
6 and 9 classes at various epoch levels For the three classes, TF weight method is better than
TF*IDF weight method by 25.18%. But for the six
Class 3 Classes 6 Classes 9 Classes classes experiment TF*IDF weight method is better
Epoch
than TF weight method by 16.61% than TF weight
100 69.63% 61.61% 62.09%
500 68.89% 61.61% 62.30%
method. The result of nine classes experiment testifies
1000 68.89% 61.61% 61.48% that TF weight method scored better accuracy than
1500 69.63% 61.61% 70.08% TF*IDF weight method by 2.05%. The average of
2000 94.81% 61.61% 62.30%
all the experiments indicates that TF weight method
2500 68.89% 61.61% 61.07%
3000 69.63% 61.61% 61.48%
registered better accuracy by 3.54% than TF*IDF
3500 68.89% 61.61% 61.89% weight method.
4000 68.89% 61.61% 61.68%

The main performance difference between the two


weighting schemes happens because of the range of
values in the weighting schemes. In the datasets, TF
Table 9. Accuracy using TF*IDF weighting Scheme weight value is between 0 and 5 and TF*IDF weight
at 3, 6 and 9 classes at various epoch levels value is in the range of 0 and 45. This affects the
classifier accuracy. Because, according to [22], it
Class 3 Classes 6 Classes 9 Classes
Epoch is recommended to have maximum value of 1 and
100 69.63% 57.78% 62.09% minimum value of -1 for the input pattern of LVQ
500 62.22% 70.22% 62.30% algorithm. This seems plausible for the greater
1000 65.19% 78.22% 61.48%
accuracy result of TF weight method than TF*IDF
1500 69.63% 57.78% 68.03%
2000 65.19% 57.78% 62.30% weight method.
2500 65.19% 58.22% 62.30%
3000 65.19% 72.89% 62.30% As depicted in Table 8, using TF weight method
3500 69.63% 57.78% 54.51%
the best accuracy is registered at three classes
4000 62.22% 57.78% 62.30%
experiment. The least accuracy is recorded at the
six classes experiment. The nine classes experiment,
resulted accuracy lower than the three classes but
higher than the six classes experiment. Hence, we
Table 10.Best accuracy at increasing No. of classes can say that the increase in the number of classes and
and news using TF and TF*IDF weight methods news are not the determinant factor for the decrease
of performance with regard to the LVQ algorithm.

Classes Accuracy
TF TF*IDF
Based on Table 9, least accuracy for the TF*IDF
Three 94.81% 69.63% weight method is scored in the nine classes experiment
Six 61.61% 78.22% with less (1.6%) difference in the three classes
Nine 70.08% 68.03%
experiment. The best accuracy for this weighting
Average 75.50% 71.96%
method is recorded in the six classes experiment. The
three classes experiment resulted in the second best
accuracy. Like the TF weight method, it can be said
Ethiop. J. Sci. & Technol. 6(2) 127-137, 2013 135

that the increase in the number of classes and news accuracy, facilitating Amharic text classification,
items are not the major factors in the reduction of or untouched problems related to text classification
performance using LVQ algorithm for the Amharic for Amharic context and recommendations for the
text classifier. agency, ENA.

Stemmer: standard stemmer that can be applied on


Amharic is vital to decrease feature size. The result
obtained on this study is encouraging in reducing
CONCLUSION
the size of features by applying stemmer. If standard
This study tried to see the potential application stemmer is available, it can play great role in the
of Learning Vector Quantization to automatic reduction of features as a result computational
classification of Amharic text news. Under this complexity can be decreased and effectives can be
umbrella, effectiveness of text news classifier at enhanced. Hence, there is a need to investigate on
increasing level of classes and news items has Amharic stemmer. Prior to stemming, actually there
been investigated using TF and TF*IDF weighting is a need to use stop word removal system. In this
methods. The concluding remarks are described study, very few news specific and language specific
below: stop words are identified, using these stop words
The best accuracy using TF weighting scheme has good result is obtained. If complete stop word list is
been obtained at three classes, which is 94.81%. The prepared for the domain considered, that again have
least accuracy recorded for TF weighting scheme an important positive effect on reducing the size of
has been scored at six classes, which is 61.61%. On the features.
the other hand, for TF*IDF weighting scheme, the
best accuracy recorded at six classes that is about Spell checker: spelling error multiplies features by
78.22% and the least accuracy has been obtained at forming different features for the same word. As
nine classes that accounts to 68.03%. TF weighting a result, spell checker researches are essential for
scheme is better in accuracy than TF*IDF weighting guiding the development of Amharic spell checkers.
scheme by 3.54% on average from the three, six and In addition to spelling errors, some Amharic words
nine classes experiments. can take varying forms due to the presence of varying
Increase in class does not affect the performance of Amharic characters but with similar sound. Here, the
the classifier unlike previous studies, which show recommendation is extended to Amharic language
consistent decrease in accuracy as the number of experts; to devise a clear rule on the use of those
classes increase. In the course of training using characters. Compound words and abbreviations are
LVQ algorithm, it is found that computational time also written in various forms; hence, the language
increases as the number of news items, features and experts again recommended on standardizing the way
classes increases. compound words and abbreviations are written.

Generally, the study shows that Learning Vector Dimension reduction: in this study, Document
Quantization can be employed to automatic Amharic Frequency (DF) thresholding is used as a method
text classification but an integration of standard of dimension reduction of features; other dimension
preprocessing techniques is crucial. Classifier is reduction techniques like Information Gain (IG), x
constructed using LVQ-neural network algorithm. 2
-test (CHI), etc can be used for reducing the size of
But the accuracy obtained has to be improved. The features. Corpus preparation: in other languages, it is
recommendations given revolve around improving
136 Worku Kelemework

very common to prepare corpus for research purpose; REFERENCES


unfortunately, we are not lucky for not having
standard corpus for Amharic text classification, Baeza-Yates, R and Ribeiro-Neto B. (1999).
Modern information Retrieval. Addison-
as to the researcher’s knowledge. Researchers can Wesley: New York,
devote much time on their work if standard corpus is Bender, M., Bowen, J., Cooper, R. and Ferguson
prepared for Amharic classification experiments like C. (1976). Language in Ethiopia. Oxford
‘Reuters-21578’ for English. Feature preparation: the University Press, London.
Demuth, H and Beale, M. (2004). Neural Network
features that can represent classes are selected using Toolbox for Use With Matlla. The Math Works
words in this study. But confusion occurred when inc. Natica.
the words are common across classes that resulted in Ethiopia Tadesse. (2002). Application of case-
misclassifications. Hence, there is a need to undergo based reasoning for Amharic legal precedent
retrieval: a case study with the Ethiopian labor
research on text classification that considers features law. M Sc Thesis, Addis Ababa University,
selected based on phrases or using ontology. Ethiopia.
Frakes, W and Baeza-Yates R. (2002). Information
Classification types: the data on ENA SQL retrieval: data structures and algorithms,
Prentice Hall.
server exhibit hierarchical in nature. As far as Giorgino, T. (2004). An introduction to text
the researcher’s knowledge, this problem is not classification. www.cs.cornell.edu/people/tj/
researched for Amharic. So, this is potential area of publications/joachims_98a.pdf Retrieved on
research. And some news items in ENA reveal the October 13, 2008.
Klein B. (2004). Text categorization or
characteristics of more than one class. But ENA classification. http://www.bklein.de/text_
uses only single-label classification scheme. So, it classification.php, 2004. Retrieved on October
is recommended for ENA to start implementation 12, 2008.
of multi-label classification scheme so that the true Krishnakumar, A. (2006). Text categorization:
building a KNN classifier for the Reuters-
characteristics of news items are exhibited. This also 21578 collection. http://en.scientificcommons.
helps researchers to undergo study on multi-label org/42606011. Retrieved on October 12, 2008.
classification of Amharic news. Liao, C., Alpha, S and Dixon, P. (2003). Feature
preparation in text categorization.
Manning, C., Raghavan, P and Schütze, H.
ENA: manual classification is used in ENA till now. (2008). Introduction to information retrieval
The results of Amharic text news classification Cambridge University Press, Cambridge.
researches are promising. Hence, the company Martín-Valdivia, M., Ureña-López, L. And
shall start to think the implementation on automatic García-Vega, M. (2007). The Learning Vector
Quantization algorithm applied to automatic
classification for Amharic news.Other domains: as text classification tasks. Neural Networks. 20:
to the knowledge of the researcher, Amharic text 748-756.
classification is still tried on news items text only. Nega, Alemayehu and Willett P. (2002). Stemming
Other areas, like classifying ‘research papers’, can of Amharic words for information retrieval.
Literary and Linguistic Computing. 17:1-17.
be researched for Amharic documents. Sebastiani, F. (2002). Machine learning in
automated text categorization. ACM
Computing Surveys. 34: 1-47.
Skarmeta, A., Bensaid, A and Tazi, N. (2000).
Data mining for text categorization with
semi-supervised Agglomerative: Hierarchical
Clustering. International Journal of Intelligent
Ethiop. J. Sci. & Technol. 6(2) 127-137, 2013 137

Systems. 15:63-646.
Solomon Tefera and Menzel, W. (2007). Syllable-
based speech recognition for Amharic.
In Proceedings of the 5th Workshop on
Important Unresolved Matters, pp. 33–40.
Addis Ababa Ethiopia.
Surafel Teklu (2003). Automatic categorization
of Amharic news text: a machine learning
approach. MSc Thesis, Addis Ababa
University, Ethiopia.
Tewodros Hailemeskel. (2003). Amharic text
retrieval: An Experiment Using Latent
Semantic Indexing (LSI) with Singular value
Decomposition (SVD). M Sc Thesis, Addis
Ababa University, Ethiopia.
Thulasiraman, P. (2005). Semantic classification of
rural and urban images using Learning Vector
Quantization. MSc Thesis, Madras University.
India.
ttp//www.oracle.com/technology/products/
text/pdf/feature_preparation.pdf. Retrieved on
October 12, 2008.
Wapedia (2009). አማርኛ, http://wapedia.mobi/
am/” Retrieved on April 24, 2009.
Yi, K. and Beheshti, J. (2004). A comparative study
on feature selection of text categorization for
Hidden Markov Models. http://www.jsbi.org/
journal/GIW02/GIW02F006.pdf. Retrieved on
September 24.
Yohannes Afework (2007). Automatic Amharic
text categorization. MSc Thesis, Addis Ababa
University, Ethiopia.
Zelalem Sintayehu. (2001). Automatic
classification of Amharic news items: The case
of Ethiopian News Agency. MSc Thesis, Addis
Ababa University: Ethiopia.

You might also like