Chapter 2
Background Research
2.1 Machine Learning
Machine learning is a scientific field in artificial intelligence. It permits computers to find
patterns, classify, or cluster a given set of text. Classification is one of the important tasks that could
be accomplished in machine learning field via natural language processing. Computers are trained to
classify given certain attributes, then experts to evaluate the algorithm to check if it meets the
requirements and give the percentage accuracy of the outcome[ CITATION Atw1 \l 2057 ]. Natural
language processing tasks could be implemented using machine learning methods.
2.2 Natural Language Processing
Natural Language Processing (NLP) can be defined as a computational approach to analyse
text that is based a set of theories and technologies. The term NLP is normally used to describe the
function of software or hardware components in a computer system which analyse or synthesize
spoken or written language. The “natural” description meant to distinguish human speech and
writing from more formal languages such as mathematical or logical notations, or computer
languages such as java, C++ [ CITATION 1 \l 2057 ]. In order to achieve these tasks NLP uses some
linguistic tools that help in text mining and information retrieval. it includes many important
techniques to apply and extract knowledge automatically from the text. Linguistic analysis of text is
typically preceded in layered trend. Documents are splintered into paragraphs, then into sentences
which are eventually splintered to words. These words are tagged by part of speech, grammatical
analysis and other features prior to the sentence being parsed. Thus, parsers progress is carried out
upon sentences delimiters, tokenizer, stemmers, and part of speech taggers. Mainly, detecting
sentences boundaries is done accurately by stating delimiters that rely on regular expressions or
punctuation signs. Regular expressions or punctuation signs can disambiguate for example the end
of a sentence. In addition to, this type of analysis can rely on training corpora, or use more substance
such as part of speech frequencies. Furthermore, the use of tokenizer can be utilised to
disambiguate punctuation characters since tokenizers are lexical analysers that divide a motion of
characters into meaning units called tokens. Secondly, Parsing cannot proceed in the absence of
lexical analysis, and so it is required to use stemmers. Stemmers are morphological analysers that
identify alternative terms to a word using root form. Moreover, part of speech tagger build upon
tokenizers and sentence delimiters labels every word with its proper tag such as nouns, verbs,
adjective and etc. Finally, parsing maybe accomplished by addressing the simplified variant of the
parsing extend which helps in extracting interesting parts of the text. Moreover parsing can be done
in respect to grammar, which is a set of rules that say which combinations of parts of speech
generate well-formed phrase and term structure [ CITATION 1 \l 2057 ].
2.3 Text Classification
Text mining is the knowledge of analysing and applying specific algorithms on text to detect
and extract substance. Text mining is more recent and uses techniques primarily developed in the
field of information retrieval, statistics, and machine learning. Text mining is achieved in three main
stages. Firstly, Text preparing stage where the text selection and management is developed
presenting training sets which were created by AI experts manually. Secondly, text processing stage
that uses text mining algorithms to treat the processed data sets. At this stage of the text mining
procedure a fully featured natural language processing system would define rules and attributes,
and features of the provided data to be used in a designated algorithms or techniques such as
decision trees[ CITATION 4 \l 2057 ]. The NLP systems usually convert the input to an internal
representation that interacts with the external language resource such as dictionaries. This is done
to produce a useful analysis and annotation on the input text. This can be used in many types of
application such as: automatic question –answering, text summarization, Machine translation into
other language, analysis of customer preferences, automated tagging of internet advertising, etc.
[ CITATION Atw1 \l 2057 ]. Finally, text analysis stage that consist the evaluation and demonstration
of assumptions made by the experts in the beginning are proved. The extracted text is passed to text
tools to visualize and present a detailed analysis eventually [CITATION 4 \l 2057 ]. According to
Khitam Jbara [ CITATION Jba10 \l 2057 ], automatic text classification is an essential research subject
in text classification researches. This is due to the availability of the large number of digital
documents that is used. In addition according to Al-Kabi, G Kanaan; Al-Shalabi, R; M.O, Khalid; Bani-
Ismail, and Basel Mohammed [CITATION AlK05 \l 2057 ], Automatic Text Categorization (ATC) refers
to producing software that includes a predefined categorise to handles “unseen” text files.
Text classification is widely spread around the world which formed many researches around
the world in different languages. Projects on techniques to classify texts in different languages such
as English, European, Asian languages are widely spread. One of the examples of projects that will be
implemented in this project is classifying Arabic text. Automatic Arabic text classifications are mostly
implementing in the same sequence by first compiling the text document in a corpus and have the
option of labelling it. Secondly select the most suitable feature to classify. Thirdly the selection of
classification algorithms that is trained and tested. This project is using the Arabic version of the holy
Quran to implement the classification to extract the interesting verses of the Quran [ CITATION SAl \l
2057 ].
2.3.1 Text Classification methods
Data are divided into two main types based on the attributes. At first, data in which the
attributes were taken into account and intended to be used are called labelled data. This type of that
uses supervised learning methods to data mining. If the attributes can be divided into categorise,
then it is a classification. On the other hand, analysing numerical attributes is called regression.
Second, data in which no attributes were taken into account are called unlabelled data which aims to
extract information from the data set. This type of data uses unsupervised learning method. In
addition, semi-supervised learning is another machine learning method that is used in text
classification and is a combination of the previously mentioned methods.
2.3.1 Unsupervised Learning
Unsupervised learning uses training samples that includes a number of instances. These
instances do not help in directing the results of the training[ CITATION Zhu09 \l 2057 ]. Clustering is
considered to be an alternative term to unsupervised learning. Clustering can be the objective
function in the implementation of some solutions to a problem. Clustering can be hierarchal in some
problem or a modal based clustering[ CITATION Cio07 \l 2057 ]. Since the unlabelled data that is
used for information extraction is done by implementing unsupervised method it is important to
understand that clustering is grouping objects that are similar to each other in a cluster.
One example of clustering is K-means clustering. This algorithm is considered to be one of
the simplest unsupervised learning algorithms. It identifies the groups that are similar in a data set
without labelling or guidance. It is implemented by defining points in a dimensional space. The
number of attributes is the space dimensions. According to[ CITATION The11 \l 2057 ], the algorithm
that is processed to implement a solution to a clustering problem using the K-means is to initially
define a random K points in a space of a problem. K in this step is the number of cluster that a user
wants to get. Secondly, each cluster carries out a value that is the average of a cluster that is called
“centroid of the cluster”. Thirdly, iterate over the defined points including the points that are
assigned to the clusters. Then each point is assigned to the cluster that is closest to a centroid. These
steps are reiterated to fine centroids and assigned points until results are not changed.
Another example of clustering is hierarchical clustering algorithms which can be top-down or
bottom- up. The last of the two types deals with an each data set as a different point to cluster. This
is done until all data is clustered under one cluster containing all the data. According to Christopher
D. Manning, Prabhakar Raghavan and Hinrich Schütze[ CITATION Chr08 \l 2057 ], this type is more
commonly used in information retrieval and it could be visualised in a dendrogram. On the other
hand, the top-down is more like splitting a cluster into points repeatedly until it reaches a point that
stand as a cluster by itself.
2.3.2 Supervised Learning
Supervised learning uses corpus to build and train a system based on specific features. The
system in this type of algorithm determines in advance the text content that should be considered
and what types of information to use from the context. It determines the available methods of
combining the contextual evidence from the values of features in the training data [CITATION McC09
\l 2057 ]. There are many approaches in applying supervised machine learning, Such as Decision
Trees, Naïve Bayes Classifier, nearest neighbour algorithm[ CITATION Bra07 \l 2057 ]. According to
[CITATION 1 \l 2057 ], classification using supervised machine learning methods is achieved if the
following condition were fulfilled:
1. Pre-classification of the data that will be analysed
2. “In the simplest case, these classes should be disjoint”
3. If the data cannot be split into classes, then convert the data to be classified to (n)
corresponding sub-problems. In which, a sub-problem classifies data to those that belong to
corresponding category and those which do not {Yes, No}.
One example of supervised classification method is Naive Bayes Classifiers that uses probability
theory to achieve the possible classifications. Bayes theorem was conducted by Thomas Bayes, the
first mathematician, who used probability in supervised learning [ CITATION Bra07 \l 2057 ]. ”Naïve
Bayes looks at the distribution of terms of terms, either with respect to their frequencies or with
respect to their presence or absence. “ It looks at the probability of a certain feature fro m the data
set and classify. This is achieved by calculating a function that extracts frequency of the terms
occurrences[ CITATION 1 \l 2057 ].
Another method to apply supervised learning is to build a tree that contains features from data
sets. The root of the tree will be the categorization that is used to differentiate the class. The nodes
of the tree are decision point that tests a feature. A branch in the decision tree corresponds to the
value of the results. If the tree was build based on the training set that was pre-classified then it
forms inductive learning techniques. According to [ CITATION 1 \l 2057 ],“Decision tree method work
best with large data set training set that are too small will lead to over fitting, The data must be
regular attribute value format. Thus each datum must be capable of being characterized in terms of
fixed set of attributes and their values, whether symbolic, ordinal or continuous. Continuous values
can be tested by thresh holding. Assuming that they are applicable, decision tree methods can have
a number of advantages over more conventional statistical methods:
1- The make no assumptions about the distribution of attributes values.
2- They do not assume that conditional independent of attributes.” [ CITATION 1 \l 2057 ]
2.3.3 Semi-Supervised Learning
Another method of machine learning is the semi-supervised learning or “reinforcement
learning”. This method is basically a combination of both supervised and unsupervised learning. This
means that the data set will include a set of labelled data along with unlabelled data. It is claimed
that semi-supervised results higher accuracies that the other two methods. According to Steven
Abney [ CITATION Abn08 \l 2057 ], there are six different implementations of semi-supervised
learning methods which are Self-training, Agreement-based methods (Co-training algorithm),
Clustering algorithms, Boundary-oriented methods, Label propagation in graphs, and Spectral
methods. According [ CITATION Zhu05 \l 2057 ], self-training method has been used in text
categorisation by applying a classifier on a labelled data and train it and then on unlabelled data.
Moreover, Co-training is another way of implantation which is accomplished by using two classifiers
and splitting the features into subset. The two classifiers are training with the labelled data each on
one of the subsets. Then each classifier is used on the unlabelled data.
2.3.2.2 Classification Examples
Islamic scholars have shown their interest in classifying the verses of Quran for a number of
reasons. For example classifications of verses help in understanding the context of the chapters,
analysing the evolution and wisdom behind some Islamic rulings in Quran, the history of the Islamic
nation and the biography of the Prophet Muhammad [CITATION Atw1 \l 2057 ]. In addition
according to Khitam Jbara[ CITATION Jba10 \l 2057 ] , “Al hadith is the saying of Prophet Mohammed
(peace and blessings of Allah be upon him) and the second religious source for all Muslims”. It is
important to be able to classify the hadith in to subjects that are classes in a research .Two examples
will be discussed in this section. A first example is the Quranic classification and second hadith
classification. Both classifications examples required a predefined corpus to train and test the results
of the classifier. The corpus is a natural language source of language information, since they are the
illustration of the linguistic knowledge. They are used as data analysis tools to determine patterns or
other language processing tasks. A corpus maybe created manually to achieve a certain aim and
objective of a study. The text mining procedures can be carried out on large data set such as the
Qur’an corpus, and use words and interesting features that were defined by the user to retrieve
information, and the use of a corpus help in consuming time and effort in classifying manually. Most
of text classification examples use the corpus based on the research. In the following examples, an
explanation is provided on how the corpus is created and the method and algorithms that were
implemented to achieve the classification.
Quranic classification
According to Al-Kabi, G Kanaan; Al-Shalabi, R; M.O, Khalid; Bani-Ismail and Basel Mohammed
[ CITATION AlK05 \l 2057 ], the journal stated that the objective of the study was to classify the
verses (Ayat) of The Opening (Fatiha) and Yaseen (Ya-seen) chapters according to Islamic scholars,
using linear classification function. This was accomplished by building a system that intended to
categorize the different verses in each chapter. The system was designed on Microsoft visual basic
because it supports Unicode of Arabic language. To begin with the implementation, a Selection of
the desired chapter and verse to classify was done, generate a list of words as features and count
their occurrences, check what subject are they related to create a class of that subject. A corpus
should be created to build a list of subjects, which was generated manually for the selected
chapters. Secondly, normalize the verses by removing punctuations and parsing the verses into
different tokens. Categorize to classes according to the subjects created by the system. Define a
function that represent the percentage of a specified subject, take records of the highest score
classifications. This system scored 91% accuracy in classifying different verses.
Furthermore, a research on classifying the Quran verses into Meccan and Madinan at the
University of Leeds. In this research the classification was based on the migration of the Prophet
Muhammad. The algorithm was based on pre-knowledge of the Quran scholars that has already
defined some of the chapters to be Meccan or Madinan. The accuracy of the algorithm was based on
the features set that contained keywords, the availability of the training set on few chapters that
were classified earlier, and the use of the developed Quranic corpus. In contrast to the previous
study, this research defined 14 features that were obtained from scholars. These features were
converted into countable keywords and get their frequencies. Moreover, after defying these
features, the classification was carried out using open source software called WEKA [CITATION
Atw1 \l 2057 ].
Hadith classification
Another example of text classification is al hadith classification. According to Khitam Jbara
[ CITATION Jba10 \l 2057 ], the reason that started this study was the importance of the hadith and
right classification that helped in Islamic studies. The text set of al hadith was obtained from Sahih
AL-Bukhari. This is the hadith book that is used by most researches that includes hadith. This
research was carried out to classify the hadith into subjects that they talk about and the results were
thirteen class. The system which the scholars proposed included four main procedures which are
processing; training that included the selected feature helped in classification, classification that
included a method to classify the hadith and eventually the analysis of the classification results. The
scholars had to create a corpus for hadith to process their classification. The procedure that was
followed to obtain training set where to firstly remove the part of hadith that refers to the names of
individuals who transferred it from the prophet. Secondly tokenizing the hadith in to words and
removing punctuations marks. More over word like pronouns, prepositions, and names of people or
places that were mentioned was removed from the hadith set, and what was left of the set was
considered as a feature. The last step towards the construction of the corpus was stemming the
feature from prefixes and suffixes and the elimination of the words that didn’t make sense after
stemming them. The scholars eventually had 19 features that were weighted and used to classify al
hadith. Supervised classification was used as they have created a training text file to extract the
features. Three methods were implemented to classify which are AL-Kabi, Word based classification
(WBC), Stem expansion classification (SEC) methods. The results of this study SEC achieved the best
results of all methods that have been implemented.
Medical classification
According to Aphinyanaphongs, Yin and Aliferis, Constantin [CITATION Aph07 \l 2057 ], many
publications were spread in the internet which claims that there are many “unproven treatments”,
inaccurate medications and cancer patients are in treble conditions. In order to overcome these
problems a research has been carried out to identify the web pages that made the previous claims.
One of the claims that has been made was that people who are not real physicians gave in medical
advises and treatments. Moreover, cancer patients had made some purchases of abnormal
medication online. The set that was used to retrieve the information is gathered from chosen
unproved treatments that were identified by the quackwatch website. To create the corpus the
scholars selected eight unproved treatments randomly which is in the “Cure for all
Cancers” ,“Metabolic Therapy”, “Cellular Health”, and “Insulin Potentiation Therapy” other examples
of subjects. The web pages were identified by appending the words cancer and treatment and
retrieve the top results from Google query. The web pages where labelled with positive and
negative. The web pages that include unproven claims were the positively labelled, others were
negatively labelled. Eventually the corpus included 191 web pages labelled as positive and 98 as
negative. Web pages were converted to list of word by removing the script tags and replacing
punctuation with spaces to obtain words. Secondly, words were stemmed and words that appeared
in less than three web pages were removed. A number of algorithms were implemented and
compered to each other. One of the algorithms that were implemented was counting the
frequencies of the terms and defines a user threshold.
2.5 WEKA
WEKA was developed at the University of Waikato in New Zealand, and the name stands for
Waikato Environment for Knowledge Analysis. It is open source software that is written in JAVA
under the terms of general public licences GNU. WEKA was tested to run on any platform of
operating system. Easy interface was made available to many different learning algorithms, along
with methods for pre and post processing and for evaluating the result of learning schemes on any
given dataset [ CITATION Ian05 \l 2057 ].
2.5.1 Graphical interface & Performance
There are three graphical interfaces of WEKA. First, The Explorer that provides the user the
privilege to access all facilities implemented using the menu selection and form filling. The second
graphical interface is the Knowledge Flow interface. This interface provides the user the designing
configurations for streamed data processing. It allows the user to specify a data stream. This is done
by connecting components representing the data sources, pre-processing tools, learning algorithms,
evaluation methods, and visualization models. The third interface is called the experimenter, which is
designed to help the user to answer the basic practical questions while applying classifications and
egression techniques.
Figure 1: WEKA Explorer
It performs a variety of learning algorithms that the user can easily apply to their dataset. It
similarly includes a selection of tools for transforming datasets, such as the algorithms for
discretization. The user can apply any method on a dataset. That is accomplished by loading the
dataset into a learning algorithm, and outline analyse of the results. It consist of methods that can
be applied on any standard data mining problem such as, regression, classification, clustering, etc.
WEKA can be applied in learning method to a dataset and analyse its output to learn about the data.
Secondly, use it on learned methods and produce estimations on new instances. Another example is
to apply several different learning methods and compare their performance in order to choose one
for a prediction [ CITATION Ian05 \l 2057 ].
2.5.2 Data format: ARFF file and Processing
The dataset that is loaded into the WEKA explorer in a number of formats such as spread
sheets and database format, but the built-in format is the ARFF file. This file has three parts:
1. Relation name: the first line in the file should be a relation name starting with @relation
that can be any meaningful name[CITATION Atw1 \l 2057 ].
2. Attribute list: each attribute start with @attribute command followed by a name defined
by the user as the attribute, followed by the types expecting an enumerated list of
outcomes[CITATION Atw1 \l 2057 ].
3. Data set: finally the WEKA ARFF file expects a list of data set. This is the representation of
each line formed in a row of comma and attributes[CITATION Atw1 \l 2057 ].
After loading the data in the explorer the attributes that were defined in the .arff file for
example appears in the processing tab. In addition to the process tab, classify tab is provided in the
explorer. In this tab the user can build or train the classifier. Many option of classification is provided
along with a description and the parameters that can be used. Clustering can also be implemented
using the cluster tab. Both classify and cluster tabs provide the option of testing and training and the
results are displayed in the output panel. In addition, WEKA provides attributes selection feature
which can be accessed from the “select attribute” tab. This can help in selecting smaller sets of
attributes that helps in defining the most suitable attributes in classifying and clustering. The last tab
in the explorer is the visualize tab that displays 2D distribution of data[ CITATION Ros07 \l 2057 ].