Text Mining Quran Tafsir in Indonesia
Text Mining Quran Tafsir in Indonesia
EDI IRAWAN
2017390001
SAMPOERNA UNIVERSITY
2020
LEGITIMATION LETTER
Faculty
i
EXCLUSIVE RIGHT STATEMENT
I, Edi Irawan (2017390001), at this moment grant to our school, Sampoerna University,
the non-exclusive to archive, reproduce, and distribute my Final Year Project entitled
I acknowledge that I retain the exclusive right of my Final Year Project by using all or
parts of it in future work or output, such as articles, books, software, and information
system.
Edi Irawan
SIN: 2017390001
ii
[Originality Statement]
DECLARATION OF ORIGINALITY
I, Edi Irawan (2017390001), hereby acknowledge that this submission entitled “TEXT
acknowledgement is made in the Final Year Project. Any contribution made to the
Edi Irawan
SIN: 2017390001
iii
ACKNOWLEDGEMENT
I would like to thank Allah SWT for always giving His blessings, and one of them is
my thesis completion. I also have to thank my research supervisor, Prof. Ir. Media
Anugerah Ayu, MSc., PhD. Without her advice to every step made to the thesis, it could
not be possible to complete this research. I also would like to say that this thesis contains
knowledge given from all my lecturers that have been teaching me in Sampoerna
University; for that, I am very grateful for their dedication in lecturing me all the
courses.
iv
ABSTRACT
commentaries and explanation of each surah and/or ayah, forming a large document.
The challenge is how to refer to both tafsir and translation in faster and accurate ways
as one needs to always refer to them back and forth. Hence, this study proposes several
Translation. The results show that the tafsir is well clustered for eight groups. From TF
and TF – IDF results, in most 30 frequent words, 17 words appeared to be the same.
The correlation of the tafsir and translation frequency is 0.5306 which is moderate type
rules from the tafsir and translation for example (“Musa”, “Bani Israil”) -> “Agama”.
The future work of this field is to do semantic text mining and to create a system for
Keywords: Text mining, text clustering, most frequent words mining, association rule.
v
Table of Contents
vi
IV. RESULTS AND DISCUSSIONS .................................................................................. 31
4.1 Feature Selection and Extraction Result ............................................................. 31
4.1.1 Datasets Gathering Result ................................................................................ 31
4.1.2 Feature Selection Result................................................................................... 32
4.1.3 Feature Extraction Result ................................................................................. 33
4.2 Most Frequent Words Mining Result .................................................................. 34
4.3 Clustering Result .................................................................................................... 40
4.3.1 Preliminary Experiment Result ........................................................................ 40
4.3.2 Final Experiment Result................................................................................... 41
4.4 Association Rules Result........................................................................................ 46
V. CONCLUSIONS AND RECOMMENDATIONS ....................................................... 50
5.1 Conclusions ............................................................................................................. 50
5.2 Recommendations .................................................................................................. 50
REFERENCES....................................................................................................................... 52
APPENDICES ........................................................................................................................ 55
APPENDIX 1 K-MEANS ALGORITHM CODE ........................................................... 55
APPENDIX 2 FP GROWTH ALGORITHM CODE ..................................................... 57
APPENDIX 3 MOST FREQUENT TERMS ALGORITHM CODE............................ 64
APPENDIX 4 DESIGN OF ASSOCIATION RULES .................................................... 65
vii
LIST OF FIGURES
viii
LIST OF TABLES
ix
I. INTRODUCTION
The Holy Quran is the most valuable book for Muslims as they believe it is
containing the words of God (Yusuf Ali, 2001). Inside the Quran, there are fundamental
regardless (YILMAZ GENÇ & Syed, 2019). The original language of Quran is
delivered in Arabic language. In order the meaning of Arabic Quran to be known by all
Muslims in all around the World, Quran had been translated into some languages and
Indonesia, the number of Muslims in Indonesia is around 207 million people or 87% of
Any translation of Arabic Quran is not enough for nonnative Arabic speakers,
to really understand the exact meaning of the Quran. That is why there are some
honorable and knowledgeable people who write and create some commentary books on
the Quran called Mufassir and the books called tafsir. The explanation and commentary
inside the tafsir must not base on the individual opinion. The contain of the Quran must
stay the same so all the commentary must refer to the explanation of Prophet
As tafsir and translation of the Quran dealing with long sentences and words, it
becomes a challenge to extract the valuable information from both of them. In this
technological era, people tend to leave conventional thing such as refer to a thick book
by opening one page to another page. The invention of information retrieval algorithm
1
and text mining in Natural Language Processing or NLP is enabling people to mine
valuable information inside large text documents faster automatically and might be a
possible answer to the referencing tafsir and translation challenge. There are Quran-
related NLP studies, for examples by (Khadangi et al., 2018), (Sabah et al., 2015) and
Nowadays, Natural Language Processing has been widely used for the
method called Text Mining (Zhou et al., 2020). Text mining is considered as the NLP
branch as it is using some fundamental methods in NLP but with different goals. Unlike
NLP which cares about semantics information in the text, in the text mining there is
also a method which treats the text as the ‘bag of word’, meaning the semantics
information is not explored. The main goal in text mining is to analyze both
unstructured and structured large text dataset so that one does not have to read the whole
text (Zhou et al., 2020). Because of that, text mining is becoming a valuable research
area as the existing improvement of Artificial Intelligence has been on the level where
the extraction of information in a textual data has to be automated. The result from text
mining is the information of the terms and words analysis (Zhou et al., 2020).
Today, there are three types of research areas in the text mining: preprocessing
algorithm, comparative studies about machine learning for both classification and
clustering as well as the feature extraction algorithm comparison, and the study about
the text dataset exploration result for the mining (Chua & Ellyza Binti Nohuddin, 2014).
From those types, the text mining research area is mostly in the preprocessing of the
text mining itself. The development of text mining needs to be explored, because the
2
previous methodology still has weaknesses such as in the prepossessing stage
(Alhawarat et al., 2015). As text mining dealing with numerous words, the feature
extraction and selection need to be improved in order to really get such “tidy” training
This study explored about the Indonesian Quran Tafsir and Translation using
one of the NLP sub-field in grouping topics based on the similarities of each word,
which is text mining, as well as using the frequent words mining. The further mining is
to see the association rules inside the tafsir and translation. As the Indonesian Quran
Tafsir and Translation are available, it is possible to get the datasets and create a
methodology in order to analyze and understand the textual data inside both of them.
The idea of text mining or data mining in general is to get the most valuable information
inside the large both unstructured or structured data sets (Jayasekara & Abu, 2018). In
other words, the text mining should fit in getting most valuable knowledge inside the
Many researchers present some text mining in using different types of methods.
In machine learning, there are two types of learning which are classification and
clustering (Louridas & Ebert, 2016). For the case of text classification, the text dataset
is classified based on the supervised learning which decides the target labels (Louridas
& Ebert, 2016). Clustering of text means that the text datasets are grouped based on the
similarities of each word (Louridas & Ebert, 2016). The studies of text mining in
general have some sides to explore such as research in how to prepare the preprocessing
in order to get a really “clean” dataset. The other example is the research in how to
implement the text mining effectively in more specific dataset and focus on how the
3
dataset result after the text mining done. Furthermore, scholars also have high interest
mining.
In the Quran alone, it contains 114 chapters (the Arabic term is called Surah), 30 parts
or Juza, 60 groups or Hizb and 6236 verses or Ayah. As a result, tafsir has many words
specific, are rare to just read the whole text of the tafsir. To support that statement,
based on survey from Badan Pusat Statistik (BPS) done by SUSENAS in 2018, 53.57%
of all Muslims in Indonesia cannot read the Arabic Quran. That number is very high
and shows how the Muslims laziness to refer from Quran and how high the urgency to
Inside the Quran, if one reads all the chapters, there are many similar topics and
narratives but separated from each other, not in the same chapters (Farooq & Kanwal,
2019). Thus, if one wants to refer a problem from Quran, there are many possible
answers provided by different chapters and ayah. The reason why that happens is that
mostly Quran is not organized each chapter based on a similar topics or narratives, but
based on the timeline when Quran was delivered to The Messenger of The God,
Muhammad (Farooq & Kanwal, 2019). In conclusion, in this modern era, it should exist
an approach to help in categorizing and structuring the tafsir and translation. The other
thing is how to enable the readers to optimize the information retrieval from both of
them.
4
When it comes to translation and tafsir of Quran, it is dealing with interpretation
of the Arabic language. The problems regarding natural language are the difficulties in
exchanging the exact same information and meaning from one language to another and
understanding the syntax or grammar, becoming a research topic, for example done by
(Verhagen, 2008). This is also why, any translation and tafsir of Quran cannot be
claimed as the exact representation of the Arabic Quran (Religious Literacy Project,
n.d.). It is applied also to Indonesian Tafsir and Translation of Quran. Hence, the
exchange information inside both of them need to be explored and compared to prevent
• To group the Indonesian Quran Tafsir and Translation topics based on their
similarities.
• To get valuable information inside the tafsir and translation from text mining
point of view.
of this version is not biased; the only reason is that the text format for this
5
3. The range of cluster numbers in preliminary study is set from 2 – 12 clusters.
This is due to the known numbers of topics is around six themes, based on
(Religious Literacy Project, n.d.), so that the range of cluster need to be around
that number.
4. The clustering is done only for grouping purposes, not to measure performance
Algorithm.
1. For Quran-related text mining study area, this study is expected to contribute in
varying the studies about grouping Indonesian Quran Tafsir and Translation
texts.
Frequent Pattern (FP) Growth Algorithm for finding association rules inside the
3. For Islamic scholars and new Islamic learners, this study can be a new method
4. For Indonesian Quran Application Developers, this study can be their reference
to group the topics of the text based for information retrieval purposes.
5. For Indonesian Quran Application Developers, the association rule inside the
6
1.6 Limitation of the Study
This study has potentials limitations. This study uses non sematic text mining
approach or statistical study only. Thus, the result of study cannot differentiate such
Study, Objective of the Study, Scope of the Study, Significance of the Study
2. Literature review. This section includes Text mining sub-chapter which tells
about the related works of text mining in general. After that, the second sub-
chapter is text mining in Tafsir and Quran Related. In this sub-chapter will be
presented all related works in text mining, specifically in Tafsir and Quran.
3. Research Methodology. This section includes research work flow and all
4. Results and Discussion. This section includes all the result showed by the
7
II. LITERATURE REVIEW
The literature review chapter is to present the reference for all related knowledge
and past works related to this study. The organizational of this chapter is as follows:
2.1 Quran
All Muslims must believe that Quran is the sacred book of Islam which contains the
word of God. The Prophet Muhammad is the one whose carried all those words through
one of angel, Jibril, and have the duty to deliver the Quran to all human being (Yusuf
Ali, 2001). Muslims have the duty to obey the laws inside the Quran.
The original Quran is delivered to Muhammad by Arabic language and since Islam
have been spreading out around the World, the translation and interpretation, tafsir, are
made into the other languages. Even so, all Muslims must believe that the Arabic Quran
with the translated one are not the same. No translation or interpretation can claim to
present the Quran but the Arabic, the original one (Religious Literacy Project, n.d.).
Thus, Muslims need to always recite Arabic Quran and never replace Quran with any
8
Unlike Quran translation which only contains the cross language from Arabic to a
certain language, tafsirs contain the interpretation as well of each Quran verse; these
Tafsirs are created by the high-level knowledgeable Islamic teachers (Nirwana, 2017).
There are four methodologies in creating Tafsir. The first method is called “Tahlili” or
based on analysis, which is by interpreting each verse, each surah, based on the surah
order. The second method is called “Ijmali” or global, which unlike the first method,
this Tafsir is made based on the interpretation of the whole surah by doing it with
simpler and number of words in order to simplify Tafsir and create it easy to understand
and Hadith or with other credible sources. The last is by “Maudhui” or thematic, which
is by classifying all the similar theme of the verses and analyze the reason why the
2.3.1 Definitions
branch knowledge of Artificial Intelligence (AI) technology which uses three types of
concepts and techniques; information retrieval, information extraction, and also natural
language processing. Those concepts and techniques should be well combined and
connected by the help of algorithms and knowledge discovery databases (KDD), data
2005).
There are three main terms of frequent pattern mining which are itemset,
subsequences, and substructures. The idea of frequent pattern mining is to get those
9
three values appeared in a data set with frequency no less than specified thresholds. The
values are becoming the association rules with certain score of each rule. By knowing
the association rules, some benefit information can be gotten from the data set (Han et
al., 2007).
There are many fields in text mining that still need improvement and research more.
There are three types of research areas in the text mining: preprocessing algorithm,
comparative studies about machine learning for both classification and clustering as
well as the feature extraction algorithm comparison, and the study about the text dataset
exploration result for the mining. Currently, the studies about the text mining in general
are mostly about the preprocessing stage in the text mining (Jayasekara & Abu, 2018).
the preprocessing, the selection of the methods also is one of the trendy research areas.
The researchers usually compare two or more common method in text mining, whether
it is about clustering or classification (Jayasekara & Abu, 2018). The other research
area for text mining is to implement the text mining to a specific dataset and the focus
is to the specific dataset. Moreover, there are also studies in comparing two or more the
classification. Figure 1 shows the summary in points the common research types in text
mining.
10
Figure 1. The Text Mining Research Type.
The first example of preprocessing research was done by these studies (Agnihotri
et al., 2014; Alhawarat et al., 2015; Matsumoto et al., 2017). This study (Alhawarat et
al., 2015) presents the most frequent pattern mining from the textual data. The proposed
method of the study is using the cosine similarity in deciding the distance of the terms
of words. The objective of the study is to compare the clustering performance between
two methods which are k-means and hierarchical clustering. This text mining type of
study is not really presenting deep about the textual data analysis, but exploring more
about the performance of the two algorithms. The datasets that this study exploited are
from stories, news and e-mails. Another similar study from this study (Agnihotri et al.,
2014), the research done in (Alhawarat et al., 2015) presents the most frequent words
from a dataset, Quran. The study highlights that it is worth to do the study as the Arabic
words are different from Alphabet as the challenge for non-English and non-alphabet
natural language processing is high. Hence, there should be presented different type of
method, mostly in the preprocessing stage, in order to treat the Arabic words well.
11
Different type of study in text mining presented by this study (Matsumoto et al.,
2017) which had an objective to show different type of dataset, numerical and textual.
In other words, the study explored more about the dataset and how the researchers
treated both kind of dataset in order to know which type of dataset, whether in the
numerical form or in the textual form, gives the best result and the most effective one.
The study contributes in the reviewing such questioner or review data which not only
containing the textual data but also the numerical data. This kind of research is out of
the box and not really mainstream with the other text mining studies. The reason is that,
in this study (Matsumoto et al., 2017), the authors see another side of text mining which
is in the creating a framework for the basic environment in the text mining, which is
called Total Environment for Text Data Mining (TETDM) with the help of a tool, R.
The other example of text mining study type is the category classification of a
certain dataset; for example, this study (Jayasekara & Abu, 2018) used the highly cited
papers in text mining as the dataset. The objective is to classify main topics and sub
topics in the data and text mining study using text mining method. The other method
that focus on the dataset as well as the comparative study about two classification
methods is delivered by this study (Wongso et al., 2017). This study (Wongso et al.,
2017) explores about the Indonesian news articles online as the dataset. The goal is to
classify the topics automatically using combination in the classification method and
feature extraction method. The interesting fact is that the highest accuracy given by the
Multinomial Naïve Bayes. The option for the feature extraction was Term Frequency –
For the classification method or supervised machine learning, the authors use all variant
12
in Naïve Bayes, Multinomial Naïve Bayes, Multivariate Bernoulli Naïve Bayes, and
The last type of research areas in text mining is the study in the dataset. The
exploration of dataset is usually for business or market analysis political issue analysis,
trend analysis, and many more. For business or market analysis, typically, the
researchers are comparing two or more famous brands for the same products. The
dataset for this kind of research can be gotten from popular microblog. Usually, the
Interface (API) so that it is easier to get the large dataset with some coding. The example
of the market analysis is done by this study (Friedemann, 2015), exploiting one of giant
microblog in the world wide, Twitter, as the dataset source. The study is to cluster the
customer of a famous brand, Nike. The method of clustering for the study was K-
Means; for the feature extraction, the author used Principal Component Analysis or
PCA. The way the author did about the Twitter was getting the information of the
Nike’s followers and cluster them so that the author knew the characteristic of the
customer, which led to the conclusion where the experiment showed the clusters gave
the expected performance and the market analysis can be done well by exploring the
Twitter as the dataset (Friedemann, 2015). Moreover, the example of the other type of
dataset exploration is trend analysis, given by this study (Kim et al., 2017). The study
policy change, using one of the unsupervised machine learning or clustering method,
which is K-Means. This study gave the expected result which was there is an
al., 2017).
13
The usage of text mining can also give valuable contribution in the medical area. In
the medical area, there are numerous documents. Those documents can be so large that
people cannot read them manually. However, the documents must contain valuable
information for the sake of medical research. For example, the history of patient that
got cancer or certain diseases. By understanding the information and analysis inside the
unstructured large documents in medical area, one probably can get cure certain lethal
and fatal diseases. To be able to do that, again, text mining can be really useful. This
study (Waykole & Thakare, 2018) explored about one of clinical text which is about
cancer. The study analyzes clinical literature for cancer. The author used Logistic
Regression and Random Forest as the classifier. For feature selection, the comparison
Word2Vec; the best feature selection method showed by word2vec with random forest
Actually, there is also study in which presents the surveys and systematic literature
review in the text mining. This type of study helps the text miner to be more prepared
in doing the research. The example is given by this study (Jayasekara & Abu, 2018);
the objective of the study was to give insight for Indonesian Translation Quran (ITQ)
text mining research by reviewing some concepts from literature. The most standout
research areas that relate with data mining are GIS, information and bibliometric
analysis (Jayasekara & Abu, 2018). The other point of view from the text mining
research type is in combining the text mining method with the conventional method,
presented by this study (Zhang et al., 2019). The objective of the study was to hybrid
text mining and item response theory (IRT) in assessing career adaptability. The
methodology used was combination of Text Mining and Item Response Theory (IRT).
14
College students' self-reported career adaptability as a subjective measure and
the result showed in terms of accuracy, text-IRT gives unique advantages, best
predictive effect and higher reliability (for 300 subjects). In terms of accuracy, text
classification gives unique advantages, best predictive effect and higher reliability (for
600 subjects). In terms of accuracy, text-IRT gives unique advantages, best predictive
The summary of all the literature reviews of this study on the text mining is
presented by Table I.
15
4. (Jayasekara To give insight for Using general The most standout
& Abu, Indonesian text mining research areas that
2018) Translation Quran technique relate with data
(ITQ) text mining (similarity mining are GIS,
research by based). information and
reviewing some bibliometric
concepts from analysis.
literature.
5. (Zhang et al., To hybrid text Combination of In terms of accuracy,
2019) mining and item Text Mining and text-IRT gives
response theory Item Response unique advantages,
(IRT) in assessing Theory (IRT). best predictive effect
career adaptability. College students' and higher reliability
self-reported (for 300 subjects). In
career terms of accuracy,
adaptability as a text classification
subjective gives unique
measure and advantages, best
responses to predictive effect and
questionnaire higher reliability (for
items as an 600 subjects). In
objective terms of accuracy,
measure under text-IRT gives
Bayesian unique advantages,
framework. best predictive effect
and higher reliability
(for 900 subjects).
6. (K. Singh et To compare the Simple k-means Both algorithms give
al., 2016) two performance and spectral k- the similar results. In
of clustering means. this experiment,
methods in text there were 40 nodes
mining. with the number of
clusters was two.
The value of k in the
simple k-means was
two.
7. (Friedemann, To know the K-Means and The clusters show
2015) customers of a Principal expected result
company clusters Component which is well
based on the Analysis with defined. This
publicly available the dataset from experiment also tells
dataset which is Twitter. that Twitter can be
Nike’s Twitter one of the datasets
Followers. for market research.
8. (Kim et al., The identification K-Means This study gave the
2017) in a research topic Clustering expected result
regarding which was there is
16
consumer direction an understandable
policy change. meaning from
keywords of papers
on consumer policy
studies.
9. (Wongso et To create a suitable Multinomial The highest result is
al., 2017) method for Naïve Bayes, showed by the
classifying Multivariate hybrid of TF-IDF
Indonesian news Bernoulli Naïve and Multinomial
article. Bayes and Naïve Bayes.
Support Vector
Machine for
classifier. TF-
IDF and SVD
algorithm for
feature selection.
10. (Waykole & This study explores Logistic The best feature
Thakare, one of clinical text Regression and selection method
2018) which is about Random Forest showed by word2vec
cancer. The study Classifier. For with random forest
analyzes clinical feature selection, as the classifier.
literature for comparing Bag
cancer. of Word, TF-
IDF, and
Word2Vec.
For Quran related text mining, there are three types of the research; the first type is
to observe the algorithm used for the Quran text mining. By doing this type of research,
the researchers can compare the two or more algorithm to extract the most valuable
information inside the Quran. The other type of research in Quran text mining is focus
in the specific dataset and analyzing the text mining result acted to the datasets, which
are Indonesian Tafsir and Translation. There are also previous related works about text
mining for Quran and Tafsir related (Alromima et al., 2015; Chua & Ellyza Binti
Nohuddin, 2014; Hamoud & Atwell, 2016; Khadangi et al., 2018; YILMAZ GENÇ &
17
As the Quran contains chapters and already decided in the past, the researchers want
to explore the rule that made the division of the Quran. The perfect example is this
study (Chua & Ellyza Binti Nohuddin, 2014) has goal to do the analysis on the frequent
patterns that can be found in the chapters of a Malay translated Tafsir of Quran; the
techniques are frequent pattern mining, non-trivial patterns and interesting relations.
The findings of the study were the processed dataset: 6 documents and 17 terms. The
term weighting is Term Frequency – Inverse Document Frequency (TF- IDF). Three
most frequent terms are “Allah”, “Muhammad”, and “wahai”. The different type of
Quranic surahs; the methodology was natural language processing methods which are
word2vec and Roots' accompaniment in Verses. The finding was the knowledge that
the choice of the surah's title is based on rational logic, the surahs hold the inner
coherence between the concepts so that they have formed on a single topic or a few
The analysis in the text mining algorithm on Quran is presented by Liu, et al. This
research (Qi et al., 2017) is more challenging because it is dealing with the semantic
information inside the Quran. The objective was to contribute in building an algorithm
with semantic analysis and automatic identification areas. To compare and analyze
semantically between Chinese and Arabic written language Quran semantically; the
algorithm chosen was Semantic annotated corpus and semantic knowledge base.
To be more specific, there is a study in which exploring the Quran Tafsir in Malay
automatically, this study (Hamoud & Atwell, 2016) used K-Nearest Neighbor (KNN)
18
or classifier and cosine similarity as the distance; the result of the study was
contribution for Malay Quran Tafsir category classification because the experiment
is needed to strengthen the algorithm in the building the tidy corpus. This study
(YILMAZ GENÇ & Syed, 2019) did the research in the exploration for making the
corpus to build the tagging algorithm for creating a prototype in which is able to extract
collocation of N-gram words. This N-gram words consist of 2 until 6 words from Arabic
Quran corpus ordered by Part of Speech Tagging. The result showed that the proposed
system succeeded to make the users select a sequence of tags (2-6 gram) and scope of
the corpus source. Table 2 presents the summary for the literature review on Quran
19
a few topics tightly
related to each
other.
3. (Hamoud & To compile holy Various data 18 instances were
Atwell, Quran questions mining clustered in cluster
2016) and answers techniques. 0, 1 instance in
dataset corpus, cluster 1, 8
created for data instances in cluster
mining. 2, and 3 instances
in cluster 3.
4. (Sabah et al., To provide K-Nearest Contribution for
2015) classification Neighbor Malay Quran Tafsir
algorithm for (KNN) or category
Quran Tafsir classifier and classification
verses cosine because the
automatically. similarity as experiment
the distance. performed well.
5. (Alromima et To create a This study The proposed
al., 2015) prototype in implements system succeeded
which is able to matching the to make the users
extract collocation input structured select a sequence of
of N-gram words. pattern of tags (2-6 gram)
This N-gram Arabic and scope of the
words consist of 2 language corpus source.
until 6 words from WITH the Part
Arabic Quran of Speech
corpus ordered by Tagging of
Part of Speech Quran corpus.
Tagging.
6. (Liu et al., To compare and Semantic To contribute in
2019) analyze annotated building an
semantically corpus and algorithm with
between Chinese semantic semantic analysis
and Arabic written knowledge and automatic
language Quran base. identification areas.
semantically.
To sum up from the literature review, for the field of text mining, the majority of
researchers did the Quran and the Quran translation only. By doing those studies, the
researchers did give the contribution for the classification and clustering both Quran
and Translated Quran so that one can refer to those data set easier as the data set
20
organized based on the similarities. If it works for Quran and the Quran Translation, it
To know the most suitable for Indonesian Tafsir and Translation clustering method,
the review of comparative studies is presented on this section. Zait et al. did
comparative study about clustering methods and scored each the methods based on each
quality of the result and execution performance; one of the metrics can be made based
on each method in discovering some or all hidden pattern inside the data set (Zait, M;
Messafta, 1997).
Rodiguez et al. did similar study which comparing five types of the clustering
clustering, self-organized map (SOM), and fuzzy c-means; the author found that K-
Means, dynamical clustering and SOM had high accuracy in all experiments
(Rodriguez et al., 2019). The other clustering comparative studies (Alfina et al., 2012;
Qi et al., 2017; N. Singh & Singh, 2012) were done and more specific, both to compare
and combine k-means and hierarchical clustering. From those studies, the two methods
gave the optimum result when they were combined each other. Alfina et al. did the
testing parameters: cluster variance and silhouette coefficient method. The study shows
the k-means did a good job in term of grouping (Alfina et al., 2012).
21
III. RESEARCH METHODOLOGY
Research methodology section is to show the overall steps in doing the study, as
22
3.1 Planning
Planning is to prepare the all the information regarding to the study which later on
be the theoretical base and justification for this study. The two processes included in
Literature review is the method to gather past studies and cite some useful insight
from them. This step also to gather the problems in the related field and brainstorm the
possible solution to the problems. The information source can be journal articles,
conference papers, books, etc. As long as it has credible source. The samples of the
Accuracy and Running Time (N. Singh & Singh, 2012), Processing the Text of
the Holy Quran: a Text Mining Study (Alhawarat et al., 2015) and Pattern and
(Sabah et al., 2015) and The Semantic Annotation of the Quran Corpus Based
literature review, to answer the objectives of the study, which includes choosing the
23
• Frequent Pattern Growth as association rule algorithm.
This step is to apply the text mining steps. There are four steps in the text mining
for this study: preprocessing (feature selection), TF-IDF (feature extraction), most
frequent words mining, k-means clustering, and association rules mining. The overall
sequence of the whole text mining implementation is presented by Figure 3. This whole
process is done for tafsir and translation with the same steps.
24
3.2.1 Preprocessing (Feature Selection)
stemming words and stop words elimination. Preprocessing is needed to reduce the
unwanted words which have no significant meaning, noise, into text mining. This step
also to reduce the redundancy and repetition. Those steps are reversible and can go back
• Case folding
This case folding is required in order to avoid the same words treated as
Although they are in different case, the machine has to treat them as one
single word.
• Tokenization
• Stemming words
In this part, to avoid redundant words appearing on the dataset, the words
that have similar root will be “stemmed” into one single root. The example
25
In the text mining, stop words are treated as noise to the process as they
“oleh”.
Extraction)
(Qaiser & Ali, 2018); it is because unlike the bag of word method, this method is not
only seeing the most frequent terms so that the undominant word is eliminated; the TD
– IDF will be weighting the terms based on how frequent the term in a document
compared to how frequent the term in the whole documents. By doing TF – IDF, the
most frequent word is rescaled. The way it is rescaled is that by dividing the most
frequent words (TF) by how frequent the words appear in the whole document (IDF).
The mathematical model for the TF – IDF is showed by Equation 1 (Qaiser & Ali,
2018).
𝑁
𝑊𝑖,𝑗 = 𝑡𝑓𝑖 ,𝑗 × log (𝑑𝑓 ) (1)
𝑖
Where:
𝑁 = 𝑇𝑜𝑡𝑎𝑙 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
26
The overall pseudocode for TF-IDF is shown below.
class Mapper
function Map((dcmntId, N), (word, o))
class Reducer
function Reduce(word,(dcmntId,o, N))
n=0
In this stage, the most frequent words are extracted from both tafsir and translation.
The result of the most frequent words measured by TF will be represented and
visualized in the form of word clouds. The other presentation of the result, which is the
frequency measured by TF – IDF will be in the form of the bar plot of each tafsir and
translation result. Not only seeing the most frequent words, the result will be evaluated
For the clustering, this study using Euclidian distance between each terms or words.
27
‖𝐴 − 𝐵‖ = √∑𝑑𝑖=1(𝑎𝑖 − 𝑏𝑖 )2 (2)
Where A and B are points in d dimensional space such that 𝐴 = [𝑎1 , 𝑎2 , … , 𝑎𝑑 ] and 𝐵 =
[𝑏1 , 𝑏2 , … , 𝑏𝑑 ].
After getting each distance, then the clustering methods are applied.
The K-Means algorithm is one of the partitional clustering, meaning the clusters dataset
are fully divided from the others and treated as different cluster. The first thing to do in
K-Means clustering is the assigning the number of clustering, k. After that, initially, the
random centroid for k cluster is chosen. The iteration of K-Means is done until the mean
of each training data to the centroid met the stopping criterion, whereas the smallest
Euclidean distance from a sample is the nearest centroid for the sample to be the one
In order to present the best clustering results, the preliminary experiments are done.
K-means clustering is depended on the cluster centers. One of the approaches to know
the optimal number of k is by seeing the elbow of Sum Square of Error cluster center
plot. So, for k-means clustering stage, this study will create the preliminary experiment
to get the best number of k. Figure 4 shows the overall steps of the preliminary
experiment.
28
Figure 4. Flowchart of Preliminary K-Means
Originally, Frequent Pattern (FP) Growth Algorithm is used for knowing the
association rules in the relational database of transaction. This study creates the matrix
for the frequent words treated the same as the frequent items and each document is
treated as a transaction in the original Frequent Pattern (FP) Growth Algorithm. The
29
formal definition of association rule was presented by (Agrawal et al., 1993). Agrawal
transactions where each transaction T is a set of items such that T ⊆ I. Let X, Y be a set
of items such that X, Y ⊆ I. From those definitions, there is the association rule
et al., 1993).
When dealing with association rules, there are two values which needed to analyze,
• Support
• Confidence
30
IV. RESULTS AND DISCUSSIONS
This section is to present and discuss all the results gotten from the experiments.
The datasets of Tafsir and Translation Kemenag are required in form of text file in
UTF-8. The sample of tafsir and translation of Surah Al-Fatihah, dataset raw datasets
pembatas antara satu surah dengan surah yang lain. Jadi dia bukanlah satu ayat
dari al-Fatihah atau dari surah yang lain, yang dimulai dengan Basmalah itu.
…”
31
4.1.2 Feature Selection Result
When it comes to feature selection result, the datasets should would not form any
meaningful sentences anymore as some words taken from the datasets. Here are the
samples of the feature selected tafsir and translation. Each word has been tokenized,
The common case folding for text mining is to lower all the case. However, in this
study, the words are converted to uppercase because the word Allah should not be
There is a concern in this process. In the process or selecting the significant words,
there were some confusion whether to eliminate or keep the words due to the language
problems. For example, the word “satu”. The word “satu”, means one, is significant
when it is talking about the Tauhid or monotheism to tell that there is only one God.
However, that word also would mean the other thing when used for certain context; for
example, “salah satu dari …”. The word “satu” for that context doesn’t tell anything
32
about monotheism. The solution is, for now, the word is to eliminate the word that gives
This stage is to show the result of TF – IDF algorithm for weighting the corpus.
Table III shows the matrix property of the term document matrices, TDM of tafsir and
translation.
Table III. Properties of Tafsir and Translation Term Document Matrices (TDM)
The tafsir contains 488 significant terms for the TF – IDF calculation while
translation is 116 terms. These terms will be the column of the term document matrix
and the occurrence of each term is weighted from each document. The total documents,
or in this case sentences, of the tafsir is 18450 and the translation 6234. The non-sparse
entries of each matrix show the nonzero entries and the sparse entries are the zeros
entries. The maximal length in tafsir is 14 words of each document and 13 words of
33
(a) Tafsir TF – IDF Plot (b) Translation TF – IDF Plot
Figure 6. TF- IDF Frequencies Plot
The visualizations of word TF – IDF for both tafsir and translation are presented
in Figure 6. The two figures show similar curve for the TF – IDF values. There are
around 4 words or terms which have significant difference values compared to the
others. The discussion more about those numbers are presented on the next section,
Since the feature extraction is dealing with weighting the term frequency, by TF –
IDF, this most frequent words mining is automatically the result from the TF – IDF.
Figure and show the 20 most frequent words in the tafsir and translation measured by
TF – IDF, respectively. As the definition of the TF – IDF. That means those words are
the most likely to appear in each sentence of the tafsir and translation. Figure 7 shows
34
(a) Most Frequent words tafsir (b) Most Frequent words translation
Figure 7. The 20 Most Frequent Words Measured by TF - IDF
The mutual and the distinct words of 30 most frequent word are shown by Table IV
below.
35
The word “Firman” become the highest frequency in the tafsir is because the Mufassir
always give commentary about each ayah by reciting Allah Azza Wa Jalla Words, or
“Firman” in Indonesian. Here is the example, a piece tafsir of (Surah Al-Fatihah Ayah
1):
hamba-Nya agar memulai suatu perbuatan yang baik dengan menyebut basmalah,
sebagai pernyataan bahwa dia mengerjakan perbuatan itu karena Allah dan kepada-
Now, in order to know how strong is the correlation between tafsir and
observation is on the mutual words between the tafsir and translation, to see whether
the pattern is the same or not. The pattern observation is on how much the tendency of
the frequency of a particular word in tafsir and translation be affecting each other is.
36
Neraka 197.6404 179.5792
Musa 207.5235 189.0001
Kebenaran 242.2329 211.3072
𝑋 𝑉𝑎𝑙𝑢𝑒𝑠
∑ = 5672.065
𝑀𝑒𝑎𝑛 = 333.651
∑ (𝑋 − 𝑀𝑥)2 = 𝑆𝑆𝑥 = 366567.528
𝑌 𝑉𝑎𝑙𝑢𝑒𝑠
∑ = 3528.75
𝑀𝑒𝑎𝑛 = 207.574
∑ (𝑌 − 𝑀𝑦)2 = 𝑆𝑆𝑦 = 127689.396
𝑋 𝑎𝑛𝑑 𝑌 𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑
𝑁 = 17
∑ (𝑋 − 𝑀𝑥)(𝑌 − 𝑀𝑦) = 114805.133
𝑅 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑟 = ∑ ((𝑋 − 𝑀𝑦)(𝑌 − 𝑀𝑥)) / √((𝑆𝑆𝑥)(𝑆𝑆𝑦))
𝑟 = 114805.133 / √((366567.528)(127689.396)) = 𝟎. 𝟓𝟑𝟎𝟔
𝑾𝒉𝒆𝒓𝒆
37
(𝑋 − 𝑀𝑥)2 & (𝑌 − 𝑀𝑦)2: 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑
(𝑋 − 𝑀𝑥)(𝑌 − 𝑀𝑦): 𝑃𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑠
𝑟 = 𝑃𝑒𝑎𝑟𝑠𝑜𝑛 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
inside both the tafsir and the translation. This means that there is tendency that the
higher frequency of the word occurred in tafsir, the higher frequency of that word
occurred in translation, vice versa. So, even though there are some differences on the
most frequent words between the tafsir and translation, there will always be tendency
that the same words occurred in both of them. This information is benefit to ensure that
the tafsir and translation version are having the same directions. In other words, one
can trust to refer from both tafsir and translation of this version due to the same pattern
of the words.
Compared to the result of the most frequent words mining measured by the TF
-IDF, Figure 8 shows the word clouds of most frequent words measured by the Term
Frequency, meaning the weight of frequency is based on the total appearance from the
whole documents. Overall, the results from TF – IDF and only TF shows similar values.
The words “Allah” appeared to be the dominant one from both of the datasets.
38
(a)Word Cloud of Tafsir
By seeing these result, specifically new Quran learners, can be benefited to understand
what is going on in the Quran as general. The tafsir shows “ayat” as big frequency. By
seeing the example of tafsir containing the word “ayat”. The number of words “ayat”
is so frequent due to the Mufassir referencing to each ayah or “ayat”. Here is the
example:
39
“Tidak sedikit ayat di dalam Al-Qur'an yang menjelaskan bahwa…
For translation, the word “tuhan” shows big frequency. Seeing from the translation,
Allah Azza Wa Jalla Refers Himself to Prophet Muhammad, besides using “Kami”,
Dan adapun orang-orang yang berbahagia, maka (tempatnya) di dalam surga; mereka
kekal di dalamnya selama ada langit dan bumi, kecuali jika Tuhanmu menghendaki
(yang lain); sebagai karunia yang tidak ada putus-putusnya.” (Surah Hud Ayah 107-
108)
Figure 9 shows the SSE Cluster Center Plot for tafsir and translation, respectively.
As seen on the figure, as the number of k increasing, the SSE is decreasing. The best
selected K should be on the point where the decreasing level starting to be small which
cause the elbow effect. The elbow for the tafsir then is in K = 10 and for translation is
in K = 8 clusters. Thus, this study is focusing on the comparison clusters 8 and 10 for
40
(a) Tafsir (b) Translation
Figure 9. SSE Cluster Plots
Table IV shows the clustering result for K=8, for tafsir and translation. One should
note that the clustering number is not ordered and does not matter in clustering case.
The cluster 0 of the tafsir shows words “kitab”, “anak”, “sihir”, “mesir”, “agama”,
“harun”, “Tuhan”, “firaun”, “bani israil”, and “Musa”. This shows a good example of
one clustering. Referred to the Quran, the Prophet Musa story is narrated. The story is
about the duty of Prophet Musa to give reminder to Firaun, and companied by Prophet
Harun. The place was in Egypt or “Mesir” which the bad magic or “Sihir” was popular
at that time. The example of the 0 is found in Surah Thaha Ayah 30, will be shown in
Terms Terms
41
1 HATI, PERINGATAN, AJARAN, HUKUM, KITAB, KIAMAT, PERINGATAN,
AZAB.
BERIMAN.
TANDA, BENAR
Seeing the members of the cluster 2 in the tafsir, this cluster contains
astronomical terms, for example “Planet”, “Bulan”, “Bintang”, “Matahari, “Langit” and
“Bumi”. This cluster might be a partition about the perspective of universe creation
42
“Dalam pemahaman astronomi, langit adalah seluruh ruang angkasa semesta, yang
Hal ini dikemukakan oleh Allah di dalam Surah al-Mulk/67: 5, yang artinya:
Sesungguhnya Kami telah menghiasi langit yang dekat (langit dunia) dengan bintang-
bintang, dan Kami jadikan bintang-bintang itu alat-alat pelempar syaitan, dan Kami
Jadi, langit yang berisi bintang-bintang itu memang disebut sebagai langit dunia.
Itulah langit yang kita kenal selama ini. Dan itu pula yang dipelajari oleh para ahli
astronomi selama ini, yang diduga diameternya sekitar 30 miliar tahun cahaya. Dan
Namun demikian, ternyata Allah menyebut langit yang demikian besar dan dahsyat itu
baru sebagian dari langit dunia, dan mungkin langit pertama. Maka dimanakah letak
From the example of above tafsir’s contain is that the Mufassir gives
commentary about the word “langit” and how it relates with the other creation such as
“bintang”, “bulan”, “bumi” and “matahari”. The Mufassir lists all the ayahs from
different surahs about “langit”. So, that is why the clustering result from tafsir is good
The other interesting cluster result in tafsir is the cluster 3 which are the name
of hadith narrators, which make a good cluster as well. In cluster 4 of tafsir, the words
quite interesting, as it contains opposite words of each other, for example “neraka” and
“surga”, “hamba” and “kafir”, or “pahala” and “dosa”. This can conclude that in the
43
tafsir, the bad and the good always be narrated in one case so they become close and
one cluster. Cluster 5 also shows a good example of one clustering because of their
topic closeness, about Quran and Kitab as laws of Muslims which already discussed in
For the case of translation, the result is not as clear as the tafsir. There are some
same words appeared on each cluster. For example, the word “Azab” and “Petunjuk”
which makes difficult to decide the main topic of each cluster. Then, the information
which can be gotten is that the distance of each word in translation are not really far
from each other. Meaning, using TF – IDF weighting method, the term most likely
The next observation from the result shown in Table V for K= 10. For the tafsir
cluster, there are five clusters which are similar with the previous result. For the case
of translation, it starts to get clearer for same clusters. For examples, the words “mata”,
“kekal”,” sungai”, and “surga” on one cluster in translation. But overall, the translation
Terms Terms
44
1 ISTIDRAJ, ISTERI, ISRAFIL, GEMBIRA, PERJALANAN, KAFIR,
NERAKA.
DUNIA.
4 ISA, HUD, ESA, MUSYRIK, SEMBAH, DIUTUS, AZAB, UMAT, YATIM, NUH,
SUNGAI, SURGA.
ANAK.
LANGIT, BUMI.
45
9 IBRAHIM, MENYAMPAIKAN, HAMBA, PUJI, ZALIM, DISEMBAH, ESA,
To sum up, when K=8, tafsir clustering shows good partition and for K=10, the
This section is to show the result of interesting association inside tafsir and
translation. Figure 10 shows the association the word “Allah” from translation. Except
the word “kafir”, all the association are showing the positive sentiments. The support
value of this association is 0.004 and the confidence is 0.781, meaning from the whole
translation documents, 0.4% having the union of the term “Allah” and “kafir” in one
document. Moreover, 78.1% of the documents in the translation contain “Allah” also
contain “kafir”. To see the meaning of this, the further reference is done by looking up
into the translation dataset. One of the sample from this association is Surah An-Nahl
Ayah 106 – 107. The ayahs show that Allah always narrate on how the bad fate would
46
Figure 10. Association Rules Sample of Translation
The word “kafir” also can be a verb which has meaning denying, from the context of
Ayah 106.
“Barangsiapa kafir kepada Allah setelah dia beriman (dia mendapat kemurkaan
Allah), kecuali orang yang dipaksa kafir padahal hatinya tetap tenang dalam beriman
(dia tidak berdosa), tetapi orang yang melapangkan dadanya untuk kekafiran, maka
kemurkaan Allah menimpanya dan mereka akan mendapat azab yang besar. Yang
demikian itu disebabkan karena mereka lebih mencintai kehidupan di dunia daripada
akhirat, dan Allah tidak memberi petunjuk kepada kaum yang kafir.” (Quran, Surah
An-Nahl 106-107).
47
Figure 11. Association Rule Sample of Tafsir
For the case of tafsir, Figure 11 above shows the interesting association. The
word “israil”, “musa”, “and “bani” are in the same values of support and confidence,
which is 0.1% of the whole documents contain their union and 54% of the documents
contains those words. To compare with the previous clustering result, this is also related
to the cluster 0 which contain those words and the word “Harun”. When searching the
word “Musa” to the tafsir, one of the examples is tafsir of Surah Thaha Ayah 30.
“Ayat ini menerangkan bahwa Musa a.s. mengusulkan agar yang diangkat menjadi
pembantunya itu ialah Harun, saudaranya sendiri yang lebih tua dari dia, Musa
memilih Harun antara lain karena Harun itu seorang yang saleh, ucapannya fasih,
intonasi bicaranya seperti orang Mesir, karena ia banyak bergaul dengan orang-orang
From the information retrieval method of association rule, one can directly know
there is association to those words that the Prophet Musa did a duty from Allah to
remind the Firaun. He then asked Allah Azza Wa Jalla, that he wanted his brother,
48
Prophet Harun to company him in this duty. The story took place in Mesir which is
Egypt now.
There are several benefits from knowing the association rules. The first possible
benefit is to enable the Islamic scholars and Muslims to get all connection of a certain
topic that they would like to learn. For example, say one wants to know about Prophet
Musa by referring to Indonesian Tafsir. Without knowing the association rule, he/she
might just focus only to the word “Musa” in the tafsir and have to read the whole
sentences in the tafsir about “Musa” to be able to draw valuable information about
Prophet Musa. Different case if he/she already knew and saw the association rules list
of the word “Musa”. For instance, (“Musa”, “Mesir”) -> “Agama” or (“Musa”, “Bani
Israil”) -> “Agama”, one can take a look at those words to focus in searching
The other benefit is in business, specifically online book stores or library. Say one
user accesses to an online book store or library and is interested in book tagged “Musa”
as the keyword. Then, the systems could be able to give what kind of books that might
interests the user and create a preference book for the user. Because in tafsir the word
“Musa” has association rule with “Mesir” and/or “Bani Israil”, the systems can give
suggestion and recommendation for books which tagged with words “Mesir” and/or
“Bani Israil”. Of course, to be able to do that, it needs further process. However, that is
the general thing that the association rules can do in this business area.
49
V. CONCLUSIONS AND RECOMMENDATIONS
5.1 Conclusions
From this study, author can conclude several things. Firstly, the clustering results
of tafsir and translation are obtained using the K-Means technique. The best partition
result shown by the tafsir with 8 number of clusters. However, the translation shows
obtained, for text mining perspective. The information could be collected from the most
frequent words results. The 30 most frequent words inside the tafsir and translation
were presented and showing 17 mutual words from tafsir and translation occurred in
the 30 ranking. From the correlation result, shows that the mutual words from tafsir and
translation having 0.5306 value, meaning there is tendency that the higher frequency of
the word occurred in tafsir, the higher frequency also occurred in translation, vice versa.
algorithm with several rules. The sample of rules in tafsir showed by the association
(“Musa”, “Bani Israil”) -> “Agama”. For translation, the sample is (“Iman”, “Taqwa) -
> “Allah. The benefits from the association rules explained, which varies from
5.2 Recommendations
1. To assign each cluster with certain theme based on the context of the words’
50
2. To do the next step of this study, which is the information retrieval application
study.
3. To do comparative study in the clustering stage to find the most robust method.
4. To evaluate the clustering performance, the prediction test should be done, not
only training.
51
REFERENCES
Agnihotri, D., Verma, K., & Tripathi, P. (2014). Pattern and cluster mining on text
data. Proceedings - 2014 4th International Conference on Communication
Systems and Network Technologies, CSNT 2014, 428–432.
[Link]
Agrawal, R., Swami, A., & Imieli’nski, T. (1993). Mining Association Rules between
Sets of Items in Large Databases. SIGMOD ’93: Proceedings of the 1993 ACM
SIGMOD International Conference on Management of Data, At Washington,
D.C., United States, January 1993. [Link]
Al Qaththan, S. M. (2000). MABAHITS FI ULUMIL QURAN. Pustaka Al Kautsar.
Alfina, T., Santosa, B., & Barakbah, A. R. (2012). Analisa Perbandingan Metode
Hierarchical Clustering, K-means dan Gabungan Keduanya dalam Cluster Data.
Teknik Its, 1(1), 521–525. [Link]
Alhawarat, M., Hegazi, M., & Hilal, A. (2015). Processing the Text of the Holy
Quran: a Text Mining Study. International Journal of Advanced Computer
Science and Applications, 6(2). [Link]
Alromima, W., Moawad, I. F., Elgohary, R., & Aref, M. (2015). Extracting N-gram
terms collocation from tagged Arabic corpus. 2014 9th International Conference
on Informatics and Systems, INFOS 2014, NLP10–NLP15.
[Link]
Atenstaedt, R., & Leder, D. (2012). Word cloud analysis of the BJGP. British Journal
of General Practice, 62(March), 2520. [Link]
Balsor, J. L., Jones, D. G., & Murphy, K. M. (2019). A primer on high-dimensional
data analysis workflows for studying visual cortex development and plasticity.
[Link]
Chua, S., & Ellyza Binti Nohuddin, P. N. (2014). Frequent pattern extraction in the
Tafseer of Al-Quran. 2014 the 5th International Conference on Information and
Communication Technology for the Muslim World, ICT4M 2014.
[Link]
Farooq, M., & Kanwal, N. (2019). Summary of Holy Quran: An Ultimate Guide
Series (Issue October 2019).
[Link]
Friedemann, V. (2015). Clustering a Customer Base Using Twitter Data. Cs, 229(1),
1–5.
Hamoud, B., & Atwell, E. (2016). Quran question and answer corpus for data mining
with WEKA. Proceedings of 2016 Conference of Basic Sciences and
Engineering Studies, SGCAC 2016, February, 211–216.
[Link]
52
Han, J., Cheng, H., Xin, D., & Yan, X. (2007). Frequent pattern mining: Current
status and future directions. Data Mining and Knowledge Discovery, 15(1), 55–
86. [Link]
Hotho, Andreas; Nürnberger, Andreas; Paass, G. (2005). A Brief Survey of Text
Mining. 19–62.
Jayasekara, P. K., & Abu, K. S. (2018). Text Mining of Highly Cited Publications in
Data Mining. IEEE 5th International Symposium on Emerging Trends and
Technologies in Libraries and Information Services, ETTLIS 2018, 128–130.
[Link]
Khadangi, E., Fazeli, M. M., & Shahmohammadi, A. (2018). The study on Quranic
surahs’ topic sameness using NLP techniques. 2018 8th International
Conference on Computer and Knowledge Engineering, ICCKE 2018, Iccke, 298–
302. [Link]
Kim, M. J., Ohk, K., & Moon, C. S. (2017). Trend analysis by using text mining of
journal articles regarding consumer policy. New Physics: Sae Mulli, 67(5), 555–
561. [Link]
Liu, Z., Yang, L., & Atwell, E. (2019). The Semantic Annotation of the Quran Corpus
Based on Hierarchical Network of Concepts Theory. Proceedings of the 2018
International Conference on Asian Language Processing, IALP 2018, 318–321.
[Link]
Louridas, P., & Ebert, C. (2016). Machine Learning. IEEE Software, 33(5), 110–115.
[Link]
Matsumoto, T., Sunayama, W., Hatanaka, Y., & Ogohara, K. (2017). Data Analysis
Support by Combining Data Mining and Text Mining. Proceedings - 2017 6th
IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2017,
313–318. [Link]
Nirwana, A. (2017). MADARISUT TAFSIR FI QARNIS SAHABAH.
[Link]
Portal Informasi Indonesia. (2010). Agama. [Link]
Qaiser, S., & Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance
of Words to Documents. International Journal of Computer Applications,
181(1), 25–29. [Link]
Qi, J., Yu, Y., Wang, L., Liu, J., & Wang, Y. (2017). An effective and efficient
hierarchical K-means clustering algorithm. International Journal of Distributed
Sensor Networks, 13(8), 1–17. [Link]
Religious Literacy Project. (n.d.). Qur’an: The Word of God. Harvard Divinity
School. Retrieved November 7, 2020, from
[Link]
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa,
L. da F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative
53
approach. In PLoS ONE (Vol. 14, Issue 1).
[Link]
Sabah, G., Khaotijah, S., & Mahdi, F. (2015). Categorization of ‘Holy Quran-Tafseer’
using K-Nearest Neighbor Algorithm. International Journal of Computer
Applications, 129(12), 1–6. [Link]
Saleh, W. A. (2010). Preliminary Remarks on the Historiography of tafsīr in Arabic:
A History of the Book Approach. Journal of Qur Anic Studies, 12, 6–40.
[Link]
Singh, K., Shakya, H. K., & Biswas, B. (2016). Clustering of people in social network
based on textual similarity. Perspectives in Science, 8, 570–573.
[Link]
Singh, N., & Singh, D. (2012). Performance Evaluation of K-Means and Heirarichal
Clustering in Terms of Accuracy and Running Time. International Journal of
Computer Science and Information Technologies, 3(3), 4119–4121.
Verhagen, A. (2008). Syntax, Recursion, Productivity – a Usage-Based Perspective
on the Evolution of Grammar. Evidence and Counter-Evidence: Essays in
Honour of Frederik Kortlandt, Volume 2, January, 399–414.
[Link]
Waykole, R. N., & Thakare, A. D. (2018). a Review of Feature Extraction Methods
for Text Classification. International Journal of Advance Engineering and
Research Development, 5(04), 351–354.
Wongso, R., Luwinda, F. A., Trisnajaya, B. C., Rusli, O., & Rudy. (2017). News
Article Text Classification in Indonesian Language. Procedia Computer Science,
116, 137–143. [Link]
YILMAZ GENÇ, S., & Syed, H. (2019). Quranic Principles of Universal Law on the
Quranic Exegesis. Bilimname, December, 165–186.
[Link]
Yusuf Ali, A. (2001). The Holy Qur’an. Wordsworth Editions Ltd; 5th edition.
Zait, M; Messafta, H. (1997). A Comparative Study of Clustering Methods. Future
Generation Computer System, 149–159.
Zhang, L., Zhu, G., Zhang, S., Zhan, X., Wang, J., Meng, W., Fang, X., & Wang, P.
(2019). Assessment of Career Adaptability: Combining Text Mining and Item
Response Theory Method. IEEE Access, 7, 125893–125908.
[Link]
Zhou, M., Duan, N., Liu, S., & Shum, H. Y. (2020). Progress in Neural NLP:
Modeling, Learning, and Reasoning. Engineering, 6(3), 275–290.
[Link]
54
APPENDICES
55
text = tf_idf.transform(dataset.Column1)
sse = []
for k in iters:
[Link](MiniBatchKMeans(n_clusters=k,
init_size=1024, batch_size=5000,
random_state=20).fit(dataset).inertia_)
print('Fit {} clusters'.format(k))
f, ax = [Link](1, 1)
[Link](iters, sse, marker='o')
ax.set_xlabel('Cluster Centers')
ax.set_xticks(iters)
ax.set_xticklabels(iters)
ax.set_ylabel('Name of Plot Label')
ax.set_title('Name of Axis)
optimum_clusters(text, 20)
clusters =
MiniBatchKMeans(n_clusters=variableK).fit_predict(text)
the_pca =
PCA(n_components=2).fit_transform(dataset[max_items,:].todense
())
56
tsne =
TSNE().fit_transform(PCA(n_components=50).fit_transform(datase
t[max_items,:].todense()))
idx = [Link](range([Link][0])
label_subset = labels[max_items]
label_subset = [[Link](i/max_label) for i in
label_subset[idx]]
plot_tsne(text, clusters)
getwords(variable_text, variable_cluster,
tfidf.get_feature_names(), variableN)
57
<output/>
<macros/>
</context>
<operator activated="true" class="process"
compatibility="9.7.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel"
compatibility="9.7.000" expanded="true" height="68" name="Read
Excel" width="90" x="179" y="34">
<parameter key="excel_file" value="F:\CAPSTONE PROJECT
1\assrul\[Link]"/>
<parameter key="sheet_selection" value="sheet
number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United
States)"/>
<parameter key="read_all_values_as_polynominal"
value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0"
value="[Link]"/>
</list>
58
<parameter key="read_not_matching_values_as_missings"
value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="nominal_to_text"
compatibility="9.7.000" expanded="true" height="82"
name="Nominal to Text" width="90" x="179" y="136">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception"
value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception"
value="false"/>
<parameter key="except_block_type"
value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes"
value="false"/>
</operator>
<operator activated="true"
class="text:process_document_from_data"
compatibility="9.3.001" expanded="true" height="82"
name="Process Documents from Data" width="90" x="179" y="238">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Binary Term
Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="none"/>
59
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement"
value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights"
value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true"
class="web:extract_html_text_content" compatibility="9.3.001"
expanded="true" height="68" name="Extract Content" width="90"
x="112" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length"
value="5"/>
<parameter key="override_content_type_information"
value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags"
value="true"/>
</operator>
<operator activated="true" class="text:tokenize"
compatibility="9.3.001" expanded="true" height="68"
name="Tokenize" width="90" x="112" y="136">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
60
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true"
class="text:transform_cases" compatibility="9.3.001"
expanded="true" height="68" name="Transform Cases" width="90"
x="112" y="238">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true"
class="text:filter_by_length" compatibility="9.3.001"
expanded="true" height="68" name="Filter Tokens (by Length)"
width="90" x="313" y="85">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<connect from_port="document" to_op="Extract
Content" to_port="document"/>
<connect from_op="Extract Content"
from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document"
to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases"
from_port="document" to_op="Filter Tokens (by Length)"
to_port="document"/>
<connect from_op="Filter Tokens (by Length)"
from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true"
class="numerical_to_binominal" compatibility="9.7.000"
expanded="true" height="82" name="Numerical to Binominal"
width="90" x="179" y="340">
61
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception"
value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception"
value="false"/>
<parameter key="except_block_type"
value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes"
value="false"/>
<parameter key="min" value="0.0"/>
<parameter key="max" value="0.0"/>
</operator>
<operator activated="true" class="concurrency:fp_growth"
compatibility="9.7.000" expanded="true" height="82" name="FP-
Growth" width="90" x="179" y="442">
<parameter key="input_format" value="items in dummy
coded columns"/>
<parameter key="item_separators" value="|"/>
<parameter key="use_quotes" value="false"/>
<parameter key="quotes_character" value="""/>
<parameter key="escape_character" value="\"/>
<parameter key="trim_item_names" value="true"/>
<parameter key="min_requirement" value="support"/>
<parameter key="min_support" value="0.001"/>
<parameter key="min_frequency" value="100"/>
<parameter key="min_items_per_itemset" value="1"/>
62
<parameter key="max_items_per_itemset" value="0"/>
<parameter key="max_number_of_itemsets"
value="1000000"/>
<parameter key="find_min_number_of_itemsets"
value="true"/>
<parameter key="min_number_of_itemsets" value="100"/>
<parameter key="max_number_of_retries" value="15"/>
<parameter key="requirement_decrease_factor"
value="0.9"/>
<enumeration key="must_contain_list"/>
</operator>
<operator activated="true"
class="create_association_rules" compatibility="9.7.000"
expanded="true" height="82" name="Create Association Rules"
width="90" x="380" y="442">
<parameter key="criterion" value="confidence"/>
<parameter key="min_confidence" value="0.5"/>
<parameter key="min_criterion_value" value="0.8"/>
<parameter key="gain_theta" value="2.0"/>
<parameter key="laplace_k" value="1.0"/>
</operator>
<connect from_op="Read Excel" from_port="output"
to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example
set output" to_op="Process Documents from Data"
to_port="example set"/>
<connect from_op="Process Documents from Data"
from_port="example set" to_op="Numerical to Binominal"
to_port="example set input"/>
<connect from_op="Numerical to Binominal"
from_port="example set output" to_op="FP-Growth"
to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets"
to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules"
from_port="rules" to_port="result 1"/>
63
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
#PLOT TF IDF
library(ggplot2)
[Link]=tail(sort(frequency),n=25)
[Link]=[Link](sort([Link]))
[Link]$names <- rownames([Link])
ggplot([Link], aes(reorder(names,[Link]),
[Link])) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Frequency") +
ggtitle("Term frequencies")
64
APPENDIX 4 DESIGN OF ASSOCIATION RULES
65