0% found this document useful (0 votes)
235 views75 pages

Text Mining Quran Tafsir in Indonesia

This document is Edi Irawan's undergraduate final year project report on applying text mining techniques to Indonesian tafsir (commentary) and translations of the Quran. It includes sections on background, objectives, methodology, results, and conclusions. The methodology involves preprocessing text data, identifying frequent words, clustering tafsir texts, and mining association rules between tafsir and translations. Key results include clustering the tafsir into 8 groups, identifying 17 common frequent words between tafsir and translations, finding a moderate correlation between frequency distributions, and association rules linking terms like "Musa" and "Bani Israil" to the concept of "religion".

Uploaded by

Edi Irawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views75 pages

Text Mining Quran Tafsir in Indonesia

This document is Edi Irawan's undergraduate final year project report on applying text mining techniques to Indonesian tafsir (commentary) and translations of the Quran. It includes sections on background, objectives, methodology, results, and conclusions. The methodology involves preprocessing text data, identifying frequent words, clustering tafsir texts, and mining association rules between tafsir and translations. Key results include clustering the tafsir into 8 groups, identifying 17 common frequent words between tafsir and translations, finding a moderate correlation between frequency distributions, and association rules linking terms like "Musa" and "Bani Israil" to the concept of "religion".

Uploaded by

Edi Irawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Text Mining on Indonesian Tafsir and Translation of

The Holy Quran

UNDERGRADUATE FINAL YEAR PROJECT

EDI IRAWAN

2017390001

COMPUTER SCIENCE STUDY PROGRAM

FACULTY OF ENGINEERING AND TECHNOLOGY

SAMPOERNA UNIVERSITY

2020
LEGITIMATION LETTER

FINAL YEAR PROJECT REPORT


TEXT MINING ON INDONESIAN TAFSIR AND TRANSLATION OF THE
HOLY QURAN
Proposed as the partial fulfilment of the requirements for Sarjana Komputer
Prepared by

Name : Edi Irawan


SIN : 2017390001
Campus : Sampoerna University

Checked and approved on Tuesday, 14 July 2020

Faculty

Final Year Project Supervisor,

Prof. Ir Media Anugrah Ayu, Msc. PhD


Date: 14 July 2020

i
EXCLUSIVE RIGHT STATEMENT

I, Edi Irawan (2017390001), at this moment grant to our school, Sampoerna University,

the non-exclusive to archive, reproduce, and distribute my Final Year Project entitled

“TEXT MINING ON INDONESIAN TAFSIR AND TRANSLATION OF THE

HOLY QURAN” in whole or in parts, whether in printed or electronic formats.

I acknowledge that I retain the exclusive right of my Final Year Project by using all or

parts of it in future work or output, such as articles, books, software, and information

system.

Jakarta, 14 July 2020

Edi Irawan
SIN: 2017390001

ii
[Originality Statement]

DECLARATION OF ORIGINALITY

I, Edi Irawan (2017390001), hereby acknowledge that this submission entitled “TEXT

MINING ON TAFSIR AND TRANSLATION OF THE HOLY QURAN” is my own

work with the guidance from my Supervisor.

I also truly acknowledge that it contains no materials previously published or written

by another person, or any other educational institution, except where due

acknowledgement is made in the Final Year Project. Any contribution made to the

research by others is explicitly acknowledgement in the Final Year Project.

Jakarta, 14 July 2020

Edi Irawan

SIN: 2017390001

iii
ACKNOWLEDGEMENT

I would like to thank Allah SWT for always giving His blessings, and one of them is

my thesis completion. I also have to thank my research supervisor, Prof. Ir. Media

Anugerah Ayu, MSc., PhD. Without her advice to every step made to the thesis, it could

not be possible to complete this research. I also would like to say that this thesis contains

knowledge given from all my lecturers that have been teaching me in Sampoerna

University; for that, I am very grateful for their dedication in lecturing me all the

courses.

iv
ABSTRACT

The Indonesian Quran Tafsir and Translation is an important source of knowledge

for Indonesian Muslims. Unfortunately, especially for tafsir, it is full of the

commentaries and explanation of each surah and/or ayah, forming a large document.

The challenge is how to refer to both tafsir and translation in faster and accurate ways

as one needs to always refer to them back and forth. Hence, this study proposes several

text mining techniques on Kemenag version of Indonesian Quran Tafsir and

Translation. The results show that the tafsir is well clustered for eight groups. From TF

and TF – IDF results, in most 30 frequent words, 17 words appeared to be the same.

The correlation of the tafsir and translation frequency is 0.5306 which is moderate type

of correlation. Finally, the FP Growth algorithm succeeds to draw several association

rules from the tafsir and translation for example (“Musa”, “Bani Israil”) -> “Agama”.

The future work of this field is to do semantic text mining and to create a system for

the real user application.

Keywords: Text mining, text clustering, most frequent words mining, association rule.

v
Table of Contents

LEGITIMATION LETTER .................................................................................................... i


EXCLUSIVE RIGHT STATEMENT .................................................................................... ii
DECLARATION OF ORIGINALITY ................................................................................. iii
ACKNOWLEDGEMENT ...................................................................................................... iv
ABSTRACT .............................................................................................................................. v
I. INTRODUCTION............................................................................................................ 1
1.1 Background of Study ............................................................................................... 1
1.2 Problem Statement of the Study ............................................................................. 4
1.3 Objectives of the Study ............................................................................................ 5
1.4 Scopes of the Study .................................................................................................. 5
1.5 Significance of the Study ......................................................................................... 6
1.6 Limitation of the Study ............................................................................................ 7
1.7 Organization of Report............................................................................................ 7
II. LITERATURE REVIEW ............................................................................................... 8
2.1 Quran ........................................................................................................................ 8
2.2 Quran Tafsir and Translation ................................................................................ 8
2.3 Text Mining Techniques .......................................................................................... 9
2.3.1 Definitions.......................................................................................................... 9
2.3.2 Related Works .................................................................................................. 10
2.4 Related Works on Quran Text Mining ................................................................ 17
2.5 Review on Comparative Study of Clustering Methods ...................................... 21
III. RESEARCH METHODOLOGY ................................................................................. 22
3.1 Planning .................................................................................................................. 23
3.1.1 Literature Review............................................................................................. 23
3.1.2 Methodology Selection .................................................................................... 23
3.2 Text Mining Implementation ................................................................................ 24
3.2.1 Preprocessing (Feature Selection) .................................................................... 25
3.2.2 Term Frequency – Inverse Document Frequency, TF - IDF (Feature
Extraction)........................................................................................................................ 26
3.2.3 Most Frequent Words Mining .......................................................................... 27
3.2.4 K-Means Clustering ......................................................................................... 27
3.2.5 Association Rules Mining ................................................................................ 29

vi
IV. RESULTS AND DISCUSSIONS .................................................................................. 31
4.1 Feature Selection and Extraction Result ............................................................. 31
4.1.1 Datasets Gathering Result ................................................................................ 31
4.1.2 Feature Selection Result................................................................................... 32
4.1.3 Feature Extraction Result ................................................................................. 33
4.2 Most Frequent Words Mining Result .................................................................. 34
4.3 Clustering Result .................................................................................................... 40
4.3.1 Preliminary Experiment Result ........................................................................ 40
4.3.2 Final Experiment Result................................................................................... 41
4.4 Association Rules Result........................................................................................ 46
V. CONCLUSIONS AND RECOMMENDATIONS ....................................................... 50
5.1 Conclusions ............................................................................................................. 50
5.2 Recommendations .................................................................................................. 50
REFERENCES....................................................................................................................... 52
APPENDICES ........................................................................................................................ 55
APPENDIX 1 K-MEANS ALGORITHM CODE ........................................................... 55
APPENDIX 2 FP GROWTH ALGORITHM CODE ..................................................... 57
APPENDIX 3 MOST FREQUENT TERMS ALGORITHM CODE............................ 64
APPENDIX 4 DESIGN OF ASSOCIATION RULES .................................................... 65

vii
LIST OF FIGURES

Figure 1. The Text Mining Research Type. ............................................................................... 11


Figure 2. Research Framework ................................................................................................ 22
Figure 3. Text Mining Process ................................................................................................. 24
Figure 4. Flowchart of Preliminary K-Means .......................................................................... 29
Figure 5. The Sample of Selected Features.............................................................................. 32
Figure 6. TF- IDF Frequencies Plot ......................................................................................... 34
Figure 7. The 20 Most Frequent Words Measured by TF - IDF .............................................. 35
Figure 8. Word Cloud Measured by TF ................................................................................... 39
Figure 9. SSE Cluster Plots ...................................................................................................... 41
Figure 10. Association Rules Sample of Translation ............................................................... 47
Figure 11. Association Rule Sample of Tafsir ........................................................................... 48

viii
LIST OF TABLES

Table I. Summary of Literature Reviews on Text Mining ....................................................... 15


Table II. Summary of Literature Reviews on Quran Text Mining .......................................... 19
Table III. Properties of Tafsir and Translation Term Document Matrices (TDM) .................. 33
Table IV. Mutual and Distinct Words of Tafsir and Translation ............................................. 35
Table V. Mutual Frequency Words.......................................................................................... 36
Table VI. Clustering Results, K = 8 ......................................................................................... 41
Table VII. Clustering Result, K = 10 ....................................................................................... 44

ix
I. INTRODUCTION

1.1 Background of Study

The Holy Quran is the most valuable book for Muslims as they believe it is

containing the words of God (Yusuf Ali, 2001). Inside the Quran, there are fundamental

categories of knowledge which have to be understood and recited by all Muslims

regardless (YILMAZ GENÇ & Syed, 2019). The original language of Quran is

delivered in Arabic language. In order the meaning of Arabic Quran to be known by all

Muslims in all around the World, Quran had been translated into some languages and

one of them is Indonesian language. Indonesian Translation of Quran is important as

Indonesia is one of the Muslim majority countries. Based on Portal Informasi

Indonesia, the number of Muslims in Indonesia is around 207 million people or 87% of

total (Portal Informasi Indonesia, 2010).

Any translation of Arabic Quran is not enough for nonnative Arabic speakers,

to really understand the exact meaning of the Quran. That is why there are some

honorable and knowledgeable people who write and create some commentary books on

the Quran called Mufassir and the books called tafsir. The explanation and commentary

inside the tafsir must not base on the individual opinion. The contain of the Quran must

stay the same so all the commentary must refer to the explanation of Prophet

Muhammad (Al Qaththan, 2000).

As tafsir and translation of the Quran dealing with long sentences and words, it

becomes a challenge to extract the valuable information from both of them. In this

technological era, people tend to leave conventional thing such as refer to a thick book

by opening one page to another page. The invention of information retrieval algorithm

1
and text mining in Natural Language Processing or NLP is enabling people to mine

valuable information inside large text documents faster automatically and might be a

possible answer to the referencing tafsir and translation challenge. There are Quran-

related NLP studies, for examples by (Khadangi et al., 2018), (Sabah et al., 2015) and

(Chua & Ellyza Binti Nohuddin, 2014).

Nowadays, Natural Language Processing has been widely used for the

automation related with translation or interpretation. In the NLP, there is a branch

method called Text Mining (Zhou et al., 2020). Text mining is considered as the NLP

branch as it is using some fundamental methods in NLP but with different goals. Unlike

NLP which cares about semantics information in the text, in the text mining there is

also a method which treats the text as the ‘bag of word’, meaning the semantics

information is not explored. The main goal in text mining is to analyze both

unstructured and structured large text dataset so that one does not have to read the whole

text (Zhou et al., 2020). Because of that, text mining is becoming a valuable research

area as the existing improvement of Artificial Intelligence has been on the level where

the extraction of information in a textual data has to be automated. The result from text

mining is the information of the terms and words analysis (Zhou et al., 2020).

Today, there are three types of research areas in the text mining: preprocessing

algorithm, comparative studies about machine learning for both classification and

clustering as well as the feature extraction algorithm comparison, and the study about

the text dataset exploration result for the mining (Chua & Ellyza Binti Nohuddin, 2014).

From those types, the text mining research area is mostly in the preprocessing of the

text mining itself. The development of text mining needs to be explored, because the

2
previous methodology still has weaknesses such as in the prepossessing stage

(Alhawarat et al., 2015). As text mining dealing with numerous words, the feature

extraction and selection need to be improved in order to really get such “tidy” training

dataset (Waykole & Thakare, 2018).

This study explored about the Indonesian Quran Tafsir and Translation using

one of the NLP sub-field in grouping topics based on the similarities of each word,

which is text mining, as well as using the frequent words mining. The further mining is

to see the association rules inside the tafsir and translation. As the Indonesian Quran

Tafsir and Translation are available, it is possible to get the datasets and create a

methodology in order to analyze and understand the textual data inside both of them.

The idea of text mining or data mining in general is to get the most valuable information

inside the large both unstructured or structured data sets (Jayasekara & Abu, 2018). In

other words, the text mining should fit in getting most valuable knowledge inside the

tafsir and translation automatically.

Many researchers present some text mining in using different types of methods.

In machine learning, there are two types of learning which are classification and

clustering (Louridas & Ebert, 2016). For the case of text classification, the text dataset

is classified based on the supervised learning which decides the target labels (Louridas

& Ebert, 2016). Clustering of text means that the text datasets are grouped based on the

similarities of each word (Louridas & Ebert, 2016). The studies of text mining in

general have some sides to explore such as research in how to prepare the preprocessing

in order to get a really “clean” dataset. The other example is the research in how to

implement the text mining effectively in more specific dataset and focus on how the

3
dataset result after the text mining done. Furthermore, scholars also have high interest

in combining or comparing two or more clustering and classification method in text

mining.

1.2 Problem Statement of the Study

Tafsir in general contains explanation and interpretation of each verse of Quran.

In the Quran alone, it contains 114 chapters (the Arabic term is called Surah), 30 parts

or Juza, 60 groups or Hizb and 6236 verses or Ayah. As a result, tafsir has many words

and pages, thus it can be considered as a large dataset. Indonesian Muslims to be

specific, are rare to just read the whole text of the tafsir. To support that statement,

based on survey from Badan Pusat Statistik (BPS) done by SUSENAS in 2018, 53.57%

of all Muslims in Indonesia cannot read the Arabic Quran. That number is very high

and shows how the Muslims laziness to refer from Quran and how high the urgency to

educate Muslims by reading tafsir or translation.

Inside the Quran, if one reads all the chapters, there are many similar topics and

narratives but separated from each other, not in the same chapters (Farooq & Kanwal,

2019). Thus, if one wants to refer a problem from Quran, there are many possible

answers provided by different chapters and ayah. The reason why that happens is that

mostly Quran is not organized each chapter based on a similar topics or narratives, but

based on the timeline when Quran was delivered to The Messenger of The God,

Muhammad (Farooq & Kanwal, 2019). In conclusion, in this modern era, it should exist

an approach to help in categorizing and structuring the tafsir and translation. The other

thing is how to enable the readers to optimize the information retrieval from both of

them.

4
When it comes to translation and tafsir of Quran, it is dealing with interpretation

of the Arabic language. The problems regarding natural language are the difficulties in

exchanging the exact same information and meaning from one language to another and

understanding the syntax or grammar, becoming a research topic, for example done by

(Verhagen, 2008). This is also why, any translation and tafsir of Quran cannot be

claimed as the exact representation of the Arabic Quran (Religious Literacy Project,

n.d.). It is applied also to Indonesian Tafsir and Translation of Quran. Hence, the

exchange information inside both of them need to be explored and compared to prevent

a misleading interpretation of the Quran.

1.3 Objectives of the Study

The objectives of this study consist of:

• To group the Indonesian Quran Tafsir and Translation topics based on their

similarities.

• To get valuable information inside the tafsir and translation from text mining

point of view.

• To present association rules mining inside the tafsir and translation.

1.4 Scopes of the Study

The scopes of this study include:

1. The dataset version is by Kementrian Agama or Kemenag version. The selection

of this version is not biased; the only reason is that the text format for this

version is available and easy to get.

2. This study uses all 144 surah.

5
3. The range of cluster numbers in preliminary study is set from 2 – 12 clusters.

This is due to the known numbers of topics is around six themes, based on

(Religious Literacy Project, n.d.), so that the range of cluster need to be around

that number.

4. The clustering is done only for grouping purposes, not to measure performance

of the clustering training.

5. The association rules algorithm used is Frequent Pattern (FP) Growth

Algorithm.

1.5 Significance of the Study

The significance of the Study includes:

1. For Quran-related text mining study area, this study is expected to contribute in

varying the studies about grouping Indonesian Quran Tafsir and Translation

texts.

2. For association rules mining study area, it contributes in giving insight of

Frequent Pattern (FP) Growth Algorithm for finding association rules inside the

text document datasets.

3. For Islamic scholars and new Islamic learners, this study can be a new method

in understanding Quran easier based on text mining perspective.

4. For Indonesian Quran Application Developers, this study can be their reference

to group the topics of the text based for information retrieval purposes.

5. For Indonesian Quran Application Developers, the association rule inside the

Indonesian Tafsir and Translation is expected to be one of the methods to

optimize information retrieval.

6
1.6 Limitation of the Study

This study has potentials limitations. This study uses non sematic text mining

approach or statistical study only. Thus, the result of study cannot differentiate such

figure of speech if some exists on the datasets.

1.7 Organization of Report

This research paper will be presented in the sections as follow:

1. Introduction. This section includes Background, Problem Statement of the

Study, Objective of the Study, Scope of the Study, Significance of the Study

and Limitation of the Study.

2. Literature review. This section includes Text mining sub-chapter which tells

about the related works of text mining in general. After that, the second sub-

chapter is text mining in Tafsir and Quran Related. In this sub-chapter will be

presented all related works in text mining, specifically in Tafsir and Quran.

3. Research Methodology. This section includes research work flow and all

explanation about each stage of the methodology.

4. Results and Discussion. This section includes all the result showed by the

implementation of experiment as well as the discussion of those results.

5. Conclusions and Recommendations. This section is to give the overall

conclusion of the study and the future works recommendation

7
II. LITERATURE REVIEW

The literature review chapter is to present the reference for all related knowledge

and past works related to this study. The organizational of this chapter is as follows:

• Section 2.1 is about Quran literature review.

• Section 2.2 is about Quran Tafsir and Translation literature review.

• Section 2.3 is about text mining-related literature review.

• Section 2.4 is about the related works on text mining of Quran.

• Section 2.5 is about comparative studies of clustering methods.

2.1 Quran

All Muslims must believe that Quran is the sacred book of Islam which contains the

word of God. The Prophet Muhammad is the one whose carried all those words through

one of angel, Jibril, and have the duty to deliver the Quran to all human being (Yusuf

Ali, 2001). Muslims have the duty to obey the laws inside the Quran.

2.2 Quran Tafsir and Translation

The original Quran is delivered to Muhammad by Arabic language and since Islam

have been spreading out around the World, the translation and interpretation, tafsir, are

made into the other languages. Even so, all Muslims must believe that the Arabic Quran

with the translated one are not the same. No translation or interpretation can claim to

present the Quran but the Arabic, the original one (Religious Literacy Project, n.d.).

Thus, Muslims need to always recite Arabic Quran and never replace Quran with any

other translation or interpretation.

8
Unlike Quran translation which only contains the cross language from Arabic to a

certain language, tafsirs contain the interpretation as well of each Quran verse; these

Tafsirs are created by the high-level knowledgeable Islamic teachers (Nirwana, 2017).

There are four methodologies in creating Tafsir. The first method is called “Tahlili” or

based on analysis, which is by interpreting each verse, each surah, based on the surah

order. The second method is called “Ijmali” or global, which unlike the first method,

this Tafsir is made based on the interpretation of the whole surah by doing it with

simpler and number of words in order to simplify Tafsir and create it easy to understand

Tafsir. The third method is by “Muqaran” or comparison, which is to compare Quran

and Hadith or with other credible sources. The last is by “Maudhui” or thematic, which

is by classifying all the similar theme of the verses and analyze the reason why the

verses are delivered (Saleh, 2010).

2.3 Text Mining Techniques

2.3.1 Definitions

Text mining, or well-known as well as knowledge discovery from text (KDT), is a

branch knowledge of Artificial Intelligence (AI) technology which uses three types of

concepts and techniques; information retrieval, information extraction, and also natural

language processing. Those concepts and techniques should be well combined and

connected by the help of algorithms and knowledge discovery databases (KDD), data

mining, machine learning, or statistics (Hotho, Andreas; Nürnberger, Andreas; Paass,

2005).

There are three main terms of frequent pattern mining which are itemset,

subsequences, and substructures. The idea of frequent pattern mining is to get those

9
three values appeared in a data set with frequency no less than specified thresholds. The

values are becoming the association rules with certain score of each rule. By knowing

the association rules, some benefit information can be gotten from the data set (Han et

al., 2007).

2.3.2 Related Works

There are many fields in text mining that still need improvement and research more.

There are three types of research areas in the text mining: preprocessing algorithm,

comparative studies about machine learning for both classification and clustering as

well as the feature extraction algorithm comparison, and the study about the text dataset

exploration result for the mining. Currently, the studies about the text mining in general

are mostly about the preprocessing stage in the text mining (Jayasekara & Abu, 2018).

The needs of preprocessing to be improved is because preprocessing is a crucial stage

which can affect to the result significantly.

The preprocessing includes tokenization, normalization and substitution. Besides

the preprocessing, the selection of the methods also is one of the trendy research areas.

The researchers usually compare two or more common method in text mining, whether

it is about clustering or classification (Jayasekara & Abu, 2018). The other research

area for text mining is to implement the text mining to a specific dataset and the focus

is to the specific dataset. Moreover, there are also studies in comparing two or more the

distance calculation in determining the similarities when doing clustering or

classification. Figure 1 shows the summary in points the common research types in text

mining.

10
Figure 1. The Text Mining Research Type.

The first example of preprocessing research was done by these studies (Agnihotri

et al., 2014; Alhawarat et al., 2015; Matsumoto et al., 2017). This study (Alhawarat et

al., 2015) presents the most frequent pattern mining from the textual data. The proposed

method of the study is using the cosine similarity in deciding the distance of the terms

of words. The objective of the study is to compare the clustering performance between

two methods which are k-means and hierarchical clustering. This text mining type of

study is not really presenting deep about the textual data analysis, but exploring more

about the performance of the two algorithms. The datasets that this study exploited are

from stories, news and e-mails. Another similar study from this study (Agnihotri et al.,

2014), the research done in (Alhawarat et al., 2015) presents the most frequent words

from a dataset, Quran. The study highlights that it is worth to do the study as the Arabic

words are different from Alphabet as the challenge for non-English and non-alphabet

natural language processing is high. Hence, there should be presented different type of

method, mostly in the preprocessing stage, in order to treat the Arabic words well.

11
Different type of study in text mining presented by this study (Matsumoto et al.,

2017) which had an objective to show different type of dataset, numerical and textual.

In other words, the study explored more about the dataset and how the researchers

treated both kind of dataset in order to know which type of dataset, whether in the

numerical form or in the textual form, gives the best result and the most effective one.

The study contributes in the reviewing such questioner or review data which not only

containing the textual data but also the numerical data. This kind of research is out of

the box and not really mainstream with the other text mining studies. The reason is that,

in this study (Matsumoto et al., 2017), the authors see another side of text mining which

is in the creating a framework for the basic environment in the text mining, which is

called Total Environment for Text Data Mining (TETDM) with the help of a tool, R.

The other example of text mining study type is the category classification of a

certain dataset; for example, this study (Jayasekara & Abu, 2018) used the highly cited

papers in text mining as the dataset. The objective is to classify main topics and sub

topics in the data and text mining study using text mining method. The other method

that focus on the dataset as well as the comparative study about two classification

methods is delivered by this study (Wongso et al., 2017). This study (Wongso et al.,

2017) explores about the Indonesian news articles online as the dataset. The goal is to

classify the topics automatically using combination in the classification method and

feature extraction method. The interesting fact is that the highest accuracy given by the

combination of Term Frequency – Inverse Document Frequency (TF-IDF) and

Multinomial Naïve Bayes. The option for the feature extraction was Term Frequency –

Inverse Document Frequency or TF-IDF and Singular Value Decomposition or SVD.

For the classification method or supervised machine learning, the authors use all variant

12
in Naïve Bayes, Multinomial Naïve Bayes, Multivariate Bernoulli Naïve Bayes, and

also Support Vector Machine or SVM.

The last type of research areas in text mining is the study in the dataset. The

exploration of dataset is usually for business or market analysis political issue analysis,

trend analysis, and many more. For business or market analysis, typically, the

researchers are comparing two or more famous brands for the same products. The

dataset for this kind of research can be gotten from popular microblog. Usually, the

famous and trendy microblogs provide everyone the Application Programming

Interface (API) so that it is easier to get the large dataset with some coding. The example

of the market analysis is done by this study (Friedemann, 2015), exploiting one of giant

microblog in the world wide, Twitter, as the dataset source. The study is to cluster the

customer of a famous brand, Nike. The method of clustering for the study was K-

Means; for the feature extraction, the author used Principal Component Analysis or

PCA. The way the author did about the Twitter was getting the information of the

Nike’s followers and cluster them so that the author knew the characteristic of the

customer, which led to the conclusion where the experiment showed the clusters gave

the expected performance and the market analysis can be done well by exploring the

Twitter as the dataset (Friedemann, 2015). Moreover, the example of the other type of

dataset exploration is trend analysis, given by this study (Kim et al., 2017). The study

objective was to do identification in a research topic regarding consumer direction

policy change, using one of the unsupervised machine learning or clustering method,

which is K-Means. This study gave the expected result which was there is an

understandable meaning from keywords of papers on consumer policy studies (Kim et

al., 2017).

13
The usage of text mining can also give valuable contribution in the medical area. In

the medical area, there are numerous documents. Those documents can be so large that

people cannot read them manually. However, the documents must contain valuable

information for the sake of medical research. For example, the history of patient that

got cancer or certain diseases. By understanding the information and analysis inside the

unstructured large documents in medical area, one probably can get cure certain lethal

and fatal diseases. To be able to do that, again, text mining can be really useful. This

study (Waykole & Thakare, 2018) explored about one of clinical text which is about

cancer. The study analyzes clinical literature for cancer. The author used Logistic

Regression and Random Forest as the classifier. For feature selection, the comparison

of Bag of Word, Term Frequency – Inverse Document Frequency or TF-IDF, and

Word2Vec; the best feature selection method showed by word2vec with random forest

as the classifier (Waykole & Thakare, 2018).

Actually, there is also study in which presents the surveys and systematic literature

review in the text mining. This type of study helps the text miner to be more prepared

in doing the research. The example is given by this study (Jayasekara & Abu, 2018);

the objective of the study was to give insight for Indonesian Translation Quran (ITQ)

text mining research by reviewing some concepts from literature. The most standout

research areas that relate with data mining are GIS, information and bibliometric

analysis (Jayasekara & Abu, 2018). The other point of view from the text mining

research type is in combining the text mining method with the conventional method,

presented by this study (Zhang et al., 2019). The objective of the study was to hybrid

text mining and item response theory (IRT) in assessing career adaptability. The

methodology used was combination of Text Mining and Item Response Theory (IRT).

14
College students' self-reported career adaptability as a subjective measure and

responses to questionnaire items as an objective measure under Bayesian framework;

the result showed in terms of accuracy, text-IRT gives unique advantages, best

predictive effect and higher reliability (for 300 subjects). In terms of accuracy, text

classification gives unique advantages, best predictive effect and higher reliability (for

600 subjects). In terms of accuracy, text-IRT gives unique advantages, best predictive

effect and higher reliability (for 900 subjects).

The summary of all the literature reviews of this study on the text mining is

presented by Table I.

Table I. Summary of Literature Reviews on Text Mining

No Researcher Objective Proposed Findings


Method
1. (Agnihotri et To compare two Frequent Pattern The illustration of k-
al., 2014) different clustering Mining using means and
methods Cosine similarity hierarchical
(hierarchical and k- as distance. clustering and small
means clustering). Hierarchical and example of text
k-means mining. Both
clustering. clustering methods
show similar result.
2. (Alhawarat To provide Finding most The most frequent
et al., 2015) valuable frequent words words are different
framework and assessed from from each group
dataset in Arabic different groups. matrices.
Natural Language
Processing.
3. (Matsumoto To create a Total The proposed system
et al., 2017) framework in Environment for was effectively used
which can be a Text Data to data analysis for
method for text Mining review texts.
mining both (TETDM), is
numerical and text used as a basic
data. environment for
constructing the
proposed
framework.

15
4. (Jayasekara To give insight for Using general The most standout
& Abu, Indonesian text mining research areas that
2018) Translation Quran technique relate with data
(ITQ) text mining (similarity mining are GIS,
research by based). information and
reviewing some bibliometric
concepts from analysis.
literature.
5. (Zhang et al., To hybrid text Combination of In terms of accuracy,
2019) mining and item Text Mining and text-IRT gives
response theory Item Response unique advantages,
(IRT) in assessing Theory (IRT). best predictive effect
career adaptability. College students' and higher reliability
self-reported (for 300 subjects). In
career terms of accuracy,
adaptability as a text classification
subjective gives unique
measure and advantages, best
responses to predictive effect and
questionnaire higher reliability (for
items as an 600 subjects). In
objective terms of accuracy,
measure under text-IRT gives
Bayesian unique advantages,
framework. best predictive effect
and higher reliability
(for 900 subjects).
6. (K. Singh et To compare the Simple k-means Both algorithms give
al., 2016) two performance and spectral k- the similar results. In
of clustering means. this experiment,
methods in text there were 40 nodes
mining. with the number of
clusters was two.
The value of k in the
simple k-means was
two.
7. (Friedemann, To know the K-Means and The clusters show
2015) customers of a Principal expected result
company clusters Component which is well
based on the Analysis with defined. This
publicly available the dataset from experiment also tells
dataset which is Twitter. that Twitter can be
Nike’s Twitter one of the datasets
Followers. for market research.
8. (Kim et al., The identification K-Means This study gave the
2017) in a research topic Clustering expected result
regarding which was there is

16
consumer direction an understandable
policy change. meaning from
keywords of papers
on consumer policy
studies.
9. (Wongso et To create a suitable Multinomial The highest result is
al., 2017) method for Naïve Bayes, showed by the
classifying Multivariate hybrid of TF-IDF
Indonesian news Bernoulli Naïve and Multinomial
article. Bayes and Naïve Bayes.
Support Vector
Machine for
classifier. TF-
IDF and SVD
algorithm for
feature selection.
10. (Waykole & This study explores Logistic The best feature
Thakare, one of clinical text Regression and selection method
2018) which is about Random Forest showed by word2vec
cancer. The study Classifier. For with random forest
analyzes clinical feature selection, as the classifier.
literature for comparing Bag
cancer. of Word, TF-
IDF, and
Word2Vec.

2.4 Related Works on Quran Text Mining

For Quran related text mining, there are three types of the research; the first type is

to observe the algorithm used for the Quran text mining. By doing this type of research,

the researchers can compare the two or more algorithm to extract the most valuable

information inside the Quran. The other type of research in Quran text mining is focus

in the specific dataset and analyzing the text mining result acted to the datasets, which

are Indonesian Tafsir and Translation. There are also previous related works about text

mining for Quran and Tafsir related (Alromima et al., 2015; Chua & Ellyza Binti

Nohuddin, 2014; Hamoud & Atwell, 2016; Khadangi et al., 2018; YILMAZ GENÇ &

Syed, 2019) with the different goals among them.

17
As the Quran contains chapters and already decided in the past, the researchers want

to explore the rule that made the division of the Quran. The perfect example is this

study (Chua & Ellyza Binti Nohuddin, 2014) has goal to do the analysis on the frequent

patterns that can be found in the chapters of a Malay translated Tafsir of Quran; the

techniques are frequent pattern mining, non-trivial patterns and interesting relations.

The findings of the study were the processed dataset: 6 documents and 17 terms. The

term weighting is Term Frequency – Inverse Document Frequency (TF- IDF). Three

most frequent terms are “Allah”, “Muhammad”, and “wahai”. The different type of

research is presented by Khadangi, et al which intended to study the topic sameness in

Quranic surahs; the methodology was natural language processing methods which are

word2vec and Roots' accompaniment in Verses. The finding was the knowledge that

the choice of the surah's title is based on rational logic, the surahs hold the inner

coherence between the concepts so that they have formed on a single topic or a few

topics tightly related to each other (Khadangi et al., 2018).

The analysis in the text mining algorithm on Quran is presented by Liu, et al. This

research (Qi et al., 2017) is more challenging because it is dealing with the semantic

information inside the Quran. The objective was to contribute in building an algorithm

with semantic analysis and automatic identification areas. To compare and analyze

semantically between Chinese and Arabic written language Quran semantically; the

algorithm chosen was Semantic annotated corpus and semantic knowledge base.

To be more specific, there is a study in which exploring the Quran Tafsir in Malay

Language. In order to provide classification algorithm for Quran Tafsir verses

automatically, this study (Hamoud & Atwell, 2016) used K-Nearest Neighbor (KNN)

18
or classifier and cosine similarity as the distance; the result of the study was

contribution for Malay Quran Tafsir category classification because the experiment

performed well. Actually, if one intended to contribute in natural language of Quran, it

is needed to strengthen the algorithm in the building the tidy corpus. This study

(YILMAZ GENÇ & Syed, 2019) did the research in the exploration for making the

corpus to build the tagging algorithm for creating a prototype in which is able to extract

collocation of N-gram words. This N-gram words consist of 2 until 6 words from Arabic

Quran corpus ordered by Part of Speech Tagging. The result showed that the proposed

system succeeded to make the users select a sequence of tags (2-6 gram) and scope of

the corpus source. Table 2 presents the summary for the literature review on Quran

related text mining and pattern mining

Table II. Summary of Literature Reviews on Quran Text Mining

No Researcher Objective Method Finding


1. (Chua & Analysis on the The techniques
The processed
Ellyza Binti frequent patterns are frequent
dataset: 6
Nohuddin, that can be found pattern mining,
documents and 17
2014) in the chapters of non-trivialterms. The term
a Malay translated patterns and
weighting is Term
Tafsir of Al- interestingFrequency –
Quran. relations Inverse Document
Frequency (TF-
IDF). Three most
frequent terms are
“Allah”,
“Muhammad”, and
“wahai”.
2. (Khadangi et Intended to study Natural The choice of the
al., 2018) the topic sameness language surah's title is based
in Quranic surahs. processing on rational logic,
methods which the surahs hold the
are word2vec inner coherence
and Roots' between the
accompaniment concepts so that
in Verses. they have formed
on a single topic or

19
a few topics tightly
related to each
other.
3. (Hamoud & To compile holy Various data 18 instances were
Atwell, Quran questions mining clustered in cluster
2016) and answers techniques. 0, 1 instance in
dataset corpus, cluster 1, 8
created for data instances in cluster
mining. 2, and 3 instances
in cluster 3.
4. (Sabah et al., To provide K-Nearest Contribution for
2015) classification Neighbor Malay Quran Tafsir
algorithm for (KNN) or category
Quran Tafsir classifier and classification
verses cosine because the
automatically. similarity as experiment
the distance. performed well.
5. (Alromima et To create a This study The proposed
al., 2015) prototype in implements system succeeded
which is able to matching the to make the users
extract collocation input structured select a sequence of
of N-gram words. pattern of tags (2-6 gram)
This N-gram Arabic and scope of the
words consist of 2 language corpus source.
until 6 words from WITH the Part
Arabic Quran of Speech
corpus ordered by Tagging of
Part of Speech Quran corpus.
Tagging.
6. (Liu et al., To compare and Semantic To contribute in
2019) analyze annotated building an
semantically corpus and algorithm with
between Chinese semantic semantic analysis
and Arabic written knowledge and automatic
language Quran base. identification areas.
semantically.

To sum up from the literature review, for the field of text mining, the majority of

researchers did the Quran and the Quran translation only. By doing those studies, the

researchers did give the contribution for the classification and clustering both Quran

and Translated Quran so that one can refer to those data set easier as the data set

20
organized based on the similarities. If it works for Quran and the Quran Translation, it

is beneficial for doing the research on the Indonesian Tafsir as well.

2.5 Review on Comparative Study of Clustering Methods

To know the most suitable for Indonesian Tafsir and Translation clustering method,

the review of comparative studies is presented on this section. Zait et al. did

comparative study about clustering methods and scored each the methods based on each

quality of the result and execution performance; one of the metrics can be made based

on each method in discovering some or all hidden pattern inside the data set (Zait, M;

Messafta, 1997).

Rodiguez et al. did similar study which comparing five types of the clustering

methods which were k-means, random swap, expectation-maximation, hierarchical

clustering, self-organized map (SOM), and fuzzy c-means; the author found that K-

Means, dynamical clustering and SOM had high accuracy in all experiments

(Rodriguez et al., 2019). The other clustering comparative studies (Alfina et al., 2012;

Qi et al., 2017; N. Singh & Singh, 2012) were done and more specific, both to compare

and combine k-means and hierarchical clustering. From those studies, the two methods

gave the optimum result when they were combined each other. Alfina et al. did the

comparative research on the k-means and hierarchical clustering methods by using

testing parameters: cluster variance and silhouette coefficient method. The study shows

the k-means did a good job in term of grouping (Alfina et al., 2012).

21
III. RESEARCH METHODOLOGY

Research methodology section is to show the overall steps in doing the study, as

shown by Figure 2. This chapter is organized as follows:

• Section 3.1 discussed the planning stage

• Section 3.2 discussed the text mining implementation.

Figure 2. Research Framework

22
3.1 Planning

Planning is to prepare the all the information regarding to the study which later on

be the theoretical base and justification for this study. The two processes included in

the planning are: literature review and methodology selection.

3.1.1 Literature Review

Literature review is the method to gather past studies and cite some useful insight

from them. This step also to gather the problems in the related field and brainstorm the

possible solution to the problems. The information source can be journal articles,

conference papers, books, etc. As long as it has credible source. The samples of the

literature reviews on this study include:

1. Performance Evaluation of K-Means and Hierarchal Clustering in Terms of

Accuracy and Running Time (N. Singh & Singh, 2012), Processing the Text of

the Holy Quran: a Text Mining Study (Alhawarat et al., 2015) and Pattern and

cluster mining on text data (Agnihotri et al., 2014).

2. Categorization of ‘Holy Quran-Tafseer’ using K-Nearest Neighbor Algorithm

(Sabah et al., 2015) and The Semantic Annotation of the Quran Corpus Based

on Hierarchical Network of Concepts Theory (Liu et al., 2019).

3.1.2 Methodology Selection

Methodology selection is to decide how to conduct the experiments, based on the

literature review, to answer the objectives of the study, which includes choosing the

algorithms. On this step, the decision is as followed:

• K-Means as clustering method

• TF-IDF as feature extraction

23
• Frequent Pattern Growth as association rule algorithm.

3.2 Text Mining Implementation

This step is to apply the text mining steps. There are four steps in the text mining

for this study: preprocessing (feature selection), TF-IDF (feature extraction), most

frequent words mining, k-means clustering, and association rules mining. The overall

sequence of the whole text mining implementation is presented by Figure 3. This whole

process is done for tafsir and translation with the same steps.

Figure 3. Text Mining Process

The list of datasets information:

• Version: Kemenag Indonesian Tafsir and Translation

• Number of surah: 114 surah.

24
3.2.1 Preprocessing (Feature Selection)

The preprocessing or feature selection stage includes case folding, tokenization,

stemming words and stop words elimination. Preprocessing is needed to reduce the

unwanted words which have no significant meaning, noise, into text mining. This step

also to reduce the redundancy and repetition. Those steps are reversible and can go back

to any step if it is required.

• Case folding

This case folding is required in order to avoid the same words treated as

different words. For example, the word “mining” and “Mining”.

Although they are in different case, the machine has to treat them as one

single word.

• Tokenization

Tokenization is to chop the sentences into individual words. For example:

“I love text mining” will have the tokenized words “I”,

“love”, “text”, and “mining”.

• Stemming words

In this part, to avoid redundant words appearing on the dataset, the words

that have similar root will be “stemmed” into one single root. The example

of Indonesian stemming words:

“rumah”, “perumahan”, “rumahku”, “rumahmu”, and

“rumah-rumah” will be stemmed as “rumah”.

• Stop words removal

25
In the text mining, stop words are treated as noise to the process as they

occur in abundance. The examples of Indonesian stop words are: “yang”,

“untuk”, “ke”, “di”, “dari”, “karena”, “namun”, and

“oleh”.

3.2.2 Term Frequency – Inverse Document Frequency, TF - IDF (Feature

Extraction)

The TF – IDF is considered as one of the most powerful feature extractions

(Qaiser & Ali, 2018); it is because unlike the bag of word method, this method is not

only seeing the most frequent terms so that the undominant word is eliminated; the TD

– IDF will be weighting the terms based on how frequent the term in a document

compared to how frequent the term in the whole documents. By doing TF – IDF, the

most frequent word is rescaled. The way it is rescaled is that by dividing the most

frequent words (TF) by how frequent the words appear in the whole document (IDF).

The mathematical model for the TF – IDF is showed by Equation 1 (Qaiser & Ali,

2018).

For a term i in the document j:

𝑁
𝑊𝑖,𝑗 = 𝑡𝑓𝑖 ,𝑗 × log (𝑑𝑓 ) (1)
𝑖

Where:

𝑡𝑓𝑖 ,𝑗 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑖𝑛 𝑖 𝑎𝑛𝑑 𝑗

𝑑𝑓𝑖 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑖

𝑁 = 𝑇𝑜𝑡𝑎𝑙 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠

26
The overall pseudocode for TF-IDF is shown below.

class Mapper
function Map((dcmntId, N), (word, o))

for each element ∈ (word, o)


write(word,(dcmntId,o, N) )

class Reducer
function Reduce(word,(dcmntId,o, N))
n=0

for each element ∈ (dcmntId,o, N) do


n = n + 1
tf = o / N
idf = log(|D|/(1+n))
return (dcmntId, tfxidf)

3.2.3 Most Frequent Words Mining

In this stage, the most frequent words are extracted from both tafsir and translation.

The result of the most frequent words measured by TF will be represented and

visualized in the form of word clouds. The other presentation of the result, which is the

frequency measured by TF – IDF will be in the form of the bar plot of each tafsir and

translation result. Not only seeing the most frequent words, the result will be evaluated

in terms of its correlation using Pearson Correlation Coefficient.

3.2.4 K-Means Clustering

For the clustering, this study using Euclidian distance between each terms or words.

The Euclidian distance is formulated as seen on Equation 2.

27
‖𝐴 − 𝐵‖ = √∑𝑑𝑖=1(𝑎𝑖 − 𝑏𝑖 )2 (2)

Where A and B are points in d dimensional space such that 𝐴 = [𝑎1 , 𝑎2 , … , 𝑎𝑑 ] and 𝐵 =

[𝑏1 , 𝑏2 , … , 𝑏𝑑 ].

After getting each distance, then the clustering methods are applied.

The K-Means algorithm is one of the partitional clustering, meaning the clusters dataset

are fully divided from the others and treated as different cluster. The first thing to do in

K-Means clustering is the assigning the number of clustering, k. After that, initially, the

random centroid for k cluster is chosen. The iteration of K-Means is done until the mean

of each training data to the centroid met the stopping criterion, whereas the smallest

Euclidean distance from a sample is the nearest centroid for the sample to be the one

with (Friedemann, 2015).

In order to present the best clustering results, the preliminary experiments are done.

K-means clustering is depended on the cluster centers. One of the approaches to know

the optimal number of k is by seeing the elbow of Sum Square of Error cluster center

plot. So, for k-means clustering stage, this study will create the preliminary experiment

to get the best number of k. Figure 4 shows the overall steps of the preliminary

experiment.

28
Figure 4. Flowchart of Preliminary K-Means

3.2.5 Association Rules Mining

Originally, Frequent Pattern (FP) Growth Algorithm is used for knowing the

association rules in the relational database of transaction. This study creates the matrix

for the frequent words treated the same as the frequent items and each document is

treated as a transaction in the original Frequent Pattern (FP) Growth Algorithm. The

29
formal definition of association rule was presented by (Agrawal et al., 1993). Agrawal

et al. shows the following definition.

Let I = I1 + I2 + … + Im be a set of items or binary attributes. Let D be a set of all

transactions where each transaction T is a set of items such that T ⊆ I. Let X, Y be a set

of items such that X, Y ⊆ I. From those definitions, there is the association rule

implication which presented in the form X ⇒ Y, where X ⊂ I, Y ⊂ I, X ∩ Y = ∅ (Agrawal

et al., 1993).

When dealing with association rules, there are two values which needed to analyze,

which are support and confidence values.

• Support

If s% of transactions in D contain X ∪ Y then association rule for X ⇒ Y be

having s as the support value.

• Confidence

If c% of the transactions in D that contain X also contain Y then the

association rule for X ⇒ Y be having c as the confidence value.

30
IV. RESULTS AND DISCUSSIONS

This section is to present and discuss all the results gotten from the experiments.

The organizational report of this section is as follows:

• Section 4.1 is about feature selection and extraction result.

• Section 4.2 is for most frequent words mining result.

• Section 4.3 discussion of the clustering result.

• Section 4.4 discusses association rule mining result.

4.1 Feature Selection and Extraction Result

4.1.1 Datasets Gathering Result

The datasets of Tafsir and Translation Kemenag are required in form of text file in

UTF-8. The sample of tafsir and translation of Surah Al-Fatihah, dataset raw datasets

are presented, respectively as follows:

“Surah al-Fatihah dimulai dengan Basmalah\n\nAda beberapa pendapat ulama

berkenaan dengan Basmalah yang terdapat pada permulaan surah Al-Fatihah. Di

antara pendapat-pendapat itu, yang termasyhur ialah:\n\[Link] adalah ayat

tersendiri, diturunkan Allah untuk jadi kepala masing-masing surah, dan

pembatas antara satu surah dengan surah yang lain. Jadi dia bukanlah satu ayat

dari al-Fatihah atau dari surah yang lain, yang dimulai dengan Basmalah itu.

…”

“Dengan nama Allah Yang Maha Pengasih, Maha Penyayang.

Segala puji bagi Allah, Tuhan seluruh alam,

Yang Maha Pengasih, Maha Penyayang,

Pemilik hari pembalasan. …”

31
4.1.2 Feature Selection Result

When it comes to feature selection result, the datasets should would not form any

meaningful sentences anymore as some words taken from the datasets. Here are the

samples of the feature selected tafsir and translation. Each word has been tokenized,

cased folded into uppercase, and stemmed, shown by Figure 5.

(a) Feature Selected Samples of Tafsir

(b) Feature Selected Sample Translation


Figure 5. The Sample of Selected Features

The common case folding for text mining is to lower all the case. However, in this

study, the words are converted to uppercase because the word Allah should not be

written in lower case, to respect Muslims community.

There is a concern in this process. In the process or selecting the significant words,

there were some confusion whether to eliminate or keep the words due to the language

problems. For example, the word “satu”. The word “satu”, means one, is significant

when it is talking about the Tauhid or monotheism to tell that there is only one God.

However, that word also would mean the other thing when used for certain context; for

example, “salah satu dari …”. The word “satu” for that context doesn’t tell anything

32
about monotheism. The solution is, for now, the word is to eliminate the word that gives

more significant noise effect than supporting the process.

4.1.3 Feature Extraction Result

This stage is to show the result of TF – IDF algorithm for weighting the corpus.

Table III shows the matrix property of the term document matrices, TDM of tafsir and

translation.

Table III. Properties of Tafsir and Translation Term Document Matrices (TDM)

Total Terms Documents Non-sparse Sparse Maximal

entries entries length

Tafsir 488 18450 234693 8768907 14

Translation 116 6234 18815 704329 13

The tafsir contains 488 significant terms for the TF – IDF calculation while

translation is 116 terms. These terms will be the column of the term document matrix

and the occurrence of each term is weighted from each document. The total documents,

or in this case sentences, of the tafsir is 18450 and the translation 6234. The non-sparse

entries of each matrix show the nonzero entries and the sparse entries are the zeros

entries. The maximal length in tafsir is 14 words of each document and 13 words of

each document for translation.

33
(a) Tafsir TF – IDF Plot (b) Translation TF – IDF Plot
Figure 6. TF- IDF Frequencies Plot

The visualizations of word TF – IDF for both tafsir and translation are presented

in Figure 6. The two figures show similar curve for the TF – IDF values. There are

around 4 words or terms which have significant difference values compared to the

others. The discussion more about those numbers are presented on the next section,

most frequent word mining result.

4.2 Most Frequent Words Mining Result

Since the feature extraction is dealing with weighting the term frequency, by TF –

IDF, this most frequent words mining is automatically the result from the TF – IDF.

Figure and show the 20 most frequent words in the tafsir and translation measured by

TF – IDF, respectively. As the definition of the TF – IDF. That means those words are

the most likely to appear in each sentence of the tafsir and translation. Figure 7 shows

the bar plot of the 30 most frequent words measured by TF – IDF.

34
(a) Most Frequent words tafsir (b) Most Frequent words translation
Figure 7. The 20 Most Frequent Words Measured by TF - IDF

The mutual and the distinct words of 30 most frequent word are shown by Table IV

below.

Table IV. Mutual and Distinct Words of Tafsir and Translation

Mutual Words Distinct Words

From Tafsir From Translation

{“Allah”, “Tuhan”, {“Firman”, “Nabi”, {“Peringatan”,


“Ayat”, “Iman”, “Rasulullah”, “Hadis”, “Mendustakan”,
“Muhammad”, “Bumi”, “Perbuatan”, “Sabda”, “Melihat”, “Taqwa”,
“Manusia”, “Alquran”, “Agama”, “Dunia”, “Langit”, “Besar”,
“Kafir”, “Kebaikan”, “Mengetahui”, “Cipta”, “Kehendak”,
“Kebenaran”, “Rasul”, “Ibrahim”, “Perintah”, “Petunjuk”, “Kiamat”,
“Memberi”, “Azab”, “Riwayat”, “Salat”} “Surga”, “Jalan”}
“Kaum”, “Neraka”,
“Musa”}

35
The word “Firman” become the highest frequency in the tafsir is because the Mufassir

always give commentary about each ayah by reciting Allah Azza Wa Jalla Words, or

“Firman” in Indonesian. Here is the example, a piece tafsir of (Surah Al-Fatihah Ayah

1):

“Allah memulai firman-Nya dengan menyebut "Basmalah" untuk mengajarkan kepada

hamba-Nya agar memulai suatu perbuatan yang baik dengan menyebut basmalah,

sebagai pernyataan bahwa dia mengerjakan perbuatan itu karena Allah dan kepada-

Nyalah dia memohonkan pertolongan dan berkah. …”

Now, in order to know how strong is the correlation between tafsir and

translation, the calculation of Pearson Correlation Coefficient. The correlation

observation is on the mutual words between the tafsir and translation, to see whether

the pattern is the same or not. The pattern observation is on how much the tendency of

the frequency of a particular word in tafsir and translation be affecting each other is.

Table V. Mutual Frequency Words

Words X, Frequency word in Y, Frequency word in


Tafsir Translation
Allah 808.3605 462.6793
Tuhan 332.6135 385.8154
Ayat 553.4683 106.3765
Iman 318.8309 247.5849
Muhammad 342.1532 211.9196
Bumi 277.7571 188.8360
Manusia 437.1579 189.0001
Alquran 322.9600 183.2993
Kafir 249.3606 180.5655
Kebaikan 328.6329 167.4989
Rasul 244.2259 158.6053
Memberi 265.4201 142.9594
Azab 206.7804 203.4811
Kaum 336.9472 120.2425

36
Neraka 197.6404 179.5792
Musa 207.5235 189.0001
Kebenaran 242.2329 211.3072

The result of the Pearson Correlation Coefficient Value is as follows:

𝑋 𝑉𝑎𝑙𝑢𝑒𝑠
∑ = 5672.065
𝑀𝑒𝑎𝑛 = 333.651
∑ (𝑋 − 𝑀𝑥)2 = 𝑆𝑆𝑥 = 366567.528

𝑌 𝑉𝑎𝑙𝑢𝑒𝑠
∑ = 3528.75
𝑀𝑒𝑎𝑛 = 207.574
∑ (𝑌 − 𝑀𝑦)2 = 𝑆𝑆𝑦 = 127689.396

𝑋 𝑎𝑛𝑑 𝑌 𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑
𝑁 = 17
∑ (𝑋 − 𝑀𝑥)(𝑌 − 𝑀𝑦) = 114805.133

𝑅 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑟 = ∑ ((𝑋 − 𝑀𝑦)(𝑌 − 𝑀𝑥)) / √((𝑆𝑆𝑥)(𝑆𝑆𝑦))
𝑟 = 114805.133 / √((366567.528)(127689.396)) = 𝟎. 𝟓𝟑𝟎𝟔

𝑾𝒉𝒆𝒓𝒆

𝑋: 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑑 𝑖𝑛 𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛


𝑌: 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑑 𝑖𝑛 𝑡𝑎𝑓𝑠𝑖𝑟
𝑀𝑥: 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋 𝑉𝑎𝑙𝑢𝑒𝑠
𝑀𝑦: 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑌 𝑉𝑎𝑙𝑢𝑒𝑠
𝑋 − 𝑀𝑥 & 𝑌 − 𝑀𝑦: 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑠𝑐𝑜𝑟𝑒𝑠

37
(𝑋 − 𝑀𝑥)2 & (𝑌 − 𝑀𝑦)2: 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑
(𝑋 − 𝑀𝑥)(𝑌 − 𝑀𝑦): 𝑃𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑠
𝑟 = 𝑃𝑒𝑎𝑟𝑠𝑜𝑛 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

The 0.5306 value means a positive moderate correlation of words occurred

inside both the tafsir and the translation. This means that there is tendency that the

higher frequency of the word occurred in tafsir, the higher frequency of that word

occurred in translation, vice versa. So, even though there are some differences on the

most frequent words between the tafsir and translation, there will always be tendency

that the same words occurred in both of them. This information is benefit to ensure that

the tafsir and translation version are having the same directions. In other words, one

can trust to refer from both tafsir and translation of this version due to the same pattern

of the words.

Compared to the result of the most frequent words mining measured by the TF

-IDF, Figure 8 shows the word clouds of most frequent words measured by the Term

Frequency, meaning the weight of frequency is based on the total appearance from the

whole documents. Overall, the results from TF – IDF and only TF shows similar values.

The words “Allah” appeared to be the dominant one from both of the datasets.

38
(a)Word Cloud of Tafsir

(b) Word Cloud of Translation


Figure 8. Word Cloud Measured by TF

By seeing these result, specifically new Quran learners, can be benefited to understand

what is going on in the Quran as general. The tafsir shows “ayat” as big frequency. By

seeing the example of tafsir containing the word “ayat”. The number of words “ayat”

is so frequent due to the Mufassir referencing to each ayah or “ayat”. Here is the

example:

39
“Tidak sedikit ayat di dalam Al-Qur'an yang menjelaskan bahwa…

Di dalam ayat-ayat sebelumnya disebutkan…

Sebab itu pada ayat ini Allah mengajarkan…”

For translation, the word “tuhan” shows big frequency. Seeing from the translation,

Allah Azza Wa Jalla Refers Himself to Prophet Muhammad, besides using “Kami”,

using word “Tuhanmu”. Here is the example of Him referring Himself.

“Sungguh, Tuhanmu Maha Pelaksana terhadap apa yang Dia kehendaki.

Dan adapun orang-orang yang berbahagia, maka (tempatnya) di dalam surga; mereka
kekal di dalamnya selama ada langit dan bumi, kecuali jika Tuhanmu menghendaki
(yang lain); sebagai karunia yang tidak ada putus-putusnya.” (Surah Hud Ayah 107-
108)

4.3 Clustering Result

4.3.1 Preliminary Experiment Result

Figure 9 shows the SSE Cluster Center Plot for tafsir and translation, respectively.

As seen on the figure, as the number of k increasing, the SSE is decreasing. The best

selected K should be on the point where the decreasing level starting to be small which

cause the elbow effect. The elbow for the tafsir then is in K = 10 and for translation is

in K = 8 clusters. Thus, this study is focusing on the comparison clusters 8 and 10 for

both of the datasets.

40
(a) Tafsir (b) Translation
Figure 9. SSE Cluster Plots

4.3.2 Final Experiment Result

Table IV shows the clustering result for K=8, for tafsir and translation. One should

note that the clustering number is not ordered and does not matter in clustering case.

The cluster 0 of the tafsir shows words “kitab”, “anak”, “sihir”, “mesir”, “agama”,

“harun”, “Tuhan”, “firaun”, “bani israil”, and “Musa”. This shows a good example of

one clustering. Referred to the Quran, the Prophet Musa story is narrated. The story is

about the duty of Prophet Musa to give reminder to Firaun, and companied by Prophet

Harun. The place was in Egypt or “Mesir” which the bad magic or “Sihir” was popular

at that time. The example of the 0 is found in Surah Thaha Ayah 30, will be shown in

the association rules result.

Table VI. Clustering Results, K = 8

Cluster TAFSIR TRANSLATION

Terms Terms

0 KITAB, ANAK, MUKJIZAT, SIHIR, PETUNJUK, RASUL, BENAR,

MESIR, AGAMA, HARUN, TUHAN, MENGINGKARI, KAFIR, AZAB,

FIRAUN, ISRAIL, BANI, MUSA MENDUSTAKAN, KITAB, QURAN.

41
1 HATI, PERINGATAN, AJARAN, HUKUM, KITAB, KIAMAT, PERINGATAN,

PETUNJUK, AGAMA, KITAB, QURAN RASUL, MUSA, KAFIR, NERAKA,

AZAB.

2 TANAH, PLANET, GUNUNG, SIANG, BAIK, KEBESARAN, HUJAN, TANDA,

HUJAN, KEKUASAAN, BULAN, MALAM, AIR, GUNUNG, LANGIT, BUMI

BENDA, ALAM, TANDA, MATAHARI,

AIR, MENCIPTAKAN, BINTANG,

MAHLUK, MALAIKAT, LANGIT, BUMI

3 TIRMIDZI, IMAM, IBNU, ISMAIL, KARUNIA, KAFIR, HATI, PETUNJUK,

AHMAD, BUKHARI, HURAIRAH, ABU. MUHAMMAD, HAMBA, RASUL,

BERIMAN.

4 BALASAN, PAHALA, KEHIDUPAN, KENIKMATAN, BERTAQWA,

BERHALA, DOSA, HAMBA, KAFIR, KEBAJIKAN, KEKAL, PETUNJUK,

NIKMAT, NERAKA, SURGA, AMAL, LURUS, SUNGAI, SURGA, MANUSIA.

AZAB, AKHIRAT, DUNIA.

5 HATI, PERINGATAN, AJARAN, HUKUM, PETUNJUK, JANJI, JALAN, AZAB,

PETUNJUK, AGAMA, KITAB, QURAN FIRAUN, KITAB, RASUL, MUHAMMAD,

TANDA, BENAR

6 DOSA, KIAMAT, PEREMPUAN, LAKI, BERDOA, RAHMAT, QURAN, AZAB,

TEMPAT, KAFIR. PENGAMPUN, PENGASIH, PENYAYANG.

7 ISTIDRAJ, ISTIADAT, ISRAFIL, BERIMAN, ISTRI, AZAB, AKHIRAT,

ISRAIL, ISTANA, ISTILAH, ISTERI, YATIM, DUNIA, HARTA, NIKMAT,

PETUNJUK, MALAIKAT PEREMPUAN, LAKI

Seeing the members of the cluster 2 in the tafsir, this cluster contains

astronomical terms, for example “Planet”, “Bulan”, “Bintang”, “Matahari, “Langit” and

“Bumi”. This cluster might be a partition about the perspective of universe creation

from Quran. The example of the tafsir is as follows:

42
“Dalam pemahaman astronomi, langit adalah seluruh ruang angkasa semesta, yang

di dalamnya ada berbagai benda langit termasuk matahari, bumi, planet-planet,

galaksi-galaksi, supercluster, dan sebagainya.

Hal ini dikemukakan oleh Allah di dalam Surah al-Mulk/67: 5, yang artinya:

Sesungguhnya Kami telah menghiasi langit yang dekat (langit dunia) dengan bintang-

bintang, dan Kami jadikan bintang-bintang itu alat-alat pelempar syaitan, dan Kami

sediakan bagi mereka siksa Neraka yang menyala-nyala¦(al-Mulk/67: 5)

Jadi, langit yang berisi bintang-bintang itu memang disebut sebagai langit dunia.

Itulah langit yang kita kenal selama ini. Dan itu pula yang dipelajari oleh para ahli

astronomi selama ini, yang diduga diameternya sekitar 30 miliar tahun cahaya. Dan

mengandung trilyunan benda langit dalam skala tak berhingga.

Namun demikian, ternyata Allah menyebut langit yang demikian besar dan dahsyat itu

baru sebagian dari langit dunia, dan mungkin langit pertama. Maka dimanakah letak

langit kedua sampai ke tujuh?”

From the example of above tafsir’s contain is that the Mufassir gives

commentary about the word “langit” and how it relates with the other creation such as

“bintang”, “bulan”, “bumi” and “matahari”. The Mufassir lists all the ayahs from

different surahs about “langit”. So, that is why the clustering result from tafsir is good

The other interesting cluster result in tafsir is the cluster 3 which are the name

of hadith narrators, which make a good cluster as well. In cluster 4 of tafsir, the words

quite interesting, as it contains opposite words of each other, for example “neraka” and

“surga”, “hamba” and “kafir”, or “pahala” and “dosa”. This can conclude that in the

43
tafsir, the bad and the good always be narrated in one case so they become close and

one cluster. Cluster 5 also shows a good example of one clustering because of their

topic closeness, about Quran and Kitab as laws of Muslims which already discussed in

the literature review.

For the case of translation, the result is not as clear as the tafsir. There are some

same words appeared on each cluster. For example, the word “Azab” and “Petunjuk”

which makes difficult to decide the main topic of each cluster. Then, the information

which can be gotten is that the distance of each word in translation are not really far

from each other. Meaning, using TF – IDF weighting method, the term most likely

appears on each document the same amount of times.

The next observation from the result shown in Table V for K= 10. For the tafsir

cluster, there are five clusters which are similar with the previous result. For the case

of translation, it starts to get clearer for same clusters. For examples, the words “mata”,

“air”, “balasan”, “baik”, “taman”, “buah”, “penghuni”, “kenikmatan”, “mengalir”,

“kekal”,” sungai”, and “surga” on one cluster in translation. But overall, the translation

cluster result still difficult to tell.

Table VII. Clustering Result, K = 10

Cluster TAFSIR TRANSLATION

Terms Terms

0 MUNAFIK, LARANGAN, KEMENANGAN, HATI, LANGIT, HAMBA, TOBAT,

MUSUH, YAHUDI, IBRAHIM, MEKAH, BUMI, QURAN, MUHAMMAD, BERIMAN

PERANG, MUSYRIK, KAFIR.

44
1 ISTIDRAJ, ISTERI, ISRAFIL, GEMBIRA, PERJALANAN, KAFIR,

ISRAIL, ISTANA, PETUNJUK, CELAKALAH, MANUSIA, BERIMAN,

MALAIKAT KEBENARAN, PERINGATAN.

2 NEGERI, TANDA, NUH, SETAN, MALAIKAT, HAMBA, BENAR, SAHAYA,

NIKMAT, HATI, KIAMAT, KAFIR, ISTRI, ANAK, PEREMPUAN, LAKI.

NERAKA.

3 NIKMAT, HAMBA, QURAN, AJARAN, SAPI, TAKUT, BUMI, MALAM,

SURGA, UMAT, BAIK, KESENANGAN, FIRAUN, HARUN, KEKUASAAN,

KEBAHAGIAAN, NERAKA, KAFIR, KEBESARAN, TANDA.

KEHIDUPAN, AZAB, HIDUP, AKHIRAT,

DUNIA.

4 ISA, HUD, ESA, MUSYRIK, SEMBAH, DIUTUS, AZAB, UMAT, YATIM, NUH,

BERHALA, PATUNG, TUHAN. HARTA, ANAK, RASUL.

5 DAWUD, ABDULLAH, UMAR, TIRMIDZI, MATA, AIR, BALASAN, BAIK,

IMAM, AHMAD, IBNU, BUKHARI, TAMAN, BUAH, PENGHUNI,

MUSLIM, ABU, HURAIRAH, SABDA. KENIKMATAN, MENGALIR, KEKAL,

SUNGAI, SURGA.

6 JALAN, KEBAJIKAN, BURUK, AIR, GOLONGAN, DUNIA, GUNUNG,

BALASAN, SIFAT, ISTERI, IBU, WAKTU, NEGERI, MUHAMMAD,

SURGA, SALEH, PAHALA, PEREMPUAN, MALAIKAT, BAIK, KIAMAT,

LAKI, HAMBA, HARTA, AMAL, DOSA, MANUSIA, AZAB, NERAKA.

ANAK.

7 HARUN, HATI, SIHIR, UMAT, PETUNJUK, NIKMAT, MUHAMMAD,

PETUNJUK, KAUM, MUKJIZAT, DUSTAKAN, AZAB, KAFIR.

KEBENARAN, TAURAT, FIRAUN, BANI,

ISRAIL, KITAB, MUSA.

8 TUMBUHAN, PLANET, BULAN, AZAB, KAFIR, KERAJAAN, TANDA,

KEKUASAAN, TANDA, BENDA, TANAH, JANJI, RASUL, BUMI, LANGIT,

GUNUNG, ALAM, MATAHARI, HUJAN, BESAR

CIPTA, MAHLUK, BINATANG, AIR,

LANGIT, BUMI.

45
9 IBRAHIM, MENYAMPAIKAN, HAMBA, PUJI, ZALIM, DISEMBAH, ESA,

MAHLUK, UTUSAN, LAKI, WAHYU, LANGIT, BUMI, AZAB, PENGASIH.

JIBRIL, LUT, ADAM, MALAIKAT.

To sum up, when K=8, tafsir clustering shows good partition and for K=10, the

translation shows better partition result than the previous result.

4.4 Association Rules Result

This section is to show the result of interesting association inside tafsir and

translation. Figure 10 shows the association the word “Allah” from translation. Except

the word “kafir”, all the association are showing the positive sentiments. The support

value of this association is 0.004 and the confidence is 0.781, meaning from the whole

translation documents, 0.4% having the union of the term “Allah” and “kafir” in one

document. Moreover, 78.1% of the documents in the translation contain “Allah” also

contain “kafir”. To see the meaning of this, the further reference is done by looking up

into the translation dataset. One of the sample from this association is Surah An-Nahl

Ayah 106 – 107. The ayahs show that Allah always narrate on how the bad fate would

come to kafir, which people who deny the truth of Allah’s.

46
Figure 10. Association Rules Sample of Translation

The word “kafir” also can be a verb which has meaning denying, from the context of

Ayah 106.

“Barangsiapa kafir kepada Allah setelah dia beriman (dia mendapat kemurkaan

Allah), kecuali orang yang dipaksa kafir padahal hatinya tetap tenang dalam beriman

(dia tidak berdosa), tetapi orang yang melapangkan dadanya untuk kekafiran, maka

kemurkaan Allah menimpanya dan mereka akan mendapat azab yang besar. Yang

demikian itu disebabkan karena mereka lebih mencintai kehidupan di dunia daripada

akhirat, dan Allah tidak memberi petunjuk kepada kaum yang kafir.” (Quran, Surah

An-Nahl 106-107).

47
Figure 11. Association Rule Sample of Tafsir

For the case of tafsir, Figure 11 above shows the interesting association. The

word “israil”, “musa”, “and “bani” are in the same values of support and confidence,

which is 0.1% of the whole documents contain their union and 54% of the documents

contains those words. To compare with the previous clustering result, this is also related

to the cluster 0 which contain those words and the word “Harun”. When searching the

word “Musa” to the tafsir, one of the examples is tafsir of Surah Thaha Ayah 30.

“Ayat ini menerangkan bahwa Musa a.s. mengusulkan agar yang diangkat menjadi

pembantunya itu ialah Harun, saudaranya sendiri yang lebih tua dari dia, Musa

memilih Harun antara lain karena Harun itu seorang yang saleh, ucapannya fasih,

intonasi bicaranya seperti orang Mesir, karena ia banyak bergaul dengan orang-orang

Mesir, tempat untuk melaksanakan dakwahnya bersama Musa a.s., ….”

From the information retrieval method of association rule, one can directly know

there is association to those words that the Prophet Musa did a duty from Allah to

remind the Firaun. He then asked Allah Azza Wa Jalla, that he wanted his brother,

48
Prophet Harun to company him in this duty. The story took place in Mesir which is

Egypt now.

There are several benefits from knowing the association rules. The first possible

benefit is to enable the Islamic scholars and Muslims to get all connection of a certain

topic that they would like to learn. For example, say one wants to know about Prophet

Musa by referring to Indonesian Tafsir. Without knowing the association rule, he/she

might just focus only to the word “Musa” in the tafsir and have to read the whole

sentences in the tafsir about “Musa” to be able to draw valuable information about

Prophet Musa. Different case if he/she already knew and saw the association rules list

of the word “Musa”. For instance, (“Musa”, “Mesir”) -> “Agama” or (“Musa”, “Bani

Israil”) -> “Agama”, one can take a look at those words to focus in searching

information about Prophet Musa.

The other benefit is in business, specifically online book stores or library. Say one

user accesses to an online book store or library and is interested in book tagged “Musa”

as the keyword. Then, the systems could be able to give what kind of books that might

interests the user and create a preference book for the user. Because in tafsir the word

“Musa” has association rule with “Mesir” and/or “Bani Israil”, the systems can give

suggestion and recommendation for books which tagged with words “Mesir” and/or

“Bani Israil”. Of course, to be able to do that, it needs further process. However, that is

the general thing that the association rules can do in this business area.

49
V. CONCLUSIONS AND RECOMMENDATIONS

5.1 Conclusions

From this study, author can conclude several things. Firstly, the clustering results

of tafsir and translation are obtained using the K-Means technique. The best partition

result shown by the tafsir with 8 number of clusters. However, the translation shows

not as good as tafsir partition.

Next, the valuable information from tafsir and translation is succeeded to be

obtained, for text mining perspective. The information could be collected from the most

frequent words results. The 30 most frequent words inside the tafsir and translation

were presented and showing 17 mutual words from tafsir and translation occurred in

the 30 ranking. From the correlation result, shows that the mutual words from tafsir and

translation having 0.5306 value, meaning there is tendency that the higher frequency of

the word occurred in tafsir, the higher frequency also occurred in translation, vice versa.

Furthermore, the association rules are succeeded to find, by using FP Growth

algorithm with several rules. The sample of rules in tafsir showed by the association

(“Musa”, “Bani Israil”) -> “Agama”. For translation, the sample is (“Iman”, “Taqwa) -

> “Allah. The benefits from the association rules explained, which varies from

information retrieval until business purposes.

5.2 Recommendations

The recommendations for the future works are as following:

1. To assign each cluster with certain theme based on the context of the words’

members by consulting with knowledgeable Islamic scholars.

50
2. To do the next step of this study, which is the information retrieval application

study.

3. To do comparative study in the clustering stage to find the most robust method.

4. To evaluate the clustering performance, the prediction test should be done, not

only training.

5. To do semantic study of Indonesian Quran Tafsir and Translation text mining.

51
REFERENCES

Agnihotri, D., Verma, K., & Tripathi, P. (2014). Pattern and cluster mining on text
data. Proceedings - 2014 4th International Conference on Communication
Systems and Network Technologies, CSNT 2014, 428–432.
[Link]
Agrawal, R., Swami, A., & Imieli’nski, T. (1993). Mining Association Rules between
Sets of Items in Large Databases. SIGMOD ’93: Proceedings of the 1993 ACM
SIGMOD International Conference on Management of Data, At Washington,
D.C., United States, January 1993. [Link]
Al Qaththan, S. M. (2000). MABAHITS FI ULUMIL QURAN. Pustaka Al Kautsar.
Alfina, T., Santosa, B., & Barakbah, A. R. (2012). Analisa Perbandingan Metode
Hierarchical Clustering, K-means dan Gabungan Keduanya dalam Cluster Data.
Teknik Its, 1(1), 521–525. [Link]
Alhawarat, M., Hegazi, M., & Hilal, A. (2015). Processing the Text of the Holy
Quran: a Text Mining Study. International Journal of Advanced Computer
Science and Applications, 6(2). [Link]
Alromima, W., Moawad, I. F., Elgohary, R., & Aref, M. (2015). Extracting N-gram
terms collocation from tagged Arabic corpus. 2014 9th International Conference
on Informatics and Systems, INFOS 2014, NLP10–NLP15.
[Link]
Atenstaedt, R., & Leder, D. (2012). Word cloud analysis of the BJGP. British Journal
of General Practice, 62(March), 2520. [Link]
Balsor, J. L., Jones, D. G., & Murphy, K. M. (2019). A primer on high-dimensional
data analysis workflows for studying visual cortex development and plasticity.
[Link]
Chua, S., & Ellyza Binti Nohuddin, P. N. (2014). Frequent pattern extraction in the
Tafseer of Al-Quran. 2014 the 5th International Conference on Information and
Communication Technology for the Muslim World, ICT4M 2014.
[Link]
Farooq, M., & Kanwal, N. (2019). Summary of Holy Quran: An Ultimate Guide
Series (Issue October 2019).
[Link]
Friedemann, V. (2015). Clustering a Customer Base Using Twitter Data. Cs, 229(1),
1–5.
Hamoud, B., & Atwell, E. (2016). Quran question and answer corpus for data mining
with WEKA. Proceedings of 2016 Conference of Basic Sciences and
Engineering Studies, SGCAC 2016, February, 211–216.
[Link]

52
Han, J., Cheng, H., Xin, D., & Yan, X. (2007). Frequent pattern mining: Current
status and future directions. Data Mining and Knowledge Discovery, 15(1), 55–
86. [Link]
Hotho, Andreas; Nürnberger, Andreas; Paass, G. (2005). A Brief Survey of Text
Mining. 19–62.
Jayasekara, P. K., & Abu, K. S. (2018). Text Mining of Highly Cited Publications in
Data Mining. IEEE 5th International Symposium on Emerging Trends and
Technologies in Libraries and Information Services, ETTLIS 2018, 128–130.
[Link]
Khadangi, E., Fazeli, M. M., & Shahmohammadi, A. (2018). The study on Quranic
surahs’ topic sameness using NLP techniques. 2018 8th International
Conference on Computer and Knowledge Engineering, ICCKE 2018, Iccke, 298–
302. [Link]
Kim, M. J., Ohk, K., & Moon, C. S. (2017). Trend analysis by using text mining of
journal articles regarding consumer policy. New Physics: Sae Mulli, 67(5), 555–
561. [Link]
Liu, Z., Yang, L., & Atwell, E. (2019). The Semantic Annotation of the Quran Corpus
Based on Hierarchical Network of Concepts Theory. Proceedings of the 2018
International Conference on Asian Language Processing, IALP 2018, 318–321.
[Link]
Louridas, P., & Ebert, C. (2016). Machine Learning. IEEE Software, 33(5), 110–115.
[Link]
Matsumoto, T., Sunayama, W., Hatanaka, Y., & Ogohara, K. (2017). Data Analysis
Support by Combining Data Mining and Text Mining. Proceedings - 2017 6th
IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2017,
313–318. [Link]
Nirwana, A. (2017). MADARISUT TAFSIR FI QARNIS SAHABAH.
[Link]
Portal Informasi Indonesia. (2010). Agama. [Link]
Qaiser, S., & Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance
of Words to Documents. International Journal of Computer Applications,
181(1), 25–29. [Link]
Qi, J., Yu, Y., Wang, L., Liu, J., & Wang, Y. (2017). An effective and efficient
hierarchical K-means clustering algorithm. International Journal of Distributed
Sensor Networks, 13(8), 1–17. [Link]
Religious Literacy Project. (n.d.). Qur’an: The Word of God. Harvard Divinity
School. Retrieved November 7, 2020, from
[Link]
Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa,
L. da F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative

53
approach. In PLoS ONE (Vol. 14, Issue 1).
[Link]
Sabah, G., Khaotijah, S., & Mahdi, F. (2015). Categorization of ‘Holy Quran-Tafseer’
using K-Nearest Neighbor Algorithm. International Journal of Computer
Applications, 129(12), 1–6. [Link]
Saleh, W. A. (2010). Preliminary Remarks on the Historiography of tafsīr in Arabic:
A History of the Book Approach. Journal of Qur Anic Studies, 12, 6–40.
[Link]
Singh, K., Shakya, H. K., & Biswas, B. (2016). Clustering of people in social network
based on textual similarity. Perspectives in Science, 8, 570–573.
[Link]
Singh, N., & Singh, D. (2012). Performance Evaluation of K-Means and Heirarichal
Clustering in Terms of Accuracy and Running Time. International Journal of
Computer Science and Information Technologies, 3(3), 4119–4121.
Verhagen, A. (2008). Syntax, Recursion, Productivity – a Usage-Based Perspective
on the Evolution of Grammar. Evidence and Counter-Evidence: Essays in
Honour of Frederik Kortlandt, Volume 2, January, 399–414.
[Link]
Waykole, R. N., & Thakare, A. D. (2018). a Review of Feature Extraction Methods
for Text Classification. International Journal of Advance Engineering and
Research Development, 5(04), 351–354.
Wongso, R., Luwinda, F. A., Trisnajaya, B. C., Rusli, O., & Rudy. (2017). News
Article Text Classification in Indonesian Language. Procedia Computer Science,
116, 137–143. [Link]
YILMAZ GENÇ, S., & Syed, H. (2019). Quranic Principles of Universal Law on the
Quranic Exegesis. Bilimname, December, 165–186.
[Link]
Yusuf Ali, A. (2001). The Holy Qur’an. Wordsworth Editions Ltd; 5th edition.
Zait, M; Messafta, H. (1997). A Comparative Study of Clustering Methods. Future
Generation Computer System, 149–159.
Zhang, L., Zhu, G., Zhang, S., Zhan, X., Wang, J., Meng, W., Fang, X., & Wang, P.
(2019). Assessment of Career Adaptability: Combining Text Mining and Item
Response Theory Method. IEEE Access, 7, 125893–125908.
[Link]
Zhou, M., Duan, N., Liu, S., & Shum, H. Y. (2020). Progress in Neural NLP:
Modeling, Learning, and Reasoning. Engineering, 6(3), 275–290.
[Link]

54
APPENDICES

APPENDIX 1 K-MEANS ALGORITHM CODE

# importing some needed packages


import pandas as pd
import [Link] as plt
import [Link] as cm
import numpy as np
from [Link] import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from [Link] import TSNE
from [Link] import
StopWordRemoverFactory

#read the tafsir and translation csv file


dataset = pd.read_csv("NameOfFile", encoding=
'unicode_escape')

#this is the stopwords Indonesian factory


factory = StopWordRemoverFactory()
stopword = factory.get_stop_words()

# doing the tf idf


tf_idf = TfidfVectorizer(
min_df = 5,
max_df = 0.95,
max_features = 8000,
lowercase = True,
stop_words = stopword
)
Tf_idf.fit(dataset.Column1)

55
text = tf_idf.transform(dataset.Column1)

def optimum_clusters(dataset, max_k):


iters = range(2, max_k+1, 2)

sse = []
for k in iters:
[Link](MiniBatchKMeans(n_clusters=k,
init_size=1024, batch_size=5000,
random_state=20).fit(dataset).inertia_)
print('Fit {} clusters'.format(k))

f, ax = [Link](1, 1)
[Link](iters, sse, marker='o')
ax.set_xlabel('Cluster Centers')
ax.set_xticks(iters)
ax.set_xticklabels(iters)
ax.set_ylabel('Name of Plot Label')
ax.set_title('Name of Axis)

optimum_clusters(text, 20)

clusters =
MiniBatchKMeans(n_clusters=variableK).fit_predict(text)

def plot_tsne(dataset, lab):


max_lab = max(lab)
max_it = [Link](range([Link][0)

the_pca =
PCA(n_components=2).fit_transform(dataset[max_items,:].todense
())

56
tsne =
TSNE().fit_transform(PCA(n_components=50).fit_transform(datase
t[max_items,:].todense()))

idx = [Link](range([Link][0])
label_subset = labels[max_items]
label_subset = [[Link](i/max_label) for i in
label_subset[idx]]

f, axes = [Link](1, 2, figsize=(14, 6))


ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
ax[1].set_title('TSNE Cluster Plot')

plot_tsne(text, clusters)

def getwords(dataset, clusters, labels, n_terms):


dataframe =
[Link]([Link]()).groupby(clusters).mean()

for i,r in [Link]():


print('\nCluster {}'.format(i))
print(','.join([labels[t] for t in [Link](r)[-
n_terms:]]))

getwords(variable_text, variable_cluster,
tfidf.get_feature_names(), variableN)

APPENDIX 2 FP GROWTH ALGORITHM CODE

<?xml version="1.0" encoding="UTF-8"?><process


version="9.7.000">
<context>
<input/>

57
<output/>
<macros/>
</context>
<operator activated="true" class="process"
compatibility="9.7.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel"
compatibility="9.7.000" expanded="true" height="68" name="Read
Excel" width="90" x="179" y="34">
<parameter key="excel_file" value="F:\CAPSTONE PROJECT
1\assrul\[Link]"/>
<parameter key="sheet_selection" value="sheet
number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United
States)"/>
<parameter key="read_all_values_as_polynominal"
value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0"
value="[Link]"/>
</list>

58
<parameter key="read_not_matching_values_as_missings"
value="false"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="nominal_to_text"
compatibility="9.7.000" expanded="true" height="82"
name="Nominal to Text" width="90" x="179" y="136">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception"
value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception"
value="false"/>
<parameter key="except_block_type"
value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes"
value="false"/>
</operator>
<operator activated="true"
class="text:process_document_from_data"
compatibility="9.3.001" expanded="true" height="82"
name="Process Documents from Data" width="90" x="179" y="238">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Binary Term
Occurrences"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="none"/>

59
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement"
value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights"
value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true"
class="web:extract_html_text_content" compatibility="9.3.001"
expanded="true" height="68" name="Extract Content" width="90"
x="112" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length"
value="5"/>
<parameter key="override_content_type_information"
value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags"
value="true"/>
</operator>
<operator activated="true" class="text:tokenize"
compatibility="9.3.001" expanded="true" height="68"
name="Tokenize" width="90" x="112" y="136">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>

60
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true"
class="text:transform_cases" compatibility="9.3.001"
expanded="true" height="68" name="Transform Cases" width="90"
x="112" y="238">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true"
class="text:filter_by_length" compatibility="9.3.001"
expanded="true" height="68" name="Filter Tokens (by Length)"
width="90" x="313" y="85">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<connect from_port="document" to_op="Extract
Content" to_port="document"/>
<connect from_op="Extract Content"
from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document"
to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases"
from_port="document" to_op="Filter Tokens (by Length)"
to_port="document"/>
<connect from_op="Filter Tokens (by Length)"
from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true"
class="numerical_to_binominal" compatibility="9.7.000"
expanded="true" height="82" name="Numerical to Binominal"
width="90" x="179" y="340">

61
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception"
value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception"
value="false"/>
<parameter key="except_block_type"
value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes"
value="false"/>
<parameter key="min" value="0.0"/>
<parameter key="max" value="0.0"/>
</operator>
<operator activated="true" class="concurrency:fp_growth"
compatibility="9.7.000" expanded="true" height="82" name="FP-
Growth" width="90" x="179" y="442">
<parameter key="input_format" value="items in dummy
coded columns"/>
<parameter key="item_separators" value="|"/>
<parameter key="use_quotes" value="false"/>
<parameter key="quotes_character" value="&quot;"/>
<parameter key="escape_character" value="\"/>
<parameter key="trim_item_names" value="true"/>
<parameter key="min_requirement" value="support"/>
<parameter key="min_support" value="0.001"/>
<parameter key="min_frequency" value="100"/>
<parameter key="min_items_per_itemset" value="1"/>

62
<parameter key="max_items_per_itemset" value="0"/>
<parameter key="max_number_of_itemsets"
value="1000000"/>
<parameter key="find_min_number_of_itemsets"
value="true"/>
<parameter key="min_number_of_itemsets" value="100"/>
<parameter key="max_number_of_retries" value="15"/>
<parameter key="requirement_decrease_factor"
value="0.9"/>
<enumeration key="must_contain_list"/>
</operator>
<operator activated="true"
class="create_association_rules" compatibility="9.7.000"
expanded="true" height="82" name="Create Association Rules"
width="90" x="380" y="442">
<parameter key="criterion" value="confidence"/>
<parameter key="min_confidence" value="0.5"/>
<parameter key="min_criterion_value" value="0.8"/>
<parameter key="gain_theta" value="2.0"/>
<parameter key="laplace_k" value="1.0"/>
</operator>
<connect from_op="Read Excel" from_port="output"
to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example
set output" to_op="Process Documents from Data"
to_port="example set"/>
<connect from_op="Process Documents from Data"
from_port="example set" to_op="Numerical to Binominal"
to_port="example set input"/>
<connect from_op="Numerical to Binominal"
from_port="example set output" to_op="FP-Growth"
to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets"
to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules"
from_port="rules" to_port="result 1"/>

63
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

APPENDIX 3 MOST FREQUENT TERMS ALGORITHM CODE

#PLOT TF IDF
library(ggplot2)
[Link]=tail(sort(frequency),n=25)
[Link]=[Link](sort([Link]))
[Link]$names <- rownames([Link])
ggplot([Link], aes(reorder(names,[Link]),
[Link])) +
geom_bar(stat="identity") + coord_flip() +
xlab("Terms") + ylab("Frequency") +
ggtitle("Term frequencies")

#plot word clouds


tdm2_ter <- [Link](t_ter)
w_ter <- rowSums(tdm2_ter)
w_ter <- subset(w_ter, w_ter>=100)
wo2_ter <- [Link](names(w_ter), w_ter)
colnames(wo2_ter) <- c('word', 'freq')
word cloud2(wo2_ter, size = 0.6,
shape = 'circle', rotateRatio = 0, color = "black")

64
APPENDIX 4 DESIGN OF ASSOCIATION RULES

65

You might also like