Sentiment Analysis of School Zoning System On
Youtube Social Media Using The K-Nearest
Neighbor With Levenshtein Distance Algorithm
1st Nenny Anggraini, 2nd Muhammad Jabal Tursina
Informatics Engineering Department
Syarif Hidayatullah State Islamic University
Jakarta, Indonesia
[email protected],
[email protected],
[email protected] Abstract— The role of the education effect on the progress of system the zoning system according to her, it's the mileage
each nation and State. The Ministry of education and culture closer to learners to school while according to PDSPK the
(Kemendikbud) published a regulation of the Minister of contrary, for example, there is one the school became a center
education and culture (Permendikbud) No. 14 Year 2018 on the of zoning. The school will look for students who are close to
acceptance of New Learners (PPDB). Replace the previous the school. PDSPK interpret as zoning system equitable
rules, one of which is using a system of zoning for equalization quality of education while according to the Kemendikbud
learners. New Learner Acceptance Policy (PPDB) reaps the system of zoning as a closer access to students from home to
pros and cons of the zoning system. This research was school. School zoning system has several problems including
conducted on the analysis of the sentiments of the public
namely the socialization time with implementation budget
comments against the policy. The data were analyzed taken
PPDB year 2018. In socialization to the district town of the
from youtube as much as 160 comments. This research uses the
K-Nearest Neighbor algorithm to the value k = 3, k = 5, k = 7, k
province are present not all convey to the public or its
= 9 in classifying the test data and the Levenshtein Distance to stakeholders that there are below. less Government
fix incorrect type. This research used a combination of K- understands them in understanding and translating in the form
Nearest Neighbor algorithm and Levenshtein Distance that of the type that in its territory. There are still many areas which
aims to find out the level of accuracy from a combination of the do not comply with Regulation No. 14 year 2018 as the PPDB
algorithm. Testing was done using the confusion matrix on test acceptance do not all apply the receipt of at least 90% of the
data. Conclusions from testing the combination algorithm for K- homes closest to the students of the school. Running this
Nearest Neighbor and Levenshtein Distance can increase the policy still receive evaluation and determine the concept of the
accuracy of classification. Accuracy results from K-Nearest future so that the system can continue to run based on the
Neighbor algorithm, namely in the amount of 50% to the value evaluation of PPDB each year. Improving the quality of
k = 3 and k = 7 whereas the results accuracy the combination of education can be done through policy intervention.
K-Nearest Neighbor and Levenshtein Distance of 65.625% with
a value of k = 3. Based on the data WeAreSocial.net and Hootsuite 2017,
growth of internet usage in Indonesia, namely the very rapidly
Keywords— analysis of sentiment, zoning systems, K-Nearest growing 51% within one year. More than 69% of Indonesia
Neighbor, Levenshtein Distance. community access the internet using their mobile devices.
Results of the survey on global web index on internet users in
I. INTRODUCTION Indonesia in the age range 16-64 years, shows that there are
some social media platform that is actively used by the people
The Ministry of education and culture (Kemendikbud) of Indonesia. The platform is divided into two categories,
Muhadjir Effendy through regulation of the Minister of namely social media social networking media and messenger.
education and culture (Permendikbud) No. 14 Year 2018. The YouTube was ranked first with the use of the percentage of
Ministry of education and culture (Kemendikbud) published a 43%, to two Facebook with the percentage of use of 41%, with
regulation of the Minister of education and culture a percentage of use Whatsapp then amounted to 40%. [2]
(Permendikbud) No. 14 Year 2018 on the acceptance of New
Learners (PPDB). Replace the previous rules, one of which is In General, opinions can be expressed over what are, for
using a system of zoning for equalization learners. Child example, product, service, individual, organization, or an
Protection Commission of Indonesia (KPAI) urged the event. Term used to indicate the entity object that has been
existence of evaluation policy this good from 2018 PPDB commented upon. An object has a set of components and a set
Kemdikbud or Office of education throughout the region. of attributes. Defines that a sentence is a sentence that
System zoning in the process of acceptance of New Learners expresses opinions positive or negative, explicitly or
(PPDB) votes still has a number of weaknesses so that needs implicitly. [3]
to be evaluated further. [1] Levenshtein Distance or commonly referred to with the
Interview with head of sub-division of Law governance edit distance is a method that can be used to address the
and Staffing Mrs. Any Sayekti concluded that School Zoning occurrence of the misspelling. Spelling errors may occur if the
System established on 2 May 2018 and enacted on May 8, word you typed by the user is not found on the list of
2018. School zoning system has been implemented Indonesian Language Dictionary. The function of the
throughout Indonesia for all schools except for the SMK. Levenshtein Distance method to calculate the distance to the
Kemendikbud has a different interpretation of PDSPK (Centre closeness of the two pieces of string through the addition of
of education and culture Statistical Data) regarding zoning
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
characters, character conversion, and removal of characters up C. Input/Output Data
to second string matching. [4] At this point, the author makes the input and output of
Based on the background of the above, the authors make what will be processed on the simulation later. Comments data
an analysis that serves to categorize someone's comments are from the Youtube API will be used as input 160 comments, in
included in the category of positive or negative opinions. the form of 128 training data and 32 test data. This data is then
processed using the K-NN algorithm and Levenshtein
Distance that generates output in the form of accuracy.
II. RELATED WORK
Ernawati and Wati’s research entitled "application of D. Modeling Phase
the K-Nearest Neighbours Algorithm On the analysis of the At this stage, the author design scenario model to be
Sentiment Review Travel Agent", researchers use data as created. Modeling made the first scenario i.e. using the K-NN
much as 200 reviews. In addition, researchers used 10-fold algorithm. The second scenario is a combination of the K-NN
cross-validation for a testing model, where each section will algorithm and Levenshtein Distance.
be set up in random. Principle 10-fold cross-validation is
1:9, 1 part data into testing and training data into other data, E. Simulation Phase
so the opportunity to become a part of the 10 data testing. At this stage, the system was run to simulate the
[5] performance of the algorithm according to a predetermined
On the research of Rizal Setya, entitled "analysis of scenario. The simulation is performed with the input dataset,
Sentiment about the opinion of a movie on Twitter labeling the sentiment of the dataset, do the training on the
Indonesia- speaking Documents Using Naive Bayes with training data, perform classification using test data. The
Repairs Not Raw", On this test data used is the original data results of the simulation in the form of a comparison of the
from the Twitter user about Tweet Indonesia language film accuracy of the algorithms used in this research.
opinion. Training data are taken as many as 140 opinions, F. Verification, Validation and Experiment
data that consists of positive opinion and data 70 70 data
At this stage, the author to verify and validate from the
negative opinion. As for the test data used 60 data, which previous stages, so that simulations are ready to run and
consists of 30 data opinion positive and 30 negative opinion perform experiments in accordance with model scenarios
data. Naive Bayes classification method with have been made before
improvements not standard can be applied to the process of
analysis of sentiment about the opinion of a movie on G. Output Analysis
Twitter Indonesia-language documents. Training data and At this last stage, the author does analysis of output
test data done pre-launch processing process in advance, resulting from the scenario that's been done, namely to
which processes pre-launch processing there are additional calculate the accuracy of the algorithm used in this study
repairs not using raw kamus_katabaku made after case
folding. And Repair Word that not standard using IV. IMPLEMENTATION AND RESULTS
normalization of Levenshtein Distance is done after the On testing done by as much as 3 times to ensure the
process of pre-processing. [6] accuracy of the system. The first stage in the research of the
preprocessing, following the flowchart preprocessing phase to
III. METHODOLOGY labeling sentiment.
In this study, the authors use simulation methods to see
sentiment society as objects are examined regarding the
Government's policy about the school zoning system using K-
Nearest Neighbor algorithm and Levenshtein Distance as
Repair Word is not correct. This simulation has several stages,
namely:
A. Problem Formulation
The first stage is to identify the problems in the previous
research results.
Fig. 1. Preprocessing stage and labeling
B. Conceptual Model
At this stage, the author does design the concept model 1. The first step in preprocessing that is case folding or
for the simulation will be done. The first concept draft text change all words be lowercase.
mining processes to be used. The second concept that is 2. the second stage, namely filtering to remove character
identifying the comments that have been obtained for URL links, hashtags, etc.
processing the data into training and test, then manually 3. The third stage, tokenization or to break a sentence into
labeled. The third concept is to create a test phase to see the words per word.
results accuracy results from either a sentiment algorithm 4. The fourth stage, stopword removal to remove words
using Levenshtein Distance and KNN. deemed not important as I, you, he, etc.
5. The fifth stage, stemming to change words into basic
words.
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
6. The sixth stage, namely the normalization of words Following are the results of the classification using KNN
using the Levenshtein distance, every word of the and Levenshtein Distance.
dataset will be matched with words of KBBI to take the
smallest edit to change the value of the word to be TABLE V. RESULT KNN
changed.
7. The seventh stage, namely labeling sentiment positive Test K Value Accuracy
and negative form. 3 56.25%
5 53.125%
Testing 1
• Case Folding 7 53.125%
9 53.125%
TABLE I. CASE FOLDING 3 65.625%
5 62.5%
Document Text Case Folding Testing 2
7 62.5%
Saya Bukannya saya bukannya anakk 9 62.5%
Anakk ajaibb!! ajaibb!! 3 62.5%
5 59.375%
Testing 3
• Filtering 7 62.5%
9 62.5%
TABLE II. FILTERING
Results of the KNN and LV distance tests shown in
Document Filtering the table above, Generate accuracy based on predefined
Text k values.
saya saya bukannya
bukannya anakk
anakk ajaibb TABLE VI. STEMMING
ajaibb!! Document Text Stemming
| bukannya | | bukan | anakk | ajaibb
• Tokenization anakk | ajaibb | |
• Normalization of Words
TABLE III. TOKENIZATION
TABLE VII. NORMALIZATION
Document Text Tokenization
saya bukannya | saya | bukannya | Document Text Normalization of
anakk ajaibb anakk | ajaibb | Words
| bukan | anakk | | bukan | anak | ajaib |
ajaibb |
• Stopword Removal
TABLE IV. STOPWORD REMOVAL Every word of the dataset will be matched with words of
KBBI to take the smallest edit to change the value of the word
Document Text Stopword Removal to be changed.
| saya | | bukannya | anakk |
bukannya | ajaibb |
anakk | ajaibb |
• Stemming
b e n c i
0 1 2 3 4 5
b 1 0 1 2 3 4
e 2 1 0 2 3 4
n 3 2 1 0 1 2
c 4 3 2 1 0 1
i 5 4 3 2 1 0
i 6 5 4 3 2 1
Fig. 3. Calculation Of The Levenshtein Distance
i 7 6 5 4 3 2
i 8 7 6 5 4 3
Fig. 2. Calculation Of The Levenshtein Distance
The chart above displays the best accuracy on the first
test generated with a value of k = 3. The result is 56.25%.
From the example of the word typo "benciiii" being the
correct word is "benci" obtained the smallest distance that is
3.
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
2. Use cross validation to find the optimal value of
accuracy.
3. Use other spell checking algorithms to see how the
resulting accuracy than Levenshtein Distance
algorithm.
REFERENCES
[1] B. P. Nugroho, "Ramai soal PPDB, begini aturan aistem aonasi
sekolah," 4 July 2018. [Online]. Available:
https://news.detik.com/berita/d-4097504/ramai-soal-ppdb-begini-
aturan-sistem-zonasi-sekolah.[Accessed 28June 2018].
[2] K. Data, "Inilah media sosial dengan pengguna aktif terbesar di
Fig. 4. Accuracy results using KNN and Levenshtein Distance on the indonesia," Kata Data, 13 September 2017. [Online]. Available:
second testing https://databoks.katadata.co.id/datapublish/2017/09/13/inilah-. media-
sosial-dengan-pengguna-aktif-terbesar-di-indonesia. [Accessed 13
June 2018].
[3] S. Adi, "Perancangan klasifikasi tweet berdasarkan sentimen dan fitur
calon gubernur dki jakarta 2017," perancangan klasifikasi tweet
berdasarkan sentimen dan fitur calon gubernur dki jakarta 2017, vol.
III, pp. 10-16, 2018.
[4] P. Antinasari, R. P. Setya and M. A. Fauzi, "Analisis aentimen tentang
opini film pada dokumen twitter berbahasaanalisis aentimen
aentang opini film aada aokumen twitter berbahasa andonesia
menggunakan naive bayes dengan perbaikan kata tidak Baku," Analisis
Sentimen Tentang Opini Film pada Dokumen Twitter Berbahasa
Analisis Sentimen Tentang Opini Film Pada Dokumen Twitter
Berbahasa Indonesia Menggunakan Naive Bayes Dengan Perbaikan
Kata Tidak Baku, vol. I, pp. 1733-1741, 2017.
[5] S. Ernawati and R. Wati, "Penerapan Algoritma K-Nearest Neighbors
Fig. 5. Accuracy results using KNN and Levenshtein Distance on the Pada Analisis Sentimen Review Agen Travel," Penerapan Algoritma
second testing K-Nearest Neighbors Pada Analisis Sentimen Review Agen Travel,
vol. VI, 2018.
The chart above displays the best accuracy on the second [6] M. D. A. Putri, A. Syukur, A. Prihandono and D. Rosal, "Analisa
test generated with a value of k = 3. The result is 65.625%. Sentimen Untuk Penilaian Pelayanan Situs Belanja Online
Menggunakan Algoritma Naïve Bayes," Analisa Sentimen Untuk
Based on testing the combination of k-nearest neighbor Penilaian Pelayanan Situs Belanja Online Menggunakan Algoritma
and Levenshtein distance is done above the highest accuracy Naïve Bayes, 2018.
of the results obtained in the second test when the value k =
3 is 65,625%. The increase in the value of accuracy caused
by the existence of the word typo normalized so as to increase
the frequency of the word.
V. CONCLUSION AND SUGGESTIONS
Based on the discussion that is done, then it can be
inferred that:
1. A combination algorithm for k-nearest neighbor and
Levenshtein distance can be applied to the analysis of
sentiment with the highest accuracy from a combination
of K-Nearest Neighbor and Levenshtein Distance is
indicated at the time of the value k = 3 is 65,625% when
the second test.
2. Combination algorithm for K-Nearest Neighbor and
Levenshtein Distance can increase accuracy in
classifying analysis of sentiment in social media
youtube.
3. The conclusion that can be taken is increased accuracy
results with a combination of K-Nearest Neighbor and
Levenshtein Distance caused by the repair of
inappropriate words or typo into words that correspond
to KBBI.
Suggestions that should be done, as follows:
1. Increase the number of training data that are used in
the process of classification.