Basepaper (Water Fraud)
Basepaper (Water Fraud)
Abstract—Fraudulent behavior in drinking water consumption citizens through restructuring and rehabilitation of networks,
is a significant problem facing water supplying companies and reducing the non-revenue water rates, providing new sources and
agencies. This behavior results in a massive loss of income and maximizing the efficient use of available sources. At the same
forms the highest percentage of non-technical loss. Finding time, the Ministry continues its efforts to regulate the water
efficient measurements for detecting fraudulent activities has been usage and to detect the loss of supplied water [2].
an active research area in recent years. Intelligent data mining
techniques can help water supplying companies to detect these Water supplying companies incur significant losses due to
fraudulent activities to reduce such losses. This research explores fraud operations in water consumption. The customers who
the use of two classification techniques (SVM and KNN) to detect tamper their water meter readings to avoid or reduce billing
suspicious fraud water customers. The main motivation of this amount is called a fraud customer. In practice, there are two
research is to assist Yarmouk Water Company (YWC) in Irbid city types of water loss: the first is called technical loss (TL) which
of Jordan to overcome its profit loss. The SVM based approach is related to problems in the production system, the transmission
uses customer load profile attributes to expose abnormal behavior of water through the network (i.e., leakage), and the network
that is known to be correlated with non-technical loss activities. washout. The second type is called the non-technical loss (NTL)
The data has been collected from the historical data of the which is the amount of delivered water to customers but not
company billing system. The accuracy of the generated model hit billed, resulting in loss of revenue [3].
a rate of over 74% which is better than the current manual
prediction procedures taken by the YWC. To deploy the model, a The management of the Yarmouk Water Company (Jordan)
decision tool has been built using the generated model. The system has a significant concern to reduce its profit losses, especially
will help the company to predict suspicious water customers to be those derived from NTLs, which are estimated over 35% in the
inspected on site. whole service area in the year 2012. One major part of NLT is
customer’s fraudulent activities; the commercial department
Keywords— Fraud Detection, Data Mining, SVM, KNN, Water manages the detection processes with the absence of an
Consumption. intelligent computerized system where the current process is
costly, not effective nor efficient.
I. INTRODUCTION
Water is an essential element for the uses of households, NTL is a serious problem facing Yarmouk Water Company
industry, and agriculture. Jordan, as several other countries in the (YWC). In 2012 the NTL reached over 35%, ranging from 31%
world, suffers from water scarcity, which poses a threat that to 61 according to districts, which results in a loss of 13 million
would affect all sectors that depend on the availability of water dollars per year. Currently, YWC follows random inspections
for the sustainability of activities for their development and for customers, the proposed model in this paper provides a
prosperity [1]. valuable tool to help YWC teams to detect theft customers,
which will reduce the NTL and raise profit.
According to Jordan ministry of water and irrigation, this
issue always has been one of the biggest barriers to the economic Literature has abundant research for Non-Technical Loss
growth and development for Jordan. This crisis situation has (NTL) in electricity fraud detection, but rare researches have
been aggravated by a population increase that has doubled in the been conducted for the water consumption sector. This paper
last two decades. Efforts of the ministry of Water and irrigation focuses on customer’s historical data which are selected from the
to improve water and sanitation services are faced by YWC billing system. The main objective of this work is to use
managerial, technical and financial determinants and the limited some well-known data mining techniques named Support Vector
amount of renewable freshwater resources [2]. Machines (SVM) and K-Nearest Neighbor (KNN) to build a
suitable model to detect suspicious fraudulent customers,
To address these challenges, Jordan ministry of water and depending on their historical water metered consumptions.
irrigation as in many other countries is striving, through the
adoption of a long-term plan, to improve services provided to
II. RELATED WORK Furthermore, they introduced two statistical estimators, which
This section reviews some of the applications of data mining are used to weigh customers’ trend and the non-constant
classification techniques in fraud detection in different areas consumption. The model assists in the identification of abnormal
such as Detection of Fraudulent Financial Statement, Fraud consumption which may arise from abnormal with no fraud so
Detection in Mobile Communication Networks, Detecting they can easily be re-billed, and fraud customers where adequate
Credit Card Fraud, and Fraud Detection in Medical Claims. For procedures can take place. The accuracy of the model reached
example, Kirkos et al. [4] proposed a model for detecting fraud 22%.
in financial statements, where three data mining classifiers were Filho et al. [15] implemented decision tree classification
used, and namely Decision Tree, Neural network and Bayesian technique in the detection of suspected fraud customers and
Belief Network. Shahine et al. [5] introduced a model for credit corrupted measurement meters. They used five months
card fraud detection; they used decision tree and support vector customers’ consumption data, where a classification of
machines SVM. In addition, Panigrahi et al. [6] proposed a customers to fraud and non-fraud were applied. The technique
model for credit card fraud detection using a rule-based filter, raised the hit rate a hit rate of 5% using current techniques to
Bayesian classifier, and Dempsters-Shafer adder. 40%.
Carneiro et al. [7] developed and deployed a fraud detection Jiang et al. [16] suggested an approach using Wavelet
system in a large e-tail merchant. They explored the combination techniques and a group of classifiers, to automatically detect
of manual and automatic classification and compared different fraud customers in electricity consumption. The wavelet
machine learning methods. Ortega et al. [8] proposed a fraud technique was used to express the properties of the meter
detection system for Medical claims using data mining methods. readings. These readings were used to build models using
The proposed system uses multilayer perceptron neural several classifiers, based on the assumption that abnormalities in
networks (MLP). The researchers showed that the model was consumption appear when fraud occurs. Cabral et al. [17]
able to detect 75 fraud cases per month. introduced a fraud detection system using data mining
Kusaksizoglu et al. [9] introduced a model for detecting techniques for high-voltage electricity customers in Brazil. The
fraud in mobile communication networks. The results showed used techniques used customers’ historical data to be compared
that the Neural Networks methods MLP and SMO found to give with the current consumption and present the possible fraud
best results. In addition, CHEN et al. [10] proposed and status. The customers are marked as below regular consumptions
developed an integrated platform for fraud analysis and and used to be investigated by company inspection team.
detection based on real time messaging communications in De Faria et al. [18] presented a use case of forensics
social media. investigation procedures applied to detect electricity theft based
Nagi et al. [11] [12] [13] introduced a technique for on tampered electronic devices. Viegas et al. [19] provided an
classifying fraudulent behavior in electricity consumption. The extended literature review with an analysis on a selection of
proposed method is a combination of two classification scientific studies for detection of non-technical losses in the
algorithms, Genetic Algorithm (GA) and Support Vector electric grid reported since 2000 in three well know databases:
Machine (SVM), which yield a hybrid model (named GA- ScienceDirect, ACM Digital Library, and IEEE Xplore.
SVM). The technique processed the past customers’ Coma-Puig et al. [20] developed a system that detects
consumption profile to reveal abnormal consumptions of the anomalous meter readings on the basis of models that are built
customers of Tenaga Nasional Berhad (TNB) electricity utility using some machine learning techniques using past data. The
in Malaysia. After an investigation, four categories were found system detects meter anomalies and fraudulent customer
(change of tenant, replaced the meter, faulty meter, and abundant behaviour (meter tampering), and it is developed for a company
house). An expert system was designed to remove such that provides electricity and gas. Richardson et al. [21]
customers by considering characteristics that distinguish introduced a novel privacy preserving approach to detecting
between these four customer’s categories and theft customers. energy theft detection in smart grids. Malicious behaviour is
This intelligent system hit rate reached 60% where they detected by calculating the Euclidean distance between energy
indicated that this model raised the detection of fraud activities output measurements from installation over a day. These
from 3% using current procedures in the company to, a hit rate distances are then clustered to identify outliers and potentially
of 60% after onsite inspection. malicious behaviour.
Ramos et al. [3] presented optimum-path forest classifier to The available literature related to detecting the fraudulent
detect fraud customers in electricity consumption. The classifier activities of Non-Technical Loss in water consumption is limited
was compared with other robust classifiers ANN, SVM-Linear, in comparison to other sectors such as electricity consumption
SVM-RBF. The results showed that OPF accuracy is similar to and financial issues. For example, Monedero et al. [22]
SVM-RBF but superior in training time, which enables real-time developed a methodology consists of a set of three algorithms
classification. The other two classifiers accuracy were not for the detection of meter tampering in the Emasesa Company (a
comparable. water distribution company in Seville).
León et al. [14] suggested a model that can reveal electricity Humaid [23] research is the only research conducted in the
fraud customers. The data was obtained from the Spanish Endesa Arabic region related to suspicious water consumption activities.
Company. The classification model is based on Generalized Humaid used data mining techniques to discover fraudulent
Rule Induction (GRI) and Quest Decision Tree methods. water consumption by customers in Gaza city. The historical
49
2018 9th International Conference on Information and Communication Systems (ICICS)
data of water consumption was used as a training dataset to build generalization; merely it stores the training tuples or instances
the intelligent model. The author focused on using support [25]. KNN works by comparing a given test tuple with the
vector machine SVM classifier and compared K-Nearest training tuples that are closest or similar. Therefore KNN is
Neighbour classifier (KNN), and Neural Network classifier based on analogy. The training tuples are stored as points in the
(ANN). As the monthly water consumption data are scattered n-dimensional pattern space, in the case of unknown test tuple,
over 44 tables, Humaid [23] unified it in one table representing KNN finds the k tuples that are closest to the test tuple, these
customers’ consumption for 144 months from 03/2000 to tuples are the K-nearest neighbors; the test tuple is classified by
02/2012. At the beginning of the period, the consumption was the majority voting of its k neighbors.
read every two months, so null values were appearing on the
The similarity can be measured using several distance
data, to overcome this, he divided the consumption by 2, and the
metrics such as the Euclidian distance. Let two tuples x1 = (x11,
result was taken as the related two months consumption equally.
Then an attribute is added to indicate the fraudulent status x12, ..., x1n) and x2 = (x21, x22, ..., x2n). The Euclidean distance (dist)
‘Fraud_Status’ as the class label for all customers. between x1 and x2 is computed using equation (1):
50
2018 9th International Conference on Information and Communication Systems (ICICS)
are solved the bills are approved. The legacy billing system was DIST_NO The district number
developed in the mid-1980’s using COBOL language. It is still TOWN_NO The village/town number
in service, and it is installed on a mainframe with OPEN-VMS CONS_NO Customer number
CONS_NAME Customer name
platform. Yarmouk Water Company implemented an HHU
INFRACTION_NO Infraction number
billing system in Oct 2011. This system is intended to issue the INFRACTION_DATE Infraction Date
bills in the field. The HHU billing system is integrated with the
COBOL-based billing system, and all the bills are computed and C. Data Preparation
issued from this system. This phase of the knowledge discovery needs huge efforts to
The commercial departments in the ROU’s try to fight theft prepare the data with high quality and suitable format to be used
of water. When a zone is supplied with water, a random later in the modeling phase. The data preparation phase includes
inspection is performed for the customers’ properties and water the following tasks: Data Preprocessing, Customer Filtering and
connections. If a theft case detected, they record the case on a Selection, Features Extraction, Data Normalization, and Feature
dedicated form document and return the form to the department, Adjustment.
and a penalty is imposed on that customer. • Data Preprocessing
B. Data Understanding The main steps of this phase are illustrated in Figure 1. The
The essential part of the data mining process is the data itself. consumption table of the historical customers’ data contains
The following sections characterize the structure and nature of around 16 million records for 109 thousand customers. It
the collected data. This paper is limited to the data collected from includes the consumptions for the interval from 1990 to the
the billing system which is mainly used for issuing the customer’ current time. The customers’ consumption records that are
water bills. Suitable COBOL programs have been developed to related to Qasabat Irbid ROU are around 1.5 million
extract the most important customers billing data into a text consumption records for around 90 thousand customers. The
format data files. Oracle tables are created with a similar format consumption data for the customers are stored in a vertical
to COBOL data files. The tables are the customers’ main format as shown in Figure 2.
information table, Customers’ water consumptions table and the
Customers’ payments table. The description of the Customers’
water consumptions table (relation) is presented in Table 1. Historical
C t ’d t
51
2018 9th International Conference on Information and Communication Systems (ICICS)
52
2018 9th International Conference on Information and Communication Systems (ICICS)
53
2018 9th International Conference on Information and Communication Systems (ICICS)
than random manual inspections held by YWC teams with hit communications on social networks”, IEICE Trans. Inf. & Syst., 2017,
rate around 1% in identifying fraud customers. This model Vol. E100–D, No.10, pp: 2267-2274.
introduces an intelligent tool that can be used by YWC to detect [11] J. Nagi, K. Yap, S. Tiong, S. Ahmed and A. Mohammad. “Detection of
abnormalities and electricity theft using genetic support vector
fraud customers and reduce their profit losses. The suggested machines”, In Proc. IEEE TENCON Region 10 Conf., 2008, pp.1-6.
model helps saving time and effort of employees of Yarmouk [12] J. Nagi, Mohammad A., Yap K., Tiong S., Ahmed S. “Non-Technical Loss
water by identifying billing errors and corrupted meters. With Analysis For Detection Of Electricity Theft Using Support Vector
the use of the proposed model, the water utilities can increase Machines”, In Proc IEEE 2nd International Power and Energy Conference,
cost recovery by reducing administrative Non-Technical Losses 2008, pp. 907-912.
(NTL’s) and increasing the productivity of inspection staff by [13] J. Nagi, K. Yap, S. Tiong., S. Ahmed, M. Mohamad. “Nontechnical loss
onsite inspections of suspicious fraud customers. detection for metered customers”, IEEE Transactions on Power Delivery,
2010, 25(2): 1162-1171.
[14] C. León, F. Biscarri, I. Monedero, J. Guerrero, J. Biscarri and R. Millán,
“Variability and trend-based generalized rule induction model to ntl
detection”, IEEE Transactions on Power Systems, 2011, 26(4):1798 -
1807.
[15] J. Filho, E. Gontijio, A. Delaiba, E. Mazina., J. Cabral, J and Pinto. “Fraud
identification in electricity company customers using decision tree”,
Systems, Man and Cybernetics, IEEE International Conference, 2004, 4:
3730 – 3734.
[16] R. Jiang, H. Tagiris, A. Lachsz. and M. Jeffrey “Wavelet-based features
extraction and multiple classifiers for electricity fraud detection”, In Proc.
IEEE/PES Transmission and Distribution Conf. Exhibit. 2002.
[17] J. Cabral, J. Pinto. E. Martins and A. Pinto, “Fraud detection in high
voltage electricity consumers”. 2008.
[18] R. De Faria, K. Ono Fonseca, B. Schneider and S. Nguang, “Collusion and
fraud detection on electronic energy meters - a use case of forensics
Fig. 4. The Prediction System of the Fraud Detection. investigation procedures”, in 2014 IEEE Security and Privacy Workshops,
pp. 65-68.
ACKNOWLEDGMENT [19] J. Viegas, P. Esteves, R. Melicio, V. Mendes and S. Vieira, “Solutions for
detection of non-technical losses in the electricity grid: a review”,
The authors would like to thank Yarmouk Water Company Renewable and Sustainable Energy Reviews, 2017, 80: 1256-1268.
for providing the data to be used for the purpose of this study. [20] B. Coma-Puig, J. Carmona, R. Gavald, S. Alcoverro, and V. Martin,
“Fraud detection in energy consumption: a supervised approach”. In Proc
REFERENCES IEEE Intl. Conf. on DSAA, 2016, pp. 120-129.
[1] N/A, “Jordan Water Sector Facts & Figures, Ministry of Water and [21] C. Richardson, N. Race, and P. Smith, “A privacy preserving approach to
irrigation of Jordan”. Technical Report. 2015. energy theft detection in smart grids”, 2016 IEEE International Smart
[2] N/A, “Water Reallocation Policy, Ministry of Water and irrigation of Cities Conference (ISC2), Trento, pp. 1-4.
Jordan”. Technical Report. 2016. [22] Monedero I., Biscarri F., Guerrero J., Roldán M., and León C. “An
[3] C. Ramos , A. Souza , J. Papa and A. Falcao, “Fast non-technical losses Approach to Detection of Tampering in Water Meters”, In Procedia
identification through optimum-path forest”. In Proc. of the 15th Int. Conf. Computer Science, 2015, 60: pp 413-421.
Intelligent System Applications to Power Systems, 2009, pp.1-5. [23] E. Humaid, “A data mining based fraud detection model for water
[4] E. Kirkos, C. Spathis and Y. Manolopoulos, “Data mining techniques for consumption billing system in MOG”, Islamic University of Gaza,
the detection of fraudulent financial statements”, Expert Systems with Deanery of higher Studies, Information Technology Program, Department
Applications, 32(2007): 995–1003. of Computer Science, Master thesis. 2012.
[5] Y. Sahin and E. Duman, “Detecting credit card fraud by decision trees and [24] C. Cortes and V. Vapnik, 1995. “Support-Vector Networks”, Machine
support vector machines”, IMECS, 2011, Vol I, pp. 16 – 18. Learning, 1995, 20(3): 273-297,
[6] S. Panigrahi, A. Kundu, S. Sural and A. Majumdar, “Credit card fraud [25] J. Han, M. Kamber, J. and Pei. Data mining: concepts and techniques, 3rd
detection: a fusion approach using dempster–shafer theory and bayesian Ed, Morgan Kaufmann. 2012.
learning, information fusion”, 2009, 10(4): 354–363. [26] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer
[7] N. Carneiro, G. Figueira and Costa M., “A data mining based system for and R. Wirth, “CRISP-DM 1.0: step-by-step data mining guide”, SPSS
credit-card fraud detection in e-tail decision support systems”, Decision Inc., 2000, USA.
Support Systems, 2017, 95(C): 91-101. [27] I. Witten, E. Frank E., L. Trigg, M. Hall, G. Holmes and S. Cunningham.
[8] Ortega P., Figueroa C., and Ruz G. “A Medical Claim Fraud/Abuse “WEKA: practical machine learning tools and techniques with java
Detection System based on Data Mining: A Case Study in Chile”, In proc implementations”. In Proc the ICONIP/ANZIIS/ANNES Workshop on
of DMIN, 2006. Emerging Knowledge Engineering and Connectionist-Based Information
Systems. 1999, pp. 192–196.
[9] B. Kusaksizoglu, “Fraud detection in mobile communication networks
using data mining”, Bahcesehir University, The Department of computer [28] C. Chang and C. Lin, “LIBSVM: a library for support vector machines”.
engineering, Master Thesis. 2006. ACM Transactions on Intelligent Systems and Technology, 2011, 2:27:1-
-27:27.
[10] C. Liang-Chun, H. Chien-Lung, L.Nai-Wei, Y. Kuo-Hui and L. Ping-
Hsien, “Fraud analysis and detection for real-time messaging [29] Y. EL-Manzalawy and V. Honavar, “WLSVM: integrating LibSVM into
WEKA environment”. 2005.
54