Redes Neuronales - 2015
Redes Neuronales - 2015
DOI 10.1007/s10922-015-9348-6
123
J Netw Syst Manage
1 Introduction
Plain Old Telephone Service (POTS) networks were developed over a long period
of time during the last century. Due to their long-term development and fault-
tolerant design, these systems are reliable and work in a stable manner. Fault
tolerance is achieved by duplication of the essential parts of access and transit
switches as well as transmission and management systems. Even though an IP
(Internet Protocol)-based broadband telecommunication network in Croatia was
introduced to the mass market more than 10 years ago, and the quality of the
network is continuously improving, it still hasn’t reached the level of reliability that
the POTS networks have. Therefore, fault detection, diagnosis, and correction are
still major concerns for a telecom operator. The parameter that best reflects the
quality of a network regarding fault occurrence is the mean time between failures
(MTBF). The service complexity, longer average service usage time, many more
instances of terminal equipment compared to the POTS network, and a higher
bandwidth demand on the access network all have an impact, so the MTBF of
broadband services is 2–6 times lower than the MTBF of narrowband services.1
A broadband network, as opposed to a POTS network, includes a multitude of
active and passive elements that can be subject to fault. Elements most susceptible
to faults are as follows: an Asymmetric Digital Subscriber Line (ADSL) modem,
customer’s equipment, Internet Protocol TeleVision (IPTV) set-top box, ADSL
splitter, home installation, copper twisted pair, distribution point, main distribution
frame, fibre optic cable, and ADSL Digital Subscriber Line Access Multiplexer
(DSLAM) port. Faults, as commonly defined, are problems that can be detected and
handled directly. A consequence of a fault, i.e., its manifestation is a discrepancy
between some observed value or condition and a true, specified, or theoretically
correct value or condition. Faults are usually reported by a surveillance system in
the form of alarms. Generally, faults can be caused by the following:
1
The range of values of this MTBF reduction factor has been published as a result of an internal
technical analysis encompassing networks of 16 telecom operators in Western and Central Europe. The
range (2–6) is quite large because of considerable differences in the equipment that is installed in the
analyzed national networks and because of different efficiencies of the fault-repair systems and processes
implemented.
123
J Netw Syst Manage
Some faults result in the service delivered deviating from the agreed or specified
service that is visible to the outside world. The term failure is used to denote this
situation. Failures are commonly defined as follows: a system failure occurs when
the service delivered deviates from the specified service, where the service
specification is an agreed description of the expected service [1]. Similar definitions
can be found in papers of Melliar-Smith and Randell [2], Laprie and Kanoun [3],
and Salfner [2–4]. The main point here is that a failure refers to misbehavior that
can be observed by the user, which can either be a human or another software or
hardware component. Failures can be reported either by a surveillance system or by
users. For example, the most common failures that can be reported by users are
complete interruption of a service, low downstream bandwidth, inability to access
web sites, noise during Voice Over Internet Protocol (VoIP) phone calls, inability to
establish a phone call, and problems with IPTV service like error blocks or
jerkiness.
There are many other problems that could affect customer service, so for IPTV
the following issues are listed Tiling, Ringing, Quantization Noise, Aliasing Effects,
Artifacts, Object Retention, Slice Losses, Blurring, and Color Pixelation [5, 6].
In some cases, faults are not recognized immediately from the systemic alarms,
but later they become apparent due to failures reported by users. More often,
occurrence of a fault is accompanied by one or more alarms while users report
failures afterward. Operators can minimize failure occurrence with proper design
and preventive maintenance of the network. In order to resolve a failure, the fault
that caused it has to be detected and fixed. Failures should be eliminated as soon as
possible for the sake of the customers’ satisfaction as well as respecting signed
Service-Level Agreements (SLA) and rules laid down by the regulatory agencies.
They need to be resolved reactively after a user complaint, but it would be better to
act preventively and proactively. The enabler for this is failure prediction.
Generally, proactivity based on failure prediction increases overall quality of
service (QoS) and a customer’s perception of the QoS has a major impact on her
satisfaction and loyalty.
Two types of failure predictions are considered in the literature: online failure-
occurrence prediction and quantity-of-failures prediction. The aim of online
prediction is to predict the occurrence of failures during runtime based on
system-state monitoring, [4]. This type of prediction can enable proactive action and
thus, directly increase customer satisfaction and loyalty at the individual level. On
the other hand, accurate prediction of the expected number of failures (quantity of
failures) that will be reported by customers of a broadband network is becoming
123
J Netw Syst Manage
123
J Netw Syst Manage
reported failures several days in advance enables operators to plan and allocate
necessary resources and can considerably decrease operational costs.
2 Related Work
There are various systems developed for real-world networks, SHRINK [7],
NetworkMD [8], Draco [9] and so forth, whose purpose is to enable proactive action
based on network analysis and network diagnosis. Improvement of performance
management and network reliability in similar types of networks are analyzed in
papers [10–12].
Generally, proactivity assumes existence of data and knowledge about processes
as well as efficient intelligent methods for data analysis, learning, and predictions.
Selection of the optimal prediction method depends on the nature of the processes
being modeled, data availability, and the duration of the monitoring period, as well
as on adaptability of involved operational support systems. In order to improve the
accuracy of prediction models, research is conducted in two directions. First, there
are efforts to improve performance of the existing prediction methods, e.g., to
develop a new training method for a neural network or to propose a new network
topology, and second, researchers are developing their own predictive models
customized to a specific application [13, 14]. An interesting example of multivariate
forecasting is presented in [13], where the authors develop their own predictive
model to forecast the overall sales of retail products. Their model consists of three
modules: Data Preparation and Pre-processing (DPP), Harmony search Wrapper-
based Variable Selection (HWVS), which prunes redundant and irrelevant variables
and selects out the optimal input variable subset, and Multivariate Intelligent
Forecaster (MIF) used to establish the relationship among variables and forecast the
sales’ volumes. The proposed model has proved to be effective in handling
multivariate forecasting problems. Similar principles for using three-stage predic-
tive models (preprocessing-selecting-forecasting) were used extensively in the field
of forecasting; this idea is used in our work as well. Variation and evaluation of
different configurations of neural networks are frequently encountered for predictive
purposes. In [14], the authors compare different types of neural networks with their
123
J Netw Syst Manage
123
J Netw Syst Manage
The users’ average daily usage of services—whether the user is using the service
at the time or shortly after the fault occurs, (e.g., after midnight service usage is
minimal so almost no reporting exists),
The users’ expected actions/behavior (active or passive/indifference) in the
moment when they become aware of a service failure (if the user knows a
reporting procedure, whether the user is trying to fix the problem alone, has a
habit of calling the call center or a habit of passively waiting until the service
starts to work again, etc.).
FAULT FAILURE
environmental factors: OCCURRENCE informaon about faults affecng FIXING /
lightning, humidity, whole group of customers
electrical discharges, ice PROCESS HANDLING
PROCESS
faults (detected and
undetected)
human factors: improper trouble cket opening
handling FAILURE
REPORTING
PROCESS
(FAILURES
user behavior and service NOTICED BY
usage THE
CUSTOMERS)
123
J Netw Syst Manage
DSLAM
DSLAM DSLAM
DSLAM DSLAM
DSLAM
DSLAM
PE
E PE
PE
DSLAM 3
PE
E PE DSLAM
Ethernet PE Ethernet PE
DSLAM PE PE
2
PE PE
DSLAM
DSLAM PE
VoIP 1 PE
PE IPTV & VOD exchange Internet
Ethernet PE PE
content centar SIP server ISP Ethernet PE
E
DSLAM
PE PE
DSLAM PE PE
LER LER
LER LER
DSLAM
LSR LSR LER
PE LER PE DSLAM
PE LSR LSR PE
Ethernet PE PE
Ethernet
LER LSR
IP/MPLS LER
DSLAM PE PE LSR PE DSLAM
PE
LSR LSR
LER
LER
LER LER
DSLAM PE
DSLAM
DSLAM
PE PE
Ethernet
ther PE
PE PE
Ethernet PE
PE PE
PE PE
DSLAM PE DSLAM
PE Ethernet
PE
PE PE DSLAM
DSLAM
DSLAM DSLAM DSLAM
DSLAM
DSLAM
DSLAM
123
J Netw Syst Manage
traffic flows into an aggregation card and is transported to the network through
Gigabit Ethernet rings (Metro Ethernet transport). Cables with twisted copper
pairs that form a part of the access network were inherited from the Public
Switched Telephone Network (PSTN). Broadband technology has imposed
much higher requirements on this part because the speed, and therefore the
spectrum, have dramatically increased. These have initiated new problems. The
physical link between user and the DSLAM port is a twisted copper pair in a
subscriber cable. Access via the copper pair is, at the moment, the most
common kind of access to the network. In each area of central or remote
subscriber access there is a main distribution frame (MDF), which marks the
beginning of the subscriber lines. The end point of the access part is a
Distribution point behind which the customer installation begins.
3. The third, user part (3) includes network termination equipment (ADSL
modem, Splitter), other customer premises equipment (IPTV set-top box,
television set, handset and other devices) and in-house customer installations.
This part of the network is spatially the most abundant.
A variety of elements in all three parts are possible locations of faults. The fault-
management system is designed to record all faults detected and failures resolved.
The data about alarms are entered into the database automatically while other
data are entered by technicians during the resolving process. The result is that the
database gives an accurate insight into the faults, causes of faults, and failures that
have been reported. By analyzing operational data on faults and their locations we
get the distribution of faults displayed in Table 1.
The majority of faults, 70.86 %, occur in the customer part of the network. Of
these, 34.55 % relate to the user equipment, 14.36 % to the ADSL modem (router)
and 12.36 % to the in-house customer installation. In the access part we find
26.53 % of the faults, while the rest, or 2.61 %, occur in the core part of the
network. Over the years, causes have been recorded for each fault detected. Their
frequencies are shown in Table 2. The majority of failures that occur in the users
part of the network (34.89 %) are caused by users themselves (by improper handling
and wrong initial settings) or because of errors in CPE software.
The distribution of faults by locations together with the preponderance of faults
in the user part of the network have an influence on the shape of the time series of
reported failures. In fact, the dynamics of service usage by customers introduces
periodicity and seasonality in the time series. Slight discrepancies between
percentages related to total quantities of faults per parts of the network (Tables 1,
2) arise from the measurement noise already mentioned in the Introduction. Table 3
shows an overview of the most frequent causes of faults recorded during the period
2010–2012.
123
J Netw Syst Manage
The diagram in Fig. 3 shows annual quantities of reported failures by services in the
last 10 years. During this period the transition from the POTS network to the
123
J Netw Syst Manage
350000
300000
250000
Number of failures
ADSL
200000
IPTV
POTS
150000
ISDN
100000
50000
0
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Time (Year)
123
J Netw Syst Manage
broadband network took place. The number of failures related to the traditional
services such as POTS voice telephony and Integrated Services for Digital Network
(ISDN) services is in decline because these services are being replaced by
alternative services on broadband platforms. Another reason for the decline of
POTS failures is the migration to the other shared models of services among
multiple service providers. On the other hand, the number of failures on broadband
services such as ADSL and IPTV grows along with the total number of users of
these services. Note that the total number of failures in the whole observed period
increases, and this trend is expected to continue in the upcoming years.
The failure-reporting process can be represented by a time series. These are
stochastic series whose future development in time can be estimated based on
previous values. Thanks to the fault-management system we have precise data about
failure reporting, i.e., daily, weekly, monthly and annual series on the number of
reported failures are available. Sampling was carried out in equal intervals so that
interval sequences are cumulative in nature and can be considered as a series of a
discrete time parameter.
As we said before, the failure-reporting process is strongly driven by customers.
Failure reporting has daily dynamics that depend on the users’ habits of using
services and on their actions when they notice the presence of failure. Therefore, the
time series representing the number of reported failures displayed in hourly and
daily intervals reveals periodicity in time (Figs. 4, 5).
Daily or weekly patterns simply reflect work activities (usage of services) of
residential and business customers during a day or week (Monday–Saturday). In the
daily pattern a notable drop happens during the night while in the weekly pattern a
reduction in the usage of the services on Sundays results in a lower number of
reported failures. In a time series representing reported failures displayed in weekly
and monthly intervals (monthly and yearly patterns) no periodicity in time is notable
(Figs. 6, 7), because the cumulative number of reported failures in a week or in a
month is more under the influence of random factors, such as bad weather or
unexpected breakdowns in the network. In regular circumstances, time series have a
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Time (hours)
123
J Netw Syst Manage
1500
1000
500
0
0 10 20 30 40 50 60 70
Time (days)
20000
15000
10000
5000
0
0 10 20 30 40 50 60 70
Time (weeks)
common and recognizable shape. However, unexpected events such as core network
element malfunctions or thunderstorms can significantly affect the shape of the
curve. One such anomaly that occurred in the 21st week can be seen in Fig. 6.
Furthermore, the sharp decline that reaches a minimal value in the 52nd week is
caused by characteristics of the calendar, i.e., by an incomplete last week in the
year. Prediction of such anomalies on the curve is important in order to reserve the
human resources that can prevent or resolve additional failures. A common daily
distribution of reported failures (working day) is shown in Fig. 8 (gray line).
These are ‘‘regular failures’’ that occur mainly in the access network, and are
caused by problems in the customer equipment; their reporting can be expected in
similar daily distributions. For these quantities of failures, telecom operators have
reserved resources to deal with their removal. However, in some situations
anomalies occur, i.e., incidents that lead to an increased number of failures. One
such anomaly recorded by the management system is shown in Fig. 8. The black-
colored addition to the common shaped baseline curve represents the increased
number of reported failures caused by a serious fault in equipment. Two things are
important to define a prediction model of good quality: knowing the characteristics
123
J Netw Syst Manage
50000
40000
30000
20000
10000
0
0 10 20 30
Time (months)
30
Anomaly
Normal
25
Number of failures
20
15
10
0
0 3 6 10 13 16 20 23
Time (hour)
of the common series (trend, periodicity) and detecting the main external factors
that cause substantial increases in the number of reported faults.
Estimates of future failures only on the basis of past information concerning
quantities, without recognizing the external factors (environmental and human
factors) and their influences can be optimistic and insecure. For example, the access
and customer parts of the network are susceptible to environmental factors.
Particular weather conditions (lightning or high humidity) can result in negative
influences on lines and equipment, leading to sharp increases in the failure rate. In
the next Section we show that the additional quantity of failures that appear under
the influence of external factors (like adverse weather conditions) can be
successfully predicted by the recursive neural net model. The multivariate modeling
concept is introduced to reflect the effect of continuously varying influences of
internal and external factors. It is very important that the model is scalable in a way
that allows inclusion of additional factors that will be subsequently detected as
123
J Netw Syst Manage
A large variety of data from network management systems or data about external
conditions are now available to the service providers. In this Section, we discuss the
influence of the data and other characteristics of the broadband network
environment and its processes on the choice of prediction method, predictor type,
topology of predictor, and learning method. On the other hand, comprehensive data
analysis and evaluation of the significance of input variables represent a
precondition for development of a multivariate prediction model of good quality
that encompasses the most relevant predictor variables. With such an approach it is
possible to eliminate redundancy, enhance processing efficiency, and improve
prediction accuracy.
The data sets that have been used in this study were obtained from three different
sources.
The first source—the Trouble Tickets database—contains information related to
trouble reporting and troubleshooting. Three fields from the database were used, see
Table 4. The second data source—the Error Logging database—is a component of
the Network Management System. It includes information about network-element
outages (alarm logs). Relevant data extracted from the Error Logging database is
shown in Table 5.
Finally, the third data source—the Meteorological database—contains data from
external sources. These data represent daily readings of meteorological measure-
ments from 3 main regional centers in Croatia (Zagreb, Split, and Rijeka) that cover
the most populated areas in the country. Seven relevant fields were extracted from
the Meteorological database, Table 6.
Table 4 Relevant data extracted from the Trouble Tickets database (Trouble Tickets table)
ID Field name Field description
1 Faulty_Service Affected service, identified according to customer’s reports. For the purpose
of this study only ADSL and IPTV related services have been selected
2 Reporting_Time The time at which the customer who reported the failure called the contact
center
3 General_Description General description of a failure and possibly additional text about noticed
causes
123
J Netw Syst Manage
Table 5 Relevant data extracted from the Error Logging database (Alarm Logs table)
ID Field name Field description
1 Element_Name DSLAM identification. Unique ID for the entire network This field is used as
a link to the trouble-tickets table
2 Fault_Type Fault type. Possible types are: breakdown, service degradation or
occasionally occurring fault, and announced work
3 Fault_Cause Causes of problems are grouped as software errors, hardware failure,
transmission, and power supply
4 Alarm_Start_Time The time at which the alarm first appeared
5 Alarm_End_Time Alarm ceasing time, after repair
6 Affected_Customers The number of customers affected by the network-element outage
We analyzed data collected during the period from January 2012 to August 2012.
The total number of failure reports recorded in the Service Management Center in
this period was 585,000, while the number of network-element outages in the same
period was 591. There were a total of 103 rainy days, 53 days with lightning,
23 days with snow, 2 days with fog, and a day with hail observed in all 3 meteo-
stations in the period of observation.
123
J Netw Syst Manage
Fig. 9 Armstrong’s decision tree that helps in the selection of an appropriate prediction method
123
J Netw Syst Manage
In [22], Armstrong developed a decision tree (Fig. 9) that helps in the selection
of an appropriate method. Bearing in mind the characteristics of the broadband
network environment and its processes, the following facts, which are relevant to
the selection of the method, can be stated:
Data sets of sufficient size and accuracy are available—input data for prediction
can be obtained from Operations Support Systems (OSS), Business Support
Systems (BSS), and external sources. In the research described in this paper, we
used actual data that exceeds 1.5 million items collected by performance and
fault management systems in the period of 2009–2012, data about users’ habits,
data warehousing, external sources—meteo-data logs and data about relevant
announced events;
Good knowledge about the relationship between relevant variables is missing.
Data type—discrete time series;
Inner nature of the system is not known well;
The system is massive with inertia, i.e., there is a low probability of changing
conditions in the system during a period of predictions.
Considering these facts in relation to the Armstrong decision tree reveals that the
most appropriate prediction approaches should be: extrapolation, neural networks,
and data mining. These prediction approaches were reached by passing through the
decision tree according to the responses to the following questions:
123
J Netw Syst Manage
123
J Netw Syst Manage
broadband networks, the following data can be used as input for predicting failure
quantities:
Data from the very recent past—this relates to data about reported failures
collected in a period from the past few minutes up to the past few hours;
Archived data (data warehousing (DWH))—short-term and long-term historical
data about failures (from the past few days up to several years); these data allow
time series analysis and identification of trends and seasonal patterns;
Data about network loads, performances and the operational statuses of network
elements—provide information about the operating regime of network elements
or some kind of observed irregularity (peak loads, overloads, traffic rejections,
…);
Fault-management data—degradation of services, outages of individual network
elements, faults logged and stored in the fault-management system that are
correlated with failures reported by users;
Service-usage information—data about users’ habits, i.e., about average daily
service usage time, average service usage time per session, preferences according
to types of service, the distribution of daily traffic volumes, etc.;
Equipment reliability—reliability of network elements can be calculated from
error logging and error statistic databases, providing a basis for calculating Time
To Failure (TTF) and Time To Repair (TTR) parameters; by using these
parameters, we can predict the dynamics of equipment breakdowns,
Data about external influences on network—nformation about the events that
cause external influences on the system; there are many different influences that
can cause faults in the network; for example, faults in power supply can disrupt
the operational status of customer premises equipment;
Meteorological data—due to the large impact of humidity and electrical
discharge on the network, particularly the access and transport part of the
network, these data represent important input to the model;
Data about social events—information on social events that may affect the
network, for example large gatherings of customers in a given area, a variety of
migrations, seasonal loads, ‘‘nomadism’’, scheduled or announced events that can
be taken as input variables for prediction.
Certainly, data of good quality from all the above-mentioned sources were not
available during the research described in this paper. Some data were not available
at all and it would take significant, additional effort to make them available.
Therefore, the model has been developed on the basis of available data sets. As
we emphasized before, the model is scalable, i.e. subsequent addition of input
variables are possible. The neural network used in the model belongs to the group of
nonlinear dynamic networks, which is known in the literature as NARX, NARMA,
NARMAX [28]. This is a nonlinear autoregressive neural network with exogenous
inputs, also referred to as an input–output recurrent model; the principle scheme is
shown in Fig. 10. A special feature of this network configuration is two delay lines.
The first line, known as a recurrent delay line, connects output with input and allows
the dynamics of the signal to be captured. The second line, known as a tapped delay
123
J Netw Syst Manage
line, accepts an input vector with a time delay. Both nonlinear and linear functions
can be employed in one hidden and one output layer; input and output can be
multidimensional.
An additional advantage of the model, as opposed to some other recurrent
models, is a standard multilayer perceptron located in the center of the network and
enabling learning by a standard algorithm. This ensures simplicity and reduces
learning time. The network dynamics is described by Eq. (1):
yðn þ 1Þ ¼ F ðyðnÞ; . . .; yðn q þ 1Þ; uðnÞ; . . .; uðn r þ 1ÞÞ ð1Þ
where u(n) is the current observed value, u(n - 1), …, u(n - r ? 1) are past ob-
servations of the variables memorized up to r - 1 lags, and y(n), …, y(n - q ? 1)
are q past output fed into a recurrent delay line. In this way, past information can be
preserved, which means that information from the initial moment up to the current
moment affects the calculation of the new output value.
We used the Matlab [26] implementation of the neural network with configura-
tion parameters shown in Table 8. The input and output vectors consist of a total of
250 records that represent the observed period (approximately 8 months).
123
J Netw Syst Manage
The network training function updates weight and bias values according to the
Levenberg–Marquardt optimization (function trainlm) [26], because this is much
faster than training by a basic error back-propagation algorithm (function train).
Another advantage of the Levenberg–Marquardt method is the ability to find a
solution in situations when training starts far away from a global minimum. The only
problem related to this method is its memory consumption. Memory exhaustion can
be avoided by adjusting the parameters to reduce memory usage, or by using other
methods such as the quasi-Newton back-propagation method, which is slower, but
uses memory more efficiently. The choice of transfer functions, like tansig, logsig,
and purelin, depends on the characteristic of the modeled system. In this specific
case, a hyperbolic tangent sigmoid transfer function (tansig) was used in the hidden
layer while a linear transfer function (purelin) was used in the output layer. The
configuration of the NARX network with 6 inputs, 1 recurrent feedback loop, and the
above-listed parameter values is shown in Fig. 11. The network consists of three
main parts. The input section ensures that the values of the variables are passed into
the 6 inputs through the delay lines. Central place is occupied by the classic
multilayer perceptron. Its function is to determine the significance and functional
dependencies using regression in the hidden and output layers of the network. The
third part is the output layer, which makes input/output links through the delay lines.
Input delay lines ensure that the input data related to past events impact the output
value (links for temporal cross-correlations); for example, how a thunderstorm that
occurred 2 days ago affects the number of faults today.
123
J Netw Syst Manage
they are candidates to be quantified and included as input variables in the prediction
model. We can simply use scatter diagrams or mathematical methods (e.g.,
calculating Spearman’s or Kendall’s rank correlation coefficient, or coefficient of
linear correlation) to compare two phenomena. The scatter diagram in Fig. 12
shows the influence of electrical discharges on the number of failures. We can see a
significant, positive impact. The diagram in Fig. 13 shows the effect of temperature.
In this case, the impact is negligible—the trend line is horizontal.
An additional possibility to find relations between data is detecting temporal
cross-correlations, as shown in the following example. In a large network that is
built up over years, users’ access lines do not have the same characteristics, and thus
are not all equally resistant to the effects of humidity. The reasons for this are
manifold:
Even though pure water is an insulator, the water that reaches the cable usually is
not clean. It picks up dust and pollutants from the air and minerals that come from
123
J Netw Syst Manage
3500
Number of failures (thousand)
3000
2500
2000
1500
1000
500
0
-1 0 1 2 3 4
Electrical discharge intensity [dimensionless]
2500
2000
1500
1000
500
0
-30 -20 -10 0 10 20 30 40 50
Temperature [oC]
the soil. Manufacturing defects and wire insulation deteriorate over time (aging),
allowing moisture to penetrate into cable and therefore into the twisted pair. The
results of this are changed electrical parameters, such as the capacity and impedance
of the cable, causing a stronger attenuation of the higher frequency spectrum of the
ADSL signal [29, 30]. To analyze the influence of humidity on the occurrence of
failures, a number of measurement data were collected on a large number of lines.
The measurement data were collected from the DSLAM measurement system and
correlated with data on humidity. Our previous research [31] showed that the higher
values of humidity caused by humid weather have significant influence on a specific
123
J Netw Syst Manage
80
Values
60
40
20
0
0 10 20 30 40 50 60 70 80 90
Time (day)
number of the twisted pairs that are poorly protected from these external influences.
Based on long-term monitoring, it was found empirically that the appearance of rain
and humidity increases the total number of failures. So the idea was to discover the
degree of correlation between the humidity and the number of reported failures. The
measured data and the data obtained as a result of the failure fixing/handling process
allowed us to show the dependency between two variables: relative humidity in the
air and number of faults (see Fig. 14). The diagram depicts two time series. A time
delay between peaks of humidity and the number of existing reported failures
indicates the correlation between them.
If we display the humidity and the number of failures in a scatter diagram,
Fig. 15, correlation is evident. Moreover, one can recognize a circular shape of the
scattered points in the diagram that suggests the existence of temporal cross-
correlation with the period of delay d.
For two series x(i) and y(i), with the mean values mx and my that are shifted by
the delay d, the expression (2) defines their cross-correlation.
P
i ½ðxðiÞ mxÞ ðyði d Þ myÞ
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ð2Þ
2 2
i ð x ð iÞ mx Þ i ð y ð i d Þ my Þ
Peak values of reported failures are delayed with regard to the peaks of measured
humidity by approximately 24 h. Figure 16 shows the temporal cross-correlation
and its dependence on the variable of time delay (lags) d (the highest correlation is
around the 1 day mark.). This delay is explained by the time necessary for the
moisture to enter the cable, combined with the time it takes the user to notice and
report a failure in the service.
123
J Netw Syst Manage
90
80
Humidity (%)
70
60
50
40
7 9 11 13 15 17 19
Number of failures (thousand)
Data about lightning and rainfall are downloaded from the three main
meteorological stations in Croatia [32];
Data about outages of network elements, data on the number of failures in the last
4 days, data about announced work on the network, and historical data on weekly
averages of the number of failures are all taken from the OSS and DWH systems
in the T-HT.
Network element outages and external factors do not equally affect all parts of
the network. Electrical discharge mainly causes problems with equipment in the
user and the access part of the network, rainfall affects the operation of equipment in
the access part, while network element outages or announced works significantly
impact the access and the core part of the network.
There were some practical limitations on the data that generally reduce the
accuracy of predictions. These limitations could be avoided by improving the data-
collection process. Variables representing thunderstorms and rainy weather are
123
J Netw Syst Manage
0.8
0.6
Cross-correlation
0.4
0.2
-0.2
-0.4
-10 -8 -6 -4 -2 0 2 4 6 8 10
Lags (days)
quantized by four values (weighting factors) on the set {0, 1, 2, 3} where 3 denotes
the greatest impact. This resolution is too low to provide the precise expression of
impact. The same quantization is applied to the variables representing outages of
network elements and announced work on the network. Data on daily amounts of
failures are quite accurate with the exception of a few cases where data are not
collected due to errors in the OSS and DWH systems. Also, there were some minor
errors in the classification of failures in the DWH system. These limitations on the
data, to some extent, affect the accuracy of predictions. Due to the aforementioned
constraints, the significances of input variables are not estimated solely on the basis
of correlation coefficients but also by using special knowledge of experts.
123
J Netw Syst Manage
Model learning and testing were conducted on data collected during the first
8 months of 2012. Data collected during the first 5 months were used for learning
while the rest of the data were used for prediction and model testing. This learning/
predicting ratio (5/3) was chosen because of relatively rare influences of external
factors on the network. A longer learning period is necessary to ensure proper
inclusion of the external influences into the model parameters. But once the learning
is completed, predictions can be performed for months ahead with periodic updates,
assuming a stable, unchangeable system. It is known that large systems like
broadband networks are slowly changing systems.
Figure 17 shows the actual, daily reported number of failures and values obtained
by prediction (gray circles) for a 4-day prediction horizon. The daily average failure
values, mean values of all Mondays, Tuesdays, etc. (black pluses in the figure) were
taken as reference values. The prediction approach based on the daily average
values was probably the simplest method of rough prediction and was used in the
real process. Values of all input variables were fed into the NARX model. Relative
to the reference value, the gain of using NARX model is shown (with a gray circles)
in the figure.
Certain prediction deviations are visible at some points. For example, on the
105th day the network suffered a quite unexpected outage of a large number of
devices caused by a sudden fault in an aggregation of DSLAMs. This event was not
predictable on the basis of available input information. The large number of devices
affected by the fault caused a significant increase in the number of failure reports.
Therefore, the deviation between the predicted and actual number of reports was
significant.
3000
Number of failures
2500
2000
1500
1000
500
0
0 20 40 60 80 100 120
Time (days)
123
J Netw Syst Manage
Mean NARX
500000
Cumulave MSE
400000
300000
200000
100000
0
0 20 40 60 80
Time (day)
MSE R2 MSE R2
123
J Netw Syst Manage
a model that quantifies the correspondence between actual and modeled data.
R2 = 1 means complete correspondence between the model and reality. In real
models, the values are always less than 1; the lower the value of the coefficient, the
less correspondence.
We have also measured the quality of the multi-variable prediction model by
varying the number of input variables included. By varying the number of input
variables, it is possible to observe the effect that each individual input has on the
accuracy. Table 11 shows mean square errors of the model when the number of
variables involved in the model is varied.
The model is designed based on the assumption that the input variables have the
same accuracy during the whole prediction horizon. In reality, it is very difficult to
maintain the same level of variable accuracy over a longer period. For the input
variables that represent environmental factors, such as rain and electrical discharges,
it is realistic to expect that the accuracy of prediction declines considerably when
the prediction period exceeds 3 or 4 days (accuracy of weather forecasts). Usually,
for the announced work there are plans for more than 10 days in advance, while the
outages of the network elements are very unpredictable. However, there are various
methods to anticipate outages of elements; the use of any particular method depends
on the nature and type of analyzed systems. These methods have already been
described in Salfner’s study [4]. Estimated prediction horizon lengths in which the
input variables have satisfactory accuracy are shown in Table 12. Each variable has
its own characteristics, but it is reasonable to assume that the multivariable model
developed here will provide prediction results of considerable accuracy for a
3–4 day horizon. This would be sufficient for the needs of the telecom business, i.e.,
operational network maintenance and workforce allocation.
Notwithstanding the described positive characteristics of the model, it should be
stated that the model is susceptible to erroneous input data. Generally, erroneous
data that enter into neural network models with long-term memory cause
considerable reduction in the accuracy. Therefore, it is important to estimate the
likelihood of errors in data or even better, if possible, to provide some procedure for
identifying and correcting erroneous data before they enter the model. Because we
were aware of the problem, we have been using a semi-automatic procedure for
123
J Netw Syst Manage
Table 12 Estimated prediction horizon lengths with satisfactory accuracy for input variables
Variable ID Variable description Estimated horizon length (days)
5 Conclusion
123
J Netw Syst Manage
References
1. Avizienis, A., Laprie, J.-C.: Dependable computing: from concepts to design diversity. In: Pro-
ceedings of the IEEE (1986). doi:10.1109/PROC.1986.13527
2. Melliar-Smith, P.M., Randell, B.: Software reliability: the role of programmed exception handling.
In: Proceedings of the ACM Conference on Language Design for Reliable Software (1977). doi:10.
1145/800022.808315
3. Laprie, J.-C., Kanoun, K.: Software reliability and system reliability. In: Lyu, M.R. (ed.) Handbook
of Software Reliability Engineering, chapter 2, pp. 27–69. McGraw-Hill, New York (1996)
4. Salfner, F., Lenk, M., Malek. M.: A survey of online failure prediction methods. J. ACM Comput.
Surv. (2010). doi:10.1145/1670679.1670680
5. A guide to standard and high-definition digital video measurements. Publishing Tektronix. http://
www.tek.com/regional-page/guide-standard-hd-digital-video-measurements (2007). Accessed 15
August 2013
6. Held, G.: Understanding IPTV (Informa Telecoms and Media). Auerbach publications, Boston (2006)
7. Kandula, S., Katabi, D., Vasseur, J.P.: Shrink: A tool for failure diagnosis in IP networks. In:
Proceedings of the ACM SIGCOMM Workshop on Mining Network Data (MineNet 2005),
pp. 173–178. ACM (2005). doi:10.1145/1080173.1080178
8. Mao, Y., Jamjoom, H., Tao, S., Smith, J.M.: Networkmd: topology inference and failure diagnosis in
the last mile. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet measurement (IMC
2007), pp. 189–202. ACM (2007), doi: 10.1145/1298306.1298333
9. Kavulya, S.P., Joshi, K., Hiltunen, M., Daniels, S., Gandhi, R., Narasimhan, P.: Draco: Top-down
statistical diagnosis of large-scale VoIP networks. Carnegie Mellon University, AT&T Labs-Re-
search. http://www.pdl.cmu.edu/PDL-FTP/ProblemDiagnosis/CMU-PDL-11-109.pdf (2011)
10. Mahimkar, A., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., Zhao, Q.: Towards automated
performance diagnosis in a large IPTV network. In: Proceedings of the ACM SIGCOMM 2009
conference on Data communication, pp. 231–242. ACM (2009). doi:10.1145/1592568.1592596
11. Song, H.H., Ge, Z., Mahimkar, A., Wang, J., Yates, J., Zhang, Y.: Analyzing IPTV set-top box
crashes. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Home Networks (HomeNets
2011), pp. 31–36. ACM (2011). doi:10.1145/2018567.2018575
12. Mahimkar, A., Song, H.H., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., Emmons, J.: Detecting
the performance impact of upgrades in large operational networks. In: Proceedings of the ACM
SIGCOMM 2010 Conference, pp. 303–314. ACM (2010). doi:10.1145/1851182.1851219
13. Guo, Z.X., Wong, W.K., Li, M.: A multivariate intelligent decision-making model for retail sales
forecasting. Decis. Support Syst. 55, 247–255 (2013). doi:10.1016/j.dss.2013.01.026
14. Soroush, A., Bahreininejad A., van den Berg, J.: A hybrid customer prediction system based on
multiple forward stepwise logistic regression mode. Intell. Data Anal. 16, 265–278 (2012). doi:10.
3233/IDA-2012-0523
15. Mastorocostas, P., Hilas, C., Varsamis, D., Dova, S.: A recurrent neural network-based forecasting
system for telecommunications call volume. Appl. Math. Inf. Sci. 7(5), 1643–1650 (2013). doi:10.
12785/amis/070501
16. Oduro-Gyimah, F.K., Azasoo, J.Q., Boateng, K.O.: Statistical analysis of outage time of commercial
telecommunication networks in Ghana. In: Proceedings of the International Conference on Adaptive
Science and Technology, pp 1–8 ICAST (2013). doi:10.1109/ICASTech.2013.6707520
17. Jaudet, M., lqbal, N., Hussain, A., Sharif, K.: Temporal classification for fault-prediction in a real-
world telecommunications network. In: Proceedings of the International Symposium on Emerging
Technologies, pp. 209–214 IEEE (2005). doi:10.1109/ICET.2005.1558882
18. Zhang, X., Sugiyama, A., Kitabayashi, H.: Estimating telecommunication equipment failures due to
lightning surges by using population density. In: Proceedings of the International Conference on
Quality and Reliability, pp. 182–185, ICQR (2011). doi:10.1109/ICQR.2011.6031705
19. Barbosa, C., Ying, X., Day, P., Zeddam, A.: Recent progress of ITU-T recommendations on lightning
protection. In: Proceedings of the 7th Asia-Pacific International Conference on Lighting (APL 2011),
pp. 258–262. IEEE (2011). doi:10.1109/APL.2011.6110120
20. Schulman, A., Spring. N.: Pingin’ in the rain. In: Proceedings of the ACM SIGCOMM Conference on
Internet Measurement (IMC 2011), pp. 19–28. ACM (2011). doi:10.1145/2068816.2068819
21. Jin, Y., Duffield, N., Gerber, A., Haffner, P., Sen, S., Zhang Z.: NEVERMIND, the problem is
already fixed: Proactively detecting and troubleshooting customer DSL problems. In: Proceedings of
123
J Netw Syst Manage
the 6th International Conference (Co-NEXT ‘10), Artic. 7. ACM (2010). doi:10.1145/1921168.
1921178
22. Armstrong, J.S.: Principles of forecasting: a handbook for researchers and practitioners. Stanford
University, Kluwer. http://www.gwern.net/docs/predictions/2001-principlesforecasting.pdf (2002)
23. Deljac, Ž., Kunštić, M.: A comparison of methods for fault prediction in the broadband networks. In:
Proceedings of the 18th International Conference on Software, Telecommunications and Computer
Networks (SoftCOM 2010), pp. 42–46. IEEE (2010)
24. Deljac, Ž., Kunštić, M., Spahija, B.: A comparison of traditional forecasting methods for short-term
and long-term prediction of faults in the broadband networks. In: Proceedings of 34th international
convention on information and communication technology, electronics and microelectronics
(MIPRO 2011), pp. 517–522. IEEE (2011)
25. Deljac, Ž., Kunštić, M., Spahija, B.: Using temporal neural networks to forecasting of broadband
network faults. In: Proceedings of the 19th International Conference on Software, Telecommuni-
cations and Computer Networks (SoftCOM 2011), pp. 1–5. IEEE (2011)
26. Demuth, H., Beale, M., Hagan, M.: Neural Network ToolboxTM 5. The MathWorks Inc., Natick (2006)
27. Coulibaly, P., Anctil, F., Bobee, B.: Multivariate reservoir inflow forecasting using temporal neural
networks, pp. 367–376. J. Hydrol. Eng., ASCE (2001)
28. Dijk, O.E.: Analysis of recurrent neural networks with application to speaker independent phoneme
recognition. University of Twente, Department of Electrical Engineering, pp. 21–24. http://www.
eskodijk.nl/doc/Dijk99_Recurrent_Neural_Networks.pdf (1999)
29. Dodds, E.D., Celaya, B.: Locating Water Ingress in Telephone Cables Using Frequency Domain
Reflectometry. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering,
pp. 324–327. IEEE (2005). doi:10.1109/CCECE.2005.1556938
30. Celaya, B., Dodds, E.D.: Single-ended DSL line tester. In: Proceedings of the Canadian Conference
on Electrical and Computer Engineering, pp. 2155–2158. IEEE (2004). doi:10.1109/CCECE.2004.
1347670
31. Spahija, B., Deljac, Ž.: Proactive copper pair troubleshooting utilizing principal component analysis.
In: Proceedings of the 18th International Conference on Software, Telecommunications and Com-
puter Networks (SoftCOM 2010). IEEE (2010)
32. Free Meteo Forecast Archive (2012) (Stations: Zagreb/Grič, Rijeka/Kozala and Split/Marjan). http://
freemeteo.com.hr
Željko Deljac has been working as a Senior Service Management and Quality Assurance Expert at the
T-Croatian Telecom. He is currently pursuing his Ph.D. in the department of Telecommunications at the
Faculty of electrical engineering and computing, University of Zagreb. His doctoral research focuses on
the development and application of data mining techniques in fault management and failure prediction.
His research interests include research on usage of artificial intelligence in network and services
management.
Mirko Randić is an Assistant Professor at the Faculty of Electrical Engineering and Computing,
University of Zagreb where he received his Ph.D. His research interests include systems, networks and
service management, software systems modeling and service performance modeling. His work has been
published in several peer reviewed journals, such as Software, practice & experience, Journal for Control,
Measurement, Electronics, Computing and Communications, Journal of Computing and Information
Technology, and other journals and book chapters.
Gordan Krčelić received his B.S. in Mechanical Engineering from the Faculty of Mechanical
Engineering and Naval Architecture, University of Zagreb and M.S. in Electrical Engineering from the
Faculty of Electrical Engineering and Computing, University of Zagreb. He is working as Quality
Assurance Expert at the T-Croatian Telecom. He is certified Project Management Professional and Six
Sigma Black Belt. His projects were mostly process improvement focused using Six Sigma ? LEAN
methodology. He started his postgraduate doctoral study at the Faculty of Organization and Informatics,
University of Zagreb. His research interests include processes management, service management and
high-level planning.
123