Developing A Hybrid Intrusion Detection System Using Data Mining For Power Systems2
Developing A Hybrid Intrusion Detection System Using Data Mining For Power Systems2
6, NOVEMBER 2015
Abstract—Synchrophasor systems provide an immense volume Intrusion detection systems (IDSs) identify activities that
of data for wide area monitoring and control of power sys- violate the security policy of a computer system or network.
tems to meet the increasing demand of reliable energy. The IDS are a necessary complement to preventive security mecha-
construction of traditional intrusion detection systems (IDSs)
that use manually created rules based upon expert knowledge is nisms such as firewalls because IDS detect attacks that exploit
knowledge-intensive and is not suitable in the context of this big system design flaws or bugs and IDS provide forensic evidence
data problem. This paper presents a systematic and automated to inform system administrator’s reactions to cyber-attacks [5].
approach to build a hybrid IDS that learns temporal state-based The increasing coupling of cyber infrastructure and physical
specifications for power system scenarios including disturbances, devices of the smart grid makes a traditional host-based IDS
normal control operations, and cyber-attacks. A data mining
technique called common path mining is used to automatically inadequate because host-based IDS monitor host in the system
and accurately learn patterns for scenarios from a fusion of individually while power system control algorithms such as the
synchrophasor measurement data, and power system audit logs. distance protection scheme usually involve multiple devices at
As a proof of concept, an IDS prototype was implemented and multiple locations. Therefore, new IDS should have the ability
validated. The IDS prototype accurately classifies disturbances, to take multiple data sources into account and perform stateful
normal control operations, and cyber-attacks for the distance
protection scheme for a two-line three-bus power transmission monitoring at the system level. Manually building a stateful
system. IDS is a knowledge intensive task which requires vulnerabil-
ity analysis and manual creation of rules and patterns which
Index Terms—Cyber-attacks, data mining, distance protection,
intrusion detection system (IDS), power system, synchrophasor describe attacks and normal behaviors. The manual develop-
system. ment process results in limited scalability and updates are slow
and expensive.
This paper documents a systematic and automated approach
I. I NTRODUCTION
to building a hybrid IDS that leverages features of signature-
HE NEXT generation power system, also known as the
T smart grid, will rely on advanced technologies such as
synchrophasor systems for wide area monitoring and con-
based and specification-based IDS. The IDS classifies system
behaviors over time as specific disturbances, normal control
operations, or cyber-attacks. Sequences of critical states, called
trol in order to meet the increasing demand of reliable common paths, provide a specification or signature for each
energy. While in the past, power system components were scenario. A fundamental ingredient of the IDS presented in this
isolated, they are now interconnected via information infras- paper is a data mining technique that aggregates synchrophasor
tructure, e.g., Ethernet, and therefore are under the threat of measurement data and audit logs from multiple system devices
cyber-attacks. Due to the critical role that the power system to learn the common paths. The automatic approach eliminates
plays in our society, there is a common agreement that the the need to manually analyze and manually code patterns and
electric power grid needs to be better secured to ensure con- is able to handle very large amounts of data.
tinually available power for the nation [1]. There have been Common paths are signatures of events present in a train-
multiple documents from different organizations which pro- ing database. Common paths are also specification since
vide recommendations and guidelines for industry to better they describe expected system behaviors related to normal
secure their facilities [2], [3]. However, the U.S. Government expected system behaviors and cyber-attacks behaviors. The
Accountability Office (GAO) has concluded that current guide- IDS matches a temporal set of monitored system states to
lines are insufficient to securely implement the smart grid and common paths to make a classification. Behaviors which
the GAO calls for research and development to improve upon do not match a common path are considered unspecified
current security mechanisms [4]. events and are either zero-day attacks or unknown system
Manuscript received July 14, 2014; revised January 15, 2015; accepted behaviors.
February 17, 2015. Date of publication March 18, 2015; date of current A case study is included to demonstrate that the proposed
version October 17, 2015. This work was supported by the U.S. National IDS provides high detection accuracy for both known and
Science Foundation under Grant DUE-1344369 and Grant DUE-1315726.
Paper no. TSG-00716-2014. unknown scenarios and thus is suitable for mission critical
The authors are with Mississippi State University, Starkville, environments such as power systems.
MS 39762 USA (e-mail: [email protected]). The rest of this paper is organized as follows. Section II
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. reviews related works. An overview of the test bed and
Digital Object Identifier 10.1109/TSG.2015.2409775 simulated power system scenarios is presented in Section III.
1949-3053 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
PAN et al.: DEVELOPING A HYBRID IDS USING DATA MINING FOR POWER SYSTEMS 3105
Section IV introduces the procedure to construct the proposed patterns in the industrial control system. The IDS proposed
IDS. Experiments and results are discussed in Section V. by [9]–[11] can detect malicious changes to network traf-
The conclusion is provided in Section VI. fic, but all three IDS fail to detect malicious payload that
results in invalid changes to the physical system. For example,
Hadeli et al.’s [11] method cannot detect an injected but other-
II. R ELATED W ORKS
wise valid command to trip a protection relay from a valid IP
A. IDS for Smart Grid address which will take a transmission line out of service and
In recent years, the emergence of the smart grid has moti- cause a blackout. A specification-based IDS was developed
vated research into a variety of IDS techniques. People with to track sequential events in an advanced metering infras-
different backgrounds have created various IDS that focus on tructure (AMI) [12]. A manually constructed state machine
different aspects of the smart grid. One type of IDS research was used to extract legitimate sequential system states from
focuses on intelligent electronic device (IED) security within two AMI protocols and devices status. To prove the correct-
the smart grid [6], [7]. This type of IDS is usually host-based ness of the state machine, a model checking technique was
and thus only identifies attacks against a single IED/network used to verify the specifications. This IDS is not applicable
appliance in the system based on its intended behaviors. While for use with transmission systems because transmission sys-
host-based IDS secure individual devices in the smart grid, tems have far more control actions and disturbances than AMI.
they do not provide stateful monitoring at the system level. As such, manually building such a state machine would be
More advanced IDS of this type consider behaviors of multi- very expensive.
ple devices within the system to obtain system level detection. Other proposed IDS for smart grid leverage power system
Mitchell and Chen [8] proposed a rule-based IDS for the elec- theory. For instance, Valenzuela et al. [13] used optimal power
tric grid by considering the behaviors of three types of physical flow programs to detect cyber-attacks which alter system
devices in the electric grid: 1) head-ends; 2) distribution measurement data to cause the power flow to be dispatched
access points/data aggregation points; and 3) subscriber energy erroneously. Talebi et al. [14] proposed a mechanism for iden-
meters. Readings from 22 sensors from the three types of tification of bad data attacks in a power system using weighted
devices were used as state components. The method quantized state estimation. Although these works are all proven capable
each of the 22 components into a limited number of ranges. of detecting altered data, these IDS are limited to one type of
Three state machines with 3456, 1728, and 3456 states were attack and cannot be extended to detect other attacks against
manually built for the three devices and the state machines power systems.
act as specifications for the three types of devices. Manual
construction of such an IDS is cost prohibitive and dies not B. Accuracy of Specification-Based IDS
scale for larger power systems. Additionally, changes to sys-
The detection accuracy of specification-based IDS depends
tem behaviors require updating the specification state machines
on how accurately the specifications describe system
via the manual process.
behaviors. A promising way to improve the accuracy of
Network-based IDS leverage communication traffic in
specifications is through the use of data mining. A data min-
the information infrastructure of the smart grid to detect
ing technique was applied to an IDS framework proposed
cyber-attacks. IDS can leverage trust systems which moni-
by Lee et al. [15] that combined signature-based IDS and
tor communications to and from a device [24] to validate
anomaly-based IDS. Data mining programs were applied to
communications and limit command and control actions to
a large volume of log data to learn attack signatures and nor-
those approved by the trust system. Yang et al. [9] proposed
mal behavior patterns and automatically create detection rules.
an IDS for synchrophasor systems that detects cyber-attacks
Lee et al. [15] showed that the signatures for attacks and pat-
by using white lists of packets with legitimate source IP
terns for system normal behaviors created using their data min-
addresses, correct packet formats, and legal values for fields.
ing technique are accurate by comparing their detection results
The Yang IDS was evaluated for man-in-the-middle (MITM)
to all other participants in the Defense Advanced Research
and denial of service attacks against synchrophasor devices
Projects Agency intrusion detection evaluation program pre-
using the IEEE C37.118 protocol. Zhang et al. [10] pro-
pared by MIT Lincoln Laboratories. Lee et al.’s [15] IDS was
posed a distributed IDS that analyzes communication traffic
originally designed for stateless IDS therefore it cannot be
at different network levels of the smart grid including home
directly applied to specification-based IDS. A new data mining
area networks, neighborhood area networks, and wide area net-
algorithm must be developed to discover sequential events for
works. An intelligent module was deployed at each level
specifications.
to classify malicious data and possible cyber-attacks using
data mining algorithms. These modules then communicate to
provide a system level view of the communication network to C. Data Mining Techniques for Learning Specifications
improve the detection accuracy. Hadeli et al. [11] proposed an A specification for a scenario contains a sequence of exe-
anomaly detection technique for industrial control systems that cution events or system states. The nature of specifications
whitelists legitimate communication patterns extracted from requires the data mining technique applied to the proposed
different industrial control system protocols available in the IDS to be able to mine sequential patterns and identify
system. The Hadeli IDS uses a system description file to the dependent relationship between events. The data min-
provide a description of the overall expected communication ing technique used in this paper uses the mining sequential
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
3106 IEEE TRANSACTIONS ON SMART GRID, VOL. 6, NO. 6, NOVEMBER 2015
TABLE I
patterns technique which discovers patterns of activity from E XAMPLE PATHS FOR A S CENARIO
time ordered data. The mining sequential patterns algorithm
was first mentioned in [16]. Lin et al. [17] applied it to
discover patterns in clinical client care management process
data that consists of patient records and log data over a period
of treatment time. This technique was extended in [18] by
employing a Bayesian network to graphically represent pat-
terns of different hemodialysis processes which consists of
a sequence of patients’ physiological states that are snap-
shots of clinical log data and patient records, e.g., body paths for a single scenario. Common paths reflect the states
temperature, pulse rate, etc. In Lin et al.’s [18] work, that occur most frequently for a scenario.
states were assigned with probabilities for the purpose of The common path mining algorithm consists of six steps.
prediction. The first five steps create paths, P, for each instance of a sce-
For the work presented in this paper, the FP-growth [19] nario. First, raw data is collected from various sensors in the
algorithm was used in the training process to mine frequent system. Second, raw data is fused or merged into a single
sequential patterns from power system data. FP-growth is an database. Sensors may measure at different times and frequen-
implementation of frequent item set mining. A common exam- cies. Lower frequency sensor data is up sampled so that all
ple of frequent item set mining is market basket analysis in high frequency measurements are maintained. Third, measure-
which stores attempt to find associative relationships among ments which are continuous are quantized to minimize the
products purchased by multiple customers, such as finding total number of possible states in a database. Expert knowl-
products often purchased together. Common path mining is edge is used to design ranges for each sensor. A database is
similar to market basket analysis except common path min- a table with columns for each sensor and rows representing
ing finds system states which are commonly found together in the state of the system at increasing TSs. In the fourth step,
a set. Common path mining also preserves temporal order of the database is parsed to find all unique states. Fifth, the
the system states. database is compressed by merging all rows which are the
same state. In the sixth step, all known paths for a scenario,
the set G, are processed with the mining frequent patterns
III. C OMMON PATH M INING algorithm FP-growth [19] to mine for frequent sequences of
This paper uses the concept of a common path to repre- states. The support threshold is set via trial and error or using
sent the patterns encoded in a fusion of time stamped sensor expert knowledge. Maximal frequent sequences are common
data. A common path consists of a sequence of critical system paths for the scenario.
states in temporal order. Describing the common path mining Example: Consider the set of paths shown in Table I. For
algorithm requires definitions of the concepts of state, feature, the example G = {P1, P2, P3, P4, P5}. If the minimum sup-
sequence, and path. port threshold is set to 60%, the set of frequent sequences
A state is used to represent a system’s instantaneous status. in G which meet the minimum support threshold includes
A state consists of a set of observed system measurements {S1 , S2 , S3 , S4 , S5 }, {S1 , S3 , S4 , S5 }, and {S1 , S4 , S5 }. For this
or features f as well as a normalized time stamp (TS), example, {S1 , S2 , S3 , S4 , S5 } is maximal and is therefore the
i.e., S = {TS, f1 , . . . , fn }. The value of a feature is read from common path. The sequences {S1 , S3 , S4 , S5 } and {S1 , S4 , S5 }
a sensor. The possible values for a feature are in a range are not maximal because they are contained in {S1 , S2 , S3 ,
called its domain. A feature that has continuous values in S4 , S5 }. Alternatively, if the minimum support threshold is
its domain should be discretized to finite ranges to avoid an changed to 70%, the set of sequences in G which meet the
infinite state space. minimum support threshold includes only {S1 , S4 , S5 }. Since
A path P is a list of observed system states arranged {S1 , S4 , S5 } meets the threshold in this case, it is maximal
in temporal order according to their TSs, namely, Pi = and is a common path.
{S1 , S2 , . . . , Sn }, ordered by increasing time. A sequence s Table I also provides examples of possible types of paths
is a subset of a path, i.e., s ⊆ P. We denote a sequence s by that could be found in the dataset. P1 represents the ideal
{Si+1 , Si+2 , . . . , Si+m }. A path P contains sequence s if all case for a path representing a scenario. P2 matches P1 except
of the elements in s appear in P in the same order. In a set a subset of states are delayed. This may occur due to a mea-
of sequences, a sequence is maximal if the sequence is not surement error or due to power system dynamics. P3 contains
contained in any other sequences. an extra state. Extra states may occur when a feature oscillates
Let G be the set of all observed paths for a scenario Q during a state transition. P4 represents the case when a path
so G = {P1 , P2 , . . . , Pn } where n is the number of observed is similar but a state is different from the ideal case. This can
paths for Q. A path supports sequence s if the sequence is happen when an event that should have occurred at T2 occurs
contained in the path. Support can be defined as a metric in at T3 instead, which mangles states S2 and S3 (they change
which the support of sequence s is the percentage of paths in G to S11 , S12 ). P5 represents an error path. In the error path no
that contain sequence s. A common path for scenario Q is any sequences match the ultimate common path.
sequence whose support is greater than a minimum support The common path is used as a specification during
threshold and is maximal. There may be multiple common classification. Changing the minimum support threshold,
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
PAN et al.: DEVELOPING A HYBRID IDS USING DATA MINING FOR POWER SYSTEMS 3107
IV. T EST B ED A RCHITECTURE configuration. The relays implemented the two zone distance
A. Distance Protection for Transmission Lines protection scheme. The relays trip and open the breakers when
a fault occurs on a transmission line. All relays included
The distance protection scheme is the most popular scheme integrated phasor measurement unit (PMU) functionality to
for protecting transmission lines. The principle of operation measure power system transmission line state, however, the
recognizes that the impedance of a high-voltage transmission PMUs were drawn separately in the graph because relays
line is approximately proportional to its length. This means the are controlled by Modbus/transmission control protocol (TCP)
impedance “seen” by the relay during a fault is proportional to and PMUs stream synchrophasor measurements using the
the distance between the point of fault and the relay. Distance IEEE C37.118 protocol. The PMUs streamed real-time syn-
relays are encoded with multiple protection zones. Each zone chrophasor measurement data at a rate of 120 samples/s, to
is assigned an apparent impedance threshold and a trip time. the phasor data concentrator (PDC) which aggregates net-
Relays have over lapping protection zones to provide system work frames from multiple PMU and forwards the aggregated
protection redundancy. One relay’s zone 1 is part of another synchrophasor frames to the OpenPDC application. A set
relay’s zone 2 and so forth. For this case study, the distance of scripts control the simulation by inducing random state
protection scheme was simplified by disabling reverse time changes, capturing measurements, labeling captured data by
delay backup and limiting the number of protection zones for scenario type, and merging data from multiple sources into
each relay to 2. Fig. 1 shows a three-bus two-line transmission a single file. The synchrophasor measurement data includes
system that is modified from IEEE four-bus three-generator of frequency, current phasors, voltage phasors, and sequence
system. Relay R1’s zones 1 and 2 are shown as dashed line components. The four relays were sources of time stamped
boxes. Each relay provides primary protection up to 80% of relay state changes. A signature-based IDS, Snort, runs on
the line (zone 1 protection) and backup protection (zone 2 a PC to detect network activity. Snort provides alerts when
protection) up to 150% of the line in case that the primary it detects remote tripping command activities in the network.
protection fails. The trip time for zone 1 protection is con- Snort, by itself, cannot distinguish between legitimate and ille-
figured to be instantaneous while the trip time for the zone 2 gitimate remote trip commands since they appear the same
protection is time-delayed to avoid false tripping unless the on the network. A control panel computer simulates energy
primary relay fails. management system (EMS) functionality. The EMS simulation
was used to disconnect a transmission line for maintenance by
B. Test Bed Architecture remotely tripping relays via a Modbus/TCP network packet.
A hardware-in-the-loop test bed, shown in Fig. 2, was used An EMS log provides the TS of such a line maintenance event.
for power system scenario implementation and data genera- For this paper, it is assumed that an attacker computer has suc-
tion. A real time digital simulator (RTDS) was used to simulate cessfully penetrated the utility’s operational network and can
transmission lines, breakers, generators, and load. Four phys- launch cyber-attacks from a node on the operational network.
ical relays were wired to the RTDS in a hardware-in-the-loop Scenarios of power system disturbances, normal operations,
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
3108 IEEE TRANSACTIONS ON SMART GRID, VOL. 6, NO. 6, NOVEMBER 2015
TABLE II
S IMULATES S CENARIOS A python script was used to initiate a MITM attack between
the hardware PDC and the OpenPDC application. The attacks
replay synchrophasor measurements from a valid SLG fault
then replay commands to trip the relays on the affected line.
The transmission line maintenance scenarios (Q5 and Q6)
simulate the situation when an operator remotely trips relays
to open breakers at both ends of a transmission line to take the
line out of service for line maintenance. The operator initiated
remote trip commands are recorded and time stamped in the
control panel log.
Power system cyber-attacks may originate from insiders,
amateur hackers, political activists, criminal organizations,
governments, and terrorists. Cyber-attacks may appear as
a nuisance or may bring the system to collapse. Attacks
can be carried out from within power system substations,
a control center, or in transmission and distribution infrastruc-
tures by exploiting weaknesses in physical security policies.
Alternatively, attacks may take advantage of security flaws and
vulnerabilities in software, devices, communication infrastruc-
tures, or communication protocols to electronically infiltrate
power system operational networks. Three types of attacks are
simulated: 1) relay trip command injection; 2) disabling relay
function; and 3) SLG fault replay.
Relay trip command injection attacks (Q7–Q12) create
contingencies by sending unexpected relay trip commands
remotely from an attacker’s computer to the relays at the ends
of the two transmission lines. The trip command injection
and power system cyber-attacks are applied against the sim- attack used for this paper closely mimics the line maintenance
ulated power system and its components. Data logs were scenario. The malicious trip command originates from another
captured from the synchrophasor system, relays, Snort, and node on the communications network with a spoofed legiti-
the simulated EMS. All data logs were time stamped and with mate IP address. Since the attack is not from the control panel
the name of the scenario being simulated. computer there will be no record in the control panel log, how-
ever, the Snort network traffic monitor will detect this remote
C. Test Bed Scenarios trip command.
The power system scenarios used to train and validate the The disabled relay attacks (Q13–Q24) mimic the effects of
IDS presented in this paper have been grouped into three cate- insiders taking illicit control actions or malware taking control
gories: 1) power system single-line-to-ground faults; 2) normal of software systems to manipulate control devices. A python
operations; and 3) cyber-attacks. Each category is described in script accesses a relay’s internal registers via Modbus/TCP
this section with details. There are a total 25 scenarios each commands sent from the attacker’s computer which modify
named with capital “Q” along with a number. The system load the relevant relay settings. The disabled relay attacks overlap
was randomized at the beginning of each scenario. Power sys- fault and maintenance events. The final scenario, Q25, repre-
tem SLG faults belong to the shunt fault family and account sents a stable system state. For this scenario, the load may
for up to 70% of faults in a power system [20]. For this paper, change, but no other attacks, disturbances, or control actions
only phase-a-to-ground faults were simulated as each phase to are simulated.
ground fault has similar characteristics. The phase-a-to-ground Scenarios start and end with the system in a stable state.
fault is abbreviated as “fault” in the rest of this paper. Table II As such, all faults are cleared, transmission lines taken out of
provides a summary of the simulated scenarios used to validate service for maintenance are returned to service, and all attacks
the proposed IDS. end before the next scenario is simulated.
For the SLG fault scenarios (Q1 and Q2) the relay operates
instantaneously for zone 1 and after a time delay for faults D. Test Data
in zone 2. The auto-reclosing scheme models a high speed Test data used for this paper includes data logs associ-
three-phase reclosing scheme [21] which closes the breaker ated with 10 000 simulated instances of the 25 aforemen-
after one second. tioned scenarios. The data log is a comma separated file
The SLG fault replay attacks (Q3 and Q4) attempt to emu- with labeled tuples that include 56 sensor measurements
late a valid fault by altering system measurements followed and a TS. The 56 data sources consist of 52 synchropha-
by sending an illicit trip command to relays at the ends of sor measurements; 13 from each relay location on Fig. 1.
the transmission line. This attack may lead to confusion and The synchrophasor data from a single relay consists of phase
potentially cause an operator to take invalid control actions. voltage and current phasor magnitude, zero, positive, and
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
PAN et al.: DEVELOPING A HYBRID IDS USING DATA MINING FOR POWER SYSTEMS 3109
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
3110 IEEE TRANSACTIONS ON SMART GRID, VOL. 6, NO. 6, NOVEMBER 2015
from 24% to 79% of the transmission line. The relay trip time
for Fig. 3 was calculated from the MED as the time relay
status is transitions from closed to open minus the initial time
the line current equals is high. System behavior also varies as
the system load changes.
Ideally, instances of SLG faults from a two zone distance
protection scheme can be separated into three groups accord-
ing to the area of the line in which the fault occurs. Group 1
includes faults from the length of the line which is protected
by relay R1’s zone 1 and relay R2’s zone 2. From Fig. 2,
group 1 includes faults which occur between 10% and 23% of
the line. For group 1 faults, relay R1 should trip instantly and Fig. 4. 2-D coordinates documenting fault versus fault replay attack common
R2 should trip after 0.4 s. Group 2 includes faults protected by paths.
relay R1’s and R2’s zone 1. Both relays should trip instantly
for group 2 faults. From Fig. 3, group 2 faults occur between
24% and 79% of the line. Group 3 includes faults protected by
relay R1’s zone 2 and relay R2’s zone 1. Relay R1 should trip
after 20 cycles and R2 should trip instantly for group 3 faults.
From Fig. 3, group 3 faults occur between 80% and 90% of
the line.
Observed trip times in group 2 tend to increase as the
fault approached the zones 1 and 2 boundary points. To com-
pensate for this observed behavior the SLG fault paths were
grouped by fault location per the following groups: 10%–23%,
24%–29%, 30%–35%, 36%–40%, 41%–60%, 61%–65%,
66%–70%, 71%–80%, and 81%–90%. Additionally, it was
Fig. 5. 2-D coordinates documenting line maintenance versus command
observed that trip times partially correlated to the system load. injection attack common paths.
As a result, the SLG fault paths were grouped by fault loca-
tion and load. Four load ranges were used: 200–249, 250–399,
300–349, and 350–399 MW. This grouping subdivided the immediately because for the fault replay, the attacker has to
SLG fault paths into 9 ∗ 4 = 36 sub-groups. inject relay trip commands to relay R1 and R2 at the same
time. As such, the second state for the fault replay attack
C. Common Path Mining has the trip commands to R1 and R2 detected by Snort,
i.e., SNT = (R1, R2) in Fig. 4.
For this experiment the set G consists of 5000 raw paths Fig. 5 shows common paths for line maintenance and
from 5000 instances of the 25 scenarios. The common path command injection attack scenarios. The primary differ-
mining algorithm produced 477 common paths across all sce- ence between the two scenarios is the command to open
narios. The minimum and maximum number of common paths relays R1 and R2 originates from the control panel computer
for a single scenario were 4 and 53, respectively. The 15 SLG for the line maintenance scenario. This causes the control
fault scenarios had 421 common paths spread among them. panel log to include a trip command message. The common
The remaining ten scenarios had 56 common paths. The large path for the line maintenance scenario includes a state noting
number of common paths for the SLG faults is due to the large the detection of control panel log events [i.e., CP = (R1, R2)]
variation in relay trip times as fault location and system load and states showing Snort detecting remote trip command net-
varies. work packets [i.e., SNT = (R1, R2)]. The common path for
Common paths can be mapped into 2-D coordinates with command injection includes the Snort alert but excludes the
the y-axis indicating the state identification code (state ID) and control panel log state.
the x-axis indicating normalized TSs. An edge between two Figs. 4 and 5 demonstrate that common paths contain the
vertices represents the temporal transition between two states. critical states for different scenarios. The primary contribution
Each vertex is marked with state information. Note that, only of the common path mining algorithm is the ability to automat-
necessary features are displayed to save space. Fig. 4 shows ically create unique paths for each scenario type from data sets
common paths for two scenarios, a fault in the 36%–40% fault which measure behavior associated with the scenarios.
location of line L1 and a fault replay attack on line L1. The
fault and fault replay paths both start at the system normal
state. For real faults, the PMU will measure high current when VI. E VALUATION
a fault is present while for the fault replay attack, the attacker Three approaches were used to evaluate the IDS. First, the
injects high current measurements to the PDC. This makes IDS was used to classify 5000 instances of scenarios from the
the second state of both common paths high current detected test data set described in Section IV of this paper. Confusion
at relay R1, i.e., IR1 = high. However, these paths differ matrices are provided to show IDS accuracy. A detailed review
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
PAN et al.: DEVELOPING A HYBRID IDS USING DATA MINING FOR POWER SYSTEMS 3111
TABLE III
C ONFUSION M ATRIX FOR S CENARIOS Q1–Q13 scenario or event. For this paper, false positives rates were
calculated for all nonattack scenarios misclassified as attacks.
Scenarios Q1 and Q2, both SLG faults, had 2.1% and 1.6%
false positive rates, respectively. In both cases, the major-
ity of false positives were classified as fault replay attacks.
Replay attacks are designed to mimic SLG attacks. One out
of eleven false positives was classified as a relay disable
attack. Scenarios Q5 and Q6, both line maintenance events,
had 0.8% and 0.9% false positive rates, respectively, which
was one false positive for Q5 and Q6, respectively. For the
Q5 scenario, the false positive was a command injection attack
to open both relays at the end of the transmission line. For
the Q6 scenario, the false positive was a fault replay attack. In
both cases, the sequence of states in the common paths for the
actual scenario and the misclassified scenario have overlapping
sub-sequences of states. This overlap combined with variabil-
ity in observed data due to power system and measurement
TABLE IV
C ONFUSION M ATRIX FOR S CENARIOS Q14–Q25 system dynamics can lead to false positives.
Additional evaluation was performed for classifications of
the sub-groups of scenario Q1, a SLG fault on line L1. The
paths for Q1 were grouped into sub-groups by fault location
and circuit load as previously mentioned. The SLG fault with
grouping accuracy rate was 84.6% while 11.35% of the paths
were misclassified. Further analysis showed that a majority of
misclassification occurred when SLG fault groups were clas-
sified as members of a neighboring or nearby fault group. The
grouping experiment demonstrates the common path mining
algorithm’s strength of finding unique paths for even similar
scenarios.
Tenfold cross-validation was used to evaluate the detection
accuracy of zero-day attack scenarios as shown in Table V. For
each round of testing four scenarios were randomly selected
to be excluded from training but present in the testing data set.
of the algorithms ability to classify SLG faults by fault loca- The average detection accuracy for zero-day attack scenarios
tion is also provided. Second, training and testing was repeated was 73.43%. However, there were cases where the detection
with sets of four scenarios missing from the data set. This test rate for zero-day attack was low. For example, analysis of
was used to demonstrate the IDSs ability to detect zero-day round three results showed that scenario Q6 (command injec-
attacks and unknown scenarios. Finally, IDS cost and perfor- tion to trip relays R1 and R2) was always misclassified as
mance was measured by measuring the amount of processing scenario Q3 (fault replay attack on line L1). This occurs
time and memory required during training and evaluation. because the expected common paths for Q6 and Q3 are sim-
Tables III and IV provide confusion matrices for the ilar. Therefore, when Q6 is unavailable in training, instances
25 tested scenarios. The confusion matrices were separated of Q6 are classified as instances of Q3 which leads to mis-
into two tables to allow them to fit in the column width of this classification. In this case, both Q6 and Q3 are attacks and
paper. The row labeled “Oth” represents scenarios Q14–Q25 in the zero-day attack is classified as another attack which is
Table III and Q1–Q13 in Table IV. The row labeled “Unk” better than classifying as a nonattack. To improve the classi-
provides the number of instances which were unclassified due fication accuracy between similar scenarios additional sensors
to no matching common path. Finally, the row labeled “Unc” are needed to illuminate events which are different between
provides the number of instances with uncertain classification the two scenarios. Of course, in the zero-day case it is difficult
due to matching more than one common path from more than to predict which additional sensors may be required.
one scenario. Training and classification processing time and memory
In total, 90.4% of the tested instances were correctly clas- usage were measured using an Ubuntu Linux Virtual Machine
sified and 2.7% of the instances were misclassified. 4.7% of with 3.5 GHZ CPU and 2 GB memory. Training required
instances were classified as unknown and 2.2% were classified 0.33 s per scenario instance and 34 MB memory. Classification
as uncertain. All of the cases of uncertain classification were of test cases required 0.85 s per scenario instance to complete
related to SLG fault instances which matched a common path and 26.2 MB of memory.
for more than one fault scenario. Multiple batch processing-based data mining algorithms
The IDS can generate false positives, especially, in the were used to classify power system faults and cyber-attacks
case of scenarios which are designed to mimic a nonattack in [22] using the same data used for the work presented
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
3112 IEEE TRANSACTIONS ON SMART GRID, VOL. 6, NO. 6, NOVEMBER 2015
TABLE V
D ETECTION ACCURACY FOR F OUR R ANDOM Z ERO -DAY In this paper, the IDS was trained an evaluated for
ATTACKS 10× VALIDATION a three-bus two-line transmission system which implements
a two zone distance protection scheme. Twenty five scenar-
ios consisting of stocktickerSLG faults, control actions, and
cyber-attacks were implemented on a hardware-in-the-loop test
bed. Scenarios were run in a loop 10 000 times with random-
ized system parameters to create a dataset for IDS training
and evaluation. The IDS correctly classified 90.4% of tested
scenario instances. Evaluation also included a tenfold cross-
validation to evaluate the detection accuracy of zero-day attack
scenarios. The average detection accuracy for zero-day attack
scenarios was 73.43%. The common paths mining-based IDS
outperforms traditional machine learning algorithms and is
better suited for the high volume of data present in power
systems.
in this paper. The results in [22] were for classification with Currently, the common paths mining-based IDS builds com-
binary classes (attack and nonattack), three classes (attacks, mon paths from captured data logs. Capturing such data logs
nonattacks, and normal), and multiclass (all classes main- for real systems is difficult. As such, future work is required to
tained). The common paths mining-based IDS outperformed limit the amount the number of captured scenarios instances
all traditional methods in [22] for overall accuracy in the required to train the algorithm. The IDS was tested by offline
multiclass case. A combination of the JRipper_and Adaboost review of test data sets. Future work is needed to update the
algorithms produced accuracy approaching 90% which is sim- IDS to perform real time classification from live system inputs
ilar to the accuracy of the common paths mining-based IDS. and to incorporate the classifier with an intelligent adaptive
All other test approaches had significantly lower accuracy than control framework [23] to achieve increased automation in of
the IDS presented in this paper. The binary and three-class power systems.
methods in [22] lead to improved accuracy at the expense of
classification precision. The common paths mining-based IDS
R EFERENCES
provides accurate and precise classification of each scenario
type. Precise classification by scenario type is needed to speed [1] A Systems View of the Modern Grid, Nat. Energy Technol.
Lab. (NETL), Morgantown, WV, USA, 2007. [Online]. Available:
understanding of attacks and to enable automated or manual https://www.smartgrid.gov/sites/default/files/pdfs/a_systems_view_of_
response. Binary and three-class IDS need post processing to the_modern_grid.pdf
provide additional detail before response. The primary advan- [2] NERC Standards Critical Infrastructure Protection CIP-002-3 Through
CIP-009-3, North American Elect. Rel. Corp., Atlanta, GA, USA, 2010.
tage of common paths mining-based IDS over a traditional [Online]. Available: http://www.nerc.com/page.php?cid=2|20
batch processing IDS is the ability to process data as a stream [3] NISTIR 7628 Guidelines for Smart Grid Cyber Security: Vol. 1, Smart
rather than collecting batches of data for off line analysis. Grid Cybersecurity Strategy, Architecture and High-Level Requirements,
Nat. Inst. Stand. Technol., Gaithersburg, MD, USA, 2014. [Online].
Stream processing minimizes the amount of memory required Available: http://nvlpubs.nist.gov/nistpubs/ir/2014/NIST.IR.7628r1.pdf
to train and classify and therefore is better suited for IDS at [4] D. Powner and D. Trimble, “Electricity grid modernization:
the scale of a power system. Progress being made on cyber-security guidelines, but key chal-
lenges remain to be addressed,” Gov. Acc. Office, Washington,
DC, USA, Tech. Rep. GAO-11-117, 2011. [Online]. Available:
http://www.gao.gov/new.items/d11117.pdf
VII. C ONCLUSION [5] N. Falliere, L. O’Murchu, and E. Chien, “W32.Stuxnet dossier,
V 1.4,” Symantec Corp., Mountain View, CA, USA, Tech.
The common paths mining-based IDS provides stateful Rep. MS10-046, 2011. [Online]. Available: http://www.symantec.com/
monitoring of an electric transmission distance protection content/en/us/enterprise/media/security_response/whitepapers/
w32_stuxnet_dossier.pdf
system by leveraging a fusion of synchrophasor data and
[6] C. W. Ten, J. Hong, and C. C. Liu, “Anomaly detection for cybersecurity
information from relay, network security logs, and EMS logs. of the substations,” IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 865–873,
The IDS is trained using a common path mining algo- Dec. 2011.
rithm. Common paths are hybrid signatures and specifications [7] Y. Chen and B. Lou, “S2A: Secure smart household appliances,” in Proc.
2nd ACM Conf. Data Appl. Sec. Privacy, San Antonio, TX, USA, 2012,
which described patterns of system behavior associated with pp. 217–228.
power system events. The algorithm provides a time-domain [8] R. Mitchell and I.-R. Chen, “Behavior-rule based intrusion detection
data analysis approach to overcome transients present in systems for safety critical smart grid applications,” IEEE Trans. Smart
Grid, vol. 4, no. 3, pp. 1254–1263, Sep. 2013.
the measurements. This is done by mining shared states [9] Y. Yang et al., “Intrusion detection system for network security in syn-
out of a group of observed paths. Common paths are used chrophasor systems,” in Proc. IET Int. Conf. Inf. Commun. Technol.,
to describe system responses to power system disturbances, Beijing, China, 2013, pp. 246–252.
[10] Y. Zhang, L. Wang, W. Sun, R. C. Green, and M. Alam, “Distributed
control actions, and cyber-attacks. intrusion detection system in a multi-layer network architecture of smart
The IDS matches monitored system state traversal to com- grids,” IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 796–808, Dec. 2011.
mon paths to make classification decisions. Classification [11] H. Hadeli, R. Schierholz, M. Braendle, and C. Tuduce, “Leveraging
determinism in industrial control systems for advanced anomaly detec-
is specific to each trained scenario rather than simply an tion and reliable security configuration,” in Proc. IEEE Conf. Emerg.
indication of normal or abnormal activity. Technol. Factory Autom., Mallorca, Spain, 2009, pp. 1–8.
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.
PAN et al.: DEVELOPING A HYBRID IDS USING DATA MINING FOR POWER SYSTEMS 3113
[12] R. Berthier and W. H. Sanders, “Specification-based intrusion detec- Shengyi Pan (S’12–M’15) received the B.Eng.
tion for advanced metering infrastructures,” in Proc. IEEE 17th degree in electronic information engineering from
Pac. Rim Int. Symp. Depend. Comput., Pasadena, CA, USA, 2011, Fuzhou University, China, in 2008; the M.Sc. degree
pp. 184–193. in data communications from the University of
[13] J. Valenzuela, J. Wang, and N. Bissinger, “Real-time intrusion detection Sheffield, Sheffield, U.K., in 2009; and the Ph.D.
in power system operations,” IEEE Trans. Power Syst., vol. 28, no. 2, degree in electrical and computer engineering from
pp. 1052–1062, May 2013. Mississippi State University, MS, USA, in 2014.
[14] M. Talebi, J. Wang, and Z. Qu, “Secure power systems against malicious From 2010 to 2014, he was a Research Assistant
cyber-physical data attacks: Protection and identification,” World Acad. with the Department of Electrical and Computer
Sci. Eng. Technol., vol. 6, no. 6, pp. 112–119, 2012. [Online]. Available: Engineering, Mississippi State University, where his
http://[2]waset.org/publications/6605/secure-power-systems-against- research focused on smart grid cyber security and
malicious-cyber-physical-data-attacks-protection-and-identification data-driven intrusion detection technologies. He is currently a Software
[15] W. Lee, S. Stolfo, and K. Mok, “A data mining framework for building Engineer with MaxPoint Interactive Inc., Morrisville, NC, USA, for big data
intrusion detection models,” in Proc. IEEE Symp. Sec. Privacy, Oakland, application development in internet digital advertising. His current research
CA, USA, 1999, pp. 120–132. interests include smart grid technologies, cyber security, data mining, and big
[16] R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. 11th data technologies.
Int. Conf. Data Eng., Taipei, Taiwan, 1995, pp. 3–14.
[17] J. L. Lin, X. S. Wang, and S. Jajodia, “Abstraction-based misuse detec- Thomas Morris (M’06–SM’08) received the B.S.
tion: High-level specifications and adaptable strategies,” in Proc. 11th degree in electrical engineering from Texas A&M
IEEE Comput. Sec. Found. Workshop, Rockport, MA, USA, 1998, University, College Station, TX, USA, in 1994, and
pp. 190–201. the M.S. and Ph.D. degrees in computer engineer-
[18] F. Lin, C. Chiu, and S. Wu, “Using Bayesian networks for dis- ing from Southern Methodist University, Dallas, TX,
covering temporal-state transition patterns in hemodialysis,” in Proc. in 2001 and 2008, respectively.
35th Annu. Hawaii Int. Conf. Syst. Sci., Big Island, HI, USA, 2002, He joined Mississippi State University, Starkville,
pp. 1995–2002. MS, USA, in 2008, where he currently serves
[19] J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, as an Associate Professor of Electrical and
3rd ed. Burlington, MA, USA: Morgan Kaufmann, 2012. Computer Engineering, an Associate Director of the
[20] H. Saadat, Power System Analysis. New York, NY, USA: McGraw-Hill, Distributed Analytics and Security Institute, and the
2010. Director of the Critical Infrastructure Protection Center. His current research
[21] R. Nylén, “Auto-reclosing,” ASEA J., vol. 52, no. 6, pp. 127–132, interests include cyber security for power systems and industrial control
1979. systems.
[22] R. Borges et al., “Machine learning for power system disturbance and
cyber-attack discrimination,” in Proc. Int. Symp. Resil. Control Syst.,
Uttam Adhikari (S’11) received the B.S. degree
Denver, CO, USA, 2014, pp. 1–8.
in electrical engineering from Tribhuvan University,
[23] R. Amgai, J. Shi, and S. Abdelwahed, “An integrated lookahead Kirtipur, Nepal, in 2005. He is currently pursuing
control-based adaptive supervisory framework for autonomic power sys-
the Ph.D. degree in electrical and computer
tem applications,” Int. J. Elect. Power Energy Syst., vol. 63, pp. 824–835,
engineering from Mississippi State University,
Dec. 2014.
Starkville, MS, USA.
[24] G. Coates, K. Hopkinson, S. Graham, and S. Kurkowski, “Collaborative, His current research interests include
trust-based security mechanisms for a regional utility intranet,”
cyber-physical system modeling and simulation,
IEEE Trans. Power Syst., vol. 23, no. 3, pp. 831–844, Aug. 2008.
wide area measurement systems, data mining, and
cyber security in smart grid.
Authorized licensed use limited to: Jaypee Insituite of Information Technology-Noida Sec 128 (L3). Downloaded on February 13,2024 at 05:12:52 UTC from IEEE Xplore. Restrictions apply.