JITA 7(2017) 2:84-91 G. JANANI, N.
RAMYA DEVI:
ROAD TRAFFIC ACCIDENTS ANALYSIS USING DATA
MINING TECHNIQUES
G. Janani, N. Ramya Devi
Assistant Professor, Department of Information Technology, Sri Shakthi Institute of Engineering and
Technology, Coimbatore
[email protected],
[email protected] Contribution to the state of the art
DOI: 10.7251/JIT1702084J UDC: 656.1.08:656.08
Abstract: Road Traf ic Accidents (RTAs) are a major public concern, resulting in an estimated 1.2 million deaths and 50 million
injuries worldwide each year. In the developing world, RTAs are among the leading cause of death and injury. Most of the analysis
of road accident uses data mining techniques which provide productive results. The analysis of the accident locations can help in
identifying certain road accident features that make a road accident to occur frequently in the locations. Association rule mining
is one of the popular data mining techniques that identify the correlation in various attributes of road accident. Data analysis has
the capability to identify different reasons behind road accidents. In the existing system, k-means algorithm is applied to group the
accident locations into three clusters. Then the association rule mining is used to characterize the locations. Most state of the art
traf ic management and information systems focus on data analysis and very few have been done in the sense of classi ication. So,
the proposed system uses classi ication technique to predict the severity of the accident which will bring out the factors behind
road accidents that occurred and a predictive model is constructed using fuzzy logic to predict the location wise accident frequency.
Keywords: Road Traf ic Accident (RTA), Data Mining, k-means, Association Rule Mining (ASM), Classi ication, Prediction.
INTRODUCTION
ten repeating patterns. The success of such analysis
Road and traf ic accidents are the major causes of depends strongly on the quality of the data available
fatality and disability in Coimbatore. A RTA not only for the experiments. An interesting source of data in
causes property damage but it may lead to partial this domain is continually created by Department of
or full disability and sometimes can be fatal for hu- Police, Coimbatore. All available data iles provide
man being. Increasing ratio of RTA is not a good sign detailed road safety data about the circumstances
for the transportation safety. The analysis of traf ic of personal injury road accidents involved and the
accident data provides solution to identify different consequential casualties. The statistics relate only
causes of road accidents and undertaking preven- to the accidents that are occurred on public roads,
tive measures. Various types of research have been and are reported to the police and subsequently
done on road accident data from different countries. recorded using the RADMS (Road Accident Data-
Signi icant help in this situation represents an iden- base Management System) accident reporting form
ti ication of the key factors causing road traf ic acci- (completed by police).
dents [1]. Application of suitable data mining meth- This paper consists of four main sections: the
ods on the collected datasets representing different section 1 discusses about the Literature review. The
situations on the roads and occurred accidents can proposed methodology is discussed in section 2.
help understand the most signi icant factors or of- The section 3 describes the simulation results and
84 Journal of Information Technology and Applications www.jita-au.com
ROAD TRAFFIC ACCIDENTS ANALYSIS USING DATA MINING TECHNIQUES JITA 7(2017) 2:84-91
the section 4 describes the conclusion, that is sum- Kwon et al. [11] used classi ication algorithm to ana-
marizes extracted knowledge in respect with other lyze factor dependencies related to road safety. Ac-
relevant work. cident severity is directly concerned with the victim
involved in accidents and it only targets the type of
REFERENCES SURVEY severity and shows the circumstances that affect the
Sachin Kumar. [2] Discussed about the various injury severity of accidents. Most of the accidents
data mining techniques in order to cluster the data are concerned with certain location characteristics
into various categories and to identify the correla- which make them to occur frequently at these loca-
tion between the attributes in the dataset. Lee et al. tions. Hence, identi ication of these locations where
[3] stated that statistical models were a good choice accident frequencies are high and further analyzing
to analyze road accidents in order to identify the them is very much bene icial to identify the factors
correlation between accident and other traf ic and that affect the accident frequency at these locations.
geometric factors. However, Chen and Jovanis [4] Depaire et al. [12] discussed that cluster analysis of
stated that analyzing large dimensional datasets us- road accident data can extract better information
ing traditional statistical techniques may result in rather than analyzing data without clustering. Pre-
certain problems such as sparse data in large con- eti Mulay [13] discussed that the RTA involved fatal
tingency tables and also the statistical models have crashes data is directly concerned with nutritional
their own model with speci ic assumptions and vio- health survey data to analysis of the association of
lation of these can lead to some erroneous results. dietary habit of a motor vehicle driver’s to road traf-
Due to these limitations of statistical methods, data ic accident by applying Association rule mining al-
mining techniques are being used to analyze road gorithms. So the previous work focuses mainly on
accidents. Data mining techniques are used to ex- the driver characteristics and dietary habits, where
tract novel, implicit and hidden information from aforementioned was analysed using ASM, and this
large data. Barai [5] discussed that there are variety paper focused on the contribution of various road-
of applications in transportation engineering such related factors such as the role of environment,
as road roughness analysis, pavement analysis and place where the accident occurred and cause of the
road accident analysis which uses data mining. Vari- accident in order to classify the severity of the ac-
ous data mining techniques [6] such as ASM, classi i- cident.
cation and clustering are widely used for the analysis In this paper, the data mining techniques are
of road accidents. Accident cases in India are usually used to identify accident locations which are more
recorded by police of icer of the region in which the prone to risk and further analyzing them to identify
accident has occurred and also the area covered by various factors that affect road accidents at those lo-
a police station is limited and they keep record of ac- cations. Initially, the dataset is divided into k groups
cidents that are occurred in the regions under their based on their locations using k-means clustering
control. Abellan et al. [7] developed various deci- algorithm. Then, the association rule mining algo-
sion trees to extract different decision rules to ana- rithm is applied on those to reveal the correlation
lyze two-lane rural highway data of Spain. It is found between different attributes in the accident data and
that bad light conditions and safety barriers badly understand the characteristics of these locations.
affect the crash severity. Geurts et al. [8] used ASM Then, the Classi ication algorithm (Naive Bayes) is
technique to analyze the various circumstances that applied to classify the severity of the accident.
occur at high-frequency accident locations on Bel-
gium road networks. Tesema et al. [9] used adaptive METHODOLOGY
regression tree model to build a decision support The proposed methodology consists of four phas-
system for the road accidents in Ethiopia. Kashani et es, namely the Preprocessing, Clustering of data, As-
al. [10] used the Classi ication and Regression Tree sociation Rule Mining, and Classi ication. Figure 3.1
(CART) to analyze road accidents data of Iran and represents the system architecture of the project.
found that not using seat belt, improper overtak-
ing and over speed affect the severity of accidents.
December 2017 Journal of Information Technology and Applications 85
JITA 7(2017) 2:84-91 G. JANANI, N. RAMYA DEVI:
Data Preprocessing mining produces a set of rules that de ine the un-
Data preprocessing is the initial step in data min- derlying patterns in the data set. Given a data set D
ing techniques which involves mainly transforming of n transactions where each transaction is TID. Let
the raw data into an understandable format. Gener- I = {I1, I2,…, In} be a set of items. An item set A will
ally Real-world data is incomplete, inconsistent and occur in T if and only if AT. AB is an association
is likely to contain many errors. Data preprocessing rule, provided that AI, BI and AB = Ø. In case of
is a method of resolving such issues and it prepares road accident data, an association rule can identify
the raw data for further processing. In this paper the the various attribute values which are responsible
Data preprocessing techniques such as Data Clean- for an accident occurrence. In association rule min-
ing and Data Transformation is used. ing, various interesting measures are there to assess
the quality of a rule. These interesting measures for
the rule AB are discussed as follows:
Support
The support of the rule AB de ines the percent-
age how often A and B occur together in a data set
and can be calculated using the Equation (1). Sup-
port is also known as frequency constraint. A set of
items satisfying certain support threshold is known
as frequent item set. These frequent item sets are
further used to generate association rules based on
other measures.
(1)
Where N is the total number of accident records.
Figure 3.1. System Architecture
Con idence
Con idence of the rule AB de ines the ratio of
Clustering the occurrence of A and B together with the occur-
Clustering is an unsupervised data mining tech- rence of A only and can be calculated by using the
nique which is used to group the data objects into Equation (2). Higher the con idence values of the
different clusters in such way that objects within rule A B, higher the chances of occurrence of B with
a group are more similar than the objects in other the occurrence of A. Sometimes, only con idence
clusters. K-means algorithm [14] is very popular values are not suf icient enough to evaluate the de-
clustering technique for numerical data. It groups scriptive interest of a rule.
the data objects into k clusters. There are various
clustering algorithms existing but selection of suit- (2)
able clustering algorithm depends on the type and
nature of data. Our prime motive of this paper is to
discriminate data into different clusters based on Lift
the accident location. Lift for the rule AB measures the occurrence
of A and B together more than expected. In other
Association Rule Mining words, lift is the ratio of
Association rule mining is a very popular data the Con idence and the expected con idence of a
mining technique based on market basket analysis rule. Expected con idence can be de ined as the oc-
that extracts interesting rules between various at- currence of A and B together with the occurrence of
tributes in a large data set [18]. Association rule B. A lift value ranges from 0 to ∞. Lift values greater
86 Journal of Information Technology and Applications www.jita-au.com
ROAD TRAFFIC ACCIDENTS ANALYSIS USING DATA MINING TECHNIQUES JITA 7(2017) 2:84-91
than 1 make a rule potentially useful for predicting Because P(X) is the same for all classes, it only
the consequent in future data sets. Lift determines need to ind the largest P(X|Ci)P(Ci). The prior prob-
how far from independence are A and B. Lift mea- ability of class Ci can be calculated. P(Ci)=si/s, si is the
sures co-occurrence only and is also symmetric with number of training samples of class Ci, and s is the
respect to A and B. Lift can be calculated using Equa- total number of training samples. If the prior prob-
tion (3). ability of class Ci is unknown, it is usually assumed
that the probability of these classes are equal, then
(3) P(C1)= P(C2)= …= P(Cm), therefore the problem is
transformed into how to get maximum P(X| Ci).
2. If the data set has many attributes, the work-
Apriori Algorithm load of calculating P(X| Ci) is very high. In order to
Apriori Algorithm [14] is used to generate the reduce the computational overhead of P(X|Ci), sim-
frequent item-sets and the strong association rules. ple assumptions are used that under certain condi-
The input of the algorithm will be the transaction tion attribute characteristic value is independent of
Database of Accident data and the output will be the each other. P(X| Ci) is calculated as in equation (5)
frequent item-sets and Association rules which sat-
isfy the minimum threshold of Lift.
(5)
Classi ication 3. Probability P(x1|Ci),P(x2|Ci), …, P(xn|Ci) can be
Classi ication is the process of inding a derived calculated from the training set. Here xk refer to the
model which describes the data classes. The main attribute Ak of sample X.
purpose is to be able to use the model to predict the 4. For each class, calculating P(X|Ci)P(Ci). If and
class of objects whose labels are unknown. The de- only if P(X|Ci)P(Ci) is maximum, the classi ier pre-
rived model is based on the analysis set of training diction sample X belongs to class Ci. Bayes’ theorem
data. is used for classi ication as the past information
about a parameter can be incorporated and form a
prior distribution for future analysis.
Naive Bayes Classi ier
Naive Bayes classi ier [14] uses the probabilistic Performance Evaluation
method to predict a class for every instance of data Classi ication performance is evaluated in terms
set. The input of the algorithm is Test data and the of three commonly used metrics: accuracy, recall
output will be the predicted severity level. The spe- and precision as de ined in equation (6) – (8). Table
ci ic working process of the Naive Bayes is as fol- 3.1 is a confusion matrix whose entries are given as
lows: a function of two typical classes in severity classi i-
Let T be the training sample set. Each sample has cation.
category labels. Sample set has a total of m classes: • Accuracy is the percentage of test set samples
C1, C2,...Cm. Each sample is represented by an n- that are correctly classi ied by the model.
dimensional vector System designX={x1, x2, ..., xn},
and each vector describes n attributes A1, A2,...,An. (6)
Different ways in calculating the probability of the • Precision is the fraction of retrieved instances
class are explained below. which are relevant.
1. Given a simple X, the classi ier will predict
that X belongs to the highest posterior probability (7)
of class. If and only if P(Ci|X)>P(Cj|X), 1<=i, j<=m, X • Recall is the fraction of relevant instances
is predicted to belong to class Ci. According to the which are retrieved.
Bayes’ theorem, the probability is calculated as in
equation (4). (8)
P(C_i/X)=(P(X/C_i)*P(C_i))/(P(X) (4)
December 2017 Journal of Information Technology and Applications 87
JITA 7(2017) 2:84-91 G. JANANI, N. RAMYA DEVI:
Table 4.2. Sample Data set
Where
Road
TP= True Positive FP= False Positive name
Road no Municipality fatal Grievous Nature
TN= True Negative FN= False Negative 1 1 1 1 0 1
Table 3.1. Confusion Matrix 2 1 2 0 1 2
Predicted Predicted 2 1 2 1 1 3
Slight Severe
3 2 1 0 1 1
Actual Slight TP TN
Actual Severe FP FN Categorization of accident locations
K-means clustering technique was applied on the
SIMULATION RESULTS accident data to get three clusters based on the acci-
The proposed methodology is implemented us- dent locations. The clusters are renamed as the Area
ing the Java language and executed in NetBeans IDE. under city, Area beyond city limit and Area under
Highways. There are 52 locations where accidents
Data Set happened in Coimbatore.
In Coimbatore, all accident related details are
collected and maintained by the Department of Po- Association rule mining
lice using software called Road Accident Database Apriori algorithm is used to generate the rules. To
Management System (RADMS) and the information ind the strong association rules minimum support
is stored at the central server which is located at one of 5% is set. Association rules provide the correla-
particular place in TamilNadu. Hence, these data tion between the different attributes when an acci-
provide information about accidents that have hap- dent happens. Based on the lift value the interesting
pened in the road network of entire city. The data rules have been chosen in this paper. The rules for
for this study is obtained from the Commissioner of- various clusters are discussed below:
ice, Coimbatore. The data consists of 570 accident
details for 3 years from 2013 to 2015 in Coimbatore. Association rules for Cluster 1
After pre-processing, 542 accident records were The association rules of Cluster 1 shows that most
considered for this study. A description about the of the accidents that happened in these locations are
data set is provided in the table 4.1. Sample dataset mainly due to over speed and careless driving. These
is provided in the table 4.2. locations are highly sensitive to Hit and Run. Most of
the accidents happened here led to injuries & some
Table 4.1. Data set Description led to property damage. If the nature of the accident
Attribute Name Type is RTA then the area comes under the road type CH.
Date & Time of Accident Nominal Strong rules with high lift value show that the acci-
dents are happening mostly near the junction areas
Road name Nominal
and due to poor lightning and road surface.
Road No Nominal
Municipality Nominal Association Rules for Cluster 2
Fatal Nominal The association rules of Cluster 2 show that most
Grievous Nominal of the accident happened in these locations are
mainly due to self accident. Most of the accidents
Injury Nominal
happened in these locations are due to negligence
Property Damage Nominal and some are due to the intersection road feature.
Nature of Accident Nominal The accidents that happened in these areas belong
Reason as in FIR Nominal to the road type SH. When compared to Cluster 1,
the fatal and injury levels are less in this cluster.
Place Nominal
Most of the vehicles involved in the accidents have
Lightning Binary
crossed the minimum speed limit.
88 Journal of Information Technology and Applications www.jita-au.com
ROAD TRAFFIC ACCIDENTS ANALYSIS USING DATA MINING TECHNIQUES JITA 7(2017) 2:84-91
Association Rules for Cluster 3 rule based systems are an extension of classical rule
The association rules of Cluster 3 show that most based systems. Fuzzy rules are linguistic IF-THEN
of the accidents happened in these locations are constructions that have the general form “IF A,
mainly due to rash driving. Most of the accidents’ THEN B” where A and B are propositions contain-
nature is RTA. Most fatal accidents happen due to ing linguistic variables. In effect, the use of linguis-
rash driving and few self accidents also happened. tic variables and fuzzy IF-THEN rules exploits the
The rules suggest that highways are more prone to tolerance for imprecision and uncertainty. In this
accidents. When compared to the other areas, Clus- respect, fuzzy logic mimics the crucial ability of the
ter 3 areas are more prone to severe fatal accidents. human mind to summarize data and focus on deci-
Most of the accidents happened in the highways ar- sion-relevant information.
eas. A fuzzy rule based system consists of four ma-
The association rules for the various clusters jor modules: fuzzi ication, inference engine, knowl-
show the factors behind the accident and they re- edge base and defuzzi ication module [18]. The
veal the correlation between different attributes. fuzzi ication module transforms the crisp input(s)
Some of the rules in all clusters are similar to each into fuzzy values. These values are then processed
other. Similar rules such as: if the nature of the acci- in fuzzy domain by inference engine based on the
dent is RTA and the FIR is Rash, and few other rules knowledge base supplied by the domain expert(s).
are also similar, we come to the conclusion that rash The knowledge base is composed of the Rule Base
driving leads to fatalities and injuries. If the road (RB), which characterizes the control goals and con-
lightening is poor, then accident is likely to occur in trol policy of the domain expert by a set of linguistic
those locations. control rules, and of the Data Base (DB), containing
the term sets and the membership functions de in-
Classi ication ing their semantics. Finally, the processed output is
Naïve Bayes algorithm is used to classify the se- transformed from fuzzy domain to crisp domain by
verity of accidents. The severity of an accident is defuzzi ication module.
directly concerned with the victims involved in the The structure of a rule base can be stated as follows:
accident. Based on the affected victims, the severity Ri : if Xi is Ai1 … Xn is Ain then Y is Bj
level of an accident is classi ied. To train, the Model Where Ain and Bj are fuzzy sets de ined on the
70% of data is taken and to test the model 30% of input and output domains respectively. X1…Xn and
data is used. Based on the attribute such as Fatal, Y are input and output linguistic variables, respec-
Grievous, Injury and Damage the class label is cre- tively, and Ai1 … Ain and Bj linguistic labels, each one
ated. The class label represents the severity level of of them having associated a fuzzy set de ining its
the accident happened. Class 0 represents the low meaning.
severity and class 1 represents the high severity. Na- Figure 4.1 represents the yearly distribution
ïve Bayes performs well in terms of accuracy when of accidents happened in Coimbatore. In the year
compared with other classi ication algorithms such 2013, 207 accidents occurred, in the year 2014, 229
as Decision tree J48, Random forest. The outcome of accidents occurred and in the year 2015, 218 acci-
this phase is the severity level of the accidents oc- dents occurred.
curred. In Coimbatore, 40% of accident happened Figure 4.2 represents the monthly distribution of
belong to the severity level high and 60% of acci- accidents happened in the various clusters.
dent happened belong to the severity level slight. To Figure 4.3 represents the rate of accidents that
measure the performance of the classi ier, the clas- occurred in the various locations of Coimbatore.
si ication accuracy is computed from the test set. Figure 4.4, Figure 4.5 and Figure 4.6 represent
the comparison of performance metrics of different
Prediction classi ication algorithms.
In the proposed system, the Prediction model us- Figure 4.7 and Figure 4.8 represent the compari-
ing fuzzy logic is built in order to predict the prob- son of the Prediction results and Location wise pre-
ability of accident occurrence in Coimbatore. Fuzzy diction results respectively.
December 2017 Journal of Information Technology and Applications 89
JITA 7(2017) 2:84-91 G. JANANI, N. RAMYA DEVI:
Figure 4.1. Yearly Distribution of Accident rate
Figure 4.5. Performance Evaluation of Classifier in terms of Precision
Figure 4.2. Month wise Accident rate in various clusters
Figure 4.6. Performance Evaluation of Classifier in terms of Recall
Figure 4.7. Comparison of prediction results
Figure 4.3. Location wise accident rate
Figure 4.8. Location wise prediction results
CONCLUSION AND FUTURE WORK
In this paper, traf ic accident data of Coimbatore
is collected and cleaned in order to use it to test the
predictive model. The endeavour of this paper is
Figure 4.4. Performance Evaluation of Classifier in terms of accuracy to spot the factors behind an accident and severity
90 Journal of Information Technology and Applications www.jita-au.com
ROAD TRAFFIC ACCIDENTS ANALYSIS USING DATA MINING TECHNIQUES JITA 7(2017) 2:84-91
of accidents. The assessment of the Classi ication have impact on the accident severity. The results of
model showed that Naive Bayes algorithm outper- this study could be used by the respective authori-
forms with an accuracy of 92.45 % when compared ties to promote road safety and create awareness
with other algorithms. In contrast with the previ- about risk factors. Thus, this work could have tre-
ously published work of authors, which focused on mendous impact on the well-being of Coimbatore
driver characteristics and dietary habits, this paper civilians and a predictive model is constructed in or-
focused on the contribution of various road-related der to predict the probability of accident occurrence
factors such as the role of environment, place where which helps the Coimbatore civilians to have aware-
the accident occurred and cause of the accident that ness about the accident prone zones in advance.
REFERENCES:
[1] Abellan J, Lَopez G, and De Oña J, “Analysis of traf ic accident severity using Decision Rules via Decision Trees”, Expert Systems
with Applications, 40, 6047–6054, 2013.
[2] Addi, Ait-Mlouk et al. “An approach based on association rules mining to improve road safety in Morocco”, International Con-
ference on Information Technology for Organizations Development (IT4OD), 2016.
[3] Barai S, “Data mining application in transportation engineering”. Transport 18:216–223, 2003.
[4] Beshah, Tibebe and Shawndra Hill. “Mining Road Traf ic Accident Data to Improve Safety: Role of Road-Related Factors on
Accident Severity in Ethiopia”, AAAI Spring Symposium: Arti icial Intelligence for Development, 2010.
[5] Chen W, Jovanis P, “Method for identifying factors contributing to driver-injury severity in traf ic crashes”. Transp Res Rec,
2000.
[6] Depaire B, Wets G, Vanhoof K, “Traf ic accident segmentation by means of latent class clustering”, Accid Anal Prev 40:1257–
1266, 2008.
[7] František Babi, Karin Zuskáová, “Descriptive and Predictive Mining on Road Accidents Data”, IEEE 14th International Sympo-
sium on Applied Machine Intelligence and Informatics, January 21-23, 2016.
[8] Geurts K, Wets G, Brijs T, Vanhoof K, “Profiling of high frequency accident locations by use of association rules”. Transp
Res Rec, 2003.
[9] Han J, Kamber M “Data mining: concepts and techniques”, Morgan Kaufmann Publishers, Burlington, 2001.
[10] Kashani T, Mohaymany AS, Rajbari A, “A data mining approach to identify key factors of traffic injury severity”. Promet-
Traffic Transp 23:11–17, 2011.
[11] Kwon OH, Rhee W, Yoon Y, “Application of classification algorithms for analysis of road safety risk factor dependencies”,
Accid Anal Prev 75:1–15, 2015.
[12] Lee C, Saccomanno F, Hellinga B, “Analysis of crash precursors on instrumented freeways”. Transp Res Rec, 2002.
[13] Matsatsinis N, “A fuzzy decision aiding method for the assessment of corporate bankruptcy,” Fuzzy economic review, vol. 8,
2003.
[14] Preeti Mulay and Selam Mulat, “What You Eat Matters Road Safety: A Data Mining Approach”, Indian Journal of Science
and Technology, Vol 9(15), 2016.
[15] Sachin Kumar, Durga Toshniwal, “A data mining approach to characterize road accident locations” Springer Journal Vol.
24(1):62-72, 2016.
[16] Tan PN, Steinbach M, Kumar V “Introduction to data mining”. Pearson Addison-Wesley, Boston, 2006.
[17] Tesema TB, Abraham A, Grosan C, “Rule mining and classification of road accidents using adaptive regression trees”. Int
J Simul 6:80–94, 2005.
Submitted: September 10, 2017.
Accepted: November 30, 2017.
ABOUT THE AUTHORS
Janani G holds a M. Tech in IT by the Anna University and is a assistant professor for the Department of IT. Her main area of interest
is the study of data mining and analytics. She has presented papers at conferences and published papers in various journals. She has
taught Grid and Cloud Computing and Problem solving and Python Programming.
Ramya Devi N holds a M. Tech in IT by the Anna University and is a assistant professor for the Department of IT. Her main area
of interest is the study of data mining and networking. She has presented papers at conferences and published papers in various
journals. She has taught Web Programming and ComputerArchitecture.
December 2017 Journal of Information Technology and Applications 91