0% found this document useful (0 votes)
47 views17 pages

Integrating Large Language Models For Severity Classification in Traffic Incident Management: A Machine Learning Approach

This document evaluates using large language models to enhance machine learning for classifying traffic incident severity. It compares models that use features from incident reports alone and combinations of conventional and language model-derived features. Incorporating language model features improved or matched the performance of random forests and extreme gradient boosting algorithms compared to using conventional features alone. The research demonstrates how language models can simplify feature extraction from text and enhance predictions in traffic incident management machine learning pipelines.

Uploaded by

malossi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views17 pages

Integrating Large Language Models For Severity Classification in Traffic Incident Management: A Machine Learning Approach

This document evaluates using large language models to enhance machine learning for classifying traffic incident severity. It compares models that use features from incident reports alone and combinations of conventional and language model-derived features. Incorporating language model features improved or matched the performance of random forests and extreme gradient boosting algorithms compared to using conventional features alone. The research demonstrates how language models can simplify feature extraction from text and enhance predictions in traffic incident management machine learning pipelines.

Uploaded by

malossi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Integrating Large Language Models for Severity

Classification in Traffic Incident Management: A


Machine Learning Approach*
1st Artur Grigorev 2rd Khaled Saleh 3th Yuming Ou
Faculty of Engineering and IT Faculty of Engineering and IT Faculty of Engineering and IT
University of Technology Sydney University of Newcastle University of Technology Sydney
Sydney, Australia Newcastle, Australia Sydney, Australia
ORCID: 0000-0001-6875-3568 [email protected] [email protected]
arXiv:2403.13547v1 [cs.LG] 20 Mar 2024

4nd Adriana-Simona Mihăiţă


Faculty of Engineering and IT
University of Technology Sydney
Sydney, Australia
ORCID: 0000-0001-7670-5777

I. A BSTRACT
This study evaluates the impact of large language models on enhancing machine learning processes for managing traffic
incidents. It examines the extent to which features generated by modern language models improve or match the accuracy
of predictions when classifying the severity of incidents using accident reports. Multiple comparisons performed between
combinations of language models and machine learning algorithms, including Gradient Boosted Decision Trees, Random
Forests, and Extreme Gradient Boosting. Our research uses both conventional and language model-derived features from texts
and incident reports, and their combinations to perform severity classification. Incorporating features from language models
with those directly obtained from incident reports has shown to improve, or at least match, the performance of machine learning
techniques in assigning severity levels to incidents, particularly when employing Random Forests and Extreme Gradient Boosting
methods. This comparison was quantified using the F1-score over uniformly sampled data sets to obtain balanced severity
classes. The primary contribution of this research is in the demonstration of how Large Language Models can be integrated
into machine learning workflows for incident management, thereby simplifying feature extraction from unstructured text and
enhancing or matching the precision of severity predictions using conventional machine learning pipeline. The engineering
application of this research is illustrated through the effective use of these language processing models to refine the modelling
process for incident severity classification. This work provides significant insights into the application of language processing
capabilities in combination with traditional data for improving machine learning pipelines in the context of classifying incident
severity.

Keywords: traffic accident, incident severity classification, machine learning, traffic management, large language models
II. I NTRODUCTION
The rise in vehicular traffic over the past few decades has led to a corresponding increase in traffic accidents, with over five
million reported in the United States in 2013 alone according to the National Highway Traffic Safety Administration (NHTSA)
[1]. This surge underscores the need for effective Traffic Incident Management Systems (TIMS) capable of handling complex
datasets involving accident details, traffic conditions, and environmental factors.
A critical aspect of TIMS is the ability to classify the traffic accident severity accurately, which is essential in determining
the resources required for response - including team size, equipment, and traffic control measures [2]. However, classifying
accident severity poses significant challenges due to the stochastic nature of traffic accidents [3]. Therefore, it’s necessary to
perform the research in the direction of finding more efficient models.
The ability of LLMs to understand and process unstructured textual data from incident reports presents a significant
opportunity to augment conventional machine learning approaches. These algorithms have typically been applied to structured
tabular data, but their performance can potentially be enriched by the features extracted from incident description using LLMs.
Large Language Models (LLMs), with their capability to comprehend and process unstructured textual data from accident
reports (accident description), offer an opportunity to enhance the performance of traditional machine learning approaches.
Traditionally, these approaches rely on structured, tabular data, which makes prediction models non-transferable due to
differences in accident report formats and different structure of accident report data sets.
Objectives:
1. Our primary goal is to investigate the potentials of LLMs in feature extraction from textual accident reports. By ’feature
extraction’, we refer to the process of selecting and encoding information from raw accident report data to represent the
properties of an accident.
2. Furthermore, we aim to evaluate if the use of the extracted features, when used with traditional accident reports, can
advance or at least match the performance of traditional feature engineering pipeline in classifying accident severity. It’s known
that feature engineering pipeline in machine learning process has multiple complex steps, which can be mitigated by the use
of full-text representation of incident reports using LLM (see Figure 1). For example, language model can be used to encode
the textual representation of the date field, instead of performing complex set of steps for parsing of these values to represent
them in numerical format. The use of textual accident descriptions was also found to improve the performance of traffic
incident duration prediction models when incorporated with original feature set when using more simple LSTM model for
feature extraction from accident narrative [4]. The full-text representation allows for seamless combination of traffic accident
description with other relevant accident report variables. The removal of other feature engineering steps also allows to reduce
efforts in data preparation. Therefore, we seek to determine whether the use of LLM-extracted features can match manual
feature engineering within traditional machine learning pipeline, and, when combined with conventional structured data, if
it can enhance the performance of traditional machine learning models (like Random Forest, XGBoost, and LightGBM) in
classifying traffic incident severity.
3. The proposed methodology simplifies traditional feature engineering by using LLMs to encode full-text data representations
into numerical features for machine learning models. The framework integrates LLM-extracted features with traditional machine
learning techniques. Due to inherent complexity of feature engineering procedure, it may be more efficient in comparison to
the use of LLM because all these steps are expected to be virtualy performed by the large language model, which creates a
high demand for model’s capabilities. Therefore, the difference in performance is expected. But it’s unknown whether LLM
approach will provide acceptable, subpar or superior performance.
Challenges in incident management: a) Solving prediction tasks in relation to traffic incidents is notoriously difficult due
to their stochastic nature: accidents can happen anywhere in time and space, under various external conditions (weather, events,
road closures, other cascading accidents in the network), b) The complexity of the task is amplified by integrating various
forms of non-numerical data, such as textual incident description. Often, this data is in unstructured text form, making it more
challenging to include in predictive models, c) Various traffic departments have different ways of storing the incident data logs
and the format can vary significantly from one city to another, one country to another; therefore adoption a single AI model
that is efficient across multiple data sets represents a high challenge; one needs to tailor, adapt, and construct various hybrid
models that can meet the challenges of each urban set-up; this can lead to a waste of computational resources.
Cross domain challenges and applicability of our research The classification of traffic incident severity holds relevance
and applicability across a diverse array of fields, extending well beyond the area of traffic management. This positions it
as a concept with applicability to other domains. For instance, language models were applied to the field of construction
injury precursors [5] and construction accidents classification [6]. The outcomes of this research can be applied to any area
of transportation, including classification of water transportation accidents [7]. Also, in healthcare, this approach could be
utilized to classify patient needs based on severity; in public safety, to determine the urgency of response to public complaints;
or in customer service, to rank customer issues based on incoming queries. Ultimately, our goal is to streamline the data
preparation process using LLMs, potentially reducing costs for traffic management organizations through time and labour
savings in data-processing and improved prediction accuracy.
Accident Report Accident Report
(table data) (table data)

Label Encoding
X V
(categorical values)

Manual/Automated Feature Engineering

Feature Embedding through LLM


Data imputation
(Missing Values)

(novel approach)
Feature Transformation
(classic approach)

(scaling, distribution
transformation)

Creating Polynomial Full-text representation


Features

Date Parsing LLM embedding


(extract day, month, year, (feature extraction from
hour, minutes) textual representation)

Embedding of textual
fields
(Accident Description)

Feature Selection
(attention to specific vs
variables)

Numerical Feature Numerical Feature


Representation Representation
(to be used with machine (to be used with machine
learning models) learning models)

Fig. 1. The advantage of using LLM models for translating unstructured text data into meaningful features for machine learning models

Our contributions are threefold: First, we provide an extensive comparison of combinations of various machine learning
models and large language models (used for feature extraction) for the task of incident severity classification. The aim here
is to identify the optimal pairing for the most accurate prediction. Secondly, we perform a comparison of traditional feature
engineering pipeline and LLM-based feature engineering processes. Thirdly, we investigate how a combination of baseline
features (accident reports converted to numerical representation through the process of manual feature engineering) with LLM-
extracted features can enhance the accident severity classification accuracy. Finally, our proposed method has cross-domain
application potential, especially in other areas that involve predicting event outcomes based on unstructured textual data or
features converted to textual representation.
III. R ELATED W ORKS
Traffic incident severity prediction is an area of extensive research, with numerous investigations employing machine learning
models for accurate forecasting [8]. These studies often process structured data sourced from Traffic Management Centres
(TMC) to apply predictive models. For instance, the k-nearest neighbors algorithm and Bayesian networks have been used
by some researchers to predict traffic accident severity [9]. More recent studies have seen the use of ensemble methods like
Random Forests and XGBoost, demonstrating their efficacy in this application [10].
Text mining, particularly with the assistance of LLM (Language Model) such as ChatGPT, has demonstrated its effectiveness
in various Natural Language Processing (NLP) tasks like summarization, information extraction, and text classification. These
techniques can be applied to address different problems in the field of Intelligent Transportation Systems (ITS). One such
problem is analysis of unstructured crash description, which may be crucial for documenting crash information. Also, LLm
can generate a narrative summary containing details about the sequence of events, human behavior, and crash outcomes is
essential. LLM models like ChatGPT prove to be efficient in generating clear, professional, and easily understandable crash
summaries, assisting in evaluating the nature of the crash [11]. Another application lies in crash news mining. LLM models
possess the ability to accurately extract relevant crash information from news articles and present it in a tabular form, ready
for processing using machine learning algorithms.
Limitations: This research includes a novel approach that goes beyond existing methodologies that rely solely on machine
learning models or using a single LLM model for accident severity classification [12], [13], [14], [8] by considering multiple
Large Language Models in combination with a diverse set of predictive machine learning models. A similar multiple model
approach (including both LLM and ML models) was utilized to classify accident narratives against multiple classes of outcomes
[15], but without using the entire table of accident report parameters like we do in the current research. The use of machine
learning models offers various benefits of rapid training and evaluation performance (e.g. XGBoost method when using
Graphical Processing Unit was shown to classify 10 million rows of 100 features each in under 45 seconds [16]), minimal
memory requirements which makes possible a rich subsequent analysis including word importance estimation (e.g. using
Shapley values). The key challenge in computing Shapley values lies in the number of subsets of features to be evaluated [17].
For a model with d features, there are 2d possible subsets, since each feature can either be included or excluded from a subset.
Therefore, the computational complexity for calculating the Shapley value of just one feature is exponential in the number of
features, O(2d ). For all d features, this remains exponentially expensive. The use of fast machine learning models to process
feature vectors produced by LLM makes word importance analysis at least feasible.
Gap 1: Overall, there appears to be a lack of comprehensive studies comparing traditional feature engineering methods with
those based on LLMs. Understanding the strengths, weaknesses, and applicability of each approach in the context of traffic
incident severity prediction is crucial for developing more efficient data processing pipelines and accurate predictive models.
Gap 2:Also, there is a need for more extensive research to identify the most effective combinations of LLMs and ML
models, since the use of LLM represents area of rapidly advancing research [18]. Such studies are essential to objectively
assess the effectiveness of different methodologies and to provide clear guidance for practitioners in the field. This includes
exploring different types of LLMs and ML algorithms to determine the optimal pairing for various contexts and datasets.
Gap 3: Finally, there’s a gap in research on methods and strategies to reduce computational complexity and enhance
scalability, especially when dealing with large datasets and feature sets. For example, one of the latest data sets on traffic
accidents contains 1.5 million records with 48 features each with accident narrative included [19].
One of the studies that utilized advanced language models, specifically fine-tuned Bidirectional Encoder Representations
from Transformers (BERT), to classify traffic injury types using a large dataset of over 750,000 crash narrative reports. The
models achieved a high predictive accuracy of 84.2% and demonstrated their effectiveness in classifying crash injury types [20].
Previous studies using basic natural language processing (NLP) tools have limitations in handling complex sentence structures
and text ambiguity, whereas advanced language models like BERT address these limitations.
Another study regarding construction accident classification [21] explores the application of machine learning algorithms in
efficiently categorizing accident narratives, using accident reports as the dataset. The researchers evaluated the performance of
Support Vector Machine (SVM) [22], K-Nearest Neighbors (KNN) [23], Decision Tree (DT) [24] and other machine learning
algorithms on a dataset of 1000 construction accident narratives from the US OSHA website , with the use of unigram
tokenization for coding accident narratives. The results showed that SVM performed the best in classifying a test set of
251 cases, with linear SVM and RBF SVM using unigram tokenization being the most effective classifiers. It signifies the
importance of performance comparison of multiple models.
There are multiple studies which utilize BERT word embeddings to represent textual data from incident reports [12], [13],
[25], [20] and then utilizing various regressors to predict traffic incident duration. The BERT embeddings are commonly
inputed into models like XGBoost, RandomForest, and Support-Vector models to perform the prediction task. To evaluate the
effectiveness of the approach, various comparisons are made with the state-of-the-art LDA representation. LDA topic modelling
is commonly used for representing textual data, but the results show that the BERT-LSTM hybrid model outperforms it in
terms of mean absolute error (MAE), which indicates that the contextual understanding provided by BERT embeddings leads
to more accurate predictions of traffic incident duration. The BERT model was also combined with the Recurrent Convolutional
Neural Network (RCNN) model for fine-tuning [13].
The reviewed studies showcase the diverse applications of language models, including the analysis of unstructured crash
descriptions, the generation of narrative summaries, the classification of traffic seveirty and injury types. The findings indicate
that the integration of advanced language models and machine learning algorithms enhances the accuracy and effectiveness of
these tasks.
IV. M ETHODOLOGY
Our study explores how the combination of LLM and ML models together with full-text representation can be utilized to
benefit traffic accident modelling: 1) the inclusion of unstructured data from various international sources can enhance the
predictive models, potentially improving the accuracy of traffic incident classification predictions in general, 2) The use of
full-text representation can streamline the process of data preparation without compromising prediction accuracy.
The full-text representation is an approach to represent both tabular data from accident reports and accident narrative into
a single text string which then can be used to perform feature extraction using various LLM models (see Figure 2): accident
report values are combined with their corresponding names including the accident narrative. Tokenization is performed before
the feature extraction - it is a fundamental step that involves dividing text into smaller units, known as tokens, to facilitate
processing of the text by the LLM model. These tokens can be words (e.g. numbers or abbreviations) or parts of words (e.g.
parts of a highway index: I-81 is being split into ’I’, ’-’ and ’81’).
These extracted features (numerical representations of textual accident reports) can then be used with machine learning
models to perform traffic accident classification. As shown previously (see Figure 1), we expect LLM models to be able to
intrinsicaly perform various tasks of data representation including label encoding, creating polynomial features (since reliance on
convolutional layers in architecture), handling missing data (offered by intrinsic flexibility of text representation), performing
feature selection (by utilizing attention mechanism) and keyword extraction from narrative (frequently utilized approach in
machine learning to create binary variables based on word presence in the text [26]).

Accident Severity Class


LLM

Classification Model
Accident Description
model

Tokenization
‘Entry ramp to I-81
Southbound from 7th

[0..N]
Full-text
North St closed due + Representation
to stalled truck.’
Accident ID A-2760450, Source Source2,
Accident Report Start Latitude 43.090641, Start Longitude -
Features 76.168594, Accident extent (miles) 0.49,
Description Entry ramp to I-81 Southbound
Distance(mi) from 7th North St closed due to stalled truck.,
Street 7th North St, City Liverpool, County Feature vector
Temperature(F) Onondaga, State NY, ZipCode 13088, (up to 1024 units)
Lane Number Timezone US/Eastern, Airport Code KSYR,
Start_Time Temperature (F) 62.1, Humidity (%) 72.0,
Pressure (inch) 29.86, Visibility (miles) 10.0,
... Wind Direction WNW, Wind Speed (mph)
15.0, Weather Condition Overcast, Traffic
Signal, Sunrise/Sunset Day

Fig. 2. Diagram of the full-text representation and feature extraction

BERT (Bidirectional Encoder Representations from Transformers) [27] utilizes a training method known as Masked Language
Modeling (MLM). Its notable features include the introduction of bidirectional context, allowing the model to better understand
the semantics of each word. Additionally, BERT introduced a distinct pretraining and fine-tuning methodology that has been
widely used for incident analysis [21], [12], [13].
BERT-large [27] also employs Masked Language Modeling (MLM) as its training method. As an enlarged version of BERT,
it maintains the core principles but scales up the architecture to handle more complex tasks and deliver better performance.
XLNet [28] uses a Generalized Autoregressive Pretraining method. It addresses some limitations of BERT and incorporates
features of Transformer-XL, thus allowing for better handling of long-term dependencies in the text.
XLNet-large [28] is a scaled-up version of XLNet, and it also employs Generalized Autoregressive Pretraining. It brings the
advantages of XLNet into a larger and more powerful architecture, suitable for more complicated tasks.
RoBERTa (Robustly Optimized BERT) [29] aims to improve upon BERT by optimizing its training process. It employs a
modified form of MLM as its training method, removes the next sentence prediction task, and introduces dynamic masking
for better performance.
ALBERT (“A Lite BERT”) [30] is a self-supervised transformer-based NLP model that utilizes Masked Language Modeling
and Sentence Ordering Prediction for pretraining, utilizes shared-layer architecture for reduced memory footprint. The latest
version (2.0) improves upon its predecessor with adjustments like lower dropout rates and additional training data for enhanced
performance in downstream tasks. The largest configuration is ALBERT-xxlarge version 2 have been selected as a large model
in this study.
Including variations of BERT in the analysis is a design choice due to the expectation of higher performance from these
models. BERT and its variants are designed to capture complex language nuances and contextual relationships within text,
which often result in better performance on tasks involving natural language understanding. Even if current datasets do not
show substantial differences between language models, it remains a reasonable approach to test BERT variations, as they have
the potential to yield superior results, especially with richer narrative content in the data.
The model summary table (see Table I) provides a synthesized overview of several prominent NLP models, the technical
aspects and functionalities of these models, number of parameters, vector size, primary features, and their relevance to traffic
incident severity classification. This comparative analysis helps to understand the capabilities of each model when dealing with
incident report texts, thereby informing our selection of models for the study. The descriptions of each model’s characteristics
highlight their suitability for processing and analyzing text from traffic incident reports.
Model (Refer- Num. of Pa- Vector Size Notable Features Relevance for the Study
ence) rameters
BERT [27] 110 mil 768 Masked Language Modeling (MLM), Bidirec- Effective contextual understanding
tional context, Pretrain-finetune discrepancy. of text. Widely used for incident
Wide application in similar NLP tasks and analysis.
versatility in fine-tuning.
BERT-large 345 mil 1024 Masked Language Modeling (MLM), Bidirec- Enlarged version with improved
[27] tional context, Pretrain-finetune discrepancy. contextual capturing.
Capability to process complex tasks and large
text sequences. Expected higher performance
due to larger architecture.
XLNet [28] 110 mil 768 Generalized Autoregressive Pretraining, Over- Addresses BERT’s limitations and
comes BERT limitations, Transformer-XL in- incorporates long-term dependen-
tegration. Superior handling of sequence pre- cies in text.
diction and permutation-based training.
XLNet-large 340 mil 1024 Generalized Autoregressive Pretraining, Over- Enlarged version with improved
[28] comes BERT limitations, Transformer-XL in- contextual capturing.
tegration. Capability to process complex tasks
and large text sequences. Expected higher per-
formance due to larger architecture.
RoBERTa [29] 125 mil 768 Optimized BERT (MLM with changes), Improved training process for
Longer training, Removed next sentence pre- BERT-like models.
diction, Dynamic masking. Better performance
due to optimized training and handling of more
data during pretraining.
RoBERTa- 355 mil 1024 Optimized BERT (MLM with changes), Enlarged version with improved
large [29] Longer training, Removed next sentence pre- contextual capturing.
diction, Dynamic masking. Capability to pro-
cess complex tasks and large text sequences.
Expected higher performance due to larger
architecture.
ALBERT [30] 18.2 mil 768 Optimized BERT (MLM with changes), Sen- Reduces BERT’s memory footprint
tence Ordering Prediction, Layer-Sharing Ar- with shared-layer architecture and
chitecture, Reduced Memory Footprint. Imple- efficient pretraining tasks.
mentation efficiencies and advancements over
BERT with comparable abilities in context
understanding.
ALBERT-large 223 mil 4096 Optimized BERT (MLM with changes), Sen- Enlarged version with improved
[30] (ALBERT- tence Ordering Prediction, Layer-Sharing Ar- contextual capturing.
xxlarge) chitecture, Reduced Memory Footprint. Capa-
bility to process complex tasks and large text
sequences. Expected higher performance due
to larger architecture.
TABLE I
C OMPREHENSIVE OVERVIEW OF NLP M ODELS , THEIR F EATURES , AND R ELEVANCE

We also perform text traffic accident severity classification using various machine learning models:
XGBoost [31] is an advanced gradient boosting algorithm that utilizes the principle of boosting weak learners using the gra-
dient descent architecture. It features several advanced techniques such as regularization (L1 and L2), which prevents overfitting
and improves model generalization. XGBoost also supports various objective functions including regression, classification, and
feature ranking.
LightGBM [32], standing for Light Gradient Boosting Machine, is a gradient boosting framework that uses tree-based
learning algorithms. Its main advantage lies in its use of Gradient-based One-Side Sampling (GOSS) and Exclusive Feature
Bundling (EFB), which collectively reduce the data size and feature space without compromising accuracy. LightGBM is
designed for distributed and efficient training, particularly on large datasets, and supports categorical features natively.
RandomForest (RF) [33] aggregates the predictions of multiple decision trees to improve predictive accuracy and control
overfitting. Each tree in the RandomForest is built from a sample drawn with replacement (bootstrap sample) from the training
set. Moreover, when splitting a node during the construction of a tree, the split that is chosen is no longer the best split among
all features. Instead, the split that is picked is the best split among a random subset of the features. This results in a wide
diversity that generally results in a better model.
K-Nearest Neighbors (KNN) [23] operates by finding the predefined number k of training samples closest in distance to the
new point and predict the label from these. The distance can be any metric measure: standard Euclidean distance is the most
common choice. KNN has the advantage of being simple to interpret and having little to no assumption about the data.
Machine learning models have various specifics in application and require performance considerations (see Table II).
Model Description Specifics Performance
XGBoost Gradient boosting utilizing Exhaustive gradient-based ap- Fast training with exhaustive
L1/L2 regularization proach search; iteratively refines model
LightGBM Gradient boosting employing Approximate gradient-based Faster than XGBoost, ideal for
GOSS and EFB approach large datasets
RandomForest Ensemble of decision trees built
Random feature/rows subset se- Robust but relatively slower
on bootstrapped samples lection for training and averag- than gradient boosting methods
ing of predictions
K-Nearest Neighbors (KNN) Instance-based method using Performance significantly de- Can be computationally expen-
distance metrics pends on hyper-parameters - sive but easy interpretation via
distance metric and k need to neighborhood concept
be appropriately chosen
TABLE II
C OMPARISON OF SELECTED MACHINE LEARNING MODELS

Overall, XGBoost and RandomForest can be computationally expensive when used with high-dimensional vectors provided
by language models. KNN performance significantly decreases with increasing number of samples. There is a need for
computationally effective methods that can be used with LLM features to perform classification and regression tasks. LightGBM
may provide an effective solution for these tasks. Moreover, the use of tree-based methods can be more practical due to low
dependence on hyper-parameters, in comparison to artificial neural networks where number of units, types of units, number
of layers and training method have a high impact on final result, which requires a lot of trial-and-error attempts in network
design [34], [35]. Tree-based methods provide a streamlined approach to utilizing language model features.
V. C ASE STUDY & E XPERIMENT S ETUP
The three datasets, originating from the United States, the United Kingdom, and Queensland (Australia), offer varied but
rich contextual information about traffic accidents (see Table III). The U.S. dataset consists of 31 fields (reduced from 49 due
to zero variance in columns) and has emphasis on environmental and lighting conditions, such as weather and astronomical
twilight state, which could be crucial for understanding the impact of these factors on accidents. It also includes latitude and
longitude for both the start and end points of an accident (as sourced from MapQuest and Bing services), as well as a detailed
description of the accident itself. In contrast, the UK dataset comprises 34 fields with a focus on infrastructure details like
pedestrian crossings and local authority information, which can be valuable for urban planning and public policy analyses. The
Queensland dataset is the most comprehensive with 41 fields, providing hyper-local geographical information, including the
suburb and local government area, as well as specific roadway features and traffic control setups. This makes the Queensland
data potentially useful for localized analysis. While all three datasets contain some form of time and location information, they
vary significantly in the types of fields and the level of detail, suggesting that accident report unification may be required for
the transferability of any international accident model. Overall, each dataset has its own unique focus—environmental factors
and accident extent in the U.S., infrastructure in the UK, and detailed geographical and situational context in Queensland.
According to methodology, the table representation of accident reports is converted into a full text representation - all the
columns simply converted to text in the form “columns name: column value” in a single string of text.
Large Language Models are primarily optimized for natural language and may find it challenging to accurately interpret
short strings full of domain-specific abbreviations and numbers unless specifically fine-tuned for such tasks.
A. Data preparation
The preprocessing of data is a crucial step in developing our model. Initially, we have an imbalance in the sample distribution
across different severity classes. To resolve this, we use an even sampling. This function balances the dataset by sampling even
number of samples (e.g. 12,500 for USA data set) instances from each unique class present in the ’Severity’ column. This
even sampling ensures a fair representation of each severity level in the data we analyze. Secondly, we identify and remove
zero-variance features from the dataset. These are columns that contain only a single unique value and are not beneficial for the
modeling process. In fact, some algorithms require the removal of zero-variance features for successful convergence. Finally,
we enhance the feature set by processing the time-based information where it’s available. Specifically, we calculate and include
Description Example Crash Report
USA Data Set Accident ID A-7463401, Source Source1, Start Latitude 32.68116, Start Longitude -97.02426, End
Latitude 32.67618, End Longitude -97.03483, Accident extent (miles) 0.704, Description Ramp to
I-20 Westbound - Accident., Street President George Bush Tpke S, City Grand Prairie, County
Dallas, State TX, ZipCode 75052, Timezone US/Central, Airport Code KGPM, Temperature (F)
48.2, Humidity (%) 75.0, Pressure (inch) 30.26, Visibility (miles) 10.0, Wind Direction South, Wind
Speed (mph) 5.8, Weather Condition Mostly Cloudy, Junction, Sunrise/Sunset Night, Civil Twilight
Night, Nautical Twilight Night, Astronomical Twilight Night, Start Time hour 22, Start Time month
1, Weather Timestamp hour 22, Weather Timestamp month 11
UK Data Set accident index: 2018460317259, accident year: 2018, accident reference: 460317259,
location easting osgr: 556147.0, location northing osgr: 165830.0, longitude: 0.241871, latitude:
51.370065, police force: 46, number of vehicles: 1, number of casualties: 1, date: 08/08/2018,
day of week: 4, time: 11:35, local authority district: 538, local authority ons district: E07000111,
local authority highway: E10000016, first road class: 3, first road number: 20, road type: 6,
speed limit: 60, junction detail: 3, junction control: 4, second road class: 6, second road number: 0,
pedestrian crossing human control: 0, pedestrian crossing physical facilities: 0, light conditions: 1,
weather conditions: 1, road surface conditions: 1, special conditions at site: 0, carriageway hazards:
0, urban or rural area: 2, did police officer attend scene of accident: 1, trunk road flag: 2,
lsoa of accident location: E01024433
Queensland (Australia) Data Set Crash Ref Number: 28863.0, Crash Year: 2004.0, Crash Month: September, Crash Day Of Week:
Wednesday, Crash Hour: 6.0, Crash Nature: Angle, Crash Type: Multi-Vehicle, Crash Longitude:
152.872284325108, Crash Latitude: -27.5455985592659, Crash Street: Kangaroo Gully Rd,
Crash Street Intersecting: Mount Crosby Rd, State Road Name: Mount Crosby Road,
Loc Suburb: Anstead, Loc Local Government Area: Brisbane City, Loc Post Code: 4070,
Loc Police Division: Indooroopilly, Loc Police District: North Brisbane, Loc Police Region:
Brisbane, Loc Queensland Transport Region: SEQ North, Loc Main Roads Region: Metropolitan,
Loc ABS Statistical Area 2: Pinjarra Hills - Pullenvale, Loc ABS Statistical Area 3:
Kenmore - Brookfield - Moggill, Loc ABS Statistical Area 4: Brisbane - West,
Loc ABS Remoteness: Major Cities, Loc State Electorate: Moggill, Loc Federal Electorate:
Ryan, Crash Controlling Authority: State-controlled, Crash Roadway Feature: Intersection
- T-Junction, Crash Traffic Control: No traffic control, Crash Speed Limit: 70 km/h,
Crash Road Surface Condition: Sealed - Dry, Crash Atmospheric Condition: Clear,
Crash Lighting Condition: Daylight, Crash Road Horiz Align: Curved - view open,
Crash Road Vert Align: Level, Crash DCA Code: 202.0, Crash DCA Description: Veh’S
Opposite Approach: Thru-Right, Crash DCA Group Description: Opposing vehicles turning,
DCA Key Approach Dir: E, Count Unit Car: 1.0, Count Unit Motorcycle Moped: 1.0,
Count Unit Truck: 0.0, Count Unit Bus: 0.0, Count Unit Bicycle: 0.0, Count Unit Pedestrian:
0.0, Count Unit Other: 0.0
TABLE III
E XAMPLE OF FULL TEXT REPRESENTATIONS FOR DIFFERENT DATA SETS

the starting hour and the month of each accident. This addition aims to capture any temporal patterns that might exist in the
occurrence of accidents.
The table IV outlines three scenarios focused on the task of Severity Classification of traffic accidents using LLM.
• Severity Classification with Baseline Accident Report Features: In this scenario, the model is trained using only numerical
baseline accident report features obtained from tabular representation of accident reports. Objective: To assess how well
traditional, structured data performs in predicting the severity of traffic accidents.
• Severity Classification with NLP Features: This scenario focuses on training the model using features derived from fulltext
representation and LLM. Objective: To determine the performance of the proposed fulltext+LLM approach.
• Severity Classification with a Combination of Baseline and NLP Features: This scenario combines both baseline accident
report features and NLP-derived features for model training. Objective: To evaluate if a combination of structured and
unstructured data improves the model’s ability to predict the severity of an accident. All the features used in scenarios 1
and 2 are combined into a single feature set.

VI. R ESULTS
Next figures demonstrate the accident severity classification performance over different feature sets (report only features,
NLP - features extracted from the full text representation of accident reports as discussed previously, report+NLP - combination
of report only features with extracted features from a language model). Results represent the cross-validation results of different
models and feature sets. We assess the models based on 4 metrics: Accuracy, F1-score, Precision, Recall.
The findings are as follows:
• F1-score is representative for the evaluation of the models and may be prioritized for interpretation and further evaluations
over accuracy, precision and recall.
• There is no apparent difference between all the language models. This could be attributed to the limited narrative content
within the accident reports, suggesting that the models are primarily leveraging tabular data. If the textual data does not
Severity Classification

LightGBM Random Forest XGBoost KNN

NLP Features (BERT, etc) ✓ ✓ ✓ ✓

Baseline Features (report) ✓ ✓ ✓ ✓

Combined Features(report+NLP) ✓ ✓ ✓ ✓
TABLE IV
E XTENDED M ATRIX OF E XPERIMENTS FOR S EVERITY C LASSIFICATION U SING VARIOUS M ODELS

contain much discriminative information (e.g. complex textual descriptions), language models may not have sufficient
context or features to significantly outperform each other, leading to a uniformity in their performance metrics.
• Also, there is a negligible difference between base and large variants of models (e.g. bert vs bert-large). Accident reports
are represented as small paragraphs of text, where advantages of language models in interpreting language features may
not be relevant.
• The ability to use text representation right away, while achieving accceptable prediction performance, instead of feature
engineering (e.g. normalization of values, label encoding, functional feature transformations, date interpretation, etc) is of
interrest to traffic management authorities and data analysts in transportation.
• In general, the use of additional language features together with report features may improve the severity classification
performance. It may lead to the use of vectors of higher dimensionality, but tree ensemble models show sufficient
performance in this case.
These results suggest that machine learning models trained with a combination of LLM-features and report-features tend
to perform better at predicting traffic incident severity class. The combination of BERT and XGBoost specifically presents a
robust method in this task.
For Queensland data set, the best performing model is the combination of report and language features (extracted using
GPT-2) with RandomForest reaching F1-score of 0.65 (see Figure 3).

Heat Map of Average F1 Score Sorted by Features


K-Nearest Neighbors + NLP 0.44 0.42 0.44 0.45 0.44 0.45 0.44 0.45 0.65
Light GBM + NLP 0.59 0.58 0.57 0.6 0.6 0.59 0.61 0.6
0.6 0.58 0.6 0.61 0.61 0.6 0.61 0.61
Model + Feature Set

Random Forest + NLP


XGBoost + NLP 0.6 0.59 0.59 0.61 0.61 0.6 0.62 0.61 0.60
K-Nearest Neighbors + report 0.45
Light GBM + report 0.56 0.55
Random Forest + report 0.65
XGBoost + report 0.59
K-Nearest Neighbors + report+NLP 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.50
Light GBM + report+NLP 0.61 0.61 0.61 0.63 0.63 0.63 0.64 0.63
Random Forest + report+NLP 0.6 0.62 0.63 0.65 0.65 0.65 0.64 0.65 0.45
XGBoost + report+NLP 0.64 0.63 0.64 0.65 0.65 0.65 0.66 0.66
xlnet-large
-

albert-large

bert-large

roberta-large
albert

bert

xlnet
roberta

Language Model
Fig. 3. Random Forest for Queensland, Australia

In our analysis, it is evident that certain combinations of features and models outperform others. Notably, the highest F1-
score of 0.58 was achieved by a RandomForest model. What makes this particular model stand out is its use of both report
and language features, the latter being extracted via the GPT-2 language model, as detailed in Figure 4. Following closely is
the XGBoost model integrated with BERT features, which reached an F1-score of 0.56, also displayed in Figure 4. When it
comes to language models, the results did not show any significant performance variation between different language models.

Heat Map of Average F1 Score Sorted by Features


K-Nearest Neighbors + NLP 0.4 0.4 0.41 0.42 0.42 0.4 0.42 0.41
0.56
Light GBM + NLP 0.5 0.5 0.51 0.51 0.51 0.48 0.51 0.51
0.51 0.51 0.52 0.52 0.51 0.5 0.52 0.52 0.54
Model + Feature Set

Random Forest + NLP


XGBoost + NLP 0.51 0.52 0.52 0.52 0.52 0.49 0.52 0.52 0.52
K-Nearest Neighbors + report 0.42
0.50
Light GBM + report 0.55
Random Forest + report 0.57 0.48
XGBoost + report 0.55 0.46
K-Nearest Neighbors + report+NLP 0.42 0.42 0.42 0.42 0.42 0.42 0.42 0.42
Light GBM + report+NLP 0.56 0.55 0.56 0.56 0.56 0.54 0.57 0.56 0.44
Random Forest + report+NLP 0.52 0.53 0.53 0.54 0.53 0.51 0.54 0.54 0.42
XGBoost + report+NLP 0.56 0.56 0.57 0.57 0.57 0.55 0.57 0.57 0.40

xlnet-large
-

albert-large

bert-large

roberta-large
albert

bert

xlnet
roberta
Language Model
Fig. 4. Random Forest for UK

The USA data set has rather unique response to the use of language models in comparison to two other data sets (Queensland
and UK), see Figure 5
• The best performing model is the combination of report and language features (extracted using BERT) with XGBoost
reaching F1-score of 0.89. In comparison to using report features only - F1-score is 0.82.
• The performance of using NLP features from different models can be ordered in the following way: 1) BERT, 2) BERT-
large, 3) XLNet-large, 4) XLNet, 5) Roberta.
• What is highly important is that BERT model can be effectively used with full text representation of accident reports right
away and demonstrate performance higher than just using accident reports.
• XGBoost and RandomForest show the best performance and may be used based on model preference.
• The rest of the models have basic architecture and perform worse than tree ensembles. These models still demonstrate low
but unusual performance - using only language features with LogisticRegression or KNN shows much better performance
(0.81 and 0.84 correspondingly) than just using accident reports (F1-score is 0.6). These results highlight the applicability
of combining older or simpler models with advanced language models.
In general, the performance of RandomForest is highest, with XGboost as second best model. In all the cases, the performance
of combining language feature sets with accident report features shows lower performance than just using accident report
features.
For Uk data set, the performance of RandomForest is highest with negligible variation betwwen language models, with
XGboost as second best model. In all the cases, the performance of combining language feature sets with accident report
features shows lower performance than just using accident report features. When using features extracted using GPT-2 the
performance is the lowest when using each of ML models.
The analysis of the USA dataset reveals that XGBoost outperforms other models when utilizing either report-only features or
a combination of report and language features. The bert-large variant demonstrates a marginally better performance compared
to the base BERT model. All other models have lower performance metrics than the BERT. So far, this data set shows both
the highest severity classification performance and advancement of using language model features.
The performance of different language models was further analyzed through a point plot, with the models arranged on the
x-axis in descending order of their average F1 scores (see Figure 6). Across all three datasets, roberta-large and albert-large
show lowest performance, while bert-large and xlnet models consistently show high performance levels. XGBoost displayed
competitive results in comparison to RandomForest, with only negligible differences in their performance. The K-Nearest
Neighbors algorithm was observed to have the lowest performance.
Heat Map of Average F1 Score Sorted by Features
K-Nearest Neighbors + NLP 0.84 0.63 0.84 0.86 0.74 0.51 0.82 0.81 0.90
Light GBM + NLP 0.88 0.8 0.88 0.88 0.81 0.61 0.87 0.87 0.85
Model + Feature Set Random Forest + NLP 0.85 0.71 0.85 0.85 0.76 0.57 0.84 0.84
XGBoost + NLP 0.88 0.81 0.89 0.89 0.82 0.63 0.87 0.87 0.80
K-Nearest Neighbors + report 0.8 0.75
Light GBM + report 0.88
Random Forest + report 0.87 0.70
XGBoost + report 0.89 0.65
K-Nearest Neighbors + report+NLP 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
Light GBM + report+NLP 0.91 0.89 0.91 0.91 0.9 0.89 0.91 0.91 0.60
Random Forest + report+NLP 0.87 0.77 0.88 0.87 0.84 0.78 0.87 0.87 0.55
XGBoost + report+NLP 0.91 0.89 0.91 0.91 0.9 0.89 0.91 0.91

xlnet-large
-

albert-large

bert-large

roberta-large
albert

bert

xlnet
roberta
Language Model
Fig. 5. Random Forest for USA

A. Performance comparison
The computational experiments were conducted on a high-performance computing system equipped with dual Intel Xeon
Gold 6238R CPUs, each featuring 28 cores operating at a base frequency of 2.20 GHz, capable of reaching up to 4.00 GHz.
The system follows a 64-bit architecture, supports multi-threading with a total of 56 logical CPUs, and is optimized for complex
computational tasks through advanced virtualization (VT-x), and large-scale memory operations. The system is further enhanced
with NVIDIA Quadro RTX 6000 graphics card, equipped with 24 GB of GDDR6 memory.
Language models were tested for performance using LightGBM model, batch size 32, time averaged over 4 batches.
Time measurements obtained for total per batch processing time, tokenization time and model inference time. We compute
performance in samples per minute terms. We observe that BERT and ROBERTA are the fastest models overall (see Figure
7) with large variants of these models performing consistently slower up to 12 times difference (ALBERT vs ALBERT-large).
We also see that speed of tokenization is very different with a noticable trend towards more recent and advanced variations of
BERT (e.g. BERT has the slowest tokenization engine and ALBERT-LARGE is the fastest with 8 times difference in speed) -
see Figure 8. The speed of inference for BERT model is among three slowest models, with ROBERTA as the fastest models
for inference (model run when tokenization is already performed over text lines).
In this study, we evaluated the performance of various language models using LightGBM with a consistent batch size of 32.
The performance metrics were time-averaged across four batches, including total batch processing time, tokenization time, and
model inference time. To express the efficiency of the models, we calculated their throughput in terms of samples per minute.
The results, illustrated in Figure 7, indicate that the BERT and ROBERTA models shows the highest overall speed. In
comparison, their larger counterparts, such as ALBERT and ALBERT-large, demonstrate a significant decrease in performance,
with the latter being up to twelve times slower than its base model.
A notable variance in tokenization speeds was also observed, as depicted in Figure 8. The trend suggests that more recently
developed variations of the BERT model have improved tokenization efficiency, with BERT displaying the slowest tokenization
process and ALBERT-LARGE being the fastest – the latter is eight times swifter than BERT.
During the model inference phase, where the models generate predictions post-tokenization, the BERT model’s inference
speed is amongst the lowest. The ROBERTA model stands out as the fastest during this stage, confirming its superiority in
processing speed over the other models evaluated.
B. General Comparison for Area: USA, Description only
In this scenario we try to evaluate models on the task of traffic incident severity classification using description field only.
We compare performance of language models by extracting NLP-features from accident descriptions: the NLP features are
extracted using LLM directly from the ’description’ field in the incident reports. The evaluation metric chosen is the F1-score,
USA Queensland UK
0.90 0.625
0.52
0.85 0.600
0.80 0.575 0.50
Average F1 Score

Average F1 Score

Average F1 Score
0.75 0.550 0.48
0.70 0.525 0.46
0.65 0.500
0.44
0.60 0.475
0.42
0.55 0.450
0.40
0.50 0.425
roberta

roberta

roberta
bert
albert

roberta-large

albert

albert
xlnet

xlnet

roberta-large
bert

xlnet

bert

roberta-large
bert-large

albert-large

bert-large

albert-large

albert-large
xlnet-large

xlnet-large

bert-large

xlnet-large
Language Model Language Model Language Model
ML Model
Light GBM K-Nearest Neighbors Random Forest XGBoost

Fig. 6. General comparison of Average F1 Score for traffic accident severity classification on USA data set using incident description only

which balances precision and recall and is particularly useful for imbalanced classification problems (traffic incident severity
classes are often imballanced).
There are several findings related to this description-only scenario (see Figure 9):
• Effectiveness of BERT: Among the LLMs, BERT and its large variant outperform all other models in feature extraction
relevant to incident severity classification.
• Random Forest and XGBoost: Both of these algorithms are ensemble methods that combine multiple weak learners to
make a strong learner. These machine learning models are the most effective in utilizing the LLM-extracted features for
severity classification.
• NLP Features extracted from the incident description field can be nearly as affective as Report-Only features from USA
data set: this suggests that the unstructured text in the description contains nuances or contextual information that the
tabular form does not capture. The results indicate that the textual description in the incident reports contains valuable
information, necessary to detecrmine the accident severity. This makes sense, as the incident description might include
details about how the accident occurred, how many vehicles were involved and other, even though in unstructured form.
The findings suggest that more focus of reporting authorities should be placed on extracting high-quality incident descriptions,
as these are found to be very informative for the task at hand. Also, it means that LLM models with ML pipeline can be
utilized right away (e.g. in mobile application) once the accident report is received and provide a high degree of classification
accuracy. It would be interesting to explore why exactly BERT models are so effective in this scenario compared to other
LLMs.
C. Use of PCA for Dimensionality Reduction
Given the high dimensionality of the feature set extracted from traffic incident reports using LLMs (e.g. 768 dimensions
for each report for bert-large model), traditional machine learning models may face challenges in handling the data efficiently.
The model training process can be time-consuming, high dimensionality can also lead to overfitting or poor generalization of
the data.
To mitigate these issues, we used fast machine learning models, like XGBoost and Decision Tree Regression model, are
known for their efficiency and scalability in handling high-dimensional data. these models can process a large amount of data
relatively quickly, enabling rapid model training and prediction, which is crucial for timely traffic incident response.
Total [Samples per Minute]
6000
5000
4000
3000
2000
1000
0

roberta-large
bert-large

roberta

xlnet-large

albert-large
bert

xlnet

albert
Model
Fig. 7. Language model performance (total): samples per minute

Furthermore, we used Principal Component Analysis (PCA) for dimensionality reduction. PCA is a popular technique that
transforms the original variables into a new set of variables, which are linear combinations of the original variables and are
orthogonal to each other, ensuring no redundant information. The transformation retains most of the variance in the data using
fewer components, thus reducing the data’s dimensionality.
Dimensionality reduction was applied to language feature vectors using Principal Component Analysis, as illustrated in Figure
10. The F1 scores from traffic accident severity classification were compared between language models, utilizing LightGBM as
the machine learning model on a dataset from the USA. The findings suggest that base models works well with dimensionality
reduction, showing a minimal difference in performance between 64 and 768 components. Conversely, large models demonstrate
a distinctly slower progression, indicating a requirement for a larger number of components (e.g. albert vs albert-large), or
the use of the full feature vector, to achieve sufficient results. The worst performance recorded for roberta-large, which for
unknown reason has very slow progression in performance over number of components, which hints at requirement for full
feature vector.
VII. C ONCLUSION
In the present study, we propose an innovative study on the efficiency of large language models for predicting traffic
incident severity. We utilize the unstructured data representation across various traffic incident reports. By combining the
extracted features from these reports with features extracted by large language models, our methodology shows enhancement
in prediction accuracy.
Our findings show that the integration of LLM-derived features with those from incident reports leads to a notable improve-
ment in the performance of all tested machine learning models on USA dataset. Specifically, RandomForest and XGBoost
exhibit the highest classification accuracy, signifying the potential of combining traditional machine learning with advanced
NLP techniques for more accurate and efficient incident management.
The study illustrates LLMs’ potential in enhancing TIMS through improved traffic incident severity classification. By applying
advanced language models in conjunction with traditional machine learning, we can better understand and predict the outcomes
of traffic incidents. Our approach represents a paradigm shift from relying solely on structured accident reports with pre-defined
set of parameter variations to integrating the rich, contextual information embedded within the narrative of traffic accident
reports and utilizing completely textual accident representation. Additionally, our evaluation revealed the trade-off between
performance and computational cost across LLMs, providing a valuable reference for choosing suitable models based on the
application requirements.
Comparison with Tabular Data: When compared with traditional, tabular accident report features (report), the LLM-
derived features sometimes show either superior or acceptable performance, Best Models: For tasks like traffic incident
severity classification, using advanced NLP models like BERT, coupled with powerful ML algorithms like Random Forest or
Tokenization [Samples per Minute]

Inference [Samples per Minute]


4000 8000
3500 7000
3000 6000
2500 5000
2000 4000
1500 3000
1000 2000
500 1000
0 0

roberta-large
roberta-large

bert-large
roberta
xlnet-large
albert-large
bert-large
roberta
xlnet-large
albert-large

xlnet
bert

albert
xlnet
bert

albert

Model Model
Fig. 8. Language model performance (tokenization): samples per minute
Model + Feature Set

K-Nearest Neighbors-NLP 0.86 0.85 0.88 0.87 0.8 0.63 0.86 0.85 0.85
Light GBM-NLP 0.87 0.87 0.89 0.88 0.83 0.69 0.87 0.86 0.80
0.75
Random Forest-NLP 0.86 0.84 0.88 0.87 0.81 0.68 0.86 0.85
0.70
XGBoost-NLP 0.87 0.87 0.89 0.89 0.84 0.71 0.88 0.86
0.65
roberta
albert

bert

roberta-large

xlnet
albert-large

bert-large

xlnet-large

Language Model
Fig. 9. General comparison of Average F1 Score for traffic accident severity classification on USA data set using incident description only

XGBoost, is likely to yield the best results. Mitigating Feature Engineering: Instead of relying solely on tabular data from
accident reports, incorporating text-based features could provide a more holistic view and improve classification performance.
Limitations of this work: Our current approach has been primarily tested and validated on publicly available CTADS
dataset from the United States. Nevertheless, there is a potential in applying this approach to traffic incident data from other
countries.
Future works: By utilizing the flexibility of LLMs in interpreting incident reporst represented in the textual form, we can
build traffic incident severity classification frameworks, which have the potential to be transferable between various accident
Average F1 Scores for Language Models
by PCA Components. USA data set
0.9

0.8

0.7

Average F1 Score
0.6
Base Model Name
Base Model Name
bert
0.5 roberta
xlnet
albert
Model Size
0.4 Regular
Large
2
4
8
16
32
64

8
6
2
12
25
51
PCA Components

Fig. 10. General comparison of Average F1 Score for traffic accident severity classification on USA data set using incident description only

reporting systems and locations. Our future research will also focus on performing a joint prediction of traffic accident
severity, impact and duration. LLM models can be fine-tuned for specific tasks, including the interpretation of technical
terms, abbreviations, and numbers. However, out-of-the-box, neither is specifically designed to perform well in handling highly
abbreviated or numerical short strings present in accident reports. For such specialized tasks, a domain-specific model fine-tuned
on a large corpus of accident reports would likely perform better.
ACKNOWLEDGMENT
This work has been done as part of the ARC Linkage Project LP180100114. The authors are highly grateful for the support
of Transport for NSW, Australia.
R EFERENCES
[1] N. H. T. S. Administration, Traffic safety facts, a compilation of motor vehicle crash data from the fatality analysis reporting system and the general
estimates system, Available at: https://crashstats.nhtsa.dot.gov/Api/Public/Publication/812261, accessed: 2 June 2023 (2013).
[2] W. Kim, G.-L. Chang, Development of a hybrid prediction model for freeway incident duration: A case study in maryland, International Journal of
Intelligent Transportation Systems Research 10 (01 2011). doi:10.1007/s13177-011-0039-8.
[3] A. Theofilatos, G. Yannis, P. Kopelias, F. Papadimitriou, Predicting road accidents: A rare-events modeling approach, Transportation Research Procedia
14 (2016) 3399–3405, transport Research Arena TRA2016. doi:https://doi.org/10.1016/j.trpro.2016.05.293.
URL https://www.sciencedirect.com/science/article/pii/S235214651630299X
[4] A. Grigorev, A.-S. Mihăiţă, K. Saleh, M. Piccardi, Traffic incident duration prediction via a deep learning framework for text description encoding, in:
2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2022, pp. 1770–1777.
[5] H. Baker, M. R. Hallowell, A. J.-P. Tixier, Automatically learning construction injury precursors from text, Automation in Construction 118 (2020)
103145. doi:https://doi.org/10.1016/j.autcon.2020.103145.
URL https://www.sciencedirect.com/science/article/pii/S0926580519310209
[6] F. Zhang, H. Fleyeh, X. Wang, M. Lu, Construction site accident analysis using text mining and natural language processing techniques, Automation in
Construction 99 (2019) 238–248. doi:https://doi.org/10.1016/j.autcon.2018.12.016.
URL https://www.sciencedirect.com/science/article/pii/S0926580518306137
[7] J. Yu, J. Ouyang, X. Bao, Water accidents severity classification based on prompt-bert, in: 2023 3rd International Symposium on Computer Technology
and Information Science (ISCTIS), IEEE, 2023, pp. 942–946.
[8] R. Li, F. C. Pereira, M. E. Ben-Akiva, Overview of traffic incident duration analysis and prediction, European transport research review 10 (2) (2018)
22.
[9] S. Ahmed, M. A. Hossain, M. M. I. Bhuiyan, S. K. Ray, A comparative study of machine learning algorithms to predict road accident severity, in: 2021
20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), IEEE, 2021, pp. 390–397.
[10] A. S. Mihaita, Z. Liu, C. Cai, M. Rizoiu, Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting, CoRR
abs/1905.12254 (2019). arXiv:1905.12254.
URL http://arxiv.org/abs/1905.12254
[11] O. Zheng, M. Abdel-Aty, D. Wang, Z. Wang, S. Ding, Chatgpt is on the horizon: Could a large language model be all we need for intelligent
transportation?, arXiv preprint arXiv:2303.05382 (2023).
[12] P. Agrawal, A. Franklin, D. Pawar, P. Srijith, Traffic incident duration prediction using bert representation of text, in: 2021 IEEE 94th Vehicular
Technology Conference (VTC2021-Fall), IEEE, 2021, pp. 1–5.
[13] S. Yuan, Q. Wang, Imbalanced traffic accident text classification based on bert-rcnn, in: Journal of Physics: Conference Series, Vol. 2170, IOP Publishing,
2022, p. 012003.
[14] T. Yuanlai, Z. Jiale, W. Huifeng, Text classification method of accident cases based on bert pre-training model, Journal of East China University of
Science and Technology 49 (4) (2023) 576–582.
[15] D. M. Goldberg, Characterizing accident narratives with word embeddings: Improving accuracy, richness, and generalizability, Journal of safety research
80 (2022) 441–455.
[16] R. Mitchell, A. Adinets, T. Rao, E. Frank, Xgboost: Scalable gpu accelerated learning, arXiv preprint arXiv:1806.11248 (2018).
[17] N. Jethani, M. Sudarshan, I. C. Covert, S.-I. Lee, R. Ranganath, Fastshap: Real-time shapley value estimation, in: International Conference on Learning
Representations, 2021.
[18] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Barnes, A. Mian, A comprehensive overview of large language models, arXiv
preprint arXiv:2307.06435 (2023).
[19] S. Moosavi, M. H. Samavatian, S. Parthasarathy, R. Teodorescu, R. Ramnath, Accident risk prediction based on heterogeneous sparse data: New dataset
and insights, in: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2019, pp.
33–42.
[20] A. H. Oliaee, S. Das, J. Liu, M. A. Rahman, Using bidirectional encoder representations from transformers (bert) to classify traffic crash severity types,
Natural Language Processing Journal 3 (2023) 100007.
[21] Y. M. Goh, C. Ubeynarayana, Construction accident narrative classification: An evaluation of text mining techniques, Accident Analysis & Prevention
108 (2017) 122–130.
[22] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE Intelligent Systems and their applications 13 (4) (1998)
18–28.
[23] L. E. Peterson, K-nearest neighbor, Scholarpedia 4 (2) (2009) 1883.
[24] A. J. Myles, R. N. Feudale, Y. Liu, N. A. Woody, S. D. Brown, An introduction to decision tree modeling, Journal of Chemometrics: A Journal of the
Chemometrics Society 18 (6) (2004) 275–285.
[25] P. Hosseini, S. Khoshsirat, M. Jalayer, S. Das, H. Zhou, Application of text mining techniques to identify actual wrong-way driving (wwd) crashes in
police reports, International Journal of Transportation Science and Technology (2022).
[26] W. A. Qader, M. M. Ameen, B. I. Ahmed, An overview of bag of words; importance, implementation, applications, and challenges, in: 2019 international
engineering conference (IEC), IEEE, 2019, pp. 200–204.
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, Advances
in neural information processing systems 32 (2019).
[29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining
approach, arXiv preprint arXiv:1907.11692 (2019).
[30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv
preprint arXiv:1909.11942 (2019).
[31] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery
and data mining, 2016, pp. 785–794.
[32] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, Advances in
neural information processing systems 30 (2017).
[33] A. Liaw, M. Wiener, et al., Classification and regression by randomforest, R news 2 (3) (2002) 18–22.
[34] S. L. Özesmi, C. O. Tan, U. Özesmi, Methodological issues in building, training, and testing artificial neural networks in ecological applications,
Ecological Modelling 195 (1) (2006) 83–93, selected Papers from the Third Conference of the International Society for Ecological Informatics (ISEI),
August 26–30, 2002, Grottaferrata, Rome, Italy. doi:https://doi.org/10.1016/j.ecolmodel.2005.11.012.
URL https://www.sciencedirect.com/science/article/pii/S0304380005005806
[35] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, X. Wang, A comprehensive survey of neural architecture search: Challenges and solutions,
ACM Computing Surveys (CSUR) 54 (4) (2021) 1–34.

You might also like