Modifiedinfrastructure
Modifiedinfrastructure
Abstract Detecting the damage assessment tweets is beneficial to both humanitarian organizations and victims
during a disaster. Most of the previous works that identify tweets during a disaster have been related to
situational information, availability/requirement of resources, infrastructure damage, etc. There are only a
few works focused on detecting the damage assessment tweets. In this paper, a novel method is proposed for
identifying the damage assessment tweets during a disaster. Our proposed method effectively utilizes the low-
level lexical features, top-most frequency word features, and syntactic features that are specific to damage
assessment. These features are weighted by using simple linear regression and Support Vector Regression (SVR)
algorithms. Later, the random forest technique is used as a classifier for classifying the tweets. We experimented
on 14 standard disaster datasets of different categories for binary and multi-class classification. The proposed
method gives an accuracy of 94.62% for detecting the damage assessment tweets. Most importantly, the proposed
method can be applied in a situation where enough labeled tweets are not available and also when specific
disaster type tweets are not available. This can be done by training the model with past disaster datasets. Our
proposed model is trained on datasets such as (i) combination of earthquake disaster datasets (ii) combination
of old earthquake disaster datasets, and (iii) combination of old diverse disaster datasets and tested on the other
datasets in the cross-domain scenario. The proposed approach is also compared with state-of-the-art approaches,
both in-domain and cross-domain, for binary and multi-class classification. The proposed method has improved
up to 37.12% accuracy compared with the existing methods.
Keywords Twitter · Disaster · Damage assessment · Infrastructure damage · Social media
1 Introduction
Social media platforms have received much attention in the past decade during a crisis to get the situational
updates [1–3] like dead people, availability and requirement of resources [4–6], needs of affected and injured
people, etc. People post a large number of messages on social media related to the situational information [7–9]
during and in the aftermath of a disaster. Along with the situational tweets, sentiment tweets (e.g., sympathies
for those affected) are also posted. Some works [10–13, 8] have used feature-based approaches for extracting the
situational information from social media. On the other hand, some studies [14–20] have used either feature-
based methods or deep learning methods for sentiment analysis of tweets from social media.
Damage assessment is one of the critical situational awareness steps for humanitarian organizations to
understand the seriousness of the damage and provide the services to the people according to the emergency
during a disaster. Existing works [10–12] focused on the damage assessment during a disaster. In [11], the authors
concentrated only on damage assessment of buildings and infrastructure on Italian tweets. However, they did
not focus on English tweets and also failed to address the human damage assessment from the tweets. Even
though the authors in [12] used a split-query based information retrieval approach for detecting the infrastructure
damage assessment during a disaster, there is no focus on the human damage assessment. However, their method
requires manual selection of keywords for setting the query in the information retrieval methods. Similarly, the
authors in [10] focused on damage assessment on buildings and infrastructure with the use of images but
missed important information that might be available from text data. However, there is no reported work on
human damage assessment either from the text or images during the disaster. The human damage assessment
information is beneficial to the government organizations for providing necessary services to the victims during a
disaster. Therefore, it is essential to detect both human and infrastructure damage assessment from the tweets
during a disaster. Few example tweets related to damage and non-damage assessment tweets are explained
Table 1 Example tweets related to damage and non-damage assessment during disaster
Tweet Tweets
No
Damage Assessment Tweets
1. Deadly monsoon hits India, Nepal: Dozens of people have been killed in flooding in northern and
eastern India [URL].
2. Hundreds dead as monsoon hits India, Nepal [URL] #world #news #last
3. What about? Scores Killed in Flooding in Nepal and India: More than 11,000 homes have been
damaged [URL] Next time
4. #India news - Odisha Villagers Keeping Damaged Houses Intact for Getting Government Help
[URL]
Non-Damage Assessment Tweets
5. A Flood Statement has been issued for portions of the WATE 6 area. Details [URL]
6. Selfie with the flood @ cameron grace #peepthecar [URL]
7. @FOX2now #earthquake USGS reports 6.2 mag earthquake in central Italy.
8. 6.4 magnitude quake strikes central Italy near the city of Perugia: [URL] #9News.
in Table 1. The first four tweets are related to damage assessment, and the rest are related to non-damage
assessment.
In this paper, we considered the identification of damage (both human and infrastructure damage) assessment
tweets as a binary and multi-class classification problem. As different categorical damage information is essential
during disaster, we have formulated a new approach by designing the weighted features using linear regression
and SVR for both binary and multi-class classification problems, which includes features such as low-level lexical
and syntactic features and highest frequency words. The experiments are performed on different classifiers to
select suitable classifier for the proposed features. The selected classifier is used for experimenting on different
disaster datasets, and it is evaluated by various parameters.
The contributions of this work are summarized as follows:
1. We propose a novel method based on the low-level lexical, syntactic features and top-most frequency words
that are weighted by using SVR and linear regression weighting algorithms for identifying the damage
assessment tweets during a disaster. The proposed method is vocabulary independent, which allows the
model to identify the tweets accurately, even the model is trained with different disaster datasets.
2. We compared our proposed method with different state-of-the-art methods on various datasets. The proposed
method has improved up to 37.12% accuracy compared with the existing methods. It achieves the best
performance in identifying the damage assessment tweets during the disaster for both in-domain and cross-
domain in the case of binary and multi-class classification.
The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 describes
the proposed method for identifying damage assessment tweets during a disaster. The experiment results and
performance analysis of the proposed method are explained in the Section 4. Section 5 explains the conclusion
of the paper.
2 Related Work
We describe the related work in three areas such as a) crisis-related tweets that discusses the different approaches
for classifying crisis-related tweets, b) sub-categories of crisis-related tweets that explains the various methods
for identifying the sub-categories of crisis-related tweets, and c) specific damage assessment tweets that discusses
the works related to damage assessment like infrastructure damage, utility damage, etc.
In this section, we describe the various methods for identifying crisis-related tweets during a disaster. In [1],
the authors showed the use of social media during emergencies and presented the computational methods for
processing social media data. Many studies have been published [5, 6, 21, 22, 13, 8, 23–29] to extract useful infor-
mation from social media during a disaster. Most of the methods [5, 6, 21, 13, 8, 23–25] mainly depend on the
features used for classifying crisis-related data. In [13], the authors used uni-gram, Part-Of-Speech (POS) tags,
etc. for extracting the situational awareness information during a disaster. In [21], the authors developed an
earthquake reporting system based on features such as length, the position of the keyword “earthquake”, and
context words of a tweet for detecting the tweets related to the earthquake. The authors in [23] developed a
Title Suppressed Due to Excessive Length 3
system named as Artificial Intelligence Disaster Response (AIDR) based on the n-gram features for detecting
the user-defined category of tweets during a disaster. However, all the features are vocabulary dependent and
don’t work on cross-domain. In 2018, the authors [8] used vocabulary independent features such as low-level
lexical and syntactic features for classifying the tweets into situational and non-situational information. They
used different types of disaster datasets for experimenting with their methods and compared against the Bag-
Of-Words (BOW) model for both English and Hindi tweets. The authors in [30] used the textual-based features
and domain-expert features to the random forest classifier for automatic identification of the eyewitness reports.
They categorize the eyewitness reports into direct eyewitness, indirect eyewitness, and vulnerable direct eye-
witness. They experimented with earthquakes, floods, hurricanes, and forest fires datasets. The authors in [31]
used the low-supervision and transfer learning-based approaches for detecting the urgency tweets. They experi-
mented on datasets Nepal, Macedonia and Kerala earthquake. They have shown that their method is beneficial,
especially when the labeled data is less and also has shown that their method outperforms the existing baseline
methods. However, all these works focus on disaster-related tweets but not on the sub-category of tweets such
as infrastructure damage, human damage, resource need and availabilities, etc.
In this section we discuss the works related to identifying various sub-categories of the crisis-related tweets
during a disaster. Few studies [5, 6, 22] focused on extracting the different categories of tweets during a disaster.
In [22], the authors used context and content features with different classifiers such as decision tree, SVM,
Random forest, and Adaboost for finding help requests during a disaster. Among the classifiers, the decision
tree gives a better result than the others. Recently, the authors in [6] used features such as terms related to
communication, location, infrastructure damage, etc. for detecting the resources during a disaster. Consequently,
the authors [5] specifically focused on detecting the availability and requirement of resources. They used the
re-ranking feature selection algorithm for extracting the features from the tweets and send the output of the
feature selection algorithm to the classifier for detecting the availability and requirement of resources. In [32], the
authors used information retrieval methodologies based on word embedding, a combination of word and character
embeddings for extracting the resource needs and availabilities from the disaster tweets. They used Nepal and
Italy earthquake disaster datasets. The authors in [33] developed a methodology for understanding the semantics
of need and availability tweets i.e. what resource is available or needed, resource location, etc. They also designed
a methodology for matching the availability and need of resource tweets based on the resource similarity and
location. The authors in [34] analyzed the tweets posted during disasters such as Hurricane Harvey, Hurricane
Irma, and Hurricane Maria. They performed the experiments by using random forest based on the bag-of-words
model for multi-class classification. It includes classes such as affected individual, infrastructure and utilities
damage, caution, and advice, etc. They proved that both text and image in the tweet give complementary
information. Specifically, the authors in [35] designed a social-EOC model for identifying and ranking the social
service requests from social media. They used the semantic grouping approach for reducing the redundancy
and grouping similar requests to save time. They used the datasets such as Hurricane sandy, Alberta floods,
Hurricane Harvey, Nepal earthquake, etc. The authors in [36] used the majority voting based ensemble model
for identifying the medical resource tweets during the disaster. They utilized the various classifiers such as SVM,
AdaBoost, random forest, bagging and gradient boosting algorithms based on the informative features that are
specifically related to the medical resource tweets. However, they did not focus on damage assessment from
social media during a disaster.
In this section we describe the various works for identifying the posts specifically related to the damage as-
sessment during a disaster. A recent review of the literature on damage assessment [10, 37] used pre-trained
models for damage assessment during a disaster. Specifically, the authors in [10] used fined VGG-16 architec-
ture for comparing against VGG-16 architecture from scratch and the Bag-of-Visual words model. They used
three classes, namely, severe damage, mild damage and little-to-no damage, and experimented with different
disaster datasets such as Nepal earthquake, Typhoon ruby, etc. The authors in [37] used the image processing
techniques based on the deep neural networks for rapid damage assessment from the social media images during
the disaster. They used the Hurricane Dorian dataset for experimentation. They gave a detailed explanation
of where their model failed to detect the damage assessment images during the disaster. The authors in [38]
used the Domain-Adversarial Neural Network (DANN) by combining with the VGG-19 model for identifying
the damage from the images during the disaster. They performed experiments on Nepal Earthquake, Ecuador
earthquake, Ruby typhoon and Matthew Hurricane datasets. Authors showed that their methods work well,
even the training and testing of the model for different disaster datasets. However, they focused on social im-
agery data during a disaster. Although the authors in [11] used text data for damage assessment, they did
4 Short form of author list
not rely on English tweets. The authors of [39] used information retrieval-based methods to classify damage
assessment tweets (English language) during the disaster. They used a semi-automated approach to pick the
initial set of seed keywords for finding the relevant damage assessment tweets. They used the Nepal earthquake,
Italy earthquake and Indonesia Tsunami datasets to experiment with their methods. They have shown that
their methods outperform the state-of-the-art methods in terms of precision, recall, Bpref, and MAP. However,
their method requires human annotators for selecting the relevant keywords. They also did not explored for
other disaster datasets.
The performance and limitations of the existing methods can be summarized as follows:
1. Most of the existing works focused on general categories of disaster tweets such as informative and non-
informative. It does not give much performance for identifying the damage assessment tweets during the
disaster that is explained in Section 4.5.
2. A few works focused on damage assessment from the social media that uses only imagery data but not text
data. The drawback of this works is it gives information only about the infrastructure damage, but it does
not provide the information on the human damage assessment.
3. Some of the approaches work only for regional language (like Italian) tweets but do not work for English lan-
guage tweets. Some methods also required to give the keywords manually to identify the damage assessment
tweets.
4. Finally, most of the existing approaches are not explored on diverse disaster datasets such as Hurricane
Irma, Wild fires, Floods, Earthquakes,etc.
In order to overcome all the limitations of existing methods, this paper introduces a framework based on
novel features that are weighted using linear regression and SVR algorithm for automatic detection of damage
assessment tweets during a disaster. Extensive experiments are performed on various diverse disaster datasets
both in-domain (training and testing with the same dataset) and cross-domain (training the model with one
dataset and testing the model with another dataset). We also perform error analysis which is very helpful to
future enhancement.
3 Proposed Method
In this section we describe the proposed that detects the damage assessment tweets by 94.62% accurately during
a disaster. The proposed method is illustrated in Figure 1.
3.1 Pre-processing
The following pre-processing operations are performed on the tweets before the feature engineering.
The features can be divided into three types such as Low-level lexical features, Top-frequency words, and
Syntactic features. Now we describe each of the category types in detail.
In [8], the authors have used the low-level lexical features such as count of personal pronouns, count of modal
verbs, count of wh-words, etc. for classifying the tweets into situational and non-situational during a disaster.
Specifically, these low-lexical features do not perform well in the context of detecting the damage assessment
tweets during the disaster. Therefore, we propose a new set of low-level lexical feature like count of words
related to people, count of words related to adjectives/adverbs of people, count of words related to infrastructure
damage, and count of words related to adjectives/adverbs of infrastructure damage. The details of these features
are as follows:
Title Suppressed Due to Excessive Length 5
– Count of words related to people: This feature counts the words or terms related to people like ‘patient’,
‘people’, ‘residents’, ‘fatalities’, ‘deaths’, ‘funerals’, ‘injuries’, etc. that are present in a tweet.
– Count of words related to infrastructure damage: This feature counts the words related to infrastructure
damage that are present in a tweet. Examples of these words are ‘buildings’, ‘homes’, ‘houses’, ‘villa’,
‘wagons’, ‘churches’, ‘dam’, ‘havoc’, ‘hospital’, ‘hotel’, ‘house’, ‘lodges’, ‘malls’, ‘reservoirs’, ‘restoration’,
‘road’, ‘roof’, ‘ruins’, ‘schools’, etc.
– Count of words related to adjectives/adverbs of people: It counts the words related to adjectives/adverbs
of people like ‘critical’, ‘wounded’, ‘treated’, ‘dead’, ‘killed’, ‘affected’, etc. that are present in a tweet of a
disaster.
– Count of words related adjectives/adverbs of infrastructure damage: This feature counts the words related
to adverbs/adjectives of infrastructure damage such ‘damaged’, ‘destruction’, ‘destructive’, ‘devastated’,
‘facilities’, ‘flooded’, etc. that are present in a tweet.
Custom lexicons are manually created to identify the presence of low-level lexical features in a tweet specif-
ically for the damage assessment from standard lexicons, such as emergency management terms [41] and crisis
lexicon [42]. Low-level lexical features help the model to identify the presence of either infrastructure or human
damage information in a tweet for various disasters. These features are beneficial even though model is trained
for different disasters.
The work [13] uses the frequencies of words as features which gives good performance in identifying disaster-
related tweets during a disaster. It is due to the usage of the specific vocabulary in the events. The main drawback
of this approach is the sparsity of the features and is dominant when combined with the other features because
the length of the feature vector is high. Therefore, to overcome these problems, we used top-most frequency
words as features in our approach to identifying the damage assessment tweets during a disaster. The top-
most frequency words may differ from one disaster to another disaster, and the frequency of these words also
differentiates the damage and non-damage assessment tweets. Therefore, the top-most frequency words are taken
from the training data and are considered as features in our work. The top-ten most frequent words from the
training data are used as features in our approach.
6 Short form of author list
The work [8] uses the syntactic features such as count of numerals and count of intensifiers for classifying
situational and non-situational tweets during a disaster. However, these features do not perform well in the
identification of the damage assessment tweets. In this work, specific syntactic features are proposed that are
shown below:
– Length of a tweet: It counts the number of words present in a tweet.
– Count of Determiners: It counts the determiner terms present in a tweet i.e. a, an, and the.
– Count of Verbs: It counts the Verb terms present in a tweet.
– Count of Pronouns: It counts the pronoun terms present in a tweet.
– Count of Nominal+verbal terms: It counts the Nominal+verbal terms present in a tweet.
– Count of URLs: It counts the URLs present in a tweet.
– Count of Common Noun terms: It counts the common noun terms present in a tweet.
– Count of Adverb terms: It counts the adverb terms present in a tweet.
– Count of Adjective terms: It counts the adjective terms present in a tweet.
– Count of unknown words: It counts the abbreviations, foreign words, unknown words, etc. present in a tweet.
– Count of Interjections: It counts the interjection terms present in a tweet.
– Count of Pre/Post-Position terms: It counts the pre/post-position terms that are present in a tweet.
Syntactic features are beneficial for a model for filtering non-damage assessment tweets that distinguishes
from damage assessment tweets.
The feature values which are coming from the low-level lexical, top-most frequency words and syntactic features
are used for constructing the feature vectors for every tweet. All feature values are concatenated as a single
feature vector. SVR [43] and Linear regression algorithms [44] are used for giving the weights to the features of
a tweet. The Linear regression minimizes the error rate while SVR fixed the error rate between the thresholds
by giving importance to the words in a tweet.
The classification model is mainly based on the features and the classifier used for classifying the given tweets
into infrastructure damage, human damage, and non-damage assessment during a disaster.
Initially, tweets are given as an input to the classification model. There are three challenges for the current
problem i.e. detecting the damage assessment tweets during a disaster. The first challenge is to capture the
information from different disasters to identify the damage assessment tweets if the test data have different
vocabulary from the training data. Next is the model needs to differentiate the damage and non-damage as-
sessment tweets. Lastly, the model needs to identify the damage assessment tweets accurately. To achieve these
three goals, we proposed three different types of features to identify the damage assessment tweets during a
disaster. The three types of features are low-level lexical features, top-most frequency words, and syntactic
features. Let F1 , F2 and F3 be the feature vectors of low-level lexical, top-most frequency words and syntactic
features with the length of 4, 10 and 12, respectively. The concatenation Feature vector (CF) of three feature
vectors is represented in equation 1.
CF = F1 + F2 + F3 (1)
Where ‘+’ indicates the concatenation operator. It gives the feature vector of length 26 after concatenating
three feature vectors. Table 3 proves three different types of features are essential to identify the damage
assessment tweets. Different features do not contribute equally to the final identification of damage assessment
tweets. This is also shown in Table 3. Therefore, weights are given to the features by using Support Vector
Regression (SVR) [43] and Linear regression algorithm [44]. Then the weighted feature vector is shown in
equation 2.
WLR + WSV R
W eighted f eature vector(W F V ) = × CF (2)
2
Where WLR and WSV R are learned weights of Linear regression and Support vector regression algorithms
of length 26. ‘×’ indicates the element-wise multiplication of averaged weight vector and concatenated feature
vector. The weighted feature vector (WFV) is used as an input to the random forest classifier for classifying
the tweets into infrastructure damage, human damage and non-damage assessment tweets during a disaster. It
is shown in equation 3.
Title Suppressed Due to Excessive Length 7
The reasoning for using a random forest classifier is clearly stated in Section 4.4.
In this section, we conduct substantial experiments on different datasets to prove the effectiveness of the pro-
posed method for detecting the damage assessment tweets. First, we introduce datasets and then describe the
experimental set-up that is used. Finally, we show the experimental results and comparison of proposed methods
with the existing methods.
4.1 Datasets
To demonstrate the effectiveness of the proposed method, we perform the experiments using the datasets such
as SMERP 2017 [45], CrisisNLP [46], CrisisMMD [47]. These datasets contain tweets related to the various
disaster at different locations and different times. The details of each dataset are shown in Table 2.
Different datasets have different types of class tweets such as injured or dead people, missing, trapped, or
found people, displaced people and evacuations, infrastructure and utilities damage, donation needs or offers, or
volunteering services, etc. The classes such as injured or dead people, missing, trapped, or found people, affected
individuals, displaced people, evacuations, vehicle damage, infrastructure damage, utility damage, restoration
and casualties classes tweets are considered as the damage assessment tweets for each dataset in Table 2. Among
them, vehicle damage, infrastructure damage, utility damage and restoration are considered as infrastructure
damage and the rest are regarded as human damage assessment tweets. For binary classification, both classes
(infrastructure and human damage) are considered as a single class. An equal number of tweets are selected
from other classes which are not mentioned in damage assessment tweets for each dataset, and considered as
non-damage assessment tweets. Experiments are performed on different datasets in two different ways, such as
1. In-domain and 2. Cross-domain for both binary and multi-class classification of damage assessment tweets.
4.2.1 In-domain
It refers to the training and testing of the proposed method on the same disaster datasets. We used 10-fold
cross-validation in our experiments.
8 Short form of author list
4.2.2 Cross-domain
It refers to the training and testing of the proposed method on the different disaster datasets. This cross-domain
can be categorized into three cases, such as a) A combination of all earthquake disaster datasets is used for
training, and other disaster datasets are used for testing. b) A combination of different old disaster datasets
are used for training and new datasets are used for testing the model. Here old disaster datasets indicate the
disasters that happened on or before year 2016 and new datasets indicate the disasters that happened after
year 2016 c) All old earthquake disaster datasets are used for training and new earthquake disaster datasets are
used for testing the model.
The evaluation metrics like macro-precision, macro-recall, macro-F1-score, and accuracy are used to assess the
quality of the proposed method for identifying the damage assessment tweets. Let T P1 represents the number
of tweets detected correctly as infrastructure damage, T P2 represents the number of tweets identified correctly
as Human damage, T P3 represents the number of tweets identified correctly as non-damage, F P1 represents the
number of tweets identified incorrectly as infrastructure damage, F P2 represents the number of tweets identified
incorrectly as Human damage, F P3 represents the number of tweets identified incorrectly as non-damage, T N1
represents the number of tweets identified correctly as non-infrastructure damage, T N2 represents the number
of tweets identified correctly as non-Human damage, T N3 represents the number of tweets identified correctly
as not related to non-damage. F N1 represents the number of tweets detected incorrectly as non-infrastructure
damage, F N2 represents the number of tweets detected incorrectly as non-Human damage, F N3 represents the
number of tweets detected incorrectly as a not related to non-damage.
T Pi
P recisioni = (4)
T Pi + F Pi
Where P recisioni represents the precision of i classes (i=1, 2 and 3) such as infrastructure damage, human
damage, and non-damage classes respectively.
T Pi
Recalli = (5)
T Pi + F Pi
Where Recalli represent the recall of i classes (i=1, 2 and 3) such as infrastructure damage, human damage,
and non-damage classes, respectively.
2 ∗ P recisioni ∗ Recalli
F 1 − scorei = (6)
P recisioni + Recalli
Where F 1 − scorei represents the f1-score of i classes (i=1, 2 and 3) such as infrastructure damage, human
damage and non-damage classes, respectively.
Pn
i P recisioni
M acro − P recision = (7)
n
Pn
i Recalli
M acro − Recall = (8)
n
Pn
i F 1 − scorei
M acro − F 1 − score = (9)
n
Pn
i T Pi + T Ni
Accuracy = Pn (10)
i T Pi + T Ni + F Pi + F Ni
Where M acro − P recision1 , M acro − Recall2 , M acro − F 1 − score3 and Accuracy represent the average
precision, recall and f1-score of all the three classes such as infrastructure damage, human damage and non-
damage classes, respectively. And n is the number of classes.
Title Suppressed Due to Excessive Length 9
Several experiments are conducted to gain insights into the performance of the proposed model. Initially, a
simple decision tree classifier is used as a classification model in the proposed method and the results are
shown in Table 3. Ablation experiments are performed to know the effect of the individual component on the
proposed method. If any individual component is removed from the proposed method, there will be a decrease
in performance for all parameters. Next, different experiments are performed using various classifiers to find the
best classifier for the proposed method and the results are shown in Table 4. From the experiments, it is found
that random forest classifier gives the best results, while SVM with polynomial kernel classifier gives the worst
result for classifying the damage assessment and non-damage assessment tweets. Therefore, a random forest
classifier is used in the proposed method for further experiments. Subsequently, the results of the proposed
method on various datasets such as Italy Earthquake, Chile Earthquake, India Floods, etc. are shown in Table
5. From the results, it is found that the Chile Earthquake dataset has the highest performance, while California
wildfires have the least performance compared to all the different disaster datasets.
Table 3 Different experiments of the proposed method (Decision tree classifier) on large dataset (SMERP-1 and SMERP-2)
Table 4 Comparison of different classifiers on the proposed method for SMERP Level-1 and level-2 using various parameters
This section describes the existing methods and compared the results with the proposed method in detecting
the damage assessment tweets. The existing methods used in this work are shown as follows:
– The authors in [23] developed a system with the use of n-gram features for classifying tweets into user-defined
categories during the disaster. Similar to [23], uni-gram features are used for experiments. And also, SVM
classifier is used due to good performance in disaster tweets [48, 49].
– SVM classifier with low-level lexical and syntactic features [8] such as count of subjective words, presence
of modal verbs, presence of intensifiers, presence of wh-words, presence of exclamations, count of question
marks, presence of religious words, etc. achieved good performance for classifying the disaster-related tweets.
– The authors in [34] used a random forest classifier with the bag-of-words approach for classifying tweets into
multiple class labels during a disaster. It includes damage assessment tweets also. They reported the results
of various disaster datasets which include Hurricanes, earthquakes, and floods. Therefore, reported results
are used for comparing the proposed method.
– The authors in [50] used various methodologies such as Term Frequency and Inverse document frequency (TF-
IDF), Term Frequency and Inverse Document Frequency with Normalized Entropy boosting (T F −IDFN E ),
Term Frequency and Inverse Document Frequency with Class Normalized Entropy (T F − IDFCN E ) and
Term Frequency and Inverse Document Frequency with Entropy-based Category Coverage Difference (T F −
IDFECCD ) for classifying tweets into multiple categories of disaster-related tweets. The results are reported
for all the methods of the TF-IDF, TF-IDF Normalized Entropy boosting, TF-IDF Class Normalized Entropy
boosting, TF-IDF ECCD for Nepal Earthquake and Italy Earthquake datasets in identifying the damage
assessment tweets. The reported results are used for comparing the proposed method.
– The authors in [51] used the Convolutional Neural Network (CNN) by using Adam optimizer and Bi-
directional Embedding Representations from Transformers (BERT-base) model with 12 layers for identifying
the damage assessment tweets and results. They provided the results for various disaster datasets such as
CRISISMMD, CrisisNLP, CrisisLex, AIDR, DSM, etc. for both binary and multi-classification. The classes of
binary classification are informative and non-informative while the classes of multi-classification are affected
individual, infrastructure damage, utilities damage, injured or dead people, requests or needs, etc.
– The authors in [52] used the Term Frequency and Inverse Document Frequency (TF-IDF) features and ma-
chine learning classifiers such as SVM, random forest, Naive Bayes classifier, K-Nearest Neighbor (KNN),
gradient boosting and random forest. They also used deep learning methodologies such as Convolutional Neu-
ral Network (CNN), Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Bi-directional GRU,
and GRU-CNN based on the crisis word embeddings (Embedding dimension is 300) and Glove embeddings
(Embedding dimension is 100). And the results are reported for various disaster datasets in detecting the
tweets related to the damage assessment during a disaster. Among the different deep learning and machine
learning methodologies, the best-performed results are considered for comparing the proposed method.
– The authors in [39] used the information retrieval methodologies such as Topic Aligned Query Expansion
(TAQE) technique for retrieving the damage assessment tweets. They used the Boolean retrieval technique for
creating the initial set of relevant tweets. Later, the authors used co-occurrence-based keyword selection and
topic aligned keyword-based selection methods for extracting the damage assessment tweets. They reported
the results of their method for the disaster datasets such as Nepal, Italy, and Indonesia earthquakes.
Tables 6, 7, 8, and 9 show the comparison results of proposed method with the state-of-the-art methods
on different disaster datasets for both binary and multi-class classification. The first and second columns of
the Table 6 represent the datasets and models used in this paper, respectively. Consequently, columns 3 to 6
represent the metrics such as accuracy, macro-precision, macro-recall, and macro-f1-score used for comparing
the proposed method. The last column represents the year of the method proposed. We observed that, in most
cases, SVM with unigram features [23] gives better performance than SVM with low-level lexical and syntactic
features [8]. It indicates that the SVM with low-level lexical and syntactic features [8] is not able to differentiate
the damage and non-damage assessment tweets since their features are general to disaster tweets and are helpful
only for situational tweets. However, the proposed method gives the best performance compared to the existing
methods for all different disaster datasets except for Sri Lanka floods in binary classification. It is due to the
dataset size is very small. The comparison result of the proposed method with the state-of-the-art methods by
using different parameters on various datasets are shown in Tables 7, 8 and 9 for multi-class classification. The
third, fourth, and fifth columns represent the individual class result of the tweets such as infrastructure damage,
human damage, and non-damage assessment respectively. The last column represents the average results of
all the classes to the corresponding parameter. These results revealed that in most of the datasets, SVM with
low-level lexical and syntactic features [8] is not able to differentiate the infrastructure damage and human
damage assessment tweets during the disaster. However, the proposed method achieves good performance even
for differentiating and detecting the infrastructure and human damage assessment tweets. Furthermore, it gives
better results than the existing methods for the parameters such as accuracy, macro-precision, macro-recall
Title Suppressed Due to Excessive Length 11
Table 6 Comparison of the proposed method with different existing state-of-the-art methods on different disaster datasets
Dataset Model Accuracy Macro-Precision Macro-Recall Macro-F1-Score Year
Imran et. al [23] 68.04 70.37 68.04 67.04 2014
Hurricane Irma Rudra et. al [8] 48.51 36.73 48.33 40.07 2018
Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 66.00 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 77.13 77.27 77.13 77.10 present
Imran et. al [23] 65.18 70.41 65.18 62.82 2014
Rudra et. al [8] 45.18 29.03 43.75 33.69 2018
Hurricane Harvey Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 66.00 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 77.32 76.14 76.72 76.28 present
Imran et. al [23] 64.68 69.34 64.68 62.43 2014
Rudra et. al [8] 65.00 62.17 63.33 61.13 2018
Hurricane Maria Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 66.00 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 79.14 79.41 79.14 79.10 present
Imran et. al [23] 71.16 73.93 71.16 70.11 2014
Rudra et. al [8] 55.83 41.67 52.50 43.33 2018
Mexico Earthquake Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 55.60 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 79.37 79.90 79.37 79.26 present
Imran et. al [23] 61.70 30.84 50.00 38.16 2014
Rudra et. al [8] 56.67 28.33 50.00 36.00 2018
Iraq-Iran Earthquake Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 55.60 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 79.58 78.73 78.86 78.48 present
Imran et. al [23] 50.68 25.34 50 33.63 2014
Rudra et. al [8] 57.50 38.33 57.50 44 2018
Chile Earthquake Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 55.60 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 94.62 94.86 94.62 94.61 present
Imran et. al [23] 56.31 28.16 50 36.03 2014
Rudra et. al [8] 45.18 37.58 42.92 38.80 2018
Nepal Earthquake Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 55.60 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 78.16 78.02 78.05 77.87 present
Imran et. al [23] 60.78 78.12 60.78 53.14 2014
Rudra et. al [8] 57.50 55.83 57.50 53.67 2018
Pakistan Earthquake Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 55.60 63.40 57.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 82.26 82.74 82.25 82.19 present
Imran et. al [23] 74.09 78.69 74.09 72.87 2014
Cyclone Rudra et. al [8] 48 24 45 31.25 2018
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 86.92 87.26 86.92 86.89 present
Imran et. al [23] 68.69 78.74 68.69 65.23 2014
Typhoon Hagupit Rudra et. al [8] 36.67 25.55 33.33 27.78 2018
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 85.22 86.58 85.22 85.06 present
Imran et. al [23] 74.42 82.17 74.42 72.73 2014
Rudra et. al [8] 35.24 24 32.67 26.40 2018
India floods Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 76.50 68.00 67.00 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 91.46 91.48 91.48 91.43 present
TF-IDF [50] - 86.29 86.04 86.05 2018
TF-IDF Normalized - 88.50 87.10 87.71 2018
Entropy boosting [50]
Italy Earthquake TF-IDF Class Normal- 89.94 86.79 88 2018
ized Entropy boosting
[50]
TF-IDF ECCD [50] - 90.84 83.33 86.76 2018
Shalini et. al [39] - 26.00 79.00 39.00 2020
Firoj et. al [51] - 67.10 68.29 67.60 2020
Alam et. al [34] - 70.04 63.40 64.8 2019
Proposed Method 93.27 92.81 92.79 92.81 present
Imran et. al [23] 51.48 25.74 50.00 33.98 2014
Rudra et. al [8] 55.17 52.50 55.83 50.02 2018
California Wildfires Abinav et. al [52] - 44.40 46.20 44.80 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 73.31 73.78 73.45 73.24 present
Imran et. al [23] 90.00 92.27 90.00 89.76 2014
Rudra et. al [8] 50.00 35.00 50.00 40.00 2018
Sri Lanka Floods Alam et. al [34] - 70.04 63.40 64.8 2019
Abinav et. al [52] - 76.50 68.00 67.00 2019
Firoj et. al [51] - 67.10 68.29 67.60 2020
Proposed Method 86.92 87.26 86.92 86.89 present
12 Short form of author list
Table 7 Comparison of proposed method with different existing state-of-the-art methods on different disaster datasets using
macro-f1-score parameter for multi-class classification
Dataset Model Infrastructure damage Human damage Non-damage Average
Rudra et. al [8] 59.50 0 54.08 37.86
Hurricane Irma Proposed Method 75.26 49.89 74.34 66.50
Rudra et. al [8] 53.59 13.01 43.97 36.86
Hurricane Harvey Proposed Method 68.52 58.69 72.54 66.58
Rudra et. al [8] 0 77.36 77.87 51.74
Hurricane Maria Proposed Method 75.92 57.13 79.14 70.73
Rudra et. al [8] 0 63.33 63.31 42.21
Mexico Earthquake
Proposed Method 61.15 71.21 70.65 67.67
Rudra et. al [8] 0 69.36 65.13 44.83
Iraq-Iran Earthquake
Proposed Method 74.33 80.00 75.18 76.50
Rudra et. al [8] 0 87.04 85.09 57.38
Chile Earthquake
Proposed Method 49.18 90.46 93.84 77.83
Rudra et. al [8] 0 69.31 70.78 46.70
Nepal Earthquake
Proposed Method 41.55 75.06 75.98 64.20
Rudra et. al [8] 0 77.36 77.87 51.74
Pakistan Earthquake
Proposed Method 49.67 82.49 82.49 71.55
Rudra et. al [8] 54.76 41.14 64.42 53.44
Cyclone
Proposed Method 73.90 80.43 80.30 78.21
Rudra et. al [8] 66.48 68.65 75.28 70.14
Typhoon Hagupit
Proposed Method 71.48 81.32 78.51 77.10
Rudra et. al [8] 0 67.76 78.13 48.63
India floods
Proposed Method 41.94 91.26 91.79 75.00
Rudra et. al [8] 39.81 51.35 53.18 48.11
California Wildfires
Proposed Method 55.02 68.98 61.53 61.84
Rudra et. al [8] 41.72 64.56 39.52 48.60
Sri Lanka Floods
Proposed Method 74.61 74.69 83.59 77.63
Table 8 Comparison of proposed method with different existing state-of-the-art methods on different disaster datasets using
macro-precision parameter for multi-class classification
Dataset Model Infrastructure damage Human damage Non-damage Average
Rudra et. al [8] 50.17 0 57.10 35.76
Hurricane Irma
Proposed Method 75.24 59.81 71.79 68.95
Rudra et. al [8] 42.42 46.22 49.98 46.21
Hurricane Harvey
Proposed Method 73.34 64.31 67.17 68.27
Rudra et. al [8] 45.15 0 53.63 32.97
Hurricane Maria
Proposed Method 77.16 71.17 74.73 74.35
Rudra et. al [8] 0 69.00 50.32 39.77
Mexico Earthquake
Proposed Method 66.02 73.05 67.71 68.93
Rudra et. al [8] 0 74.53 56.99 43.84
Iraq-Iran Earthquake
Proposed Method 88.33 79.88 74.50 80.90
Rudra et. al [8] 0 84.91 78.72 54.54
Chile Earthquake
Proposed Method 68.50 90.19 92.17 83.62
Rudra et. al [8] 0 68.11 66.44 44.85
Nepal Earthquake
Proposed Method 55.38 75.84 72.30 67.84
Rudra et. al [8] 0 77.59 75.30 50.96
Pakistan Earthquake
Proposed Method 61.67 85.38 79.56 75.54
Rudra et. al [8] 45.95 54.37 71.63 57.32
Cyclone
Proposed Method 74.21 86.24 78.19 79.55
Rudra et. al [8] 68.36 62.46 84.17 71.66
Typhoon Hagupit
Proposed Method 74.13 88.32 74.33 78.93
Rudra et. al [8] 0 81.40 68.67 50.02
India floods
Proposed Method 53.05 92.06 90.23 78.45
Rudra et. al [8] 33.15 54.16 72.78 53.36
California Wildfires
Proposed Method 59.98 72.14 57.67 63.26
Rudra et. al [8] 42.08 63.33 64.94 56.78
Sri Lanka Floods
Proposed Method 85.33 84.83 78.02 82.73
Title Suppressed Due to Excessive Length 13
Table 9 Comparison of the proposed method with different existing state-of-the-art methods on different disaster datasets using
macro-recall parameter for multi-classification
Dataset Model Infrastructure damage Human damage Non-damage Average
Rudra et. al [8] 73.95 0 52.47 42.14
Hurricane Irma
Proposed Method 76.19 43.42 77.92 65.84
Rudra et. al [8] 73.23 7.92 40.06 40.40
Hurricane Harvey
Proposed Method 64.97 55.25 79.55 66.59
Rudra et. al [8] 76.90 0 40.34 39.08
Hurricane Maria
Proposed Method 75.52 48.63 85.52 69.89
Rudra et. al [8] 0 59.61 85.98 48.53
Mexico Earthquake
Proposed Method 58.27 70.16 74.25 67.56
Rudra et. al [8] 0 66.92 77.73 48.22
Iraq-Iran Earthquake
Proposed Method 66.67 80.66 77.27 74.87
Rudra et. al [8] 0 89.47 92.81 60.76
Chile Earthquake
Proposed Method 43.50 91.20 96.14 76.95
Rudra et. al [8] 0 72.64 77.44 50.03
Nepal Earthquake
Proposed Method 34.59 74.43 80.33 63.12
Rudra et. al [8] 0 77.49 80.93 52.81
Pakistan Earthquake
Proposed Method 43.33 80.36 85.98 69.89
Rudra et. al [8] 68.12 33.45 59.11 53.56
Cyclone
Proposed Method 74.52 76.17 83.24 77.98
Rudra et. al [8] 65.22 77.14 69.33 70.56
Typhoon Hagupit
Proposed Method 69.67 74.50 84.67 76.28
Rudra et. al [8] 0 58.41 90.82 49.74
India floods
Proposed Method 37.67 90.57 93.49 73.91
Rudra et. al [8] 51.00 50.32 43.76 48.36
California Wildfires
Proposed Method 51.69 66.87 66.95 61.84
Rudra et. al [8] 44.50 70.50 37.50 50.83
Sri Lanka Floods
Proposed Method 70.50 70.50 93.00 78
and macro-f1-score. The experimenting proves it on various disaster datasets of tweets. It is also beneficial for
organizations that are looking for either human damage or damage information during the disaster.
Table 10 Comparison of the proposed method with the method from [8] in cross-domain using accuracy, macro-precision, macro-
recall and macro-f1-score parameters for binary classification
Testing Dataset Accuracy Macro-Precision Macro-Recall Macro-F1-score
Rudra Proposed Rudra Proposed Rudra Proposed Rudra Proposed
et.al [8] Method et.al [8] Method et.al [8] Method et.al [8] Method
Training All old datasets
Hurricane Harvey 58.85 62.18 59 63.52 59 62.17 59 61.21
Hurricane Irma 60.61 59.33 61 60.08 61 59.33 61 58.56
Hurricane Maria 60.07 61.97 60 63.83 60 61.97 60 60.64
California Wildfires 65.30 63.75 66 63.92 65 63.87 65 63.74
Sri Lanka Floods 58.89 75 59 76.44 59 75 58 74.65
Mexico earthquake 68.97 72.26 68 72.97 67 72.26 67 72.04
Iraq-Iran earthquake 68.24 74.09 66 72.58 65 72.32 66 72.44
Training all earthquake datasets
Hurricane Harvey 58.60 64.29 59 66.46 59 64.29 59 63.06
Hurricane Irma 60.61 60.27 61 61.54 61 60.27 61 59.15
Hurricane Maria 60.07 63.27 60 65.25 60 63.27 60 62.04
California Wildfires 65.30 67.14 66 67.60 65 67.34 65 67.06
Sri Lanka Floods 58.89 82.22 59 82.48 59 82.22 58 82.19
Cyclone 68.82 76.38 69 77.16 69 76.37 69 76.20
Typhoon Hagupit 79.81 76.80 80 80.11 80 76.80 80 76.15
India Floods 72.80 76.86 75 81.20 73 76.87 72 76.04
Training old earthquake datasets
Mexico Earthquake 66.78 73.90 67 73.92 67 73.91 67 73.90
Iraq-Iran Earthquake 68.20 74.45 66 73.40 65 74.42 66 73.64
The experimental results of the cross-domain scenario for binary classification using different parameters
are shown in Tables 10. The first column represents the datasets used for testing the model and the rest of the
columns represent the evaluation metrics such as accuracy, macro-precision, macro-recall, and macro-f1-score,
respectively. The proposed method is compared with the state-of-the-art method [8] in three cases for the cross-
domain scenario. The explanation of the three cases for cross-domain is shown in Section 4.2.2. It is observed
that from Table 10, when compared to the testing result of the datasets such as Hurricane Harvey, Hurricane
Irma, Hurricane Maria, California wildfires, and Sri Lanka floods for both case-1 and case-2, the case-2 achieves
the highest value in all the parameters for binary classification. It indicates that the proposed model works best
when the model is training only for the combination of all earthquake datasets instead of training the model
14 Short form of author list
for diverse old datasets and testing the model for other disaster datasets. Even though case-3 training tweets
are a smaller number than the case-1 training tweets, the proposed method achieved better performance for
testing new earthquake disaster datasets. In case-1 and case-2, the Srilanka flood and Hurricane Irma dataset
received the highest and lowest values for the proposed model in different parameters. In case 3, the Iraq-Iran
Earthquake and Mexico Earthquake datasets were given the highest and lowest value of the proposed model. In
case-1, the proposed model provides better results than the existing method for all datasets except for Hurricane
Irma and California wildfire datasets. In case-2, the proposed model produces better results than the current
methods for all datasets except Typhoon Hagupit and California wildfire datasets. From this, it is observed that
the proposed method captures the required information for identifying the damage assessment tweets from the
earthquake disaster tweets.
The experimental results of cross-domain scenario for multi-class classification using different parameters
are shown in Tables 11 and 12. Class-1, Class-2, and Class-3 represent infrastructure damage, human damage,
and non-damage assessment tweets, respectively. The results showed that, as discussed in a similar way to
the multi-class classification in-domain, the existing [8] method does not work well for the differentiation of
infrastructure and human damage assessment tweets in most datasets for all three cross-domain cases. However,
the proposed method gives a good performance compared to the existing method even for cross-domain multi-
class classification also in the parameters of accuracy, macro-f1-score, macro-precision, and macro-recall. In
case-1, Iraq-Iran Earthquake and California wildfire datasets have the highest and lowest values for the proposed
model in all parameters. Similarly, in case 2, the Indian floods and Hurricane Irma received the highest and lowest
parameters for the proposed model, respectively. In case 3, the Iraq-Iran Earthquake and Mexico Earthquake had
the highest and lowest values. However, in case-2, the proposed model provides good performance for all datasets
except the Hurricane Irma dataset and other cases (case-1 and case-2) for all datasets in the cross-domain for
multi-classification.
Finally, if we compare the results of the proposed method for both in-domain and cross-domain scenarios,
the in-domain scenario has higher values. However, the proposed method gives the best performance in both
scenarios compared to the existing methods. It is also suggested that if we don’t have labeled datasets for the
new disaster, then it can be used as a cross-domain model for classifying the tweets during the initial days of
the disaster. After getting a labeled dataset of the new disaster, then we can use the in-domain model.
Table 11 Comparison of the proposed method with the method from [8] in cross-domain for Multi-classification using Accuracy
and Macro-F1-score parameters
Testing Dataset Accuracy Macro-F1-score
Rudra et.al [8] Proposed Rudra et.al [8] Proposed Method
Method
Class- Class-2 Class-3 Average Class-1 Class-2 Class-3 Average
1
Training All old datasets
Hurricane Harvey 38.87 49.74 0 28 55 27.67 38 47 57 47.33
Hurricane Irma 40.40 49.60 0 25 58 27.67 43 35 73 50.33
Hurricane Maria 40.44 52.94 0 28 88 38.67 43 39 62 48
California Wildfires 51.47 52.66 0 65 60 41.67 42 58 55 51.67
Sri Lanka Floods 51.47 64.71 0 65 46 37 55 62 73 63.33
Mexico earthquake 54.83 57.98 0 62 62 41.33 52 65 54 57
Iraq-Iran earthquake 64.23 68.98 0 69 67 45.33 56 77 63 65.33
Training all earthquake datasets
Hurricane Harvey 49.38 50.96 41 0 61 34 38 46 60 48
Hurricane Irma 50.31 48.4 42 0 61 34.33 34 40 58 44
Hurricane Maria 48.20 51.12 26 0 62 29.33 42 33 62 45.67
California Wildfires 51.85 54.08 42 0 65 35.67 39 61 56 52
Sri lanka Floods 55.06 58.09 46 0 68 38 43 55 69 55.67
Cyclone 49.90 69.34 36 0 63 33 61 75 71 69
Typhoon Hagupit 56.44 63.57 46 0 70 38.67 41 77 67 61.67
India Floods 59.32 72.14 50 0 76 42 15 81 72 56
Training old earthquake datasets
Mexico Earthquake 54.83 62.25 0 62 62 41.33 52 68 62 60.67
Iraq-Iran Earthquake 64.23 65.69 0 69 67 45.33 41 73 64 59.33
This section describes the limitations of the proposed method i.e. where it fails to detect the damage assessment
tweets during the disaster and also discusses future research directions.
The present work is beneficial for all types of disasters such as earthquakes, hurricanes, floods, etc. in
detecting the damage assessment tweets during a disaster, even the model is trained for different disasters. It
does not give much performance specifically when training the model with old disaster datasets (past disaster
datasets) and testing with Hurricane Irma dataset for both binary and multi-classification and the reasons also
Title Suppressed Due to Excessive Length 15
Table 12 Comparison of the proposed method with the method from [8] in cross-domain for Multi-classification using Macro-
Precision and Macro-Recall parameters
Testing Macro-Precision Macro-Recall
Dataset
Methods Class-1 Class-2 Class-3 Average Class-1 Class-2 Class-3 Average
Training All old datasets
Rudra et. al [8] 0 31 41 24 0 26 86 37.33
Hurricane
Harvey Proposed Method 59 58 45 54 28 39 79 48.67
Hurricane Rudra et. al [8] 0 24 44 22.67 0 25 87 37.33
Irma Proposed Method 65 36 48 49.67 32 33 73 46
Hurricane Rudra et. al [8] 0 31 42 24.33 0 26 88 38
Maria Proposed Method 72 46 50 56 30 33 84 49
California Rudra et. al [8] 0 61 46 35.67 0 70 85 51.67
Wildfires Proposed Method 55 64 45 54.67 34 53 68 51.67
Sri Lanka Rudra et. al [8] 0 61 46 35.67 0 70 85 51.67
Floods Proposed Method 68 69 61 66 46 57 91 64.67
Mexico Rudra et. al [8] 0 66 48 38 0 55 85 46.67
Earthquake Proposed Method 47 61 65 57.67 58 69 47 58
Iraq-Iran Rudra et. al [8] 0 80 54 44.67 0 60 88 49.33
Earthquake Proposed Method 48 76 67 63.67 67 77 59 67.67
Training all earthquake datasets
Hurricane Rudra et. al [8] 56 0 47 34.33 32 0 85 39
Harvey Proposed Method 62 54 47 54.33 27 41 81 49.67
Hurricane Rudra et. al [8] 63 0 47 36.67 31 0 88 39.67
Irma Proposed Method 67 39 47 51 23 41 77 47
Hurricane Rudra et. al [8] 50 0 48 32.67 17 0 88 35
Maria Proposed Method 78 33 50 53.67 29 33 82 48
California Rudra et. al [8] 54 0 51 35 34 0 88 40.67
Wildfires Proposed Method 63 65 46 58 28 57 73 52.67
Sri Lanka Rudra et. al [8] 53 0 56 36.33 40 0 88 42.67
Floods Proposed Method 65 56 57 59.33 33 55 87 58.33
Cyclone Rudra et. al [8] 52 0 49 33.67 28 0 86 38
Proposed Method 87 75 60 74 47 75 87 69.67
Typhoon Rudra et. al [8] 54 0 57 37 40 0 89 43
Hagupit Proposed Method 81 75 54 70 27 80 87 64.67
India Floods Rudra et. al [8] 44 0 68 37.33 58 0 88 48.67
Proposed Method 10 77 87 58 36 85 61 60.67
Training old earthquake datasets
Mexico Rudra et. al [8] 0 66 49 38.33 0 58 85 47.67
Earthquake Proposed Method 53 61 71 61.67 51 76 55 60.67
Iraq-Iran Rudra et. al [8] 0 80 54 44.67 0 60 88 49.33
Earthquake Proposed Method 36 72 69 59 47 75 59 60.33
provided. There is a need to investigate this issue in the future. And also, in the cross-domain case, there is a
need to investigate the Indian flood dataset specifically for identifying the infrastructure damage rather than
human damage in the future.
Furthermore, we perform extensive error analysis on the result of the proposed method to know the error
patterns where it fails to detect the tweets related to the damage assessment. For this analysis, we selected
the specific datasets where it has lower performance from all the cases(binary and multi-classification both
in-domain and cross-domain cases). For instance, California wildfires in binary classification. We compare the
actual labels and predicted labels of the tweets for the selected datasets to know the error patterns. We focus on
error patterns with the highest frequency of tweets in the observed tweets. The reasons for the misclassification
of tweets, along with examples, are explained as follows:
Description: The actual tweet label is infrastructure damage tweet, but the proposed model predicts the
tweet as human damage tweet. This is due to the presence of words such as ‘deadly’ and ‘heads’ in the
infrastructure damage tweet.
4. Presence of majority infrastructure damage words rather than human damage words in the
human damage assessment tweet.
Example tweet: Joy & Jerry has been sleeping in crushed Stock Island home. No place to stay, but
have food/H20 @WPLGLocal10 #Irma [URL].
Description:The tweet mentioned above is labeled as a human damage, but the model predicts as an
infrastructure damage tweet. The reason for this is that tweet contains the majority of the content related
to the infrastructure damage rather than the human damage content.
5. Presence of questioning along with general terms in the tweets related to either infrastructure
damage or human damage
Example tweet: How Harvey changed the shape of three families — one forever Read more: [URL] #news
[URL].
Description: The above example tweet is labeled as a human damage tweet, but the model predicts as a
non-damage tweet in the cross-domain (training and testing on different disasters dataset). It is due to the
presence of question words like ‘how’, ‘why’, etc. in the tweet.
6. Very less number of infrastructure damage related terms in the tweet.
Example tweet: Our boardwalk through the Cypress tree preserve sustained heavy damage from #Irma.
We will rebuild. @LeeSchools [URL].
Description: The above-listed tweet is labeled as infrastructure damage, but the model predicts as non-
damage tweet in a cross-domain case. The explanation is that the tweet contains very few words related to
the infrastructure damage like ‘heavy damage’ and rebuild, and the rest are related to non-damage.
To resolve all the error patterns in the proposed model, there is a scope for designing a novel methodologies
in the future for identifying the damage assessment tweets during the disaster. Besides, the work can also be
extended to identify the specific type of damage and severity of the human damage, infrastructure damage
from the tweets, whether it is severe damage, mild damage, or no damage. Another interesting direction is
to rank the tweets by assigning a priority to the tweets based on the severity of the damage. If the tweets
contain the geo-location information and indicate the medical emergency to the injured person, then it is
given as the highest priority of the damage. And if the tweets do not contain location information and do
not specify the type of damage, give as the least priority. When the time is critical during the disaster, it
will be beneficial to the authorities and victims.
5 Conclusion
A Novel method based on the weighted features using linear regression and SVR has been proposed to identify
the damage assessment tweets during a disaster for both binary and multi-class classification. We showed that
the proposed method is very successful for both in-domain and cross-domain. The experimental results show
that the proposed method consistently exceeds existing methods on most of the datasets for both in-domain and
cross-domain. Results obtained from the in-domain and cross-domain setting show the ability of the proposed
approach to perform well with and without labeled data of specific disaster datasets, respectively. Besides, the
results of this work suggest that the use of the proposed method particularly trained with earthquake disaster
datasets in the detection of tweets relevant to damage assessment for any disaster.
References
1. Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Processing social media messages in mass emergency:
A survey. ACM Computing Surveys (CSUR), 47(4):67, 2015.
2. Kate Starbird, Leysia Palen, Amanda L Hughes, and Sarah Vieweg. Chatter on the red: what hazards threat reveals about the
social life of microblogged information. In Proceedings of the 2010 ACM conference on Computer supported cooperative work,
pages 241–250. ACM, 2010.
3. Madichetty Sreenivasulu and M Sridevi. A survey on event detection methods on various social media. In Recent Findings in
Intelligent Computing Techniques, pages 87–93. Springer, 2018.
4. Sreenivasulu Madichetty and M Sridevi. A stacked convolutional neural network for detecting the resource tweets during a
disaster. Multimedia Tools and Applications, pages 1–23, 2020.
5. Sreenivasulu Madichetty et al. Re-ranking feature selection algorithm for detecting the availability and requirement of resources
tweets during disaster. International Journal of Computational Intelligence & IoT, 1(2), 2018.
6. Madichetty Sreenivasulu and M Sridevi. Mining informative words from the tweets for detecting the resources during disaster.
In International Conference on Mining Intelligence and Knowledge Exploration, pages 348–358. Springer, 2017.
7. Sreenivasulu Madichetty and M Sridevi. Improved classification of crisis-related data on twitter using contextual representations.
Procedia Computer Science, 167:962–968, 2020.
8. Koustav Rudra, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. Extracting and summarizing situational information from
the twitter social media during disasters. ACM Transactions on the Web (TWEB), 12(3):17, 2018.
Title Suppressed Due to Excessive Length 17
9. Sreenivasulu Madichetty and M Sridevi. Classifying informative and non-informative tweets from the twitter by adapting image
features during disaster. Multimedia Tools and Applications, pages 1–23, 2020.
10. Dat T Nguyen, Ferda Ofli, Muhammad Imran, and Prasenjit Mitra. Damage assessment from social media imagery data
during disasters. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis
and Mining 2017, pages 569–576. ACM, 2017.
11. Stefano Cresci, Maurizio Tesconi, Andrea Cimino, and Felice Dell’Orletta. A linguistically-driven approach to cross-event
damage assessment of natural disasters from social media messages. In Proceedings of the 24th International Conference on
World Wide Web, pages 1195–1200. ACM, 2015.
12. Shalini Priya, Manish Bhanu, Sourav Kumar Dandapat, Kripabandhu Ghosh, and Joydeep Chandra. Characterizing infras-
tructure damage after earthquake: a split-query based ir approach. In 2018 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining (ASONAM), pages 202–209. IEEE, 2018.
13. Sudha Verma, Sarah Vieweg, William J Corvey, Leysia Palen, James H Martin, Martha Palmer, Aaron Schram, and Kenneth M
Anderson. Natural language processing to the rescue? extracting” situational awareness” tweets during mass emergency. In
Fifth international AAAI conference on weblogs and social media, pages 385–392. Citeseer, 2011.
14. Erik Cambria, Soujanya Poria, Amir Hussain, and Bing Liu. Computational intelligence for affective computing and sentiment
analysis [guest editorial]. IEEE Computational Intelligence Magazine, 14(2):16–17, 2019.
15. Wei Li, Luyao Zhu, Yong Shi, Kun Guo, and Erik Cambria. User reviews: Sentiment analysis using lexicon integrated two-
channel cnn-lstm family models. Appl. Soft Comput., 94:106435, 2020.
16. Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, Erik Cambria, and U Rajendra Acharrya. Abcdm: An attention-
based bidirectional cnn-rnn deep model for sentiment analysis. Future Generation Computer Systems, 2020.
17. Md Shad Akhtar, Asif Ekbal, and Erik Cambria. How intense are you? predicting intensities of emotions and sentiments using
stacked ensemble. IEEE Computational Intelligence Magazine, 15(1):64–75, 2020.
18. Amir Hussain and Erik Cambria. Semi-supervised learning for big social data analysis. Neurocomputing, 275:1662–1673, 2018.
19. Yunqing Xia, Erik Cambria, Amir Hussain, and Huan Zhao. Word polarity disambiguation using bayesian model and opinion-
level features. Cognitive Computation, 7(3):369–380, 2015.
20. Sreekanth Madisetty and Maunendra Sankar Desarkar. An ensemble based method for predicting emotion intensity of tweets.
In International Conference on Mining Intelligence and Knowledge Exploration, pages 359–370. Springer, 2017.
21. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Tweet analysis for real-time event detection and earthquake reporting
system development. IEEE Transactions on Knowledge and Data Engineering, 25(4):919–931, 2013.
22. Tahora H Nazer, Fred Morstatter, Harsh Dani, and Huan Liu. Finding requests in social media for disaster relief. In Advances
in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, pages 1410–1413. IEEE,
2016.
23. Muhammad Imran, Carlos Castillo, Ji Lucas, Patrick Meier, and Sarah Vieweg. Aidr: Artificial intelligence for disaster response.
In Proceedings of the 23rd International Conference on World Wide Web, pages 159–162. ACM, 2014.
24. Mark A Cameron, Robert Power, Bella Robinson, and Jie Yin. Emergency situation awareness from twitter for crisis manage-
ment. In Proceedings of the 21st International Conference on World Wide Web, pages 695–698. ACM, 2012.
25. Anirban Sen, Koustav Rudra, and Saptarshi Ghosh. Extracting situational awareness from microblogs during disaster events.
In Communication Systems and Networks (COMSNETS), 2015 7th International Conference on, pages 1–6. IEEE, 2015.
26. Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Extracting information nuggets from
disaster-related messages in social media. In Iscram, 2013.
27. Cornelia Caragea, Adrian Silvescu, and Andrea H Tapia. Identifying informative messages in disaster events using convolutional
neural networks. In International Conference on Information Systems for Crisis Response and Management, 2016.
28. Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, and Prasenjit Mitra. Rapid clas-
sification of crisis-related data on social networks using convolutional neural networks. arXiv preprint arXiv:1608.03902,
2016.
29. Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Practical extraction of disaster-
relevant information from social media. In Proceedings of the 22nd International Conference on World Wide Web, pages
1021–1024. ACM, 2013.
30. Kiran Zahra, Muhammad Imran, and Frank O Ostermann. Automatic identification of eyewitness messages on twitter during
disasters. Information processing & management, 57(1):102107, 2020.
31. Mayank Kejriwal and Peilin Zhou. On detecting urgency in short crisis messages using minimal supervision and transfer
learning. Social Network Analysis and Mining, 10(1):1–12, 2020.
32. Moumita Basu, Anurag Shandilya, Prannay Khosla, Kripabandhu Ghosh, and Saptarshi Ghosh. Extracting resource needs and
availabilities from microblogs for aiding post-disaster relief operations. IEEE Transactions on Computational Social Systems,
6(3):604–618, 2019.
33. Ritam Dutt, Moumita Basu, Kripabandhu Ghosh, and Saptarshi Ghosh. Utilizing microblogs for assisting post-disaster relief
operations via matching resource needs and availabilities. Information Processing & Management, 56(5):1680–1697, 2019.
34. Firoj Alam, Ferda Ofli, and Muhammad Imran. Descriptive and visual summaries of disaster events using artificial intelligence
techniques: case studies of hurricanes harvey, irma, and maria. Behaviour & Information Technology, pages 1–31, 2019.
35. Hemant Purohit, Carlos Castillo, and Rahul Pandey. Ranking and grouping social media requests for emergency services using
serviceability model. Social Network Analysis and Mining, 10(1):1–17, 2020.
36. Sreenivasulu Madichetty and M Sridevi. Identification of medical resource tweets using majority voting-based ensemble during
disaster. Social Network Analysis and Mining, 10(1):1–18, 2020.
37. Muhammad Imran, Firoj Alam, Umair Qazi, Steve Peterson, and Ferda Ofli. Rapid damage assessment using social media
images by combining human and machine intelligence. arXiv preprint arXiv:2004.06675, 2020.
38. Xukun Li, Doina Caragea, Cornelia Caragea, Muhammad Imran, and Ferda Ofli. Identifying disaster damage images using a
domain adaptation approach. In ISCRAM, 2019.
39. Shalini Priya, Manish Bhanu, Sourav Kumar Dandapat, Kripabandhu Ghosh, and Joydeep Chandra. Taqe: tweet retrieval-
based infrastructure damage assessment during disasters. IEEE transactions on computational social systems, 7(2):389–403,
2020.
40. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A Smith. Improved part-of-
speech tagging for online conversational text with word clusters. In Proceedings of the 2013 conference of the North American
chapter of the association for computational linguistics: human language technologies, pages 380–390, 2013.
41. Irina P Temnikova, Carlos Castillo, and Sarah Vieweg. Emterms 1.0: A terminological resource for crisis tweets. In ISCRAM,
2015.
42. Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Crisislex: A lexicon for collecting and filtering mi-
croblogged communications in crises. In Eighth International AAAI Conference on Weblogs and Social Media, 2014.
18 Short form of author list
43. Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and Vladimir Vapnik. Support vector regression
machines. In Advances in neural information processing systems, pages 155–161, 1997.
44. Sanford Weisberg. Applied linear regression, volume 528. John Wiley & Sons, 2005.
45. Saptarshi Ghosh, Kripabandhu Ghosh, Tanmoy Chakraborty, Debasis Ganguly, Gareth Jones, and Marie-Francine Moens. First
International Workshop on Exploitation of Social Media for Emergency Reliefand Preparedness (SMERP). In Proceedings of
the 39th European Conference on IR Research – J.M. Jose et al. (Eds.): ECIR 2017, LNCS 10193, ECIR 2017, pages 779–783.
Springer International Publishing AG, 2017.
46. Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. Twitter as a lifeline: Human-annotated twitter corpora for nlp of
crisis-related messages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC
2016), Paris, France, May 2016. European Language Resources Association (ELRA).
47. Firoj Alam, Ferda Ofli, and Muhammad Imran. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings
of the 12th International AAAI Conference on Web and Social Media (ICWSM), June 2018.
48. J Rexiline Ragini, PM Rubesh Anand, and Vidhyacharan Bhaskar. Big data analytics for disaster response and recovery
through sentiment analysis. International Journal of Information Management, 42:13–24, 2018.
49. Koustav Rudra, Subham Ghosh, Niloy Ganguly, Pawan Goyal, and Saptarshi Ghosh. Extracting situational information from
microblogs during disaster events: a classification-summarization approach. In Proceedings of the 24th ACM International on
Conference on Information and Knowledge Management, pages 583–592. ACM, 2015.
50. Samujjwal Ghosh and Maunendra Sankar Desarkar. Class specific tf-idf boosting for short-text classification. Proc. of SMERP,
2018, 2018.
51. Firoj Alam, Hassan Sajjad, Muhammad Imran, and Ferda Ofli. Standardizing and benchmarking crisis-related social media
datasets for humanitarian information processing. arXiv preprint arXiv:2004.06774, 2020.
52. Abhinav Kumar, Jyoti Prakash Singh, and Sunil Saumya. A comparative analysis of machine learning techniques for disaster-
related tweet classification. In 2019 IEEE R10 Humanitarian Technology Conference (R10-HTC)(47129), pages 222–227.
IEEE, 2019.