0% found this document useful (0 votes)
54 views13 pages

Building A Business Email Compromise Research Dataset With Large Language Models

This paper presents a novel system utilizing Large Language Models (LLMs) to create publicly available datasets for Business Email Compromise (BEC) research, addressing the current lack of such resources. Two datasets are generated: a small proof-of-concept dataset (BEC-1) and a larger research dataset (BEC-2), both achieving high agreement scores, indicating their credibility for analysis. The work aims to enhance BEC detection methodologies by providing researchers with realistic email samples that reflect actual attack scenarios.

Uploaded by

Raco Loco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views13 pages

Building A Business Email Compromise Research Dataset With Large Language Models

This paper presents a novel system utilizing Large Language Models (LLMs) to create publicly available datasets for Business Email Compromise (BEC) research, addressing the current lack of such resources. Two datasets are generated: a small proof-of-concept dataset (BEC-1) and a larger research dataset (BEC-2), both achieving high agreement scores, indicating their credibility for analysis. The work aims to enhance BEC detection methodologies by providing researchers with realistic email samples that reflect actual attack scenarios.

Uploaded by

Raco Loco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/380151635

Building a Business Email Compromise Research Dataset with Large Language


Models

Preprint · October 2024


DOI: 10.13140/RG.2.2.32482.95689

CITATIONS READS
0 750

1 author:

Rohit Dube
University of California, Berkeley
26 PUBLICATIONS 1,367 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rohit Dube on 18 October 2024.

The user has requested enhancement of the downloaded file.


Building a Business Email Compromise Research
Dataset with Large Language Models
Rohit Dube
Independent Researcher
California, USA
October 18, 2024

Abstract—Email-based attacks, such as Business Email Com- in urgent-sounding emails to trick the victim into changing
promise, seriously threaten many organizations. In recent years, payment information so that a future payment is redirected.
Large Language Models have improved the potency of email- The email conversation between the attacker and the victim
based attacks by giving attackers an easy-to-use tool to overcome
the language barrier and craft believable emails. At the same may span multiple days and require the attacker to have some
time, Business Email Compromise research remains hamstrung non-public information about the victim. The attacker’s goal is
by the lack of a publicly available dataset. This paper proposes to have a legitimate payment transaction end up in an attacker-
a novel system composed of Large Language Models to create controlled account [7]. This attack is difficult to detect as there
Business Email Compromise datasets. Two datasets are gener- is no suspicious URL or attachment in the emails, and the
ated. The first one (BEC-1) is a small 20-email proof-of-concept
dataset that demonstrates that the system produces a dataset that email conversation is with a known partner.
a human analyst finds credible. The second (BEC-2) is a larger Email security systems attempt to detect BEC attacks using
279-email dataset generated using the same system. BEC-2 is signals from the three major components of emails. First,
the first public Business Email Compromise dataset available to such systems scrutinize the email header to detect signs of
the email security research community. The paper also proposes compromise, such as emails coming from “known bad” sender
an accuracy-like metric called “agreement score” to measure
the quality of datasets produced. Both BEC-1 and BEC-2 have domains. Second, they determine if the URLs or attachments
high agreement scores – 90 and 93, respectively – validating the (if these exist) in the email are malicious using techniques such
effectiveness of the Large Language Model system. as URL lookups and network sandboxing. Third, they detect
Index Terms—Phishing, Business Email Compromise, Email behaviors embedded in the text of the email to ascertain if
Dataset, Natural Language Processing, Large Language Model social engineering is afoot. A cleverly crafted email from a
trusted business partner (as described above) containing no
I. I NTRODUCTION URLs or attachments can bypass the email security system
Email-borne attacks remain one of the primary cybersecurity and land in the victim’s inbox.
attack vectors today. “Phishing” is a category of email attacks Given the volume of email and the criticality of business
where the recipient of an email is tricked into divulging transactions discussed over this medium, real-world email
sensitive data or downloading malware [1]. “Spear Phishing” security systems have to balance between BEC detection per-
is a sub-category of Phishing where the email-based attack formance measures (accuracy, false positives, false negatives),
is targeted towards an individual or a small group of related system performance characteristics (throughput, latency), and
(usually by association with the same business organization) economics (system cost). This balancing is no easy task and
individuals [2]. “Business Email Compromise (BEC)” is a sub- requires the tiering of multiple detection mechanisms to make
category of Spear Phishing where the attacker attempts to steal the systems commercially viable [8], [9].
money or confidential information from an organization [3]. If email security researchers had access to a publicly
1
Attackers using BEC hope to (eventually) gain financially available BEC dataset, they could advance the state of the art
from their efforts. 2 in BEC detection. Unfortunately, no such dataset exists. We
As per the US Federal Bureau of Investigation, documented attempt to correct this problem by developing a system that
BEC losses in the US in 2016 were $360 million. In 2021, takes real-world BEC examples and perturbs them to create
these losses escalated to $2.4 billion [5]. The 2023 estimate is BEC samples.
even higher at $2.9 billion [6]. Clearly, BEC is a consequential Large Language Models (LLMs) have emerged as a general-
attack type worth studying. purpose tool that makes many email security tasks tractable
A frequently encountered BEC attack involves an attacker [10]. Our system uses LLMs to generate email variations
pretending to be a trusted business partner where the part- from the input real-world examples and automatically label
ner’s IT infrastructure has already been compromised by them. We test this system by creating a small proof-of-concept
the attacker. The attacker uses social engineering techniques dataset (BEC-1) and manually labeling it. We find that the
system does, in fact, generate BEC emails that a human would
1 BEC is also known as “email fraud” [4]. find credible. Encouraged by the success of the proof-of-
2 Examples of BEC emails are included in appendix A. concept dataset, we create a larger research dataset (BEC-
2) that can be used to test the efficacy of third-party BEC • Existing research work on BEC (as of 2019) uses small
detection systems. and unrealistic datasets.
The primary contributions of this paper include: • BEC research is limited because BEC attacks primarily
• confirming the unavailability of a public BEC dataset via affect corporations rather than consumers. Thus, the email
a literature survey. data resides with corporations, which do not release it to
• providing proof that LLMs can be used to create BEC outside researchers due to privacy concerns.
datasets. [11] observes that compromised legitimate accounts could
• creating and making public a research BEC dataset. be used in BEC attacks. The paper’s authors propose stopping
The rest of this paper is organized as follows. Section such emails by detecting them before they are sent. The detec-
II reviews recent research papers on BEC. Special attention tion of BEC emails, in turn, depends on manually identified
is given to research at the intersection of BEC and LLMs. features. The paper describes two natural language processing
Section III provides an overview of the system that creates techniques – word embeddings and sentiment analysis – that
BEC datasets and labels them. Section IV details how BEC function as features of the detection system. For both word
emails are labeled. This section also discusses metrics used embeddings and sentiment analysis, there is a notion of a
to evaluate the quality of BEC datasets. Section V sheds light centroid that represents the average email behavior of a user
on the LLM-based models in the system. Sections VI and VII or a group of users. When an email deviates too far from the
evaluate the proof-of-concept BEC-1 dataset in great detail. centroid, it is flagged and may be blocked. Note that [11] was
Section VIII discusses the larger research-ready dataset BEC-2 written in late 2019, just before the widespread recognition of
using the same analysis principles as BEC-1. The conclusion LLMs. As such, it doesn’t reference LLMs.
section IX summarizes this paper. Finally, X presents some [12] generates perturbed versions of (real) BEC emails via
ideas for future research. a natural language processing pipeline that includes an LLM
(BERT) to generate a BEC dataset. The input BEC emails and
II. R ELATED WORK
the dataset generated by the pipeline remain proprietary. The
The papers selected for review in this section discuss three paper confirms (as of 2020) the lack of public corpora for
problems in email security: BEC.
1) The detection of unwanted emails (Spam, Phishing, [13] is a survey paper that reviews 38 research studies
Spear Phishing, and BEC) using Artificial Intelligence on using machine learning in BEC detection. The survey
(AI) techniques such as natural language processing. highlights the appropriateness of natural language processing
2) The generation of such emails using LLMs for adver- techniques for (often text-only) BEC emails. The paper also
sarial demonstration. The idea is to provide proof that reports that most of the studies it surveyed used a customized
an LLM can be used to produce emails with specific dataset to report performance and that there are no publicly
properties. available yet realistic BEC datasets (as of 2022).
3) The analysis, creation, and use of public or proprietary [14] describes the experience of a technically competent per-
email datasets. These datasets are used to determine son without extensive security experience in generating Spear
the efficacy of commercial email security systems and Phishing (not BEC) emails using GPT-3/3.5/4. The paper’s
research prototypes. author was successful in using LLMs to collect information
Since the topics above are related, some papers touch upon about the target group (600 British members of parliament),
all three problems. craft credible emails using the collected information, and inject
[8] dates back to 2019 and is perhaps the earliest description a script into email-attached documents that would download
of a BEC attack detection system. The paper’s authors describe malware onto the recipient’s computer when the recipient
a system that has two stages. The first stage uses information opened the document. However, the paper does not describe
other than the subject and body (such as the email header) a systematic toolchain to regenerate the emails or allude to a
to determine if an email is suspicious. Suspicious emails new publicly available Spear Phishing dataset.
are taken through a second stage that analyzes the subject Can LLMs be used to detect BEC attacks? [15] discusses
and body of the email. If one of the analyses in the second this topic – it describes a custom model that combines LLM
stage also finds an email to be suspicious, it is blocked. The technology (BERT) with another neural network model (Long
second stage includes a natural language processing technique Short Term Memory or LSTM). The composite model is used
– the authors create a vector for each processed email us- to detect BEC attacks. However, since no public BEC dataset
ing “term frequency-inverse document frequency” (TF-IDF). was available, [15] uses a Phishing (“Nigerian fraud”) and
A pre-trained classifier then classifies the vectors as either benign email dataset to develop, validate, and test the custom
malicious or benign. model. Additionally, to create a BEC dataset proxy from
[8] uses a proprietary commercial dataset to train and the existing datasets, the paper’s authors remove URLs and
evaluate the BEC detection system it describes. It also makes attachments to extract the text portion of emails. A skeptic
a few observations regarding BEC research and datasets that would argue that such transformations significantly change
are worth noting: the nature of the email datasets, potentially making them
inappropriate for BEC detection model testing (and perhaps proprietary.
for development and validation as well). • The emails in the dataset are oriented toward Spear
[16] describes its authors’ experience building, bootstrap- Phishing rather than BEC.
ping, and maintaining a commercial email security system Several papers reviewed above have found evidence sup-
that detects BEC attacks. The system described breaks each porting the lack of a publicly available BEC dataset. The
incoming email into sentences or sentence fragments. Each of same papers also suggest the general lack of public datasets
these fragments is fed into statistical and AI models called in related sub-fields such as Spear Phishing. Further, the use
detectors. There are ≈ 90 such detectors in the system, and of LLMs in some of the papers suggests the possibility of
each detector detects the presence of one or more specific building a dataset generation system using LLMs (see figure
behaviors in the fragment (e.g., urgency, call to action, fre- 1).
quent communication). The training data for some detectors is
obtained by classifying emails using GPT-4 – an LLM. While III. S YSTEM OVERVIEW
GPT-4’s classification is expensive and lacks the classification
performance needed for production use, the performance is Our primary hypothesis is that an LLM can be used to create
sufficient for offline labeling of training data. The paper a BEC dataset. Our secondary hypothesis is that an LLM can
includes some additional observations on training data: be used to add labels to the dataset – the idea is to construct
• Privacy regimes governing commercial email systems an LLM-based classifier that can determine whether an email
hinder obtaining samples of benign emails. If an email is similar to those seen in BEC attacks.
is not problematic, a commercial system’s staff typically Figure 1 shows the conceptual building blocks of our
cannot access it. system. The input data to our system is the small set of real-
• In general, no publicly available dataset includes BEC world BEC examples in appendix A. These examples have
attacks (as of 2023). Publicly available email datasets are been sourced from public forums and blog posts. 3
either old or focused on less severe threats than BEC.
• The characteristics of incoming email to commercial
systems drift over time. The drift can be gradual as Real-world examples
communication style and vocabulary change. However,
the drift can sometimes be abrupt, as when a batch of
new customers is brought on board.
[17] provides evidence in favor of LLMs’ capabilities by Generation model
running a Spear Phishing experiment at a large university –
9, 000 staff, 35, 000 students – with the help of the university’s
security staff. The authors conclude that LLM-crafted Spear
Phishing (not BEC) emails are as effective in tricking recip-
ients as human-crafted emails under similar conditions. They BEC samples
also compare various LLMs’ Phishing (but not BEC) detection
performance.
[18] also describes an LLM-based system to detect Spear
Phishing (not BEC) emails. The authors use a set of prompts
and an ensemble of chat-friendly LLMs – GPT-3.5, GPT-4, Human labeler Validation model
and Gemini Pro – to generate a document vector containing
probabilities of various behaviors in an incoming email. These
prompts were derived from social engineering literature and
included questions such as “Does this email convey a sense of
Label agreement?
urgency?” Subsequently, these vectors are fed to a K-nearest
neighbors (KNN) classifier to distinguish between the vectors
of Spear Phishing and other email types. Fig. 1. The BEC labeled dataset creation system
To test their system, the paper’s authors use a proprietary
(and industrial) LLM-based system to produce a set of 333 An LLM is used to generate variations of the real-world
Spear Phishing emails. These emails are combined with sam- example emails. This step is depicted as the “Generation
ples from other (public) sources to create a test dataset. There model” block in figure 1. The Generation model creates a
are some other points of note in [18]: candidate dataset – BEC samples.
• It confirms the lack of a publicly available Spear Phishing
3 Each example in appendix A lists the source from where the email was
dataset.
derived. Not all the examples in the appendix are BEC – one is a Phishing
• It makes the generated Spear Phishing emails public. example, and another, while part of a BEC attack, cannot be classified as
However, the system that generated the emails remains BEC on its own.
Subsequently, the candidate dataset is labeled twice – once 3) Does the email ask the recipient
by a human and a second time by another LLM-based model to take an action related to an
depicted as the “Validation model” in figure 1. organization?
Finally, the degree of agreement between the human labeler 4) Does the email convey urgency?
and the model is calculated. This is the last step in figure 1. 1 point is awarded to an email if the answer to a question
The results from this step determine the final labels used in is yes and 0 otherwise. If an email receives a total (Σ) of 4
the dataset as well as the quality of the dataset. points, it is deemed part of a BEC attack and is assigned a
The idea of generating variations is similar to the one “positive (+)” label. If it receives 2 or 3 points, it is assigned
described in [12], where a natural language processing pipeline a “neutral (=)” label. if it receives 0 or 1 points, it is assigned
generates perturbed versions of authentic emails. However, a “negative (–)” label.
we use a general-purpose LLM instead of a custom-built
pipeline to perform natural language processing tasks. Our B. Real-world examples evaluated against the rubric
work is similar in spirit to [16], where the authors use LLMs Before proceeding with the rest of the project, we test the
for various email processing tasks. The classification of BEC rubric against the example emails in appendix A. We run
emails using an LLM resembles the email vector creation of through the questions and assign scores and labels based on
[18]. the rubric. The results are shown in table I.
Note that BEC attacks typically involve a conversation – Since we know the origin of these emails (see appendix A
i.e., multiple emails going back and forth between the attacker for details) and have read why some are considered BEC, we
and the victim. Our goal isn’t to model the entire conversation. intuitively know the label each should receive. The manual
Instead, we aim to create email bodies and subjects consistent labels in table I derived using the rubric match the intuitive
with attack conversations. labels.

IV. E VALUATING REAL - WORLD EXAMPLES AND Example 1 2 3 4 Σ Human


1 1 1 1 1 4 +
GENERATED DATASETS
2 1 1 1 1 4 +
3a 1 1 1 1 4 +
The first batch of emails generated by the system described
3b 1 0 0 1 2 =
above is preserved in a small proof-of-concept dataset (20 3c 1 1 1 1 4 +
emails) that we call BEC-1. Suppose we can generate a high- 4 0 0 0 1 1 –
quality BEC-1 with a few real-world emails. In that case, TABLE I
we can likely increase the number and types of variations to H UMAN CALCULATED SCORES AND LABELS FOR EMAIL EXAMPLES IN
APPENDIX A.
produce a larger dataset in the future. Further, we can obtain
more real-world examples that may, in turn, help increase
the size of a future dataset. Given the potential future gains,
properly evaluating the quality of BEC-1 is a significant step C. Dataset evaluation metric
in the system-building process.
Our primary hypothesis (section III) will be validated if a
We validate each email in BEC-1 in two ways. First, we
human trained on the rubric labels the emails generated by
manually label each email in BEC-1. Second, we use an LLM
the system as consistent with BEC attacks. Our secondary
classifier to label the same emails. Both efforts use the rubric
hypothesis will be validated if the LLM classifier’s labels
and labeling scheme described below.
match the human’s labels. Both hypotheses must be proven
We consider the human-generated labels to be gold labels.
true if we hope to eventually generate large BEC datasets
The degree of agreement between the validation model and
(containing hundreds of emails) on demand without needing
human labels gives us an idea of the system’s efficacy.
human labeling.
A. Labeling rubric Thus, a key metric is the agreement between the human
validator and the classification model over BEC datasets. We’ll
The rubric questions below refer only to the subject and keep track of this agreement via an agreement score that
body of the email. Email header information, if present, is defined as “the percentage of human-labeled (manually-
is not used; it is assumed that an email security system labeled) emails labeled identically by the classifier.”
would evaluate the header separately. Similarly, URLs are not The agreement score reports the accuracy without the %
evaluated if they are present in an email. The email security sign and has a [0 − 100] range. A score of 0 implies that all
system would evaluate the URLs separately and (likely) pre- of the generated emails in the dataset have a different label
classify emails with URLs as something other than BEC. than the human-produced gold label. A score of 100 implies
Here are the rubric questions: that every generated email has a label identical to the human-
1) Does the email appear to be related produced gold label.
to business? We toyed with a modified accuracy metric that accounted
2) Does the email have an authoritative for the greater semantic similarity between a positive email
tone? example and a neutral generated sample than between the
positive email example and a negative generated sample. first score, then total the score, and finally produce a label for
However, we discarded this idea because of greater complexity each email in BEC-1.
that was unlikely to change the acceptability of a generated Appendix K shows the DSPy code for the two models
dataset. Note that such a metric would be more challenging discussed above. The generation model depicted is sufficient
to interpret than the agreement score. to create BEC-1. As we will see in sections VI and VII, the
We also considered using additional metrics such as macro validation model appears sufficiently powerful to validate the
and micro F1 scores. Given that we start with positively quality of BEC-1.
labeled examples of an underrepresented email class (BEC)
VI. R ESULTS : REAL - WORLD EXAMPLES AND BEC-1
and create variations of the examples, metrics other than the
agreement score discussed above seemed unnecessary. It does The system creates 20 emails collected in a comma-
not seem likely that the quality of a generated BEC dataset, separated values (CSV) file: the BEC-1 dataset. The dataset
as indicated by the agreement score, would differ in practice separates the email subject from the email body (both gener-
from that indicated by an F1 score. ated by the generation model) and includes a label (output by
Ultimately, we decided to stay with an accuracy measure as the validation model)
an adequate proxy for the quality of the dataset generated. If We present two results from our experimentation with the
the system can generate a dataset with a high agreement score system discussed above.
(say >= 80), we believe it will be useful to email security First, we compare the human labeling on the BEC examples
researchers. On the other hand, if the system generates a in table I with the validation model’s labeling. We do this to
dataset with a low agreement score (say <= 50), it is unlikely establish a baseline for the system’s behavior. Table II presents
to be useful to researchers. That said, we do use confusion the validation model’s scores and labels. The human and the
matrices to visualize some of our results. model agree on the label of 4 of 6 examples for an agreement
score of 67.
V. M ORE ON G ENERATION AND VALIDATION MODELS Example 1 2 3 4 Σ Model
1 1 1 1 1 4 +
The model that generates BEC-1 is called the “generation 2 1 1 1 1 4 +
model.” gpt-3.5-turbo – a GPT-3.5 variant – is the LLM 3a 1 1 1 1 4 +
3b 0 0 0 0 0 –
that powers this model [19]. gpt-3.5-turbo was used in our 3c 1 1 1 1 4 +
experimentation as some of the research papers reviewed in 4 1 0 1 1 3 =
section II obtained good results with it. This LLM was also TABLE II
cost-effective to use as it was not the latest model (which tends VALIDATION MODEL SCORES AND LABELS FOR EMAIL EXAMPLES IN
to be the most expensive) available from the LLM provider. APPENDIX A.

It turned out that gpt-3.5-turbo produced good results in our


experimentation, making the use of other LLMs redundant.
Second, we manually label each email sample in BEC-1
Note that current-day LLMs such as gpt-3.5-turbo have gotten
and compare the labels with those produced by the validation
powerful enough to produce linguistically accurate output
model. The labels from both efforts are presented in table III.
without much human intervention.
18 of 20 emails are labeled identically for an agreement score
DSPy was used to manage our interaction with the LLM
of 90.
[20]. DSPy is a framework for developing applications using
language models. The use of DSPy was defensive in that VII. A NALYSIS : REAL - WORLD EXAMPLES AND BEC-1
there was no apriori guarantee that the first LLM used (gpt- We focus our analysis on the level of agreement between
3.5-turbo) would produce good results. In principle, DSPy human labeling and the validation model’s labels.
enables the easy replacement of one LLM with another without
rewriting the code. A. Label agreement on real-world examples
We feed the LLM the 4 positively labeled BEC examples At first glance, the agreement score of 67 between human
from appendix A one by one and ask the LLM to generate labeling and the validation model for the BEC examples seems
5 variations of each example. Each of the 5 requests sends low. Looking closer, we realize that given the small number of
the LLM a slightly different “temperature” configuration to examples (6), even a single disagreement would result in an
encourage it to produce unique variations. The temperature agreement score of 83 – a considerable drop from the perfect
variation is crucial as this mechanism coaxes the LLM to agreement score of 100.
produce variations of the original text. Appendix section F Examining the first disagreement (example 3b, section D),
shows samples generated from example 1. we see that human labeling scored the email with 1s on
The model that scores and creates labels is called the the email being related to a business and having a sense
“validation model” (referred to above as the LLM classifier). of urgency, whereas the validation model did not. 4 The
Here, too, we use gpt-3.5-turbo and DSPy for the same reason human labeling here is influenced by the labeler’s years of
as they were used in the generation model. We use the rubric experience working in various businesses and receiving similar
(section IV-A) in the prompt to the LLM and ask the LLM to emails. When a known person asks you if you are in the
Sample Human Model
1 + +
2 + =
3 + +
4 + +
5 + =
6 + +
7 + +
8 + +
9 + +
10 + +
11 + +
12 + +
13 + +
14 + +
15 + +
16 + +
17 + +
18 + +
19 + +
20 + +
TABLE III
H UMAN ( MANUAL ) AND VALIDATION MODEL LABELS FOR EMAIL Fig. 2. Confusion matrix for BEC-1; numbers indicate email samples of each
SAMPLES PRODUCED BY THE GENERATION MODEL . type

In the interest of conservatism, samples such as 2 and 5


office, they usually try to meet or call you at an office should remain labeled as neutral, enabling users of BEC-1 to
phone number. Unsurprisingly, the LLM missed this nuance drop them at their discretion.
of business culture.
In the second disagreement (example 4, section F), the
human labeler scored the email with 0 on the email being VIII. F ROM BEC-1 TO RESEARCH DATASET BEC-2
related to a business and asking for an action related to the
business organization whereas the validation model scored Given the successful generation and validation of BEC-1,
otherwise. 5 Here, the human labeler is conditioned by years we run an experiment to understand if the examples from table
of receiving SPAM and Phishing emails that are nominally I can generate a larger dataset.
related to a business but are clearly fake (in the labeler’s mind). Toward this experiment, we modify the generation model
The LLM is likely not trained on this nuance and takes the slightly to generate 75 samples per example for a total of 300
email’s request to fill out a survey for a business at face value. samples. Then, we eliminate duplicate samples (these have
an identical subject and body as another sample). This de-
B. Label agreement on BEC-1 samples duplication reduces the number of samples generated to 279.
The generation model appears to have done an excellent job Finally, we assign labels to each sample via the validation
on the BEC-1 samples: manual inspection declares positive model and save the samples and labels to a CSV file. This
labels for all 20 emails and the validation model agrees in 18 CSV file is BEC-2 – our research dataset. BEC-2 contains
of those cases. Figure 2 visualizes the label agreement using 262 positively labeled emails and 17 neutrally labeled emails.
a confusion matrix. Here, the “True” labels are produced via We manually label BEC-2 – a tedious task – and compare
human labeling, and the “Predicted” labels are obtained from the labels with those generated by the validation model. The
the validation model. confusion matrix in figure 3 summarizes the comparison.
Below, we focus on the two cases where human labeling 260 emails are labeled positive by both the human labeler
and the validation model disagree. and the validation model. 1 email is labeled neutral by both
In both samples 2 (H) and 5 (K), the validation model parties. The two parties have differing perceptions on 18
produces a score of 3. In both cases, the difference between emails. Curiously, neither the human effort nor the validation
the human and the validation model is regarding whether the model labels any emails as negative. Given that all samples
email asks for an action (question 3 in IV-A). To a human are variations of known BEC examples, perhaps the lack of
conditioned by past interactions with banks, the message negative samples is unsurprising.
seems clear – an action of “wire transfer” is being requested. The agreement score for BEC-2 is 93 ( 261×100
279 ), even higher
However, the validation model seems unconvinced for some than that of BEC-1, suggesting that the dataset is usable in a
reason that we can’t fully explain. research setting. Further, given the high agreement scores for
both BEC-1 and BEC-2, it appears that future BEC datasets
4 Compare row with example 3b in tables I and II. could be automatically generated by bypassing the “Human
5 Compare row with example 4 in tables I and II. labeler” and “Label agreement” steps in figure 1.
Real-world examples

Generation model

BEC samples

Validation model

Fig. 3. Confusion matrix for BEC-2; numbers indicate email samples of each
type

BEC dataset
IX. C ONCLUSION
Our paper corroborates the claim (by surveying existing
Fig. 4. The automatic BEC labeled dataset creation system; human labeler
literature) that there isn’t a publicly available BEC dataset not needed
for email security researchers to use in their work. This is an
unfortunate situation that we hope to correct.
Toward this goal, we describe and analyze a novel system
X. F UTURE WORK
that can generate a BEC dataset. This system uses LLMs to
generate and label the emails in the dataset. The labels match
those produced manually (by a human) in most cases. BEC-1 and BEC-2 can be used to test detection systems
Initially, we generate a small proof-of-concept (20 emails) trained on independent data. In particular, research such as
BEC dataset called BEC-1. The idea is to prove various [15] can be reworked to use BEC-2 to study BEC detection
components of the generation system while minimizing the performance rather than the modified Phishing dataset used
manual labeling effort involved. in that work. However, both BEC-1 and BEC-2 are too
Once the system was proven, we generated a larger research small to use for training detection systems. In the future, this
dataset called BEC-2 (279 emails). Most of the system (au- limitation can be removed by generating larger datasets using
tomatically) generated labels for BEC-2 match the human- the generation system described above.
generated ones, suggesting that BEC-2 is appropriate for
Given that we only eliminated exact duplicates, the samples
research use. To our knowledge, BEC-2 is the first public BEC
in BEC-2 may be syntactically too close together. One can
dataset made available to email security researchers.
envision a future BEC dataset where near duplicates are also
Note that the input data for BEC-1 and BEC-2 are real-
eliminated. Similarly, one can create a BEC dataset from more
world BEC emails. Our system simply produces unique vari-
real-world BEC examples, resulting in a more diverse dataset.
ations of real-world emails. As a result, the email samples in
Additional real-world examples are available in [8], [12] and
BEC-1 and BEC-2 make for credible BEC emails themselves.
[16].
BEC-1, BEC-2, and the code that generated them are
available at [21]. We know from research in related sub-fields Currently, the system that generated BEC-1 and BEC-2 pays
of information security that the availability of datasets sparks no particular attention to proper nouns in the emails. As such,
innovation [22], [23]. We hope that similar productivity in the generated emails retain the proper nouns from example
BEC research will be encouraged by the availability of BEC- emails. Using BEC-1 or BEC-2 for training may produce
2. misleading results unless the training regime masks the proper
Finally, the results presented in this paper enable us to take nouns. A future version of the system could include a proper
the human in figure 1 out of the loop and generate BEC noun masking enhancement.
datasets from BEC examples on demand. We are left with the Finally, the method employed to create the BEC datasets can
streamlined system of figure 4 that can produce new datasets also create other email datasets from a set of seed examples.
as BEC examples drift due to changes in vocabulary, attacker Indeed, techniques adapted from the system described in this
tactics, or incoming email corpora. paper can be used to create non-email text datasets.
ACKNOWLEDGEMENT R EFERENCES
We thank John Doe, an industry expert who has chosen to [1] “What is phishing?” Cisco Systems Inc., 2024, retrieved February
remain anonymous, for providing background information on 18, 2024 from https://www.cisco.com/c/en/us/products/ security/email-
the nature of Phishing and BEC. We also thank Amelia Hardy security/what-is-phishing.html.
[2] “What is spear phishing?” Cisco Systems Inc., 2024, retrieved
(Stanford University) for her valuable comments on this paper February 18, 2024 from https://www.cisco.com/site/us/en/learn/topics/
as it progressed. security/what-is-spear-phishing.html.
[3] “What is bec?” IBM, 2024, retrieved February 18, 2024 from
DATA AVAILABILITY https://www.ibm.com/topics/business-email-compromise.
[4] “Understanding email fraud,” Proofpoint Inc., 2018, tech-
The datasets generated and/or analyzed during this research nical Report 0218-017; retrieved February 18, 2024 from
study are available at [21]. https://www.proofpoint.com/sites/default/files/ pfpt-uk-tr-survey-of-
understanding-email-fraud-180315.pdf.
D ECLARATIONS [5] “Business email compromise and real estate wire fraud,” Federal
Bureau of Investigation, 2022, fBI 2022 Congressional Report
The authors have no relevant financial or non-financial on BEC and Real Estate Wire Fraud; retrieved March 17, 2024
interests to disclose. This paper was not funded by any third from https://www.fbi.gov/file-repository/ fy-2022-fbi-congressional-
party (including the author’s employer). report-business-email-compromise-and-real-estate-wire-fraud-
111422.pdf/view.
The work in this paper does not introduce new ethical [6] “Internet crime report 2023,” Internet Crime Complaint Center, 2023,
concerns, as only previously public real-world BEC examples retrieved March 17, 2024 from https://www.ic3.gov/Media/PDF/ Annu-
are used. The variations of the real-world examples in BEC-1 alReport/2023 IC3Report.pdf.
and BEC-2 do not expose any information that was not already [7] S. Pinto, “Understanding business email compromise to better pro-
tect against it,” Cisco Systems Inc., 2023, retrieved February 18,
known. 2024 from https://blogs.cisco.com/security/ understanding-business-
email-compromise-to-better-protect-against-it.
[8] A. Cidon, L. Gavish, I. Bleier, N. Korshun, M. Schweighauser, and
A. Tsitkin, “High precision detection of business email compromise,”
in 28th USENIX Security Symposium. USENIX, 2019, pp. 1291–1307.
[9] C. Beaman and H. Isah, “Anomaly detection in emails using machine
learning and header information,” arXiv preprint arXiv:2203.10408,
2022.
[10] R. Dube, “The intersection of large language models and business
email compromise: What we know so far,” ResearchGate preprint
RG.2.2.27907.72480, 2024.
[11] N. Maleki, “A behavioral based detection approach for business email
compromises,” University of New Brunswick M.S. Thesis, 2019.
[12] M. Regina, M. Meyer, and S. Goutal, “Text data augmentation:
Towards better detection of spear-phishing emails,” arXiv preprint
arXiv:2007.02033, July 2020.
[13] H. F. Atlam and O. Oluwatimilehin, “Business email compromise
phishing detection based on machine learning: A systematic literature
review,” Electronics, vol. 12, no. 1, p. 42, December 2022.
[14] J. Hazell, “Spear phishing with large language models,” December 2023.
[15] A. Almutairi, B. Kang, and N. F. Fadhel, “The effectiveness of
transformer-based models for bec attack detection,” International Con-
ference on Network and System Security, August 2023.
[16] J. Brabec, F. Srajer, R. Starosta, T. Sixta, M. Dupont, M. Lenoch,
J. Mensik, F. Becker, J. Boros, T. Pop et al., “A modular and adap-
tive system for business email compromise detection,” arXiv preprint
arXiv:2308.10776, August 2023.
[17] M. Bethany, A. Galiopoulos, E. Bethany, M. B. Karkevandi, N. Vish-
wamitra, and P. Najafirad, “Large language model lateral spear phish-
ing: A comparative study in large-scale organizational settings,” arXiv
preprint arXiv:2401.09727, January 2024.
[18] D. Nahmias, G. Engelberg, D. Klein, and A. Shabtai, “Prompted
contextual vectors for spear-phishing detection,” arXiv preprint
arXiv:2402.08309, February 2024.
[19] “Api reference,” OpenAI Inc., retrieved March 24, 2024 from
https://platform.openai.com/docs/api-reference.
[20] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam,
S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam et al.,
“Dspy: Compiling declarative language model calls into self-improving
pipelines,” arXiv preprint arXiv:2310.03714, 2023.
[21] “Bec datasets,” Dube, Rohit, 2024, retrieved April 16, 2024 from
https://github.com/r-dube/bec.
[22] R. Dube, “Faulty use of the cic-ids 2017 dataset in information security
research,” Journal of Computer Virology and Hacking Techniques, 2023.
[23] ——, “Large language models in information security research: A jan-
uary 2024 survey,” ResearchGate preprint RG.2.2.20107.26404, 2024.
[24] “Real world example: Business email compromise aka ceo fraud,”
Reddit, 2016, retrieved March 9, 2024 from https://www.reddit.com/r/
sysadmin/comments/5hfcgr/ real world example business A PPENDIX
email compromise aka/.
[25] “What is business email compromise (bec)?” Microsoft Inc., The following examples were gathered from Internet dis-
2022, retrieved March 9, 2024 from https://www.microsoft.com/en- cussion forums and security firms’ posts on BEC.
us/security/business/security-101/what-is-business-email-compromise-
bec. A. Example 1
[26] “Ceo fraud scams and how to deal with them at the email
gateway,” Trustwave Inc., 2016, retrieved March 9, 2024 from Adapted from [24] with minor modifications.
https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/ceo-
fraud-scams-and-how-to-deal-with-them-at-the-email-gateway/. From: Robert Smith <Finance Director>
[27] “Phishing email,” Sent to author, 2022, retrieved March 9, 2024. Sent: Monday, December 05, 2016 11:03 AM
To: Jennifer Smith <Finance Manager>
Subject: Fwd: Discussion

Jennifer,

Can you get a wire processed right away?


Get back to me.

Bob

---------- Forwarded message ----------


From: Pam Smith <Vice President>
Date: December 5, 2016 at 9:31:59 am
Subject: Discussion
To: Robert Smith <Finance Director>

Bob,

Sequel to our phone discussion, I have


finalized talks with the new vendor and I
need the wire processed right away.

Let me know once it has been sent.

Regards,
Pam
B. Example 2
Adapted from the text message example in [25] with minor
modifications.
From: Anthony Ivanov
To: Dakota Sanchez
Subject: cell #

Hello Dakota,

Kindly re-confirm your cell #, I need a


task done as soon as possible and look
forward to my next email.

Thanks

C. Example 3a
Adapted from the first example in [26] with minor modifi-
cations.
From: Robert Smith <[email protected]>
To: Sue Brown <[email protected]>
Subject: Please get back to me asap. Regards
Dave
Sue,
F. Example 4
Please do you have a moment? Am tied up
in a meeting and there is something I need This email was sent to the author’s email address [27]. It
you to take care of. appears to be a Phishing attempt. The phrase “START THE
SURVEY” had an embedded URL that led to a LinkedIn web
We have pending invoice from our Vendor. page.
I have asked them to email me a copy of Subject: Re:
the invoice. I will be highly appreciative
if you can handle it before the close of Dear Cvs Client,
banking transactions for today. I can’t
take calls now so an email will be fine. If you have not already completed our 2022
survey, we encourage you to take a few
Robert minutes to do so before Thursday March 31.
D. Example 3b Your answers will help us give you an even
better Shopping experience.
Adapted from the second example in [26] with minor
modifications. NOTE : there’s a $93 reward after
From: James Anderson <[email protected]> finishing the survey
To: [email protected]
Subject: Payment START THE SURVEY

Hi Mike, Thank you in advance for your time and


valuable insights.
Are you in the office?
Warm regards,
Thanks,
James Anderson The CVS team

E. Example 3c
Adapted from the third example in [26] by adding fictitious
transaction information while retaining the format of an actual
transaction.
Subject: Request

Hi Alvin,

I need you to process a fund transfer into


the bank detail below:

Amount: $28,850
Bank name: Wells Fargo
Account name: Alfred Pinto
Account num: 113452291972
Swift code: WFBIUS6S
Route num: 121042882
Bank address: Wells Fargo, 420 Montgomery
Street, San Francisco, CA 94104

Kindly get back to me with an electronic


wire report confirmation when it’s finally
processed.
All five samples generated from example 1 are included Hi Jennifer, Can you please process a wire
below. In samples 2 and 5, the human labeling differs from transfer urgently for me? Let me know once
the model’s labeling. it’s done. Thanks, Bob
G. Sample 1
Variation of A [24].
Subject: Re: Urgent Wire Processing

Hi Jennifer, Can you please prioritize


processing the wire transfer we discussed?
Let me know once it’s been sent. Thanks,
Bob

H. Sample 2
Variation of A [24].
Subject: Re: Urgent Wire Transfer

Hi Jennifer, Can you please process a wire


transfer urgently? Let me know once it’s
done. Thanks, Bob

I. Sample 3
Variation of A [24].
Subject: Re: Discussion Urgent

Jennifer, Can you process a wire transfer


urgently? Please confirm once it’s done.
Thanks, Bob

J. Sample 4
Variation of A [24].
Subject: Re: Urgent Wire Transfer Request

Jennifer, Can you process a wire transfer


urgently? Please let me know. Bob
---------- Forwarded message ----------
From: Pam Smith <Vice President> Date:
December 5, 2016 at 9:31:59 am Subject:
Discussion To: Robert Smith <Finance
Director> Bob, Following our conversation,
I have completed negotiations with the new
vendor and require the wire transfer to be
processed immediately. Kindly confirm once
it has been initiated. Thank you, Pam

K. Sample 5
Variation of A [24].
Subject: Re: Urgent Wire Processing
Request
The DSPy model signatures and classes are included below. score_urgency = dspy.OutputField(desc="score for
urgency")
L. Generation model total = dspy.OutputField(desc="total")
label = dspy.OutputField(desc="label")
# signature to generate email variation
class EmailSig1(dspy.Signature): # the validation model
__doc__ = """ Generate output from input using the class EmailVal1(dspy.Module):
following steps: def __init__(self):
1. input email subject -> output email subject super().__init__()
that is a variation of the input email subject self.generate_label = dspy.Predict(EmailLabel1)
2. input email body -> output email body that is a
variation of the input email body def forward(self, email, index):
""" return self.generate_label(email=email,
config=dict(temperature=0.0001*index))
input_subject = dspy.InputField(desc="input email
subject")
input_body = dspy.InputField(desc="input email
body")
output_subject = dspy.OutputField(desc="output
email subject")
output_body = dspy.OutputField(desc="output email
body")

# the generation model


class EmailVar1(dspy.Module):
def __init__(self):
super().__init__()
self.max_index = len(df_pos)
self.generate_email = dspy.Predict(EmailSig1)

def forward(self, index, var):


subject = ""
body = ""
if (index >= self.max_index):
print ("Error: index is too large")
else:
subject=df_pos.iloc[index][’subject’]
body=df_pos.iloc[index][’body’]
return self.generate_email(input_subject=subject,
input_body=body,
config=dict(temperature=0.7+0.0001*var))

M. Validation model
# signature to determine label
class EmailLabel1(dspy.Signature):
"""
To generate the label follow these steps:
1. Does the email appear to be related to
business? Score 1 if yes, 0 if no
2. Does the email have an authoritative tone?
Score 1 if yes, 0 if no
3. Does the email ask the recipient to take an
action related to an organization? Score 1 if
yes, 0 if no
4. Does the email convey urgency? Score if yes, 0
if no
5. Add the scores from steps 1, 2, 3, 4 to get a
total
6. Determine the label as follows:
a) If the total is 4, the label is positive
b) If the total is 2 or 3, the label is neutral
c) If the total is 0 or 1, the label is negative
7. Output the four scores, the total and the
one-word label
"""
email = dspy.InputField(desc="an email in english")
score_business = dspy.OutputField(desc="score for
business")
score_authority = dspy.OutputField(desc="score
for authority")
score_action = dspy.OutputField(desc="score for
action")

View publication stats

You might also like