Building A Business Email Compromise Research Dataset With Large Language Models
Building A Business Email Compromise Research Dataset With Large Language Models
net/publication/380151635
CITATIONS READS
0 750
1 author:
Rohit Dube
University of California, Berkeley
26 PUBLICATIONS 1,367 CITATIONS
SEE PROFILE
All content following this page was uploaded by Rohit Dube on 18 October 2024.
Abstract—Email-based attacks, such as Business Email Com- in urgent-sounding emails to trick the victim into changing
promise, seriously threaten many organizations. In recent years, payment information so that a future payment is redirected.
Large Language Models have improved the potency of email- The email conversation between the attacker and the victim
based attacks by giving attackers an easy-to-use tool to overcome
the language barrier and craft believable emails. At the same may span multiple days and require the attacker to have some
time, Business Email Compromise research remains hamstrung non-public information about the victim. The attacker’s goal is
by the lack of a publicly available dataset. This paper proposes to have a legitimate payment transaction end up in an attacker-
a novel system composed of Large Language Models to create controlled account [7]. This attack is difficult to detect as there
Business Email Compromise datasets. Two datasets are gener- is no suspicious URL or attachment in the emails, and the
ated. The first one (BEC-1) is a small 20-email proof-of-concept
dataset that demonstrates that the system produces a dataset that email conversation is with a known partner.
a human analyst finds credible. The second (BEC-2) is a larger Email security systems attempt to detect BEC attacks using
279-email dataset generated using the same system. BEC-2 is signals from the three major components of emails. First,
the first public Business Email Compromise dataset available to such systems scrutinize the email header to detect signs of
the email security research community. The paper also proposes compromise, such as emails coming from “known bad” sender
an accuracy-like metric called “agreement score” to measure
the quality of datasets produced. Both BEC-1 and BEC-2 have domains. Second, they determine if the URLs or attachments
high agreement scores – 90 and 93, respectively – validating the (if these exist) in the email are malicious using techniques such
effectiveness of the Large Language Model system. as URL lookups and network sandboxing. Third, they detect
Index Terms—Phishing, Business Email Compromise, Email behaviors embedded in the text of the email to ascertain if
Dataset, Natural Language Processing, Large Language Model social engineering is afoot. A cleverly crafted email from a
trusted business partner (as described above) containing no
I. I NTRODUCTION URLs or attachments can bypass the email security system
Email-borne attacks remain one of the primary cybersecurity and land in the victim’s inbox.
attack vectors today. “Phishing” is a category of email attacks Given the volume of email and the criticality of business
where the recipient of an email is tricked into divulging transactions discussed over this medium, real-world email
sensitive data or downloading malware [1]. “Spear Phishing” security systems have to balance between BEC detection per-
is a sub-category of Phishing where the email-based attack formance measures (accuracy, false positives, false negatives),
is targeted towards an individual or a small group of related system performance characteristics (throughput, latency), and
(usually by association with the same business organization) economics (system cost). This balancing is no easy task and
individuals [2]. “Business Email Compromise (BEC)” is a sub- requires the tiering of multiple detection mechanisms to make
category of Spear Phishing where the attacker attempts to steal the systems commercially viable [8], [9].
money or confidential information from an organization [3]. If email security researchers had access to a publicly
1
Attackers using BEC hope to (eventually) gain financially available BEC dataset, they could advance the state of the art
from their efforts. 2 in BEC detection. Unfortunately, no such dataset exists. We
As per the US Federal Bureau of Investigation, documented attempt to correct this problem by developing a system that
BEC losses in the US in 2016 were $360 million. In 2021, takes real-world BEC examples and perturbs them to create
these losses escalated to $2.4 billion [5]. The 2023 estimate is BEC samples.
even higher at $2.9 billion [6]. Clearly, BEC is a consequential Large Language Models (LLMs) have emerged as a general-
attack type worth studying. purpose tool that makes many email security tasks tractable
A frequently encountered BEC attack involves an attacker [10]. Our system uses LLMs to generate email variations
pretending to be a trusted business partner where the part- from the input real-world examples and automatically label
ner’s IT infrastructure has already been compromised by them. We test this system by creating a small proof-of-concept
the attacker. The attacker uses social engineering techniques dataset (BEC-1) and manually labeling it. We find that the
system does, in fact, generate BEC emails that a human would
1 BEC is also known as “email fraud” [4]. find credible. Encouraged by the success of the proof-of-
2 Examples of BEC emails are included in appendix A. concept dataset, we create a larger research dataset (BEC-
2) that can be used to test the efficacy of third-party BEC • Existing research work on BEC (as of 2019) uses small
detection systems. and unrealistic datasets.
The primary contributions of this paper include: • BEC research is limited because BEC attacks primarily
• confirming the unavailability of a public BEC dataset via affect corporations rather than consumers. Thus, the email
a literature survey. data resides with corporations, which do not release it to
• providing proof that LLMs can be used to create BEC outside researchers due to privacy concerns.
datasets. [11] observes that compromised legitimate accounts could
• creating and making public a research BEC dataset. be used in BEC attacks. The paper’s authors propose stopping
The rest of this paper is organized as follows. Section such emails by detecting them before they are sent. The detec-
II reviews recent research papers on BEC. Special attention tion of BEC emails, in turn, depends on manually identified
is given to research at the intersection of BEC and LLMs. features. The paper describes two natural language processing
Section III provides an overview of the system that creates techniques – word embeddings and sentiment analysis – that
BEC datasets and labels them. Section IV details how BEC function as features of the detection system. For both word
emails are labeled. This section also discusses metrics used embeddings and sentiment analysis, there is a notion of a
to evaluate the quality of BEC datasets. Section V sheds light centroid that represents the average email behavior of a user
on the LLM-based models in the system. Sections VI and VII or a group of users. When an email deviates too far from the
evaluate the proof-of-concept BEC-1 dataset in great detail. centroid, it is flagged and may be blocked. Note that [11] was
Section VIII discusses the larger research-ready dataset BEC-2 written in late 2019, just before the widespread recognition of
using the same analysis principles as BEC-1. The conclusion LLMs. As such, it doesn’t reference LLMs.
section IX summarizes this paper. Finally, X presents some [12] generates perturbed versions of (real) BEC emails via
ideas for future research. a natural language processing pipeline that includes an LLM
(BERT) to generate a BEC dataset. The input BEC emails and
II. R ELATED WORK
the dataset generated by the pipeline remain proprietary. The
The papers selected for review in this section discuss three paper confirms (as of 2020) the lack of public corpora for
problems in email security: BEC.
1) The detection of unwanted emails (Spam, Phishing, [13] is a survey paper that reviews 38 research studies
Spear Phishing, and BEC) using Artificial Intelligence on using machine learning in BEC detection. The survey
(AI) techniques such as natural language processing. highlights the appropriateness of natural language processing
2) The generation of such emails using LLMs for adver- techniques for (often text-only) BEC emails. The paper also
sarial demonstration. The idea is to provide proof that reports that most of the studies it surveyed used a customized
an LLM can be used to produce emails with specific dataset to report performance and that there are no publicly
properties. available yet realistic BEC datasets (as of 2022).
3) The analysis, creation, and use of public or proprietary [14] describes the experience of a technically competent per-
email datasets. These datasets are used to determine son without extensive security experience in generating Spear
the efficacy of commercial email security systems and Phishing (not BEC) emails using GPT-3/3.5/4. The paper’s
research prototypes. author was successful in using LLMs to collect information
Since the topics above are related, some papers touch upon about the target group (600 British members of parliament),
all three problems. craft credible emails using the collected information, and inject
[8] dates back to 2019 and is perhaps the earliest description a script into email-attached documents that would download
of a BEC attack detection system. The paper’s authors describe malware onto the recipient’s computer when the recipient
a system that has two stages. The first stage uses information opened the document. However, the paper does not describe
other than the subject and body (such as the email header) a systematic toolchain to regenerate the emails or allude to a
to determine if an email is suspicious. Suspicious emails new publicly available Spear Phishing dataset.
are taken through a second stage that analyzes the subject Can LLMs be used to detect BEC attacks? [15] discusses
and body of the email. If one of the analyses in the second this topic – it describes a custom model that combines LLM
stage also finds an email to be suspicious, it is blocked. The technology (BERT) with another neural network model (Long
second stage includes a natural language processing technique Short Term Memory or LSTM). The composite model is used
– the authors create a vector for each processed email us- to detect BEC attacks. However, since no public BEC dataset
ing “term frequency-inverse document frequency” (TF-IDF). was available, [15] uses a Phishing (“Nigerian fraud”) and
A pre-trained classifier then classifies the vectors as either benign email dataset to develop, validate, and test the custom
malicious or benign. model. Additionally, to create a BEC dataset proxy from
[8] uses a proprietary commercial dataset to train and the existing datasets, the paper’s authors remove URLs and
evaluate the BEC detection system it describes. It also makes attachments to extract the text portion of emails. A skeptic
a few observations regarding BEC research and datasets that would argue that such transformations significantly change
are worth noting: the nature of the email datasets, potentially making them
inappropriate for BEC detection model testing (and perhaps proprietary.
for development and validation as well). • The emails in the dataset are oriented toward Spear
[16] describes its authors’ experience building, bootstrap- Phishing rather than BEC.
ping, and maintaining a commercial email security system Several papers reviewed above have found evidence sup-
that detects BEC attacks. The system described breaks each porting the lack of a publicly available BEC dataset. The
incoming email into sentences or sentence fragments. Each of same papers also suggest the general lack of public datasets
these fragments is fed into statistical and AI models called in related sub-fields such as Spear Phishing. Further, the use
detectors. There are ≈ 90 such detectors in the system, and of LLMs in some of the papers suggests the possibility of
each detector detects the presence of one or more specific building a dataset generation system using LLMs (see figure
behaviors in the fragment (e.g., urgency, call to action, fre- 1).
quent communication). The training data for some detectors is
obtained by classifying emails using GPT-4 – an LLM. While III. S YSTEM OVERVIEW
GPT-4’s classification is expensive and lacks the classification
performance needed for production use, the performance is Our primary hypothesis is that an LLM can be used to create
sufficient for offline labeling of training data. The paper a BEC dataset. Our secondary hypothesis is that an LLM can
includes some additional observations on training data: be used to add labels to the dataset – the idea is to construct
• Privacy regimes governing commercial email systems an LLM-based classifier that can determine whether an email
hinder obtaining samples of benign emails. If an email is similar to those seen in BEC attacks.
is not problematic, a commercial system’s staff typically Figure 1 shows the conceptual building blocks of our
cannot access it. system. The input data to our system is the small set of real-
• In general, no publicly available dataset includes BEC world BEC examples in appendix A. These examples have
attacks (as of 2023). Publicly available email datasets are been sourced from public forums and blog posts. 3
either old or focused on less severe threats than BEC.
• The characteristics of incoming email to commercial
systems drift over time. The drift can be gradual as Real-world examples
communication style and vocabulary change. However,
the drift can sometimes be abrupt, as when a batch of
new customers is brought on board.
[17] provides evidence in favor of LLMs’ capabilities by Generation model
running a Spear Phishing experiment at a large university –
9, 000 staff, 35, 000 students – with the help of the university’s
security staff. The authors conclude that LLM-crafted Spear
Phishing (not BEC) emails are as effective in tricking recip-
ients as human-crafted emails under similar conditions. They BEC samples
also compare various LLMs’ Phishing (but not BEC) detection
performance.
[18] also describes an LLM-based system to detect Spear
Phishing (not BEC) emails. The authors use a set of prompts
and an ensemble of chat-friendly LLMs – GPT-3.5, GPT-4, Human labeler Validation model
and Gemini Pro – to generate a document vector containing
probabilities of various behaviors in an incoming email. These
prompts were derived from social engineering literature and
included questions such as “Does this email convey a sense of
Label agreement?
urgency?” Subsequently, these vectors are fed to a K-nearest
neighbors (KNN) classifier to distinguish between the vectors
of Spear Phishing and other email types. Fig. 1. The BEC labeled dataset creation system
To test their system, the paper’s authors use a proprietary
(and industrial) LLM-based system to produce a set of 333 An LLM is used to generate variations of the real-world
Spear Phishing emails. These emails are combined with sam- example emails. This step is depicted as the “Generation
ples from other (public) sources to create a test dataset. There model” block in figure 1. The Generation model creates a
are some other points of note in [18]: candidate dataset – BEC samples.
• It confirms the lack of a publicly available Spear Phishing
3 Each example in appendix A lists the source from where the email was
dataset.
derived. Not all the examples in the appendix are BEC – one is a Phishing
• It makes the generated Spear Phishing emails public. example, and another, while part of a BEC attack, cannot be classified as
However, the system that generated the emails remains BEC on its own.
Subsequently, the candidate dataset is labeled twice – once 3) Does the email ask the recipient
by a human and a second time by another LLM-based model to take an action related to an
depicted as the “Validation model” in figure 1. organization?
Finally, the degree of agreement between the human labeler 4) Does the email convey urgency?
and the model is calculated. This is the last step in figure 1. 1 point is awarded to an email if the answer to a question
The results from this step determine the final labels used in is yes and 0 otherwise. If an email receives a total (Σ) of 4
the dataset as well as the quality of the dataset. points, it is deemed part of a BEC attack and is assigned a
The idea of generating variations is similar to the one “positive (+)” label. If it receives 2 or 3 points, it is assigned
described in [12], where a natural language processing pipeline a “neutral (=)” label. if it receives 0 or 1 points, it is assigned
generates perturbed versions of authentic emails. However, a “negative (–)” label.
we use a general-purpose LLM instead of a custom-built
pipeline to perform natural language processing tasks. Our B. Real-world examples evaluated against the rubric
work is similar in spirit to [16], where the authors use LLMs Before proceeding with the rest of the project, we test the
for various email processing tasks. The classification of BEC rubric against the example emails in appendix A. We run
emails using an LLM resembles the email vector creation of through the questions and assign scores and labels based on
[18]. the rubric. The results are shown in table I.
Note that BEC attacks typically involve a conversation – Since we know the origin of these emails (see appendix A
i.e., multiple emails going back and forth between the attacker for details) and have read why some are considered BEC, we
and the victim. Our goal isn’t to model the entire conversation. intuitively know the label each should receive. The manual
Instead, we aim to create email bodies and subjects consistent labels in table I derived using the rubric match the intuitive
with attack conversations. labels.
Generation model
BEC samples
Validation model
Fig. 3. Confusion matrix for BEC-2; numbers indicate email samples of each
type
BEC dataset
IX. C ONCLUSION
Our paper corroborates the claim (by surveying existing
Fig. 4. The automatic BEC labeled dataset creation system; human labeler
literature) that there isn’t a publicly available BEC dataset not needed
for email security researchers to use in their work. This is an
unfortunate situation that we hope to correct.
Toward this goal, we describe and analyze a novel system
X. F UTURE WORK
that can generate a BEC dataset. This system uses LLMs to
generate and label the emails in the dataset. The labels match
those produced manually (by a human) in most cases. BEC-1 and BEC-2 can be used to test detection systems
Initially, we generate a small proof-of-concept (20 emails) trained on independent data. In particular, research such as
BEC dataset called BEC-1. The idea is to prove various [15] can be reworked to use BEC-2 to study BEC detection
components of the generation system while minimizing the performance rather than the modified Phishing dataset used
manual labeling effort involved. in that work. However, both BEC-1 and BEC-2 are too
Once the system was proven, we generated a larger research small to use for training detection systems. In the future, this
dataset called BEC-2 (279 emails). Most of the system (au- limitation can be removed by generating larger datasets using
tomatically) generated labels for BEC-2 match the human- the generation system described above.
generated ones, suggesting that BEC-2 is appropriate for
Given that we only eliminated exact duplicates, the samples
research use. To our knowledge, BEC-2 is the first public BEC
in BEC-2 may be syntactically too close together. One can
dataset made available to email security researchers.
envision a future BEC dataset where near duplicates are also
Note that the input data for BEC-1 and BEC-2 are real-
eliminated. Similarly, one can create a BEC dataset from more
world BEC emails. Our system simply produces unique vari-
real-world BEC examples, resulting in a more diverse dataset.
ations of real-world emails. As a result, the email samples in
Additional real-world examples are available in [8], [12] and
BEC-1 and BEC-2 make for credible BEC emails themselves.
[16].
BEC-1, BEC-2, and the code that generated them are
available at [21]. We know from research in related sub-fields Currently, the system that generated BEC-1 and BEC-2 pays
of information security that the availability of datasets sparks no particular attention to proper nouns in the emails. As such,
innovation [22], [23]. We hope that similar productivity in the generated emails retain the proper nouns from example
BEC research will be encouraged by the availability of BEC- emails. Using BEC-1 or BEC-2 for training may produce
2. misleading results unless the training regime masks the proper
Finally, the results presented in this paper enable us to take nouns. A future version of the system could include a proper
the human in figure 1 out of the loop and generate BEC noun masking enhancement.
datasets from BEC examples on demand. We are left with the Finally, the method employed to create the BEC datasets can
streamlined system of figure 4 that can produce new datasets also create other email datasets from a set of seed examples.
as BEC examples drift due to changes in vocabulary, attacker Indeed, techniques adapted from the system described in this
tactics, or incoming email corpora. paper can be used to create non-email text datasets.
ACKNOWLEDGEMENT R EFERENCES
We thank John Doe, an industry expert who has chosen to [1] “What is phishing?” Cisco Systems Inc., 2024, retrieved February
remain anonymous, for providing background information on 18, 2024 from https://www.cisco.com/c/en/us/products/ security/email-
the nature of Phishing and BEC. We also thank Amelia Hardy security/what-is-phishing.html.
[2] “What is spear phishing?” Cisco Systems Inc., 2024, retrieved
(Stanford University) for her valuable comments on this paper February 18, 2024 from https://www.cisco.com/site/us/en/learn/topics/
as it progressed. security/what-is-spear-phishing.html.
[3] “What is bec?” IBM, 2024, retrieved February 18, 2024 from
DATA AVAILABILITY https://www.ibm.com/topics/business-email-compromise.
[4] “Understanding email fraud,” Proofpoint Inc., 2018, tech-
The datasets generated and/or analyzed during this research nical Report 0218-017; retrieved February 18, 2024 from
study are available at [21]. https://www.proofpoint.com/sites/default/files/ pfpt-uk-tr-survey-of-
understanding-email-fraud-180315.pdf.
D ECLARATIONS [5] “Business email compromise and real estate wire fraud,” Federal
Bureau of Investigation, 2022, fBI 2022 Congressional Report
The authors have no relevant financial or non-financial on BEC and Real Estate Wire Fraud; retrieved March 17, 2024
interests to disclose. This paper was not funded by any third from https://www.fbi.gov/file-repository/ fy-2022-fbi-congressional-
party (including the author’s employer). report-business-email-compromise-and-real-estate-wire-fraud-
111422.pdf/view.
The work in this paper does not introduce new ethical [6] “Internet crime report 2023,” Internet Crime Complaint Center, 2023,
concerns, as only previously public real-world BEC examples retrieved March 17, 2024 from https://www.ic3.gov/Media/PDF/ Annu-
are used. The variations of the real-world examples in BEC-1 alReport/2023 IC3Report.pdf.
and BEC-2 do not expose any information that was not already [7] S. Pinto, “Understanding business email compromise to better pro-
tect against it,” Cisco Systems Inc., 2023, retrieved February 18,
known. 2024 from https://blogs.cisco.com/security/ understanding-business-
email-compromise-to-better-protect-against-it.
[8] A. Cidon, L. Gavish, I. Bleier, N. Korshun, M. Schweighauser, and
A. Tsitkin, “High precision detection of business email compromise,”
in 28th USENIX Security Symposium. USENIX, 2019, pp. 1291–1307.
[9] C. Beaman and H. Isah, “Anomaly detection in emails using machine
learning and header information,” arXiv preprint arXiv:2203.10408,
2022.
[10] R. Dube, “The intersection of large language models and business
email compromise: What we know so far,” ResearchGate preprint
RG.2.2.27907.72480, 2024.
[11] N. Maleki, “A behavioral based detection approach for business email
compromises,” University of New Brunswick M.S. Thesis, 2019.
[12] M. Regina, M. Meyer, and S. Goutal, “Text data augmentation:
Towards better detection of spear-phishing emails,” arXiv preprint
arXiv:2007.02033, July 2020.
[13] H. F. Atlam and O. Oluwatimilehin, “Business email compromise
phishing detection based on machine learning: A systematic literature
review,” Electronics, vol. 12, no. 1, p. 42, December 2022.
[14] J. Hazell, “Spear phishing with large language models,” December 2023.
[15] A. Almutairi, B. Kang, and N. F. Fadhel, “The effectiveness of
transformer-based models for bec attack detection,” International Con-
ference on Network and System Security, August 2023.
[16] J. Brabec, F. Srajer, R. Starosta, T. Sixta, M. Dupont, M. Lenoch,
J. Mensik, F. Becker, J. Boros, T. Pop et al., “A modular and adap-
tive system for business email compromise detection,” arXiv preprint
arXiv:2308.10776, August 2023.
[17] M. Bethany, A. Galiopoulos, E. Bethany, M. B. Karkevandi, N. Vish-
wamitra, and P. Najafirad, “Large language model lateral spear phish-
ing: A comparative study in large-scale organizational settings,” arXiv
preprint arXiv:2401.09727, January 2024.
[18] D. Nahmias, G. Engelberg, D. Klein, and A. Shabtai, “Prompted
contextual vectors for spear-phishing detection,” arXiv preprint
arXiv:2402.08309, February 2024.
[19] “Api reference,” OpenAI Inc., retrieved March 24, 2024 from
https://platform.openai.com/docs/api-reference.
[20] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam,
S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam et al.,
“Dspy: Compiling declarative language model calls into self-improving
pipelines,” arXiv preprint arXiv:2310.03714, 2023.
[21] “Bec datasets,” Dube, Rohit, 2024, retrieved April 16, 2024 from
https://github.com/r-dube/bec.
[22] R. Dube, “Faulty use of the cic-ids 2017 dataset in information security
research,” Journal of Computer Virology and Hacking Techniques, 2023.
[23] ——, “Large language models in information security research: A jan-
uary 2024 survey,” ResearchGate preprint RG.2.2.20107.26404, 2024.
[24] “Real world example: Business email compromise aka ceo fraud,”
Reddit, 2016, retrieved March 9, 2024 from https://www.reddit.com/r/
sysadmin/comments/5hfcgr/ real world example business A PPENDIX
email compromise aka/.
[25] “What is business email compromise (bec)?” Microsoft Inc., The following examples were gathered from Internet dis-
2022, retrieved March 9, 2024 from https://www.microsoft.com/en- cussion forums and security firms’ posts on BEC.
us/security/business/security-101/what-is-business-email-compromise-
bec. A. Example 1
[26] “Ceo fraud scams and how to deal with them at the email
gateway,” Trustwave Inc., 2016, retrieved March 9, 2024 from Adapted from [24] with minor modifications.
https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/ceo-
fraud-scams-and-how-to-deal-with-them-at-the-email-gateway/. From: Robert Smith <Finance Director>
[27] “Phishing email,” Sent to author, 2022, retrieved March 9, 2024. Sent: Monday, December 05, 2016 11:03 AM
To: Jennifer Smith <Finance Manager>
Subject: Fwd: Discussion
Jennifer,
Bob
Bob,
Regards,
Pam
B. Example 2
Adapted from the text message example in [25] with minor
modifications.
From: Anthony Ivanov
To: Dakota Sanchez
Subject: cell #
Hello Dakota,
Thanks
C. Example 3a
Adapted from the first example in [26] with minor modifi-
cations.
From: Robert Smith <[email protected]>
To: Sue Brown <[email protected]>
Subject: Please get back to me asap. Regards
Dave
Sue,
F. Example 4
Please do you have a moment? Am tied up
in a meeting and there is something I need This email was sent to the author’s email address [27]. It
you to take care of. appears to be a Phishing attempt. The phrase “START THE
SURVEY” had an embedded URL that led to a LinkedIn web
We have pending invoice from our Vendor. page.
I have asked them to email me a copy of Subject: Re:
the invoice. I will be highly appreciative
if you can handle it before the close of Dear Cvs Client,
banking transactions for today. I can’t
take calls now so an email will be fine. If you have not already completed our 2022
survey, we encourage you to take a few
Robert minutes to do so before Thursday March 31.
D. Example 3b Your answers will help us give you an even
better Shopping experience.
Adapted from the second example in [26] with minor
modifications. NOTE : there’s a $93 reward after
From: James Anderson <[email protected]> finishing the survey
To: [email protected]
Subject: Payment START THE SURVEY
E. Example 3c
Adapted from the third example in [26] by adding fictitious
transaction information while retaining the format of an actual
transaction.
Subject: Request
Hi Alvin,
Amount: $28,850
Bank name: Wells Fargo
Account name: Alfred Pinto
Account num: 113452291972
Swift code: WFBIUS6S
Route num: 121042882
Bank address: Wells Fargo, 420 Montgomery
Street, San Francisco, CA 94104
H. Sample 2
Variation of A [24].
Subject: Re: Urgent Wire Transfer
I. Sample 3
Variation of A [24].
Subject: Re: Discussion Urgent
J. Sample 4
Variation of A [24].
Subject: Re: Urgent Wire Transfer Request
K. Sample 5
Variation of A [24].
Subject: Re: Urgent Wire Processing
Request
The DSPy model signatures and classes are included below. score_urgency = dspy.OutputField(desc="score for
urgency")
L. Generation model total = dspy.OutputField(desc="total")
label = dspy.OutputField(desc="label")
# signature to generate email variation
class EmailSig1(dspy.Signature): # the validation model
__doc__ = """ Generate output from input using the class EmailVal1(dspy.Module):
following steps: def __init__(self):
1. input email subject -> output email subject super().__init__()
that is a variation of the input email subject self.generate_label = dspy.Predict(EmailLabel1)
2. input email body -> output email body that is a
variation of the input email body def forward(self, email, index):
""" return self.generate_label(email=email,
config=dict(temperature=0.0001*index))
input_subject = dspy.InputField(desc="input email
subject")
input_body = dspy.InputField(desc="input email
body")
output_subject = dspy.OutputField(desc="output
email subject")
output_body = dspy.OutputField(desc="output email
body")
M. Validation model
# signature to determine label
class EmailLabel1(dspy.Signature):
"""
To generate the label follow these steps:
1. Does the email appear to be related to
business? Score 1 if yes, 0 if no
2. Does the email have an authoritative tone?
Score 1 if yes, 0 if no
3. Does the email ask the recipient to take an
action related to an organization? Score 1 if
yes, 0 if no
4. Does the email convey urgency? Score if yes, 0
if no
5. Add the scores from steps 1, 2, 3, 4 to get a
total
6. Determine the label as follows:
a) If the total is 4, the label is positive
b) If the total is 2 or 3, the label is neutral
c) If the total is 0 or 1, the label is negative
7. Output the four scores, the total and the
one-word label
"""
email = dspy.InputField(desc="an email in english")
score_business = dspy.OutputField(desc="score for
business")
score_authority = dspy.OutputField(desc="score
for authority")
score_action = dspy.OutputField(desc="score for
action")