1 Introduction

Generating high-quality test data that can provide strong assurances about the correctness of a software system has been studied for many years by software practitioners as well as the research community. The challenge is to generate test data that can maximize fault detection while minimizing the cost and time of testing. One group of promising approaches focuses on maximizing input domain coverage. While high input domain coverage is a very effective fault detection technique at lower-level testing, such as in unit tests, when it comes to higher-level testing, for instance, system-integration tests, the effectiveness of these techniques drops. This is due to the increased complexity of the input domain at the higher levels of testing, which makes these approaches more likely to generate combinations of input values that are less likely to happen in reality. As a result, the faults they uncover are more likely to be insignificant and down-prioritized during planning. In addition, test data generation approaches generally do not address the problem of reproducing production failures. When a failure happens in production, the engineering team would want to reproduce the situation in a test environment to investigate and fix the issue. In the case of complex systems, provisioning such a test environment requires a comprehensive test database.

This lack of statistical representativeness in the test data and their inability to simulate production failures leaves many software teams with no choice but to use production data for higher-level testing and to debug production failures. However, the introduction of the General Data Protection Regulation (GDPR) and similar privacy protection regulations has made it more difficult to use production data, or even anonymized or masked variants of it for testing. In this situation, software companies and organizations are looking for new test data generation techniques that can provide production-like test data with variations and statistical distributions similar to those of real data. In this paper, we investigate this problem in the context of complex and distributed event-based systems.

1.1 Problem statement

Data-intensive systems often use an event-sourcing model (Betts et al. 2013) to keep the state of the system up to date as events occur. When integrated within and across organizations, these systems exchange data by subscribing to and receiving events from each other. These events then trigger actions in the receiving systems.

To deeply understand the test data needs in cross-organizational integration testing of event-based systems, we performed a set of semi-structured interviews (Tan et al. 2018) with seven organizations in the public and private sectors in Norway, namely, the National Population Registry (NPR), the Norwegian Tax Administration, the Brønnøysund Register Centre (a government agency responsible for numerous public registries for Norway), the National Labour and Welfare Administration, the Norwegian Directorate of eHealth, the Agency for Public Management and eGovernment, and one private data distribution IT company, Evry. The software systems in these organizations are event-based systems and frequently exchange data, including personal information, by distributing and receiving events related to the registered residents in Norway. Through the interviews, the following test data needs were identified:

  • Artificial data: To comply with privacy regulations such as GDPR, production data cannot be used for testing purposes. A common alternative is to use masked or anonymized data. These approaches can preserve the statistical properties of the production data and hide the identities of real people. Despite this de-identification, anonymized data are not entirely immune to so-called linking attacks, where specific attributes of individuals, such as age, gender, and zip code, are joined to reveal a person’s identity (Bayardo and Agrawal 2005). As a result, within the public sector, these approaches are usually considered non-conforming with GDPR. A better alternative is synthetically (artificially) generated data.

  • Representativeness: End-to-end tests are expensive to implement and run and are only justifiable if they can simulate and uncover realistic failures. For this purpose, artificially generated test data need to be statistically representative of reality. A dataset is said to be statistically representative of another if both datasets have similar statistical properties, usually expressed in terms of the probability distributions of the information fields.

  • Dynamic data: In many cases, succession and the timing between events are important. For example, consider the event of naming a newborn in the population registry domain. Different rules may apply if this event is done within one week from birth or later than that. To ensure that both of these scenarios can be tested, it is important to generate new data regularly. In the case of NPR, new synthetic data has to be generated at least on a daily basis to meet the needs of the various consumers.

Although the interviews were conducted within one specific application domain, we believe the participating organizations are diverse enough to make the identified test data needs generic and representative of the needs in a wide range of event-based application scenarios. The solution we propose in this paper to address these test data needs is, therefore, generally applicable to complex and distributed event-based systems.

1.2 Solution overview

In event-based systems, in order to build a synthetic dataset that is dynamic and statistically representative, it is sufficient to generate events that are statistically representative. Statistically representative events propagate through the systems and maintain a statistically representative state throughout the system. While statistically representative events can be generated in many different ways, the novelty of our work lies in that we frame the problem of generating representative events as a language modelling problem. The benefit of this framing is that a language model provides a compact and succinct representation of the statistical properties of the domain and provides an easy-to-maintain approach for generating varying amounts of representative test data. The rationale for this is that events can be treated as sentences from a language with well-defined grammar. By learning the statistical properties of this language (e.g., the likelihood of each sentence) and storing them in a language model, one can generate events that are statically representative of real events.

The contributions of our work are as follows:

  • We define an abstract data model of event-based systems, which describes the data and its dynamicity as a collection of stateful entities and a collection of events that alter the states of the entities. Based on this abstract model, we propose a conceptual model of an event generator that generates test data that meets the identified needs. At the core of this generator is a statistical model that captures the statistical properties of the production data.

  • To build a statistical model of the production data, we propose a novel approach that frames the data generation task as a language modelling task and uses deep learning techniques to solve it.

  • We propose a model evaluation measure to be used alongside the loss function during the model training process. The proposed evaluation measure involves computing the Jensen-Shannon divergence between the training data and the generated data from the model under training. This approach is specifically designed to speed up the training process while ensuring the representativeness of the generated data.

  • We propose a generic framework for evaluating the representativeness of the generated data with respect to the statistical properties of the domain and their conformance to domain-specific business constraints.

  • In the context of our case study, within the NPR, we experimented with three of the most popular deep learning algorithms, namely, Char-RNN, VAE and GANs. We evaluate the resulting models using the evaluation framework described above. The results show that the Char-RNN model, outperforming the other two, is able to generate high-quality test data. Using the Char-RNN model, we have implemented and deployed our test generator in the NPR. The successful adoption of this approach by our industrial partner shows its practical applicability in industrial settings.

Our research advances the understanding of the research community about the applicability of data-driven, machine-learning-based approaches as a new technique for generating rich and high-quality test data that can be used for reliable testing of large-scale and complex systems. An important feature of our approach is that it does not rely on manually building detailed models or specifications of the system; instead, it uses the available data for learning the language model. This mitigates a serious adoption risk that many model-based test data generation approaches suffer from.

The remainder of the paper is organized as follows. We discuss related work in Section 2, followed by a description of our generic solution to generating statistically representative events in Section 3. Section 4 provides an overview of language modelling using deep neural networks. We introduce our framework for language model evaluation in Section 5 before introducing our case study, the NPR of Norway, in Section 6. Sections 7 and 8 describe the process and the setup of our experiments and report the results, respectively. The industrial applicability, generalizability and an analysis of the threats to validity are discussed in Section 9. We also discuss some practical challenges in the implementation of the overall solution in this section. We conclude the paper in Section 10.

2 Related work

Test data generation has been the topic of extensive research, with proposed approaches employing a wide range of techniques, such as combinatorial approaches (Li et al. 2016; Simos et al. 2016; Salecker et al. 2012), metaheuristic search algorithms (McMinn 2011; Khari et al. 2016; Gois et al. 2017), model-based techniques (Soltana et al. 2017; Yano et al. 2011; Ali et al. 2013), fuzzing (Li et al. 2018; Padhye et al. 2019), and machine learning algorithms (Ji et al. 2019; Kim et al. 2018; Zhou et al. 2014; Čegiň and Rástočnỳ 2020). Many of these approaches focus on increasing a measure of coverage, e.g., path coverage or input coverage, and target lower testing levels where the unit under test is small. The premise of most of this work is that one can reach strong assurances about the quality of a complicated software system by performing extensive unit testing with high coverage.

In our collaboration with industry, however, we have identified this approach to be insufficient. In reality, to be able to deliver high-quality software, you need to simulate and examine end-to-end scenarios with realistic data. Test scenarios at this level are designed by expert testers proficient in exploratory testing or are inspired by prior production failures. No amount of fuzz testing or high-coverage metaheuristic search can replace this kind of testing for a large-scale complex system. The challenge with such end-to-end exploratory testing is that simulating the test scenarios is impossible without access to realistic production-like test data. As mentioned before, due to privacy concerns related to the use of production data, statistically representative data must instead be generated or synthesized artificially.

In the past, several researchers have focused on synthesizing statistically representative data. For example, Soltana et al. (2018) extended UML with probabilistic annotations to model the probabilistic characteristics of a population and generate synthetic data from the model. The purpose of that work is to support the simulation of tax policy in Luxembourg; hence, only the (static) information relevant to the policy under simulation needs to be modelled. Although probabilistic annotations in UML enable building a statistical model that explicitly describes the conditional distributions among data attributes, such an approach does not scale up to more complex and dynamic data domains, for instance, our case study with the NPR. The events in the National registry have around 100 different event types, and each event type has at least four properties. The distributions and joint distributions of the event types and their properties are crucial for the overall representativeness. The number of distributions and joint distributions among these data fields thus soars up quickly. Building UML models requires manual effort, and for a complex and large system, this means many hours of manual labour from someone with domain expertise as well as familiarity with UML. There is no obvious way to explicitly express all these distributions as UML annotations. Analysing all the population data and creating a UML model that describes all these data fields and their relations by itself is a daunting manual task. Besides the complexity of the data domain, another reason why this UML approach, as proposed in  (Soltana et al. 2018), does not suit our problem is that it does not accommodate dynamics in data; all the properties and annotations in the UML model are static and cannot describe the changes to the data after events happen. Therefore, this approach is not practically applicable to our problem.

Synthesizing relational databases (Chulyadyo and Leray 2018; Patki et al. 2016) is another relevant area of research. In their work (Chulyadyo and Leray 2018), Chulyadyo and Leray use probabilistic relational models to generate synthetic spatial datasets. Patki et al. (2016), on the other hand, propose a general end-to-end synthetic data generator, named Synthetic Data Vault, to synthesize complete tables of a relational database. The resulting databases resemble the original ones both statistically and structurally. However, these approaches are only applicable to stand-alone relational databases. Employing these approaches for distributed, heterogeneous, document-based and non-relational databases is not trivial and is not discussed by the authors. Moreover, these approaches are not designed to generate or cope with dynamic data. These are serious limitations that again make these approaches inapplicable in many practical contexts.

When de-identification is the goal, an alternative to synthetic data generation is anonymization. One benefit of anonymization is that it is suitable for generating dynamic data. With an anonymization algorithm, it is usually straightforward to create a streaming pipeline that continuously anonymizes new production data. Besides, compared to synthetic data, anonymized data may better resemble the real data. While anonymization algorithms hide the identities of real people to a reasonable extent, they are not entirely immune to the so-called linking attacks, where specific attributes of individuals, such as age, gender, and zip code are joined to reveal a person’s identity (Bayardo and Agrawal 2005). Apart from vulnerability to linking attacks, another drawback of anonymized data is its lack of scalability. It is not easy to downscale or upscale the anonymized dataset, as usually, anonymization algorithms map the real dataset to a dataset of the same size. This causes limitations for certain types of tests. For instance, performance tests may require massive amounts of data, while smaller yet representative datasets are preferable for functional integration testing. Using synthetic data gives more flexibility with respect to these concerns.

A random data generator (e.g., Cheon and Rubio-Medrano (2007)) can sample each data field independently based on the probabilistic distribution of its values. However, without taking into account joint distributions among multiple data fields and business rules, such randomized data generation does not lead to representative and valid datasets. Possible ways to solve this problem include applying filters on the generated data to exclude invalid combinations, or modelling the dependencies between the data fields and applying them when sampling the data values. However, for large-scale and complex systems, such as the NPR domain, these approaches demand tremendous manual work, making them infeasible in practice.

Another interesting area of research is Grammar-Based Test Generation (GBTG) (e.g., Hoffman et al. (2011)), where the goal is to generate test data based on context-free grammar, which describes the structure of the input data. GBTG is especially suitable for generating complex structured input data, for example, XML documents (Hoffman et al. 2009). Although it is possible for GBTG to generate syntactically valid test data given a context-free grammar or an XML schema, the statistical representativeness and semantic conformity to the business rules have not been its focus so far. In that respect, the approach we have proposed in this paper is preferable to GBTG approaches, as not only it can guarantee syntactic correctness to a great extent, but also, it provides statistical representativeness and semantic validity.

Probabilistic and stochastic algorithms have, as well, been used for generating statistically representative population data. Markov models (Saadi et al. 2016) and Bayesian networks (Sun and Erath 2015) have particularly shown promising results. One limitation to the applicability of these approaches to our problem is that the performance of these models depends heavily on their design, which has to be done manually. Building optimal models requires deep knowledge of both the business domain and statistics and probabilistic modelling. In addition, to train a Markov model, one should first define all the states. In many complex domains, the NPR domain for example, it is infeasible given the scale of the state space.

In parallel to the aforementioned studies, deep learning techniques have been used in other domains for generating synthetic data. The most commonly used techniques among these are recurrent neural networks (Medsker and Jain 1999) and generative adversarial networks (Goodfellow et al. 2014). They have been used for synthesizing text (Sutskever et al. 2011; Rajeswar et al. 2017), images (Gregor et al. 2015; Nie et al. 2017), and sequential data (Graves 2013; Ghosh et al. 2017). These approaches have no limitations with respect to the generation of dynamic data and require no manual effort to generate the statistical models. This is in contrast to other approaches for generating statically representative data discussed above. Therefore, deep learning approaches have been the main inspiration for our work. It is worth mentioning that these approaches, as well as ours, are computationally more expensive compared to the studies mentioned above. However, that is not an issue in our practical context, mainly because the models are generated only once and can be used for data generation for months.

Fig. 1
figure 1

Abstract data model for event-based systems

Fig. 2
figure 2

Data Model in a population registry

3 A Generic statistical event generator

This section describes the main concepts in our proposed solution for generating valid and statistically representative data for event-based systems. An overview of these concepts is given in the conceptual model in Fig. 3. The conceptual model is based on the abstract data model of event-based systems in Fig. 1.

3.1 Abstract data model of an event-based system

The abstract data model of event-based systems describes the data and its dynamicity as a collection of stateful entities and a collection of events that alter the states of the entities. As shown in Fig. 1, an event-based system can be seen as a collection of StatefulEntities. For instance, in the population registry domain, the population is an event-based system, and each individual person is a stateful entity. Each stateful entity (or entity for short) consists of a collection of StateDescriptors, and each StateDescriptor describes one piece of information about the entity. In the population registry domain, a StateDescriptor about a person could be the person’s residential address. When an Event occurs to an entity, it alters the entity by altering one or more of its StateDescriptors. Note that the StateDescriper has a boolean attribute isApplicable, which allows the data model to accommodate historical information about an entity: a StateDescriptor with isApplicable:True means that this StateDescriptor instance contains the current value of this piece of information about the entity; a StateDescriptor with isApplicable:False means that this piece of information is not currently applicable and hence, historical. In the population registry domain, when a person moves to a new address, a relocation event happens; the isApplicable attribute of the old address is set to false, and a new entry is added to the person’s list of addresses as the current and the applicable residential address of the person.

Fig. 3
figure 3

Conceptual Model of the Event Generator

Figure 2 illustrates a realization of the abstract data model in the population registry domain. As shown in this figure, a Person is a StatefulEntity, and is described by PersonalInfo, which is a realization of the StateDescriptor. We use different types of PersonalInfo, for example, BirthInfo, Name, Address, and MaritalRelation to describe the state of a Person. Different types of LifeEvents alter the state of the person by altering corresponding type or types of PersonalInfo. Birth, Marriage, Relocation and ChangeName, shown in Fig. 2, are examples of LifeEvents, which is a realization of Event in the conceptual model.

3.2 Conceptual model of event generator and event specification

Based on the abstract data model of the event-based system, we propose a conceptual model of an event generator, as shown in Fig. 3. At the heart of the event generator is a StatisticalModel, which captures the statistical characteristics of the events and of the states of the involved entities. The StatisticalModel samples for EventSpecifications. An EventSpecification is an abstraction over an event in an event-based system and only contains a subset of the information of an event. Please note that domain-specific knowledge and testing expertise may be necessary to identify which information fields to include in an event specification.

We choose to make such an abstraction and to build the StatisticalModel on this instead of the complete event and entity data for two reasons. Firstly, in most cases, not all the information of an event and all the states of an entity are of equal importance in the context of integration testing. For instance, in the population registry domain, some free-text information, such as the street address, is of little interest in integration testing. An event specification omits information of lower importance and includes only the most relevant properties that are important to be statistically representative when generating test data for integration testing. Secondly, from a privacy and confidentiality perspective, some information fields are more sensitive than others. Data that contain such information may not be available for training the statistical model, as access, storage and usage for such sensitive data are regulated by privacy regulations. For instance, in the population registry domain, the name of a person is sensitive information as it can, in many cases, uniquely identify a person. Such sensitive information must, therefore be excluded from event specifications. However, in domains where these concerns are not applicable, such distinction may not be necessary, and events and event specifications can be identical.

Here is an example of a simplified event specification in the population registry domain:

Example 1

(Example EventSpecification): A single woman born in 1991 gets married to a divorced man born in 1981.

Note that the actual event would normally refer to the people involved in the event by their personal identification numbers: e.g., “Person with Id M gets married to Person with Id N”. However, since a personal identification number is generally a random sequence of digits, it does not carry any interesting statistical information. Therefore, in the formation of event specifications, we replace the identities with basic information that carries statistical meaning for the model to learn.

3.3 Event generation

Algorithm 1
figure a

Event Generator

Algorithm 1 shows the event generation process. The algorithm takes four inputs: (1) \(P^{(t)}\), the state of the system at time t, which is essentially composed of the states of all the stateful entities in the system at time t; (2) the statistical model, M ; (3) a collection of constraints, C; and (4) the number of events to generate N. The algorithm returns a collection of events E and the updated state of the system \(P^{(t+1)}\).

In Algorithm 1, each iteration in the for loop in lines 3-7 generates an event e and adds it to E. Event generation in each iteration involves two steps. In the first step, an event specification \(e_{spec}\) is sampled from M, in Line 4. To construct a fully specified event, other details that are not captured by the StatisticalModel must be added to this event specification. This is the second step of event generation and is done by calling createEvent in Line 5. Based on the information provided by the \(e_{spec}\), the createEvent function fills out the missing information by random sampling from a valid data range, while making sure that the resulting event conforms to the business constraints C.

From the previous example of event specification, a simplified concrete synthetic event can be generated, with the added details in angle brackets:

Example 2

(Example generated event):

A single woman <named Beautiful Flower>, born on <01-02-> 1991 <with ID xxxxxxxxxxx>, gets married to a divorced man <named Green Tree> born on <02-01-> 1981 <with ID xxxxxxxxxxx>

In the design of our solution, our event generator creates event specifications based on the current state of entities, particularly, individuals within the NPR domain. One crucial aspect to note is that the timestamp of a newly generated event corresponds to the moment when this synthetic event is created; this way, the sequential logic of events that could happen to a person is ensured. For example, if a death event is created and is to occur to a person who is married and whose person record hence includes a marriage, this marriage date must be chronologically earlier than the date of the generated death event. This constraint ensures the logical consistency of the generated data.

With very low possibility, the model may generate event specifications that contain a date that is in the future in the person status part, for example, a person with civil status as married and a marriage date that is in the future. However, such event specifications are syntactically invalid and therefore will be excluded during the validity check.

The work presented in this paper focuses on the use of deep language modelling techniques for providing a statistical model and generating event specifications from it. In particular, the createEvent function requires a domain-specific implementation and will not be discussed in depth in this paper.

Even though Example 1 is expressed in English, we do not recommend training a statistical model on natural language sentences. Instead, as shown in Section 6, we recommend defining a formal language with a dedicated vocabulary and a deterministic syntax specific to the domain to describe events and states of entities in a domain. The trained language model serves as the StatisticalModel in the EventGenerator.

4 Language modeling techniques

A language model is a probabilistic model that captures the probability of the appearance of each sentence in a language (e.g., what is the probability of seeing the sentence “the lazy dog barked loudly” in English?), and the probability for a given word (or a sequence of words) to follow a given sequence of words (e.g., what is the probability of seeing the word “barked” after seeing the sequence “the lazy dog”?) (Goldberg 2017).

Language modelling is the task of building a language model from a corpus, i.e., an extensive collection of example texts from a target language. Given a sequence of words, the language model can then be used to predict the next word or to assign a probability to a sequence of words or a sentence to follow the given sequence.

In machine learning terminology, the model trained in this manner is a generative model, and is intended to be used for sampling or generating data points. This differs from predictive models, where the model is used for predicting a label for a data point, for instance, in a classification task.

Language modelling has been a topic of extensive research within the field of Natural Language Processing (NLP). Generating sentences from a language model can be accomplished using a technique called autoregressive generation (Jurafsky 2019). As illustrated in Fig. 4, the autoregressive generation procedure starts by feeding the language model with a start of sentence token as the input, denoted as <S>, then letting the language model predict the most probable word as the subsequent token, as the output. Appending the output from one step to its input forms the input for the next step of the process. This process is repeated until the language model reaches an end of Sentence token, denoted as <E>, at which point, a complete sentence is generated.

Fig. 4
figure 4

Autoregressive Generation Procedure

Traditional language modelling techniques, such as N-gram, are based on a Markov assumption, meaning that the conditional probability of a word depends only on a number (\(N-1\) for N-gram model) of previous words, which limits the capability of the model for capturing statistical relations in long sequences. Language models based on deep neural networks, on the contrary, make no such assumption and are capable of capturing complex statistical relations of text sequences of arbitrary length (Jurafsky 2019), and therefore are more suitable for our work.

The following three families of deep learning algorithms, which have been successfully applied for text generation purposes, are the most relevant to our use case: Recurrent Neural Networks (RNN) (Medsker and Jain 1999), Variational Autoencoder networks (VAE) (Bowman et al. 2015) and Generative Adversarial Networks (GAN) (Goodfellow et al. 2014).

Within each family of deep learning algorithms, there are many variants. We experiment with one of the most popular and successful variants from each algorithm family. In particular, we use Character-Level RNN (Char-RNN) (Karpath 2019), RNN-based VAE (Bowman et al. 2015) and LatextGAN (Donahue and Rumshisky 2018). Details of these algorithms are presented in Appendix AB and C.

5 Evaluation framework

To evaluate the effectiveness of our statistical event generator, we propose a framework for evaluating the statistical representativeness of the generated data. Furthermore, for the generated data to be production-like, it is also important to evaluate whether the generated data are valid with respect to domain-specific syntax and business rules. This evaluation framework addresses the effectiveness with respect to the test data need for representativeness, as presented in Section 1. The other two, namely, the need for artificial and dynamic test data are guaranteed by construction and through our design choices.

Two design choices in our solution collaborate to meet the requirement for artificial data, which in turn is required to eliminate leakage of sensitive information. First, we use the concept of an event specification, which is an abstract specification of an event, leaving out details that may contain sensitive information. Second, with language modelling, the model does not learn or memorise individual events but learns statistics. As a matter of fact, we verified this for the language models that we trained for our case study. Using the trained language model, we generated the same number of data records as in the training data and found that no data record in the generated data is identical to any data record in the original data. As a result, no individual event specification generated from the model duplicates any event specification extracted from real data. However, the entire generated data is statistically representative of reality.

As shown in Algorithm 1, each generated event updates the state of the event-based system. By constantly generating events and updating the entities, we can achieve dynamicity. Note that the need for dynamic data generation is not in conflict with the need for statistically representative data. In particular, by starting from an empty or a statistically representative collection of stateful entities and generating events that are statistically representative, we can create a statistically representative collection of entities that maintains its statistical representativeness while being dynamic.

The evaluation framework described in this section introduces three types of quantitative metrics. To implement the evaluation framework for each application domain, each metric has to be defined in the context of that application domain, as we have done for the Norwegian National Population Registry in Section 6. These metrics are syntactic validity, representativeness, and semantic validity as described below.

5.1 Syntactic validity

We define the percentage of generated data that are syntactically valid with respect to the syntax and the grammar of the domain-specific language as the syntactic validity rate. Syntactic validity can be evaluated with the grammar of the formal language.

5.2 Representativeness

We measure the representativeness of the generated data by measuring the similarity between the distributions (and joint distributions) of the information fields in the generated data and in the real data. Let’s consider the population registry domain as an example. The information about a person in the population registry has many data fields, including name, gender, birth date, address, marital status, family relations, etc. The events in the population registry also have many data fields; event type is one of them. There are other fields that are specific to the type of the event. A marriage event, for instance, has information fields for the spouse’s name, id, and birth date. A relocation event, on the other hand, has information fields for the new address, the relocation date, etc. For the generated data to be representative of the real data, the distributions of all these information fields in the generated data should be similar to those in the real data. Similarly, the joint distributions of combinations of the fields, e.g., the joint distribution of a person’s birth year and gender, must be similar in generated and real data. We use the Jensen-Shannon divergence (JSD) metric (Fuglede and Topsoe 2004) to measure distribution similarity. This metric is equal to zero only if the two probability distributions are identical; it grows to an upper bound, which is 1.0 in our implementation, as the two distributions diverge. An alternative metric is Kullback-Leibler divergence (KLD) (Andrew 2003). The advantage of JSD over KLD is that JSD is symmetric and bounded by an upper and a lower limit, so it is easier to use for comparison.

5.3 Semantic validity

We define the semantic validity rate as the percentage of the generated data that conforms to the constraints of the application domain. These constraints can be specified in any logic language. Each constraint has a condition and an expected result. We use Python in Example 3 to illustrate the format of the constraints: given a logical expression as the condition, the semantic validity (the boolean value \(semantic\_validity\)) equals the evaluation result of a logical expression. A logical expression is made up of a number of terms, where each term specifies a bound on the value of a data field. The data fields, which are parts of an EventSpecification, describe either the event or the entity. The terms in an expression are related using logical connectors (i.e., and, or, and not). A constraint with such a definition specifies a relationship between two or more data fields of an EventSpecification.

Example 3

(Constraint definition)

figure b

The following is an example of a constraint in the population registry domain:

Example 4

(Example of a constraint for marriage event):

figure c

This constraint specifies that if an event is of type marriage, then, the person that this event happens to should be older than 18 years old, should have a civil status that is neither married nor partnership, and should have no registered spouse or partner in the national registry. In addition, the event should state that the new spouse is older than 18 years old.

With such logical expressions of the constraints, we can evaluate the semantic validity of any generated data point (i.e., synthesised event specification). The validity rate with respect to each constraint is the ratio of the number of data points that are valid for this constraint, to the total number of data points. Given a set of constraints, the aggregated or total validity rate is calculated as the number of total data points for which all of the constraints are valid divided by the total number of data points.

Note that due to the complexity of real-world domains and applications, it is usually not possible to exhaustively check every constraint in a domain. We hypothesise that high conformance to a representative subset of the constraints can be a good indicator of how well the model has learned semantics and the business rules of the domain. Therefore, a high semantic validity rate based on evaluating a subset of the constraints implies that the majority of the other constraints will hold for the majority of the generated data as well.

6 Case study

To evaluate the effectiveness of our proposed test data generation approach, we implement our solution in collaboration with our industrial partner, the Norwegian National Population Registry (NPR), to generate production-like test data to support the integration testing within NPR and the consumers of its services and data. NPR collects, stores, and manages the personal data of all the inhabitants in Norway and releases electronic personal information to more than 2000 organisations, which are referred to as data consumers. Currently, the software system in NPR is undergoing a modernisation process, and so are the software systems of its data consumers. An effective approach to end-to-end system integration testing is critical in a situation like this, where a large number of complex systems interact. To ensure the effectiveness and efficiency of the integration testing among NPR and its data consumers, NPR provides a shared test environment with a test instance of the population registry, which disseminates data to the nationwide consumer systems that are connected to this environment.

The choice of building a shared test environment may seem arbitrary, but it is, in fact, essential for simulating realistic test scenarios and guaranteeing their correct execution. Suppose that the tax administration wants to execute a test scenario to verify the calculation of annual income taxes. This calculation requires fetching relevant information from other organisations, for instance, the welfare system. If the welfare system used for testing does not have access to the same population registry data, it would most likely fail to respond correctly to requests from the tax calculation system under test.

The goal of our collaboration with the NPR is to provide synthetically-generated, dynamic, and production-like data to populate the instance of the population registry in the shared test environment. To apply our proposed solution and synthetically generate a population similar to the Norwegian population, we first design a domain-specific language to express the event specifications in the NPR domain.

In the remainder of this section, we first present this domain-specific language, then explain the syntactic and semantic validity rules based on the domain-specific language and the business constraints of the NPR domain. This helps us use the proposed evaluation framework to evaluate the effectiveness of our test-generation approach within our case study. Note that the evaluation of the representativeness of the generated data, namely, using the JSD metrics to evaluate distribution similarity, is generic and independent of the design of the domain-specific language or the business constraints of the domain. Therefore, it is not discussed separately in this section.

6.1 Domain specific formal language - Steve132

We frame the problem of generating synthetic event specifications as the problem of generating sentences from a formal language. Each event specification can be seen as a sentence from this language. We define the vocabulary of this language as the set of all characters that can appear in an event specification. A valid sentence from the language of event specifications in the NPR domain is 132 characters long and has a specific structure. We name this formal language St(ate)eve (nt)132.

Figure 5 depicts the structure of a sentence in Steve132. The example sentence in Figure 5 expresses the example event specification in Example 1. The first two characters in the sequence (a sentence is a sequence of characters) specify the event type (in this case, a marriage event). The rest of the sequence is divided into two parts: the first part specifies the state of the person (in terms of the birth year, birth month, gender, civil status, etc.) who is the subject of the event, and the second part specifies the details of the event (in this case, information of the spouse and other details, such as where and when the marriage took place).

Fig. 5
figure 5

Steve132 event specification

6.2 Syntactic validity

We define syntactic validity for Steve132 as follows. A sequence of characters is syntactically valid if:

  • the sequence is 132 characters long, and

  • the character at each position of the sequence has a valid value designated for this position.

Violating any of these two rules deems a sequence syntactically invalid for Steve132. In the NPR domain, each piece of information has its range of valid values. For example, the gender of a person can only take in values of male, female or unknown, denoted respectively as M, F and U in Steve132. This also applies to information that is described via multiple characters. The event type, for instance, is expressed as a two-digit number at the beginning of a Steve132 sentence. Therefore, in a valid event specification, the first two characters are digits.

As the name suggests, all event specifications in the Steve132 format have the same length (132 characters). This is achieved by designating a fixed position to each piece of information, or information field, in a Steve132 sequence. Some information fields are event-specific and are only meaningful in the context of specific types of events. For event types where an information field is irrelevant, the absence of its value is marked by zeros or whitespaces. A zero indicates the absence of a numerical value, while a whitespace indicates the absence of an alphabetical value.

6.3 Semantic validity

In addition to syntax and vocabulary, event specifications must be semantically valid: they should conform to the constraints of the domain. For example, in a marriage event, the two spouses should have a civil status eligible for marriage, that is, not married and not in partnership.

We now take a closer look at the example sequence in Figure 5. This is a specification of a marriage event, with event type 11 at the first two positions of the sequence. The subject of this event is a female born in 1991 with a civil status as unmarried. In the event details part, the information of the spouse is presented, who is a male born in 1981, with civil status as divorced. Based on the presented information, this sequence is syntactically valid: all the information fields have valid values. Furthermore, this sequence is semantically valid: the birth years are meaningful, and the civil statuses of both spouses are eligible for a marriage event.

In the NPR domain, the business constraints are specific to event types, and one type of event may have multiple relevant constraints. We evaluate the business constraints of eight of the most important event types by calculating the percentage of the generated data that conforms with the relevant constraints, which we define as the semantic validity rate. For the generated data of each of these event types, we first calculate the semantic validity rate for each constraint, and then the semantic validity rate for all the constraints related to this event type. The latter is called total conformity.

The following are the eight constraints that we evaluated in our experiments. For Marriage events, we check three constraints: (1) that the subject person has a civil status that allows marriage, i.e., not already married or registered as a cohabitant, (2) that the subject person does not have a registered spouse or partner, and (3) that the generated event contains partner information (e.g., year of birth and gender of the spouse). For Death events, we check that the generated events specify a date of death. For “Relocation-within-a-Municipality” events, we check that the generated event contains Municipality information, and the “From-Municipality” field is left blank. For a “Relocation-between-Municipalities” event, both the Municipality and the “From-Municipality” field must be filled out. For Immigrate events, we check that the generated event contains an “Immigrate-from-country” but no “Emigrate-to-country”. For Emigrate events, we check that the generated event contains an “Emigrate-to-country” but no “Immigrate-from-country”. The generated Birth events should provide the birth year of the mother and the birthplace, “Birth-municipality”. Finally, we check that the generated Change-name events contain information about the new name.

For all these types of events except for the Immigrate events, we also check the “ID-in-use” information, which indicates whether an ID number is already in use in the Registry or not. For the Immigrate events, some of the immigrants may have existing IDs from before if, for example, they temporarily resided in Norway before, while others may not have one. Therefore, for the Immigrate events, the “ID-in-use” information is not constrained to be either true or false. For the Birth events, we check that the ID is not in use. This is because, in a birth event, the new-born person should be assigned a new ID, and this ID should not be in use before. For all the other types of events, we check that the “ID-in-use” property should be true.

7 Experiments

Using our case study, introduced in the previous section, we designed a set of experiments to evaluate the effectiveness of the application of deep language models for generating production-like data, and to answer the following research questions:

  • RQ1: To what extent is the generated data valid with respect to the language syntax?

  • RQ2: To what extent is the generated data representative of the real data?

  • RQ3: To what extent does the generated data conform to the constraints of the domain?

  • RQ4: To what extent is test data generation using deep language models better than random data generation?

  • RQ5: To what extent can our proposed approach address test data needs in a large-scale industrial system?

We train language models with the three deep learning algorithms (presented in Appendix AB and C), and use our proposed framework in Section 5 for evaluation. In addition, we use a random data generator that generates random and syntactically valid Steve132 sequences as a comparison baseline.

7.1 Training data

We obtained the data for training the language models from NPR. We collected the events from production, which were registered in a period of two months, and transformed them into the Steve132 format, which gave 192 thousands Steve132 sequences. Complying with the NPR data security regulation and for privacy protection reasons, we used anonymized data instead of raw production data to form the Steve132 sequences. Note that the anonymized data is available to engineers at NPR only for internal testing and cannot be shared with external consumers. This is in contrast with synthetic data, which can be shared with external parties. The anonymization process is a standard and automated process performed by a dedicated data processing team within NPR. The algorithms used for anonymization, to a great extent, preserve the statistical properties of the production data. The amount of anonymized data available to us is equal to the real population data. However, we chose to use only two months of life events in our experiments to keep the size and duration of the experiments manageable for this part of our research.

Conventionally, when developing deep learning algorithms, the available data are randomly split into three non-overlapping subsets: training, validation, and testing datasets, each of which has different usage in the development process. The training dataset is used to train the model. The validation dataset can be used to evaluate the model’s performance during the training process and to determine when to stop the training. It is also used to compare models from multiple training processes and to select the best one. This may result in the selection of a model that is biased toward the validation dataset, and hence, the need for the testing dataset, which is used to provide an unbiased evaluation of the selected model. In order to avoid any other unwanted bias, for example, due to the chronological order of the samples, randomized algorithms are usually used for splitting the data into these three subsets. This data splitting is especially prevalent for deep learning models for classification and prediction tasks, for example, sentiment analysis and next-word prediction in NLP. Another widely adopted practice is batch processing, where data points are fed to the training and evaluation algorithms in batches, as opposed to being processed one by one. The size of the batch is an important hyperparameter of the training process. Once the entire training data is presented to the model, a training epoch is completed. It is very common for a model to go through multiple epochs for training.

In our experiments, we randomly split the data into only two subsets: the training and the validation subsets. We use the training dataset to train models and the validation dataset to evaluate the training progress along the way. During training, we feed the data to each algorithm in batches of varying sizes and for varying numbers of epochs, as detailed in the subsequent subsections. We do not need a separate testing dataset to combat bias because the models in our experiments are language models for generative tasks. In contrast with classification models, the goal of language models for generative tasks is to generate text sequences that are as if they have come from the same corpus with which the language models were trained. The effectiveness of such models should be measured in terms of their ability to generate similar data to their training data. Therefore, instead of evaluating models with the testing dataset, we evaluate and select the models by using them to generate data and comparing the statistical properties of the generated data with those of the original dataset.

In our training process, we regularly evaluate each model with validation data (one validation batch after every 5 to 10 training batches). After each training epoch, we compute the epoch’s average loss for both the training and validation data. By monitoring the average epoch losses for both training data and test data, we make sure that the models do not overfit (Hawkins 2004). An overfitted model learns the training data and its noise too closely, making it incapable of generalizing to new data that it has not been exposed to. Typically, it returns low training loss but a much higher loss for validation and testing data. For GANs, a widely-accepted definition of overfitting and an algorithm for detecting it does not exist (Webster et al. 2019), as for the other algorithms. Moreover, the training of GANs faces other challenges, namely, mode collapse and difficulty in evaluation, as discussed in Appendix C.

For the Char-RNN and VAE, we split the data into 80%/20% training and validation datasets, respectively. For the LatextGAN experiment, we do not split the data but evaluate the model along the training process in different ways, as described in the LatextGAN experiment subsection below.

7.2 Char-RNN experiments

The Char-RNN algorithm in our experiment is based on a PyTorch Char-RNN implementation (char-rnn.pytorch 2019). In our final model, we configure the network with two layers, each with 100 Gated Recurrent Unit Networks (GRU) (Chung et al. 2014) units. This implementation uses the Adam optimizer (Kingma and Ba 2014). The ReLU activation function is used for the GRU units, and we train the RNN network with sequence-to-sequence loss. The network has 141 thousand parameters and was trained for 28 epochs, with a learning rate of 0.001 and a batch size of 100. The total training time for this experiment was about 250 minutes. At the end of the training, the loss value had reached a plateau. The generation time for 200 thousand sequences (event specifications) is less than 10 minutes.

7.3 VAE experiments

The VAE algorithm in our experiment is based on a PyTorch VAE implementation (Variational autoencoder 2019). The final network has a single-layer RNN network with 128 LSTM units as the encoder and another single-layer RNN with 256 LSTM units as the decoder. The dimension of the latent vector is 128. The ReLu activation function is used for the LSTM units, and we train the VAE network with sequence-to-sequence loss. The number of parameters of this configuration is 511 thousand, more than three times the number of parameters in the Char-RNN experiment. The learning rate was set to 0.0001, and the batch size to 16. The lowest loss was recorded after 190 epochs and 430 minutes of training. At the end of the training, the loss value had reached a plateau. Similar to the Char-RNN experiment, the generation time for 200 thousand sequences is less than 10 minutes.

7.4 GANs experiments

The LatextGAN algorithm in our experiment is based on (Donahue and Rumshisky 2018), and our PyTorch implementation is based on (Latextgan-pytorch 2019). In the LatextGAN model, for the RNN autoencoder networks, we reuse the configuration of the encoder and decoder networks in the VAE experiments. The ReLU activation function is used for the LSTM units in the RNN-based autoencoder elements and the ResNet unit in the GAN element. The autoencoder element is trained with sequence-to-sequence loss. The training of the GAN element uses a minimax loss function, also known as GAN loss. For both the generator and discriminator networks, we use ResNet (He et al. 2016) with 10 layers and 128 units in each layer. The RNN encoder network is a single-layer RNN with 128 Long Short-Term Memory Networks (LSTM) (Sundermeyer et al. 2012) units, and the RNN decoder network is another single-layer RNN with 256 LSTM units. In this configuration, the autoencoder networks together have 492 thousand parameters, and the GAN networks together have 660 thousand parameters. Compared to the other experiments, the LatexGAN model has the largest number of parameters.

The detailed training process of the LatextGAN model is presented in Appendix D.

7.5 General setup

For all the experiments above, we implement the training pipeline so that it allows us to configure hyperparameters easily, including the network architecture parameters, such as layers and number of units per layer, and the training parameters, such as the batch size, the optimizer algorithm, and the learning rate. The training pipeline also allows us to monitor and record various metrics during the training process, for example, training and validation losses and training gradients. From these metrics, we determine the progress of the training process and decide when to stop it.

We employ an empirical approach for selecting and optimizing hyperparameters. With the help of the training pipeline, we empirically selected a starting point for the search, considering the relation between the number of parameters of the model and the amount of training data available. This is a common approach for tuning the hyperparameters within the machine learning community. We then search for an optimal hyperparameter combination by training multiple models with different combinations of hyperparameters with a grid search around the selected starting point. The evaluation of generative models differs from traditional approaches for discriminative models. Other than evaluating the model loss, generative models are often evaluated by the data they generate. We chose the best-performing model configuration and hyperparameters based on both the model loss and generated data quality and used it to train a final model for this algorithm.

We ran our experiments on an Ubuntu machine with one GeForce GTX 1080 graphic card. After training, we generated 200 thousand synthetic sequences from each model. The number of parameters and number of training epochs of the three models is summarized in Table 1.

Table 1 Model Training Summary

7.6 Random data generator

With well-defined Steve132 syntax, a random data generator can generate syntactically valid Steve132 life events in a straightforward way. Since the Steve132 language defines event specification as fixed-length sequences with a designated value range on each information field, random values can be sampled independently from the valid range for each information field and then concatenated according to the designated order to form one complete sequence. In this way, the generated sequences are guaranteed to be syntactically valid.

Furthermore, if the discrete probability distribution of the values for each information field is provided, weighted random sampling (Rajan et al. 1989) can be used to generate random values according to these distributions. In weighted random sampling, instead of selecting each value with the same probability, the probability of a value being selected is determined by its relative weight or its probability in the overall distribution. In this way, information fields in the generated sequences are guaranteed to have distributions similar to reality.

However, this method does not take the correlations between information fields into account, which is very important for statistical representativeness. Moreover, the semantic validity of the generated data is left unwarranted. We implemented such a weighted random data generator for Steve132 and called it RanGen. RanGen takes the same data used for training the language models as input and computes distributions for each information field from it. The distributions are then used as weights to sample random values for each information field before they are concatenated into one Steve132 sequence.

We generate 200 thousands of sequences with RanGen and use them to serve as a benchmark for demonstrating the effectiveness of language modelling techniques in generating semantically valid and statistically representative data.

8 Evaluation of results

In this section, we report and discuss the results of our experiments, and answer the research questions listed in Section 7.

8.1 QR1: To what extent is the generated data valid with respect to the language syntax?

Table 2 summarizes the syntactic validity results for each deep language model and for RanGen. Each row of the table is dedicated to one data field in Steve132 and reports the percentage of the data that are valid with respect to the validity rules for that data field.

The last row of the table shows the percentage of the data where the values of all data fields are valid. It is worth mentioning that the validity rates for the original data used for training for all the information fields are 100%. This is because NPR validates every event before registering it into the system.

Considering LatextGAN and VAE models, the length of the generated sequence is designated by the model, and therefore, all generated data from these two models have valid sequence lengths. For Char-RNN, on the other hand, as discussed in Appendix A, the generated sequences have varying lengths. Nevertheless, \(98\%\) of the data generated from our Char-RNN model have the expected sequence length. For the validity of the entire sequence, the Char-RNN model has a \(96.07\%\) validity rate and outperforms the LatextGAN model and the VAE model by far. As expected, the RanGen data are \(100\%\) valid in terms of syntax, as discussed in Section 7.6.

Table 2 Validity Rates

8.2 RQ2: To what extent is the generated data representative of the real data?

To answer this question, we compare the statistical properties of the generated data and the original data, more specifically, the distributions of a number of data fields and a number of joint distributions among data fields. To compare two distributions, we compute the JSD of the distributions, using JSD with a base 2 logarithm. Tables 3 and 4 present the JSDs for single data field distributions, and joint distributions for a number of combinations of data fields, respectively. In most cases in these two tables, JSDs for LaTextGAN and VAE are an order of magnitude larger than those for Char-RNN. We believe an order-of-magnitude difference can be considered substantial. This difference implies a much larger divergence from the original distributions in case of the data generated by LaTextGAN and VAE.

Table 3 JSD Comparison

In Table 3, we present JSDs of 10 data fields. The first six data fields are general and appear in events of all types. The last four data fields are specific to certain types of events. In sequences of other event types, these data fields take either the value 0 or white space. For instance, municipality is only relevant for relocation-within-a-Municipality and relocation-between-Municipalities. The distributions of these four data fields over the entire dataset are different from their distributions over the subset of data containing only the relevant event type. Therefore, we compute two JSDs for each of these data fields: one for the entire dataset and one for the subset of the data specific to the relevant event type.

Fig. 6
figure 6

Event Type Distribution comparison. The tick marks on the x-axis represent the event types. Although the values on many of the event types are very small, this reflects the real distribution in production

The JSDs in the first part of Table 3 concerning the six general data fields are generally small, meaning that all models have preserved statistical properties to a great extent. One exception is Event Type, for which the JSD is on the \(10^{-1}\) level in the case of LatexGan and VAE. We conclude that these models have not very well learnt the distribution of the event types, which is perhaps the most important distribution. For most of the properties, the JSDs from the Char-RNN model are the smallest, meaning that Char-RNN has most accurately learnt the statistical properties of the original data. As expected, RanGen data have JSD values close to 0 since all data fields are independently sampled with exactly the same distributions as in the original data. Figure 6 compares the distribution of event types in the original data, Char-RNN generated data, and VAE generated data. We can see that the distribution of the event types from Char-RNN resembles very closely that of the original data, while this distribution, in the case of VAE, deviates significantly from the original data.

The second part of the table presents the JSDs for the four data fields that are only relevant to specific data fields. As mentioned above, for these data fields, we compute two JSDs, once over the entire dataset and once over the relevant subset of the data. For instance, JSDs reported in the Municipality\(^{\textrm{a}}\) row were computed using only Steve132 sequences in which the event type is relocation-within-a-Municipality or relocation-between-Municipalities; with this filtering applied both in the generated data and in the original data. The results in Table 3 show that the JSDs over the relevant subset of the data are larger than the JSDs for their counterparts over the entire dataset. This happens because in the case of non-relevant event types, in all sequences, event-type-specific data fields take the default values (i.e., either zeros or white spaces), making their distributions over the entire dataset more homogeneous. This bias in the training data, in turn, results in distributions that, while closely resembling the distribution of the entire dataset, are not very similar to the distributions of the values when considering only the relevant subset of the data.

For LatexGAN and VAE, the JSDs over the relevant subset for all of these four data fields are relatively large, meaning that these models, if used for data generation, will fail to generate a high percentage of valid and representative data. In contrast, all the similar JSDs for the Char-RNN model are reasonably small (less than 0.1). VAE, in particular, has a few very large JSDs (e.g., 0.6661), which lowers the overall representativeness of its generated data. Compare this to 0.0678, the largest ever JSD value for Char-RNN. This means that the Char-RNN model has learnt the event-type specific rules, and the dependencies among the information fields involved. On the other hand, in the case of RanGen, where such dependencies and constraints are not explicitly implemented, JSD values for the relevant subset of the data are very high, denoting significant deviation from the original data.

Table 4 Joint Distribution JSD comparison

Table 4 presents JSDs of six of the most important joint distributions for the three models and RenGen. The “Event Type, Municipality Code” joint distribution provides an overview of the types of events that happen over the geography of the country, i.e. over different municipalities in the country. The “Event Type, Birth Year, Gender, Civil Status” joint distribution provides an overview of the types of events that happen over the demography of the country. The remaining four joint distributions provide overviews of how the data fields correlate with each other within one event. These four joint distributions represent Birth, Marriage, Immigration, and Emigration events.

As expected, the JSD values for joint distributions are higher than the JSD values for single property distributions, demonstrating a larger divergence between the distributions. This is due to the fact that learning the correlation between the data fields is more complex than learning every single data field. Here, Char-RNN, with all its JSD scores under 0.05, is still the best-performing model. The JSD values for LatextGAN, VAE, and RanGen data are generally high, which means the joint distributions in the generated data have little similarity with the original data. This is expected for RanGen since its data generation is based on the assumption that all the properties are statistically independent. For LatextGan and VAE, this indicates that these two models did not learn the correlation between data fields well. This essentially means that the LatextGan model and the VAE model are not performing much better than a random data generator.

Table 5 Semantic validity

8.3 RQ3: To what extent does the generated data conform to the constraints of the application domain?

With the constraints described in Section 6.3, we evaluate the semantic validity rate for eight event types. The evaluation results are summarized in Table 5. Again, not surprisingly, the conformity rates for the original data for all the constraints are \(100\%\). For the LatextGAN model and VAE model, semantic validity varies from event type to event type. However, the Char-RNN model performs the best for all eight types of events. For the Char-RNN model, semantic validity rates are high for all the constraints for all the event types, which are all more than 98%. For one constraint (namely, the “Has municipality” constraint for the relocation-between-municipalities event), the LatextGAN model has a total-semantic-validity percentage that is higher than that of the Char-RNN model. Furthermore, the LatextGAN model has \(100\%\) conformity for two of the constraints, but compared to Char-RNN, it shows a lower total semantic validity rate. The LatextGAN model has especially low semantic validity rates for marriage and emigration events, which are lower than \(50\%\). The VAE model has equal or lower percentages than the Char-RNN model for all the constraints and a low total semantic validity rate for immigration and birth events, which are lower than \(70\%\). However, it has no conformity percentage that is lower than 50%.

In comparison, the semantic validity rates for RanGen data are generally very low. The highest rate among these eight event types is still lower than \(50\%\), and five of them are lower than \(10\%\). This shows that RanGen data do not meet the business constraints well. In other words, if we were to use RanGen for data generation, more than \(90\%\) of the generated data should have been thrown away.

8.4 RQ4: To what extent is test data generation using deep language models better than random data generation?

As we see from the results in Tables 2, 3, 4, and 5, although RanGen generates \(100\%\) syntactically valid data, the representativeness of the generated data is low compared to the deep learning models. Furthermore, the semantic validity rates for the RanGen generated data are, in general, much lower than data generated by the deep learning models. While LatexGAN and VAE don’t always outperform RanGen, Char-RNN does consistently outperform it.

8.5 RQ5: To what extent can our proposed approach address test data needs in a large-scale industrial system?

In addition to the quality of the generated data extensively discussed above, successful industrial adoption of the data generation solution needs a language model that can be seamlessly integrated with the rest of the data generation pipeline, a deep learning algorithm that is stable in terms of consistently producing good results and an implementation that has a reasonable computational cost. Below, we discuss considerations and trade-offs related to the implementation of our proposed data generation solution within our industrial case study.

8.5.1 Seamless integration

The experiments reported in this paper focus on building models that could be used in Line 4 of Algorithm 1, to sample synthetic data. The synthetic data generated from these models should, therefore be usable by the rest of the algorithm. The results of our experiments show that even in the case of the best model (i.e., the Char-RNN model), a small percentage of the synthetic events is invalid. Therefore, it is practically necessary that the downstream components that take the generated data as input have mechanisms to tolerate errors and inconsistencies in the data. One solution for such a mechanism is to train a separate classifier to distinguish between valid and invalid data. Such a classifier should be designed to have zero false positives (i.e., zero chance of classifying invalid data as valid). In our current implementation of Algorithm 1, we have chosen the simpler approach of making the downstream components more robust, e.g., through extensive exception handling. This process, in addition, allows the collection of the negative training examples required for training a classifier. One should also note that it is very unlikely that one will be able to build totally error-free models for a complicated domain like ours. In fact, a degree of error is inherent in almost all machine learning models. Hence, our general advice is to design the downstream components that interact with machine learning models as robust and resilient as possible.

Table 6 Training and Data Generation Time

8.5.2 Stability of the language model

A population may evolve over time; the dynamics of the population, i.e., the frequency and distribution of the events that happen to the population, are also changing. As a result, the models we generate today might not reflect the reality after a few months or a few years, depending on the speed of the changes in the real population. In order to address this aspect of the population, new models should be trained once every few months, using freshly collected training data from production. Therefore, it is important to choose a deep learning algorithm that is stable (Hardt et al. 2016) and has low sensitivity to changes in the training data, i.e., can perform equally well (in terms of the evaluation criteria in our evaluation framework) when trained with a different training dataset. A learning algorithm is said to be stable if it produces consistent predictions with respect to small perturbations of training samples (Sun 2015).

The best way to assess the stability of a deep learning algorithm is to experiment with it using datasets of varying size, quality, format, and statistical properties. In our experiments with Char-RNN, we have been able to consistently produce high-performing models using four different datasets with NPR data (we have not reported the results of all those experiments in this paper). This observation strongly suggests that Char-RNN is a stable algorithm for generating synthetic data.

8.5.3 Computational cost

Table 6 summarizes the computational costs reported in Section 7. The training time for all three models is on the level of hundreds of minutes, which is affordable for an interval of several months for retraining the model, as discussed previously in this section. The data generation time for all three models is around several minutes when generating 200 thousand events. By generating data in batches, storing them in data storage, and then serving them to the downstream components from the data storage, we have a solution that provides a continuous stream of synthetic events. The overhead of data generation is, therefore, negligible. To sum up, in terms of computational cost, all three models are suitable in our industrial setting, with the Char-RNN model being slightly better than the other two.

Note that the RanGen (not included in Table 6) does not require training. However, as shown in Table 5, more than \(90\%\) of the samples from RanGen are invalid. In this case, feeding such a huge amount of invalid data to the pipeline and leaving it to the downstream components to discard them incurs huge computational costs and slow-down to the pipeline. Manually implementing business logic to filter out invalid data before feeding them into the pipeline, on the other hand, is a daunting task that requires expert knowledge. As a result, even though it may seem that RanGen is computationally less expensive, in practice, it is not a cost-effective option.

At the time of writing this paper, the Char-RNN model has been integrated into our test data generator at NPR. The fact that the NPR has implemented our approach in their production pipeline is a significant testament to the practical applicability and usefulness of the approach presented in this paper.

9 Discussions

This section provides an analysis of the threats to validity, discusses practical challenges, and the generalization of the proposed solution.

9.1 Threats to validity

We identified a set of threats that could affect the validity of our experimental results.

9.1.1 Threats to construct validity

Due to the complexity of the NPR domain, our evaluation criteria didn’t include all the possible distributions when evaluating representativeness or all the possible constraints when evaluating semantic validity. This could bias our evaluation of the performance of models. However, the distributions and constraints we selected are representative of the ones that are important to check in integration testing scenarios. Therefore, we consider the impact of this threat on our application to be negligible.

9.1.2 Threats to internal validity

Based on the results of our experiments in the NPR domain, we recommend utilizing the Char-RNN algorithm for implementing the language model. This is because Char-RNN outperforms the other algorithms in our specific case study and is easier to train.

The training data we used in our experiments were collected from anonymized NPR data for over two months. The statistical properties of this dataset may differ from the production data in the NPR domain due to a relatively short collection period. One potential threat to the amount of training data is that it is not large enough to capture all the possible variations in the production data. However, we consider this data amount acceptable in this work due to the following reasons.

  1. 1.

    Data access limitation:Due to the specificity of the NPR domain, we do not have access to an unlimited amount of data, and a two-month period of data was what we were provided to support this work. However, our work is a research initiative, and larger models could be trained with larger datasets that potentially cover production variations more extensively. Note that a larger dataset may require training a larger model. However, based on our experiment results, we expect Char-RNN to remain the best choice.

  2. 2.

    Expert Guidance: Domain experts from NPR, who were in collaboration with our research team, confirmed that the dataset within a two-month period can effectively capture the vast majority of event types in the NPR domain. While it may not encompass every conceivable combination of data fields that could occur in the real world, it provides a rich and representative variety of events, aligning with the primary objective of evaluating the learning capabilities of deep learning algorithms for our specific purpose.

  3. 3.

    Balancing Diversity and Practicality: The dataset of a two-month period offers a diverse set of events while remaining practical in terms of hardware requirements and computational costs. This balance is essential for conducting extensive model experiments, including network design, hyperparameter tuning, and model optimization. The manageable data size allows for efficient experimentation and analysis, which is crucial in the research and development phase. As mentioned above, a larger model could be trained on a larger dataset, with a potentially more diverse set of events. With a different dataset, the process of training has to be repeated, for instance to tune the hyperparameters. However, based on our experiments, we still expect to be able to train a Char-RNN model with comparable performance to the one reported in this paper, namely with respect to the statistical representativeness, and syntactic and semantic validity of the generated data.

  4. 4.

    Adaptation to Evolving Data: Importantly, the statistical properties of event data in the national registry domain can evolve over time due to various factors, such as changing regulations or societal trends. In order to reflect this evolving reality, our model must be periodically retrained. While the specific interval for retraining can be subject to discussion, our choice of a two-month data duration serves as a benchmark for evaluating the costs associated with model retraining in real-world industrial applications. Moreover, as discussed in Section 8.5.2, with Char-RNN, we have been able to consistently produce high-performing models as shown in our experiments with four different datasets from NPR. With such observation, we believe regular retraining of a model with a new dataset will continue to result in high quality synthetic data.

9.1.3 Threats to external validity

In this paper, the evaluation was conducted in a single, longitudinal (almost three years in duration) case study. This approach results in some strengths and limitations regarding threats to external validity. Unlike smaller “experiments” where proposed technologies are demonstrated on toy examples, with our case study, we are able to claim that the proposed solution scales up to practical applications within testing, and that the solution meets the actual needs of the users. However, the use case is, in fact, larger than the NPR since all the public consumers of personal data can now also benefit from this research. Furthermore, we have no reason to believe that the proposed framework cannot be reused successfully in other countries with similar, event-based national population registries. We believe that we have demonstrated strong evidence of external validity within this domain.

The extent to which our approach is valid and useful in other domains remains an open question. In general, to apply this solution to a different domain, the following conditions must hold:

  • It must be feasible to define a formal language for the domain with sufficient expressiveness and conciseness in order to represent the input domain accurately and efficiently in a form that is appropriate for building language-based deep learning models.

  • Data of the actual events that change the state of the system (population) needs to be accessible to the ML engineer, preferably in some sort of queueing system.

  • Constraints of the domain must be identified so that quantitative evaluation metrics can be defined according to our proposed evaluation framework.

Event-based systems find widespread application across various domains, where they play a pivotal role in monitoring and responding to real-time or near-real-time events and state changes. These domains encompass finance and trading, social media and networking, logistics and supply chain, as well as healthcare, to name some.

To apply our approach effectively within these domains, we propose a structured methodology:

  • Domain-Specific Language (DSL) Definition: The initial step involves creating a domain-specific language (DSL) tailored to the unique characteristics and requirements of the domain. This DSL should encapsulate the essential inputs and events specific to the domain and serve as a foundation for subsequent processes.

  • Building a Training Corpus: Build a training corpus from the actual event data from the domain with the designed DSL. The event data should be of sufficient size and cover all production scenarios so that the built training corpus is representative of production data.

  • Language Model Training: We recommend Char-RNN for the language model based on the evaluation results in this paper. Train the model with the training corpus.

  • Test Data Generation: Utilizing the trained language model, employ Algorithm 1, as detailed in Section  3.3, to generate test data.

By following this systematic approach, we can effectively apply our solution to event-based systems across diverse domains to facilitate production-like test data generation that empowers effective and efficient testing.

9.2 Domain knowledge and manual efforts

Applying our solution requires domain knowledge in different phases of the process. Preparing training data and evaluating semantic validity concerning business constraints necessitate a deep understanding of the domain in question. While it is true that implementing domain-specific solutions and collaborating with experts requires effort and resources, the benefits derived from providing production-like test data are immeasurable. The effort invested in leveraging domain-specific knowledge and collaboration with experts is justified by the substantial benefits it brings to production-like test data generation.

Applying our solution also involves a certain amount of manual work.

  • Feature identification: Identifying relevant information fields within the dataset is a critical step and requires domain-specific knowledge. This requires collaboration with domain experts to identify a list of data fields. However, this is a one-time effort.

  • Automated transformation: Once the relevant information fields have been identified, transforming the raw data in document-based form into the training corpus was automated programmatically. While implementing this transformation requires effort, it is also a one-time activity that can be reused for subsequent training data preparation. The automation ensures consistency and efficiency in this process.

It is essential to note that data preparation for deep learning model training, or data engineering, is an ordinary and necessary step in many machine learning applications. The process involves tasks such as data cleaning, normalization, and feature extraction, which are essential to ensure the quality and suitability of the data for modelling. In our case, we extend these practices to create a domain-specific language and automate data transformation into the desired format.

We want to point out that creating a statistical model manually involves a lot more manual work in our experience and is more error-prone. Creating UML models again requires the same manual work, if not more, and also requires deep understanding and detailed knowledge of the domain.

9.3 Remaining practical challenges

Although the data generator with the Char-RNN model is generating abstract events for the downstream test environment, there are remaining practical challenges in the overall solution. The first question is how fast will the statistical properties vary over time and hence how often the model needs to be retrained. In order to address this challenge, a monitoring mechanism needs to be put in place to monitor the statistical deviation of the generated data from the real registry data over time. The evaluation framework proposed in Section 5 is a good candidate for monitoring and measuring the deviation.

Another challenge is that the real registry data has much more information than what the Steve132 format captures. In particular, there are state and event data fields that cannot be encapsulated into a sequence of 132 character length. One straightforward workaround is to sample the missing information from more basic statistical distributions in the data mapper component. However, this approach can compromise the representativeness of the generated data. To adequately meet this challenge, we are working on designing a more expressive format that will be capable of capturing more information of the registry data and provide richer generated data with the same level of representativeness.

10 Conclusion and future work

In this paper, we have proposed a solution for generating production-like test data for event-based systems using deep learning language modelling techniques. Within the context of our case study, we have defined a domain-specific language to express the event specifications and have experimented with three deep learning algorithms that have been proven to be successful in natural language text generation. Analyzing the generated test data demonstrates that these deep learning models are, to various extents, capable of generating representative data that are structurally valid and conform to the constraints of the domain. Among the algorithms that we experimented with, the Character-level Recurrent Neural Network (Char-RNN) outperforms the others in terms of validity, representativeness, and conformity with domain-specific constraints. The Char-RNN model developed in this paper is already fully deployed within the test data generation pipeline at our industrial partner, the Norwegian Population Registry (NPR), and provides test data for integration testing among NPR and its data consumers. Our practical implementation effort has shed light on new areas for further research, in particular, the need for a richer event specification language. Furthermore, a user experience study is required to measure the satisfaction of the data consumers with the generated data.