0% found this document useful (0 votes)
85 views15 pages

Harmonic: Harnessing Llms For Tabular Data Synthesis and Privacy Protection

Uploaded by

lepiri8945
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views15 pages

Harmonic: Harnessing Llms For Tabular Data Synthesis and Privacy Protection

Uploaded by

lepiri8945
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

HARMONIC: Harnessing LLMs for Tabular Data

Synthesis and Privacy Protection

Yuxin Wang Duanyu Feng


Sichuan University Sichuan University
Chengdu, China Chengdu, China
arXiv:2408.02927v1 [cs.LG] 6 Aug 2024

[email protected] [email protected]

Yongfu Dai Zhengyu Chen Jimin Huang


Sichuan University Wuhan University The Fin AI
Chengdu, China Wuhan, China Singapore
[email protected] [email protected] [email protected]

Sophia Ananiadou Qianqian Xie∗


The University of Manchester The Fin AI
Manchester, UK Singapore
[email protected] [email protected]

Hao Wang∗
Sichuan University
Chengdu, China
[email protected]

Abstract

Data serves as the fundamental foundation for advancing deep learning, particularly
tabular data presented in a structured format, which is highly conducive to modeling.
However, even in the era of LLM, obtaining tabular data from sensitive domains
remains a challenge due to privacy or copyright concerns. Hence, exploring how
to effectively use models like LLMs to generate realistic and privacy-preserving
synthetic tabular data is urgent. In this paper, we take a step forward to explore
LLMs for tabular data synthesis and privacy protection, by introducing a new
framework HARMONIC for tabular data generation and evaluation. In the tabular
data generation of our framework, unlike previous small-scale LLM-based methods
that rely on continued pre-training, we explore the larger-scale LLMs with fine-
tuning to generate tabular data and enhance privacy. Based on idea of the k-nearest
neighbors algorithm, an instruction fine-tuning dataset is constructed to inspire
LLMs to discover inter-row relationships. Then, with fine-tuning, LLMs are trained
to remember the format and connections of the data rather than the data itself, which
reduces the risk of privacy leakage. In the evaluation part of our framework, we
develop specific privacy risk metrics DLT for LLM synthetic data generation, as
well as performance evaluation metrics LLE for downstream LLM tasks. Our
experiments find that this tabular data generation framework achieves equivalent
performance to existing methods with better privacy, which also demonstrates our
evaluation framework for the effectiveness of synthetic data and privacy risks in
LLM scenarios.

Co-Corresponding Author.

Preprint. Under review.


1 Introduction

In the age of deep learning, tabular data is a predominant data format and a key element for building
more effective algorithms to solve specific applications in various fields [1, 2]. However, in many
sensitive domains such as business [3], healthcare [4], and governmental operations [5], there are
significant limitations on the acquisition and use of tabular data. Tabular data in these domains
involves personal privacy, business secrets, or state secrets. The collection and use of such data are
strictly regulated by laws and regulations, and compliance with relevant data protection requirements
is necessary. Unauthorized use or disclosure may result in serious privacy infringement or business
losses. Therefore, generating data that ensures the effectiveness in modeling these data while
preserving privacy in tabular data synthesis has always been a critical research area [6, 7, 8].
Traditionally, tabular data synthesis relied on methods like GANs [9, 10, 11], VAEs [12, 13], and
Diffusion Models [14, 15, 16, 17]. These techniques, built on mathematical foundations and complex
frameworks, significantly advanced the field. However, the rise of Large Language Models (LLMs)
with their impressive ability to generate realistic data has shifted the paradigm. Methods like GReaT
[18] and TabuLa [6] leverage LLMs for faster synthesis by converting tables to natural language
and predicting the next data token. They often utilize smaller pre-trained models like GPT-2 [19]
for efficiency. Despite their advantages, LLMs introduce significant privacy concerns [20, 21].
These models may potentially leak sensitive information from the training data they are exposed to.
Therefore, a crucial area of exploration lies in developing strategies to mitigate these privacy risks
while harnessing the power of LLMs for tabular data synthesis.
To Harness LLMs fOr Tabular Data SyNthesis and PrIvacy ProteCtion, we develop a new framework,
HARMONIC2 , with tabular data generation and evaluation on LLMs. For the tabular data generation
framework, we use existing larger-scale LLMs to leverage their understanding abilities for generating
tabular data while ensuring privacy. It is based on the idea of k-nearest neighbor classification (kNN)
[22], which lets the LLMs see the relationship between multiple similar rows and construct the
structural tabular synthetic data format. With this format, we obtain the instruction-tuning datasets
that retain more structural information for LLMs to enhance the ability to generate synthetic data
through fine-tuning, while avoiding the forced memorization of data with pre-training. Meanwhile,
to comprehensively assess the effectiveness and privacy of synthetic data generated by LLMs, our
framework introduces two novel metrics: DLT (Data Leakage Test) and LLE (LLM Efficiency).
DLT quantifies the privacy risk of the synthesized data by LLMs. Conversely, LLE evaluates the
effectiveness of the synthetic data in downstream LLM tasks. The evaluation of the effectiveness
of downstream LLM tasks is based on the increasing application of LLMs in various fields. It is
important to note that machine learning-based evaluations are no longer sufficient.
Using our evaluation framework, we assess four datasets commonly used for classification tasks in
tabular data synthesis. The results show that synthetic data generated with HARMONIC performs
comparably to existing methods in machine learning and excels in downstream tasks and privacy
assessments in LLMs. Crucially, HARMONIC’s evaluation suggests that traditional synthetic data
methods may be unsuitable for downstream LLM tasks and that pretraining-based synthetic data
poses significant privacy risks.
The main contributions of this study can be summarized as follows: 1) We recognize that it is crucial
to not only focus on the strong data generation ability of LLM in this era, but also pay attention to
the potential privacy risks it may bring. 2) We develop a framework, HARMONIC, for synthesizing
tabular data based on LLM. The framework aims to minimize the risk of data leakage while ensuring
the effectiveness of data synthesis using LLM. 3) Under the HARMONIC framework, a set of metrics
is proposed for the effectiveness in downstream LLMs tasks and privacy risk evaluation of synthetic
tabular data.

2 Related work

Tabular Data Synthesis. Prior to the rise of Large Language Models (LLMs), synthetic tabular
data generation primarily relied on machine learning or classical neural network frameworks. These
methods can be broadly categorized into three groups: Simple Augmentation, Generative Adversarial
2
https://github.com/Wendy619/HARMONIC.

2
Networks (GANs), and Diffusion Models. Techniques like SMOTE [23] exemplify Simple Augmenta-
tion, leveraging linear interpolation for data resampling. While effective for structured data, SMOTE
overlooks semantic information. Building on GANs, CTGAN [9] introduces a conditional generator
and adapts a Variational Autoencoder (VAE) for tabular data (TVAE). CTAB-GAN [10] tackles data
imbalance and long-tail issues. TabDDPM [14] serves as a prominent benchmark for Diffusion-based
methods, with TABSYN [15] offering faster synthesis compared to other such techniques. However,
most of these methods utilize one-hot encoding for categorical data, which can exacerbate the "curse
of dimensionality" for high-cardinality variables and fail to capture contextual information [18, 6].
LLMs have emerged as a compelling approach for synthetic data generation due to their exceptional
capabilities in producing high-quality, human-like data. LLM-based methods commonly employ
a pre-training paradigm. Real tabular data is converted into text format and fed into the LLM for
learning. GreaT [18] exemplifies this approach, converting each tabular feature into the format "X
is Y" and feeding the text into GPT-2 [19] for fine-tuning. Tabula [6] introduces a tabular data
synthesizer leveraging an LLM framework without pre-trained weights. It prioritizes faster training
speed by simplifying token sequences to "X Y". REaLTabFormer [24] presents a transformer-based
framework for generating both non-relational and relational tabular data. It treats each tabular
sample as a sequence with dependencies, akin to a sentence, learning conditional distributions to
sequentially generate complete samples. While LLM-based methods often outperform machine
learning approaches due to their ability to leverage contextual information in text entries, limitations
exist. Processing table data entry-by-entry hinders LLMs from fully exploiting relational information
between samples. Furthermore, inherent security risks associated with data leakage plague LLMs
[20]. Pre-training-like fine-tuning can make them vulnerable, potentially allowing an attacker with
knowledge of one or two feature values in a real entry to retrieve the entire real data record.
Tabular Data Synthesis Evaluate. Existing evaluation methods for synthetic data, such as the MLE
benchmarking system proposed by Xu et al. [9], primarily focus on assessing its performance as
training data for machine learning models. However, as Kotelnikov et al. [14] argue, relying on
weak classifiers for evaluation becomes outdated in light of the capabilities of advanced models like
CatBoost [25]. This underscores the need for more sophisticated evaluation techniques, especially
considering the widespread adoption of LLMs in downstream applications [26].
Current privacy metrics for synthetic data, such as Distance to Closest Record (DCR) [10] and
the NewRowSynthesis metric from SDMetrics [27], solely analyze the distance between synthetic
data and real data. While these distance-based approaches provide valuable insights, they fall short
when dealing with Large Language Models (LLMs). LLMs are particularly susceptible to data
leakage due to their complex nature and training on massive datasets [20]. However, existing privacy
metrics based solely on tabular data feature distances fail to capture the unique learning and inference
mechanisms of LLMs, which operate at the semantic and generative probability levels of embeddings.
Consequently, these methods lack intuitive indicators of privacy leakage specific to LLMs [28].

3 Our Framework: HARMONIC

This chapter delves into the HARMONIC framework for tabular data synthesis powered by LLMs,
encompassing both generation and evaluation modules.

3.1 Synthetic Tabular Data Generation Framework

In this section, we present our approach to fine-tuning pre-trained LLMs for the generation of
synthetic tabular data, including three key stages: (1) Construct instruction dataset: Construct
an instruction fine-tuning dataset designed to fine-tune the generator model and a prompt dataset
to facilitate data generation. (2) Instruction tuning: Fed the instruction fine-tuning dataset into a
pre-trained LLM for fine-tuning, as illustrated in Figure 1; (3) Sampling: Synthetic tabular data
is generated by sampling from the fine-tuned LLM, with the sampling process depicted in Figure
2. Below, we will provide a comprehensive description of the entire process, encompassing the
construction of the instruction dataset, model fine-tuning, and the implementation of sampling.

3
Figure 1: After applying the kNN algorithm to the original table, we obtain n sets of k + 1 data
points. Each set is structured according to the template shown in the gray table at the bottom left.
These datasets are then encoded into a single instruction using text encoding, with the features of
each table data shuffled, as shown in the white box above (a). Finally, the encoded fine-tuning dataset
is input into the pre-trained LLM for fine-tuning (b).

3.1.1 Construct Instruction Dataset


Construct a fine-tuning dataset using kNN. Our approach leverages kNN to enable LLMs to
generate synthetic data resembling limited real data. This is expected to use the in-context learning
ability of LLM (few-shots) to mine information from the most relevant table samples.
Specifically, this process involves finding the k nearest neighbors for each training sample, creating a
set of k + 1 data points. To improve the quality of the generated synthetic data, a filtering step is
necessary. Specifically, for each set of k + 1 data, if more than half of the input data have labels that
are different from that of the single corresponding data point, this k + 1 data is discarded. Ultimately,
this filtering process yields n sets of k + 1 data. A constant value of k = 5 was used throughout our
experimental setup.
Data format engineering. Since LLMs are designed as sequence-to-sequence models, feeding tabular
data into an LLM requires converting the structured data into a textual format. A straightforward
approach would be to directly input a programming language readable data structure, such as Pandas
DataFrame Loader for Python, line-separated JSON-file format, HTML code reflecting tables, etc.
[1] In our method, each table entry is converted into JSON dictionary format, preserving the original
table structure and enabling the model to understand the semantics of each value.
For a table entry si with feature names f1 , f2 , . . . , fm , where the value of its j-th feature is vi,j , the
JSON-formatted data ti corresponding to the table entry si is defined as follows:

ti,j = [fj : vi,j ] ∀i ∈ {1, . . . , n(k + 1)}, j ∈ {1, . . . , m}, (1)


ti = {ti,1 , ti,2 , . . . , ti,m } ∀i ∈ {1, . . . , n(k + 1)}, (2)

We concatenate k JSON-formatted data entries sequentially, incorporating prompts to elucidate


the fine-tuning task and contextualize the data. The label JSON-formatted data entry serves as the
reference answer. In addition, when converting a tabular feature vector into a sequence using the text
encoding scheme, we inadvertently introduce pseudo-positional information into the transformed
tabular data sample. However, there is no inherent spatial ordering among features in tabular
datasets[29]. To restore feature order independence, we randomly shuffle the order of features within
each complete JSON-formatted data entry ti using a permutation. This operation results in a new
sequence where the order of features is randomized, ensuring that the model learns to be invariant to
feature order. Therefore, a template for this instruction fine-tuning dataset is shown as Figure 1 3 .
3
For illustrative examples, please refer to Appendix A.5.

4
Figure 2: The sampling step involves inputting a prompt, shown within the white box in the upper
left corner (a), into the fine-tuned pretrained LLM. This results in a textual output (b), which is then
converted into a table using pattern matching (c).

Construct prompt dataset for generation. To generate synthetic data, we need to construct a
prompt dataset consistent in format with the fine-tuning dataset. There are three key differences
between the prompt dataset and the fine-tuning dataset: (1) The prompts used for generating data
remain consistent with those used during fine-tuning, with the exception of the OUTPUT field, as
the output is the synthetic data that the model needs to generate without reference answer. (2) Each
set of k real data points in the prompt dataset is randomly resampled from the real data, unlike in
the fine-tuning dataset, preventing the model from reproducing the original real data, and it does not
require filtering operations. (3) The size of the prompt dataset should be larger than the number of
synthetic data samples required, as each prompt can generate only one piece of synthetic data.

3.1.2 Instruction Tuning


We then fine-tune the LLM for the synthetic data generation task using the instruction dataset we
constructed. Unlike pretraining-based LLMs for this task, our aim is to prevent the LLM from
memorizing the original tabular data in the dataset. After tokenizing our instruction dataset, the
resulting token embeddings of one sample for the INPUT and OUTPUT are denoted as emb(X) =
(x1 , . . . , xl ) and emb(Y) = (y1 , . . . , yq ), respectively. Here, l and q represent the lengths of the
INPUT and OUTPUT, respectively. Therefore, the objective of our fine-tuning strategy is to
maximize the probability of generating the correct output sequence given a prompt describing the
task and k input real data points. This objective function is formulated as:
q
Y
p(ft) = p(emb(Y)|emb(X)) = p(y1 , . . . , yq |x1 , . . . , xl ) = p(y|x1 , . . . , xl , y1 , . . . , yj ) (3)
j=1
Q
The LLM is trained by optimizing the parameters to maximize the probability ft∈F T p(ft), which
only compute the loss of OUTPUT and avoids learn the real data in the INPUT to protect privacy.

3.1.3 Sampling
We denote the fine-tuned LLM as the generator G. Each data point in the prompt dataset is fed
into G, yielding the distribution of subsequent tokens conditioned on the known input sequence.
To generate the next token with more diversity and protect privacy, we adopt a weighted sampling
strategy incorporating a temperature coefficient T . We set the default temperature coefficient T to
0.7. After generation, we utilize pattern-matching algorithms, as described in [30], to reconvert the
generated textual feature representations into a dataframe format, resulting in the final synthetic
tabular dataset.

3.2 Synthetic Tabular Data Evaluation Framework

We introduce two new metrics to evaluate the quality and privacy of synthetic data for LLM-based
synthesis methods: LLM Efficacy (LLE) and Data Leakage Test (DLT).

5
3.2.1 LLE: LLM Efficacy
With the development of LLMs, we believe that evaluating the quality of synthetic data using weak
classifiers is losing its practical value and credibility. More and more pepole are concerned with the
performance of synthetic data as a training set for state-of-the-art methods [14]. Recent research
exploring the application of LLMs to tabular data processing has yielded significant advancements,
with potential to rival or even surpass state-of-the-art machine learning approaches [31]. Therefore,
we propose using synthetic data to fine-tune a pretrained LLM and then evaluate the fine-tuned LLM
on the real test set. We refer to this as LLM Efficacy (LLE). We choose LLaMA-2-7b-chat [32] as
the base model to compute the LLE.

3.2.2 DLT: Data Leakage Test


The metrics Distance to Closest Record (DCR) [10] and SDMetrics [27] focus on measuring the
"distance" between synthetic data and real data, without taking into account the extent to which the
generator itself leaks data. Research indicates that LLMs are susceptible to data leakage issues to
varying degrees [20]. Attacks on LLMs of synthetic data generator can potentially extract complete
training data, leading to severe privacy breaches. To address this, we propose a new metric for
quantifying privacy protection, named the Data Leakage Test (DLT), inspired by the work of
Skywork [33]. This metric measures the extent to which a generator leaks real data, thereby reflecting
the privacy level of the synthetic data. The DLT computes the perplexity of the generator on both
synthetic and real data to determine its data generation tendencies.
To compute the DLT, firstly, we feed the training data into the generator to calculate the ppl(perplexity)
for each sample, then average these scores to determine the ppl on the training data, referred to
as <ppl-on-train>. Then, we feed synthetic data into the generator and obtain the ppl on synthetic
data, referred to as <ppl-on-syn>. The DLT value is computed by subtracting <ppl-on-syn> from
<ppl-on-train>. A larger DLT value indicates better privacy preservation of the original data by the
generator, whereas a smaller value indicates weaker privacy preservation. The computation formula
of DLT is as below, where the P (x) denotes the probability of generating a sentence.
DLT = PPL(Dtest ) − PPL(Dtrain ) (4)

1 X 1 1 X
PPL(Dsplit ) = P (x)− N = 2Cross−Entropy(x) (5)
|Dsplit | |Dsplit |
x∈Dsplit x∈Dsplit

4 Experiment
In this section, we select four real-world datasets to compare the performance of HARMONIC with
various types of data synthesis methods. The comparison is conducted from two perspectives: the
effectiveness of the synthesized data and its privacy.

4.1 Experimental Setup

Datasets. To evaluate the proposed method, we utilized four real-world datasets from various domains,
namely GM (German [34]), AD (Adult Income [35]), DI (Diabetes)4 , BU (Buddy)5 , which are all
open source datasets and don’t contain any personal information such as names, phone numbers,
addresses, or other sensitive data. These datasets differ in size, feature types, and the number of
features, ranging from fewer than 1,000 to tens of thousands of samples. Some datasets include only
numerical features, while others contain both numerical and categorical features. We divided each
dataset into training, validation, and test sets in approximately a 7:1:2 ratio. All models were trained
on the same training data samples.
Baselines. There are numerous synthetic methods for generating tabular data. Based on the classifi-
cation approach discussed previously, we selected the most representative methods as our baselines.
SMOTE [23] is a simple interpolation method proposed for oversampling minority classes and can
4
https://www.openml.org/search?type=data&sort=runs&id=37
5
https://www.kaggle.com/datasets/akash14/adopt-a-buddy

6
also be used for generating synthetic data. TVAE [9] is a state-of-the-art method for tabular data gener-
ation based on VAE. CTABGAN [10] is a GAN-based model that performs exceptionally well across
a diverse set of benchmarks. TabDDPM [14] serves as a famous benchmark for Diffusion-based
Methods. TABSYN [15] achieves faster synthesis compared to other diffusion-based techniques.
GReaT [18] and REaLTabFormer [24] are SOTA tabular data synthesizers based on LLMs, to be
precise, both are based on GPT-2 [19]. The code for these methods can be found on GitHub.
Metrics. For the effectiveness of synthetic data, we evaluate it using our proposed LLE metric.
Specifically, we convert the synthetic tabular data into the text format required for specific classifica-
tion tasks, then feed this data into a pre-trained LLM for fine-tuning. The fine-tuned model is then
tested using a test set that has been similarly converted into the corresponding text format, and the
weighted average F1 score is obtained. We also fine-tune the pre-trained LLM using the training set
of real data and evaluate it on the test set. The synthetic data is considered to have practical value if
the performance of the fine-tuned model using synthetic data is on par with or better than that using
real data.
For the privacy of synthetic data, we use three different metrics to evaluate privacy: DCR [10]
and NRS(NewRowSynthesis) [27], and our proposed DLT metric. All three metrics are positively
correlated with privacy, meaning that higher values indicate stronger privacy.
Implementation Details. 6 Our approach allows for the selection of any pre-trained generative LLM
that supports fine-tuning, such as GPT-2 [19], LLaMA-2-7b-chat [32], Mistral [36], etc., as the base
model. By default, our method opts for LLaMA-2-7b-chat [32] as the base model due to its rich
pre-training corpus, resulting in a stronger language understanding capability compared to GPT-2 [19].
This enables LLaMA-2-7b-chat [32] to learn fine-tuning tasks more efficiently. However, users also
have the flexibility to switch base models according to their specific requirements. Considering the
time cost of the entire experiment, we choose lora [37] efficient fine-tuning instead of full parameter
adjustment.

4.2 The Effectiveness of Synthetic Data

Our experimental results offer compelling evidence that the synthetic data generated by our
method can effectively serve as a substitute for real data in downstream tasks. This finding
aligns with the growing recognition that traditional Machine Learning Efficacy (MLE) metrics may
not be well-suited for evaluating the effectiveness of synthetic data used with modern LLMs. Relying
solely on MLE metrics can be misleading when evaluating LLMs, potentially leading to inaccurate
conclusions.
Table 1: The results for effectiveness. The best results are marked in bold, the second-best results are
underlined. All results are averages over 3 trials with different random seeds.
Dataset Metric Original HARMONIC SMOTE TVAE CTAB TabDDPM TABSYN GReaT RTF

MLE 0.50±0.00 0.55±0.03 0.64±0.02 0.61±0.02 0.57±0.02 0.64±0.01 0.63±0.02 0.44±0.03 0.65±0.01
GM
LLE 0.71±0.00 0.64±0.03 0.67±0.04 0.69±0.03 0.71±0.02 0.67±0.05 0.72±0.02 0.55±0.11 0.69±0.03

MLE 0.61±0.00 0.67±0.02 0.75±0.00 0.74±0.00 0.73±0.01 0.74±0.00 0.73±0.01 0.73±0.01 0.76±0.00
AD
LLE 0.81±0.00 0.80±0.02 0.84±0.01 0.83±0.01 0.83±0.00 0.83±0.00 0.81±0.02 0.82±0.02 0.85±0.00

MLE 0.56±0.00 0.46±0.02 0.72±0.03 0.71±0.02 0.67±0.02 0.71±0.02 0.68±0.03 0.45±0.03 0.66±0.03
DI
LLE 0.70±0.00 0.75±0.00 0.69±0.04 0.72±0.04 0.62±0.09 0.72±0.03 0.77±0.01 0.71±0.03 0.70±0.04

MLE 0.38±0.00 0.27±0.03 0.25±0.02 0.27±0.03 0.26±0.01 0.27±0.01 0.26±0.01 0.24±0.03 0.26±0.00
BU
LLE 0.88±0.00 0.82±0.03 0.85±0.04 0.86±0.01 0.82±0.02 0.85±0.01 0.86±0.01 0.81±0.03 0.70±0.14

Therefore, we primarily base our analysis and evaluation on the LLM Efficacy (LLE) metric. This met-
ric provides a more nuanced assessment of the quality and effectiveness of synthetic data specifically
for LLM-based tasks. Table 1 summarizes the weighted average F1 scores achieved on classification
tasks using the LLaMA-2-7b-chat model. Each value in the table represents the average F1 score ob-
tained across three independent runs of the synthetic data generation process, using different random
seeds to ensure robustness. The results presented in Table 1 (LLE) demonstrate the effectiveness of
6
For detailed configurations of the fine-tuning, please refer to Appendix B.

7
our method. While our method surpasses the real training set on the DI dataset, its performance on
the remaining three datasets falls slightly short. However, the average decrease compared to the real
data benchmark is less than 5%, which falls within an acceptable range for practical applications.
Notably, even TABSYN , boasting the best overall performance among the compared methods, only
outperforms the real training set on two datasets (GM and DI). Furthermore, our method exhibits
a distinct advantage in terms of stability. Compared to other prominent LLM-based methods like
GReaT and RTF (REaLTabFormer), our synthetic data generation process produces results with a
significantly lower standard deviation. This indicates that our method generates data with greater
consistency and reliability, leading to more predictable performance in downstream LLM tasks.
In conclusion, while our method may not achieve the absolute highest performance on every dataset,
the results presented in this section overwhelmingly support its potential as a viable substitute for
real data. The synthetic data generated by our method demonstrates both effectiveness and stability,
making it a valuable tool for various LLM-based applications.

4.3 The Privacy of Synthetic Data

The experimental results demonstrate that our method prioritizes privacy in the synthetic data
generation. This is particularly beneficial in situations where disclosing real data is not feasible due
to privacy concerns. In such scenarios, our synthetic data serves as a reliable and secure substitute for
real data, allowing downstream tasks to proceed without compromising sensitive information.
Table 2 presents three key privacy metric scores to quantify the effectiveness of our method. Analyzing
the results in Table 2, it’s evident that our method surpasses or comes in a close second for almost
all datasets across all three metrics. This translates to demonstrably stronger privacy protection
compared to existing methods.
Table 2: The results for privacy. The best results are marked in bold, the second-best results are
underlined. Each dataset has three metrics, and in all cases, higher values are better.
Dataset Metric HARMONIC SMOTE TVAE CTAB TabDDPM TABSYN GReaT RTF
NRS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
GM DCR 8.08 2.77 4.09 5.36 2.21 3.98 5.84 4.60
DLT -0.16 — — — — — -2.14 -22.04
NRS 1.00 0.95 1.00 1.00 1.00 1.00 1.00 1.00
AD DCR 2.47 0.16 0.49 0.82 0.50 0.86 1.51 0.57
DLT -0.98 — — — — — -0.67 -163.71
NRS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
DI DCR 0.44 0.28 0.33 0.72 0.21 1.37 1.36 0.36
DLT -0.37 — — — — — -0.44 -42.46
NRS 1.00 0.93 1.00 1.00 0.99 1.00 1.00 1.00
BU DCR 2.52 0.15 0.66 0.70 0.18 1.38 8.30 0.38
DLT -0.34 — — — — — -2.22 -41.13

However, the privacy benefits go beyond the quantitative metrics. The design of our method inherently
offers superior security. An attacker attempting to reconstruct a single real data record would need
knowledge of nearly the entire set of k real data records (typically set to 5). This includes knowing the
sequence of each feature within a record and the specific order of these k samples. This significantly
raises the bar for attackers compared to methods like GReaT, which exposes a vulnerability where an
attacker with knowledge of just one or two feature values in a real record can potentially reconstruct
the entire record.

5 Conclusion
In this paper, we introduce HARMONIC, a novel framework that leverages the power of LLMs
for synthesizing tabular data and privacy concerns. HARMONIC enables LLMs to capture both
the internal feature relationships within individual data points and the broader connections between
samples by instruction fine-tuning. Recognizing the crucial importance of privacy, we have proposed
DLT specifically for detecting data privacy in LLM synthesis. Extensive evaluations across four
real-world datasets for classification tasks showcase HARMONIC’s ability to achieve this crucial
balance of effectiveness and privacy. HARMONIC demonstrably offers robust privacy protection
while preserving the effectiveness of the synthetic data.

8
Limitations. Compared to other methods, our approach requires a longer processing time for larger
LLMs. In addition, because LLMs are less sensitive to numerical data and are better suited for
classification tasks rather than regression tasks. As a result, our current work focuses solely on tabular
data used for classification tasks.

References
[1] Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego
Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models on tabular data–a
survey. arXiv preprint arXiv:2402.17944, 2024.
[2] Weizheng Lu, Jiaming Zhang, Jing Zhang, and Yueguo Chen. Large language model for table processing:
A survey, 2024.
[3] Alejandro Mottini, Alix Lheritier, and Rodrigo Acuna-Agost. Airline passenger name record generation
using generative adversarial networks. arXiv preprint arXiv:1807.06657, 2018.
[4] Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data
in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
[5] Chaeyoon Jeong, Sundong Kim, Jaewoo Park, and Yeonsoo Choi. Customs import declaration datasets.
arXiv preprint arXiv:2208.02484, 2022.
[6] Zilong Zhao, Robert Birke, and Lydia Chen. Tabula: Harnessing language models for tabular data synthesis.
arXiv preprint arXiv:2310.12746, 2023.
[7] Qinyi Liu, Mohammad Khalil, Jelena Jovanovic, and Ronas Shakya. Scaling while privacy preserving: A
comprehensive synthetic tabular data generation and evaluation in learning analytics. In Proceedings of the
14th Learning Analytics and Knowledge Conference, pages 620–631, 2024.
[8] Alycia N Carey, Karuna Bhaila, Kennedy Edemacu, and Xintao Wu. Dp-tabicl: In-context learning with
differentially private tabular data. arXiv preprint arXiv:2403.05681, 2024.
[9] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data
using conditional gan. Advances in neural information processing systems, 32, 2019.
[10] Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. Ctab-gan: Effective table data synthesizing.
In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
[11] Bingyang Wen, Yupeng Cao, Fan Yang, Koduvayur Subbalakshmi, and Rajarathnam Chandramouli.
Causal-tgan: Modeling tabular data using causally-aware gan. In ICLR Workshop on Deep Generative
Models for Highly Structured Data, 2022.
[12] Syed Mahir Tazwar, Max Knobbout, Enrique Hortal Quesada, and Mirela Popa. Tab-vae: A novel vae for
generating synthetic tabular data.
[13] Patricia A Apellániz, Juan Parras, and Santiago Zazo. An improved tabular data generator with vae-gmm
integration. arXiv preprint arXiv:2404.08434, 2024.
[14] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular
data with diffusion models. In International Conference on Machine Learning, pages 17564–17579.
PMLR, 2023.
[15] Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos Faloutsos,
Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in
latent space. arXiv preprint arXiv:2310.09656, 2023.
[16] Tongyu Liu, Ju Fan, Nan Tang, Guoliang Li, and Xiaoyong Du. Controllable tabular data synthesis using
diffusion models. Proceedings of the ACM on Management of Data, 2(1):1–29, 2024.
[17] Timur Sattarov, Marco Schreyer, and Damian Borth. Findiff: Diffusion models for financial tabular data
generation. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 64–72,
2023.
[18] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language
models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
[19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
[20] Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzheng Cheng. On
protecting the data privacy of large language models (llms): A survey. arXiv preprint arXiv:2403.05156,
2024.
[21] Bishwas Mandal, George Amariucai, and Shuangqing Wei. Initial exploration of zero-shot privacy utility
tradeoffs in tabular data using gpt-4. arXiv preprint arXiv:2404.05047, 2024.

9
[22] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information
theory, 13(1):21–27, 1967.
[23] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic
minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
[24] Aivin V Solatorio and Olivier Dupriez. Realtabformer: Generating realistic relational and tabular data
using transformers. arXiv preprint arXiv:2302.02041, 2023.
[25] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin.
Catboost: unbiased boosting with categorical features. Advances in neural information processing systems,
31, 2018.
[26] Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Alejandro Lopez-
Lira, and Hao Wang. Empowering many, biasing a few: Generalist credit scoring through large language
models. arXiv preprint arXiv:2310.00566, 2023.
[27] DataCebo, Inc. Synthetic Data Metrics, 10 2023. Version 0.12.0.
[28] Jeffrey G Wang, Jason Wang, Marvin Li, and Seth Neel. Pandora’s white-box: Increased training data
leakage in open llms. arXiv preprint arXiv:2402.17012, 2024.
[29] V Borisov, T Leemann, K Seßler, J Haug, M Pawelczyk, and G Kasneci. Deep neural networks and tabular
data: A survey. arxiv 2021. arXiv preprint arXiv:2110.01889.
[30] Alfred V Aho and AJ van Leeuwen. Algorithms for finding patterns in strings, handbook of theoretical
computer science vol a. A, ed. J. van Leeuwen, ElsevierSciencePublishersB, 1990:257–297, 1990.
[31] Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, and Jintai Chen.
Making pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024.
[32] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[33] Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Wei-
wei Lü, Rui Hu, et al. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341,
2023.
[34] Hans Hofmann. Statlog (German Credit Data). UCI Machine Learning Repository, 1994. DOI:
https://doi.org/10.24432/C5NC77.
[35] Ron Kohavi et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd,
volume 96, pages 202–207, 1996.
[36] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and
William El Sayed. Mistral 7b, 2023.
[37] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
2021.

10
A Datasets Details
A.1 Data Source

We list the sources of our datasets in Table 3, all of which are obtained from publicly accessible and
reputable websites.

Dataset URL
German https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
Adult Income https://archive.ics.uci.edu/dataset/2/adult
Diabetes https://www.openml.org/search?type=data&sort=runs&id=37&status=active
Buddy https://www.kaggle.com/datasets/akash14/adopt-a-buddy

Table 3: URLs for real-world datasets of the experiments

A.2 Data Description

Additionally, we record various statistical details for each dataset in Table 4.


German. The German dataset classifies people as good or bad credit risks described by a set of
attributes including status of existing checking account, duration in month, credit history, purpose
and more.
Adult Income. The US Adult income dataset was extracted by Barry Becker from the 1994 US
Census Database. The dataset consists of anonymous information such as occupation, age, native
country, race, capital gain, capital loss, education, work class and more. Each row is labelled as either
having a salary greater than ">50K" or "<=50K".
Diabetes. The Diabetes dataset originates from the National Institute of Diabetes and Digestive and
Kidney Diseases. This dataset comprises medical features including the number of times pregnant,
diastolic blood pressure, body mass index, age, among other variables. The label indicates whether
the individual has diabetes or not.
Buddy. The Buddy dataset originates from the HackerEarth Machine Learning Challenge—Adopt a
Buddy. The dataset consists of parameters such as: a unique ID assigned to each animal that is up for
adoption, date on which they arrived at the shelter, their physical attributes such as color, length and
height, among other factors. The labels in this dataset denote the breed of the animals.

Dataset Domain # Samples # Num # Cat Tasks # Classes


German Financial 1000 7 13 Classification 2
Adult Income Social 32561 6 8 Classification 2
Diabetes Medical 768 8 0 Classification 2
Buddy Nature 18834 4 5 Multi-Class 3

Table 4: Dataset Statistics. # Samples denotes the number of samples in each dataset. # Num and #
Cat columns indicate numbers of numerical and categorical features in each dataset.

A.3 Data Preprocessing

To maintain consistency in formatting, we converted all four datasets into CSV files. Additionally,
the other datasets underwent the following preprocessing steps:
German. The original label "status" with a value of "1" was converted to "0", and the original label
"status" with a value of "2" was converted to "1".

11
Adult Income. The original label "class" with a value of "<=50K" was converted to "0", and the
original label "class" with a value of ">50K" was converted to "1".
Diabetes. The diabetes dataset was used without any additional preprocessing.
Buddy. The original "issue_date" and "listing_date," which were represented in the "date_time"
format, have been replaced with a timestamp format.

A.4 Data Field

The instruction fine-tunnig dataset is provided in json format and contains the following attributes.
And a specific instance of INPUT and OUTPUT can be found in A.5.
{
id: [integer] The unique identifier for each instance
conversations: [
{
from: [string] "human"
value: [string] the INPUT text for LLM fine-tuning
},
{
from: [string] "assistant"
value: [string] the OUTPUT text for LLM fine-tunnig
}
]
}

A.5 Data Instance

To illustrate the data format used for fine-tuning both the generator and downstream tasks, we present
a complete data instance from the German dataset as an example, shown in Table 5 and Table 6
respectively.

B Experimental Details

B.1 Parameter Selection

Considering the time cost of the entire experiment, we did not adjust the best hyperparameters for
different dataset. By conducting experiments on the validation set and combining empirical settings,
we unified the hyperparameters of the fine-tuning process. In the fine-tuning stage, we choose lora[37]
efficient fine-tuning instead of full parameter adjustment.
We fine-tune the LLaMA-2-7b-chat model for each dataset for 5 epochs with a batch size of 16. We
utilize the AdamW optimizer for the proposed generative models, with the learning rate 3 × 10−4 .
For the sampling step, we use 3 random seeds in the data generation stage for each dataset, specifically
1234, 1235, and 1236. We set the temperature parameter T to 0.7 for all experiments and datasets.
We sample new synthetic data using the prompt dataset for generation (Sec 3.1.1), starting with
task description and five random real samples(see an example in Appendix A.5). We generated
synthetic datasets for German and Diabetes with the same number of samples as their respective
training sets. For the Adult Income and Buddy datasets, where the training sets are larger, exceeding
10,000 samples, we generated 5,000 samples due to the extended time required for sampling with our
method.
For the MLE metric, we employ logistic regression, decision tree, mlp and random forest models.
For the LLE metric, the epoch set for fine-tuning the downstream LLaMA-2-7b-chat model is 5, the
learning rate is 1 × 10−4 , and the batch size is 32. The random seed is fixed when fine-tuning the
downstream model. See an example of instruction data for downstream tasks in Appendix A.5).

12
INPUT: Here are 5 tabular data about user credit scores, each containing 20 columns of features
and 1 column of labels, where the ’status’ column is a binary classification label. I will transmit the
data to you in JSON format. Please generate an approximate sample based on these 5 examples.\n
Example one: {"Present employment since": "A75", "Credit amount": "11816", "Credit history":
"A30", "Purpose": "A49", "Duration in month": "45", "Other installment plans": "A143", "Age in
years": "29", "Savings account/bonds": "A61", "status": "1", "foreign worker": "A201", "Number
of people being liable to provide maintenance for": "1", "Number of existing credits at this bank":
"2", "Installment rate in percentage of disposable income": "2", "Housing": "A151", "Property":
"A123", "Present residence since": "4", "Telephone": "A191", "Other debtors / guarantors": "A101",
"Job": "A173", "Status of existing checking account": "A11", "Personal status and sex": "A93"}.\n
Example two: {"Housing": "A151", "Personal status and sex": "A92", "Credit amount": "6416",
"Job": "A173", "Property": "A124", "Purpose": "A49", "status": "1", "Number of people being liable
to provide maintenance for": "1", "Number of existing credits at this bank": "1", "Present employment
since": "A75", "Other installment plans": "A143", "Installment rate in percentage of disposable
income": "4", "Present residence since": "3", "Status of existing checking account": "A12", "Savings
account/bonds": "A61", "Telephone": "A191", "Other debtors / guarantors": "A101", "Age in years":
"59", "Duration in month": "48", "Credit history": "A31", "foreign worker": "A201"}.\n Example
three: {"Housing": "A151", "Installment rate in percentage of disposable income": "4", "Age in
years": "31", "Duration in month": "24", "foreign worker": "A201", "Number of people being liable
to provide maintenance for": "1", "Other installment plans": "A143", "Savings account/bonds":
"A61", "Present employment since": "A73", "Credit history": "A31", "Status of existing checking
account": "A11", "Job": "A173", "Telephone": "A192", "Number of existing credits at this bank":
"1", "status": "1", "Personal status and sex": "A93", "Credit amount": "3161", "Other debtors /
guarantors": "A101", "Purpose": "A49", "Property": "A122", "Present residence since": "2"}.\n
Example four: {"Purpose": "A49", "Number of people being liable to provide maintenance for":
"1", "Housing": "A151", "Age in years": "26", "Savings account/bonds": "A62", "Other installment
plans": "A143", "Present employment since": "A73", "Telephone": "A191", "Installment rate in
percentage of disposable income": "4", "Duration in month": "30", "Number of existing credits
at this bank": "2", "Personal status and sex": "A92", "Present residence since": "4", "Status of
existing checking account": "A12", "Job": "A172", "Credit history": "A30", "Property": "A123",
"Other debtors / guarantors": "A101", "status": "1", "Credit amount": "4280", "foreign worker":
"A201"}.\n Example five: {"Present employment since": "A74", "Credit amount": "3566", "Duration
in month": "48", "foreign worker": "A201", "Other debtors / guarantors": "A101", "Other installment
plans": "A143", "Number of existing credits at this bank": "1", "Number of people being liable to
provide maintenance for": "1", "Credit history": "A31", "Housing": "A152", "Present residence
since": "2", "Installment rate in percentage of disposable income": "4", "Savings account/bonds":
"A62", "Telephone": "A191", "status": "0", "Job": "A173", "Purpose": "A49", "Age in years":
"30", "Personal status and sex": "A93", "Property": "A123", "Status of existing checking account":
"A12"}.\n Generate one sample:
OUTPUT: {"Present residence since": "4", "Credit amount": "7685", "Age in years": "37", "Other
installment plans": "A143", "Status of existing checking account": "A11", "Housing": "A151",
"Credit history": "A31", "Duration in month": "48", "Property": "A123", "Purpose": "A49", "Other
debtors / guarantors": "A103", "Present employment since": "A74", "Installment rate in percentage
of disposable income": "2", "Job": "A173", "Savings account/bonds": "A61", "Telephone": "A191",
"Number of people being liable to provide maintenance for": "1", "Number of existing credits at this
bank": "1", "Personal status and sex": "A92", "foreign worker": "A201", "status": "1"}.

Table 5: An instance of the instruction data for the generator

B.2 Experimental Environment

Our hardware setup includes 4 NVIDIA A100-40GB GPUs. The system has 1 TB system RAM, and
runs on an AMD EPYC 7742 processor with 64 cores, using the Ubuntu 22.04 operating system.

C Additional results

The following presents the results of the ablation study. We conducted comparative experiments
using the German and Diabetes datasets.

13
INPUT: Evaluate the creditworthiness of a customer with the following financial profile. Respond
with only either ’good’ or ’bad’. \n Text: ’The state of Status of existing checking account is bigger
than 0 DM but smaller than 200 DM, The state of Duration in month is 36, The state of Credit history
is delay in paying off in the past, The state of Purpose is car (new), The state of Credit amount is
1873, The state of Savings account or bonds is bigger than 100 smaller than 500 DM, The state
of Present employment since is bigger than 1 smaller than 4 years, The state of Installment rate in
percentage of disposable income is 2, The state of Personal status and sex is male and single, The
state of Other debtors or guarantors is none, The state of Present residence since is 2, The state of
Property is unknown or no property, The state of Age in years is 29, The state of Other installment
plans is none, The state of Housing is for free, The state of Number of existing credits at this bank is
1.0, The state of Job is management or self-employed or highly qualified employee or officer, The
state of Number of people being liable to provide maintenance for is 1, The state of Telephone is yes,
registered under the customers name, The state of foreign worker is yes.’\n Answer:
OUTPUT: "bad"

Table 6: An instance of the instruction data for downstream tasks

C.1 Filter operation

Experimental results demonstrate that the filtering step can enhance the quality of synthetic data.
As shown in Table 7, the LLE values decrease without filtering, particularly for the German dataset.
This is likely due to incorrect labels in the generated synthetic data. Additionally, privacy slightly
diminishes without the filtering step, though the difference is minimal. These findings indicate that
the filtering step is effective.

Table 7: The results of whether to filter data after kNN, where "w/o fil" means not to filter data, and
"with fil" means to filter data, which is our original method. Each dataset has five metrics, and in all
cases, higher values are better.

Dataset Filter MLE LLE NRS DCR DLT


w/o fil 0.56±0.06 0.59±0.03 1.00 7.97 -0.17
GM
with fil 0.55±0.03 0.64±0.03 1.00 8.08 -0.16
w/o fil 0.56±0.06 0.74±0.01 1.00 0.44 -0.38
DI
with fil 0.46±0.02 0.75±0.00 1.00 0.44 -0.37

C.2 Random feature order permutation

Experiments indicate that permuting features can enhance the privacy of synthetic data. As shown
in the last two columns of Table 8, there is a significant reduction in both the DCR and DLT values
when features are not permuted. Concurrently, the generated numerical columns tend to produce
repeated values, which may also contribute to the decrease in the LLE metric. Overall, these results
underscore the necessity of shuffling features.

Table 8: The results of whether to shuffle features, where "w/o pm" means not to shuffle the features,
and "with pm" means to shuffle the features, which is our original method. Each dataset has five
metrics, and in all cases, higher values are better.

Dataset Permutation MLE LLE NRS DCR DLT


w/o pm 0.56±0.04 0.63±0.05 1.00 7.20 -0.58
GM
with pm 0.55±0.03 0.64±0.03 1.00 8.08 -0.16
w/o pm 0.50±0.06 0.70±0.03 1.00 0.42 -0.67
DI
with pm 0.46±0.02 0.75±0.00 1.00 0.44 -0.37

14
D Ethics Statement
The dataset used in this study is based on open-source data and can be further modified. We thoroughly
reviewed and verified the data to ensure it does not contain any personally identifiable information
or offensive content. Additionally, we conducted manual audits to ensure there are no sensitive
details. Therefore, we believe the dataset is secure and its use in the research is ethically sound and
appropriate for the purposes of this study.

15

You might also like