Targeted Data Generation: Finding and Fixing Model Weaknesses

The paper discusses how to generate right datasets for testing NLP models to reveal their weaknesses.

Uploaded by

thebazshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views13 pages

Targeted Data Generation: Finding and Fixing Model Weaknesses

The paper discusses how to generate right datasets for testing NLP models to reveal their weaknesses.

Uploaded by

thebazshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Targeted Data Generation: Finding and Fixing Model Weaknesses

Zexue He∗ Marco Tulio Ribeiro Fereshte Khani

UC San Diego Microsoft Microsoft
La Jolla, CA, USA Redmond, WA, USA Redmond, WA, USA
zehe@[Link] marcotcr@[Link] fkhani@[Link]

Abstract even if data collection is adversarial (Kiela et al.,

2021), especially when subgroups are not imme-
Even when aggregate accuracy is high, state- diately obvious or salient to humans. Therefore it
of-the-art NLP models often fail systematically
arXiv:2305.17804v1 [[Link]] 28 May 2023

helps little in addressing these weaknesses. Tools

on specific subgroups of data, resulting in un-
fair outcomes and eroding user trust. Addi- for discovering challenging subgroups still require
tional data collection may not help in address- human creativity and effort (Rajani et al., 2022).
ing these weaknesses, as such challenging sub- Khani and Ribeiro (2023); Ribeiro and Lundberg
groups may be unknown to users, and under- (2022) show that experts are able to improve exist-
represented in the existing and new data. We ing subgroups via careful data augmentation with
propose Targeted Data Generation (TDG), a large language models (LLMs), but finding such
framework that automatically identifies chal-
challenging subgroups still requires human inge-
lenging subgroups, and generates new data for
those subgroups using large language models nuity. Perhaps more importantly, they find that
(LLMs) with a human in the loop. TDG esti- naively augmenting certain subgroups can drasti-
mates the expected benefit and potential harm cally hurt other subgroups and overall performance
of data augmentation for each subgroup, and (Ribeiro and Lundberg, 2022). Hence, the chal-
selects the ones most likely to improve within- lenge is not only to find challenging subgroups, but
group performance without hurting overall per- also to determine which subgroups are amenable
formance. In our experiments, TDG1 signifi- to data augmentation, and how to augment them
cantly improves the accuracy on challenging
effectively.
subgroups for state-of-the-art sentiment anal-
ysis and natural language inference models, In this work, we propose Targeted Data Gener-
while also improving overall test accuracy. ation (TDG), a framework to automatically iden-
tify challenging subgroups that can benefit from
1 Introduction more data, and then generate that data with LLMs
Despite very high accuracy, state-of-the-art NLP (Figure 1). Given a target model, TDG clusters val-
models still exhibit systematic failures on specific idation data into potential challenging subgroups.
subgroups of data. For example, Rajani et al. We then use held-out data to estimate how much
(2022) found that a 95%-accurate sentiment analy- each subgroup would benefit from more data, and
sis model did much worse on club reviews (90%) how much additional data would hurt performance
and movie theater reviews (85%), while Stuart-Ulin in other regions. Finally, having identified chal-
(2018) notes how a commercial chatbot avoids any lenging subgroups amenable to data augmentation,
engagement on topics that even mention Islam or we use GPT-3 (Brown et al., 2020) coupled with
the middle east. The existence of these challenging local subgroup models to generate new data, so as
subgroups can lead to unfair outcomes, erode user to improve subgroup performance while remaining
trust, and ultimately limit deployment of models, faithful to the original data distribution.
even when aggregate accuracy is very high. We evaluate TDG on three tasks: sentiment anal-
One possible solution is to collect or generate ysis (SST), paraphrase detection (QQP), and nat-
more data. However, the additional data may still ural language inference (MNLI). We evaluate var-
under-sample from specific challenging subgroups, ious clustering techniques, and find that cluster-
∗ ing based on the target model’s own representation
Work done during the internship at Microsoft.
1
Codes and collected data will be released in https:// yields the clusters most amenable to data augmenta-
[Link]/ZexueHe/TDG. tion (with the exception of QQP, where our analysis
Automatic Subgroup Discovery Subgroup Augmentation with LLM
Identify challenging Clusters LLM generation in under-performing regions.
Clustering Result 1 LLM

LLM
Data C LLM
Clustering Result2
Different High Generalization
Clustering Low Interference
Strategies

Clustering Result3

Test
In-group Accuracy Overall Accuracy

Figure 1: Illustration of the Targeted Data Generation (TDG) pipeline. In the automatic subgroup discovery stage,
TDG identifies challenging clusters that can benefit from additional data while minimizing potential negative impacts
on performance in other regions (i.e., high generalization (GC) and low interference (IC), as defined in Section 2.1).
In the subgroup augmentation with LLM stage, TDG utilizes GPT-3 to generate additional examples for identified
challenging clusters.

indicates label noise would make data augmenta- i.e., more data would generalize and improve per-
tion ineffective). Finally, augmenting these clusters formance on ctest , without hurting performance on
with GPT-3 results in significant improvements on Dtest .
correspondent test clusters, and also small improve-
ments on overall accuracy. 2.1 Generalization and Interference, in
Context
2 Targeted Data Generation Given the context of (Dtrain , M) and a target clus-
ter c, we obtain a new model M′ by training on a
Let M be a target model trained on a training
mixture of Dtrain and ctrain (following Ribeiro and
dataset Dtrain , and let Dtest be a held-out test dataset.
Lundberg (2022)), which effectively upweights ex-
We assume access to a validation dataset Dval ,
amples from c as a surrogate for data augmenta-
which we use to identify and evaluate challeng-
tion. We use two statistics to evaluate whether c is
ing subgroups. We cluster Dval into k disjoint
amenable to data augmentation: Generalization in
clusters, C = {c1 , c2 , . . . , ck }, using some clus-
Context (GC) and Interference in Context (IC).
tering technique (we explore various options in
Section 2.2, and drop the subscript when talking Definition 2.1 (Generalization in Context). We
about a single cluster, for clarity). We divide Dval say a cluster c generalizes in the context of the
randomly into two halves, so that each cluster is current model M and dataset D if more training
divided into ctrain and ctest ( cval can be further di- on it leads to better performance on hidden exam-
vided from ctrain if necessary), to simulate the effect ples from the same cluster. Formally, we define
of data augmentation and its impact on the same Generalization in Context (GC) as
subgroup. We say a cluster c is a challenging clus-
ter if the target model M performs much worse GC(c) = Acc(M′ , cval ) − Acc(M, cval )
on it than on the overall validation dataset, i.e.,
Acc(M, ctrain ∪ cval ) << Acc(M, Dval ). GC measures how much the target model can
Given a challenging cluster c, our goal is to iden- learn from more data from the cluster, and whether
tify whether it is amenable to data augmentation, that learning transfers to unseen data from the same
cluster. A high GC indicates that the cluster is chal- aggregate GC and IC over all clusters by taking the
lenging but not hopeless, and that data augmenta- average:
tion could help improve performance. A low GC
indicates that the cluster is either already saturated k
X GC(ci )
by existing data or too hard for the model to learn, GC(C) = (1)
k
such that more data from the cluster does not help. i=1
For example, if the clustering is random, we would
k
expect a low GC, as training on a random subset X IC(ci )
IC(C) = (2)
of data would not improve performance on another k
i=1
random subset. Conversely, if the clustering is
based on some meaningful feature that the model
struggles with, (such as club reviews (Rajani et al., 2.2 Automatic Subgroup Discovery
2022)), we would expect a high GC, as training on We use different representation spaces for cluster-
more data from the cluster would help the model ing, using increasing amounts of information about
overcome its weakness. the task, the model, and the labels. The example is
Definition 2.2 (Interference in Context). We say shown in Figure 2.
a cluster c interferes with the original data if aug-
menting it leads to worse performance on the origi- Agnostic clustering We do not use any informa-
nal data. We could similarly evaluate interference tion about the task, the model, or the labels, and
with other clusters, but for now we restrict our- instead use general-purpose embeddings, such as
selves to having the original model and dataset as the embeddings extracted from Sentence-BERT
the context. Formally, we define Interference in implemented in sentence-transformers (Reimers
Context (IC) as and Gurevych, 2019), to cluster the validation data.
This kind of representations might capture some
patterns that the target model cannot currently rep-
IC(c) = Acc(M, Dval ) − Acc(M′ , Dval ) resent well, and that augmenting these clusters
would teach the target model new concepts or rela-
A high IC indicates that the cluster is incom- tions.
patible with the original data, and that data aug-
mentation would degrade overall performance. A Task-based clustering We use the target model’s
low IC indicates that the cluster is either similar own representation from the second-to-last layer
to the original data, or sufficiently different but to cluster the validation data. This kind of repre-
not conflicting, such that data augmentation would sentations reflects how the target model perceives
not hurt overall performance. For example, if c the data, and might group together examples that
is label-imbalanced and D is label-balanced, we the model considers similar or difficult. We expect
would expect a high IC, as training on more data that if the model relies on spurious correlations or
from c might bias the model towards a certain la- heuristics, these might show up in the representa-
bel and hurt performance on D. Conversely, if c tion and get clustered together. Augmenting these
and D are from different domains but share some clusters would force the model to learn more robust
common concepts, we would expect a low IC, as features or strategies.
training on more data from c would not confuse the
model on D. A negative IC indicates that augment- Task-based + label information We use the
ing c actually improves performance on D, which same representation as task-based clustering, but
could happen if D is small and the model has not with the constraint that all examples in a cluster
saturated it yet, or if there is some domain shift must have the same label (similar to Sohoni et al.
between Dtest and Dtrain which augmentation helps (2020)). While this creates clusters that are clearly
to bridge. label-imbalanced, we expect that examples close
in the target representation will also tend to have
Aggregate statistics To summarize, GC mea- the same label, and thus this clustering technique
sures whether a cluster benefits from more data, should yield clusters with very low or very high
while IC measures whether augmenting that cluster error rate (the latter are good candidates as chal-
would hurt performance on the original dataset. We lenging clusters).
(a) (b) (c)

Figure 2: Example illustration of cluster results on binary classification from different clustering methods. Data
points from binary categories are identified by dots and squares. Errors are shown in red. (a) Agnostic clustering
where positive and negative data points are mixed together; (b) Task-based clustering where most points of one
category are located at one side of the decision boundary of model M (being separable by M) and positive/negative
points are mixed in clusters; (c) Task-based clustering + label information: besides being separable, data points with
the same label can be clustered together.

Selecting clusters for augmentation Given a RoBERTa-large models for MNLI and QQP on the
budget of k clusters we can augment, we evaluate official training corpora released in GLUE bench-
the clustering representations using the aggregate mark to match the best Transformer performance.2
GC and IC statistics of their top-k clusters ranked They are regarded as the target model M in each
by error rate, resulting a set of clusters Ck . In other task. We randomly divide the validation data into
words, we choose a representation that yields the two half sets: a dev set, used for automatic sub-
most augmentable clusters without hurting overall group discovery, and a devtest set, used exclusively
performance, as formalized in Equation 3. for evaluation. Therefore, SST has dev size of 436,
MNLI dev has size of 4,908, and QQP has dev
Ck∗ = arg max[GC(Ck ) − IC(Ck )] (3) size of 20,215. We run each experiment five times
Ck with different random seeds and report the average
2.3 Subgroup Augmentation with LLMs scores.
In order to augment those top challenging clusters 3.1 Automatic Subgroup Discovery
Ck∗ , we follow the work of Khani and Ribeiro (2023)
We conduct clustering methods on the dev set of
to use GPT-3 to create similar in-cluster examples,
each task. We assign the closest cluster to each
with a human in the loop to provide labels. We
instance in the devtest set, such that each cluster
finetune a small local model on each cluster’s data
in dev has an aligned counterpart for evaluation.
and use the disagreement between that model and
We run each clustering method five times using
the current version of M′ to rank GPT-3 generated
different random seeds and select the clustering
examples, stopping the process once the current
results with the best Silhouette scores (Rousseeuw,
version of the cluster’s model mostly agrees with
1987).
the current version of M′ . Intuitively, when M′
and the cluster’s model converge on cluster data, Comparison of clustering representations We
M′ has learned to generalize to the data in this clus- present the error rates of discovered clusters for
ter (thus fulfilling the requirment of GC), and the SST and MNLI in Figures 3 and 4. For both tasks,
original D used when updating M′ should prevent errors were randomly distributed accross clusters
high interference. produced by agnostic clustering, which indicates
that the clusters are not aligned with model behav-
3 Experiments iors and weaknesses, as also confirmed by the low
Setup We evaluate the effectiveness of TDG GC and IC scores. In contrast, task-based clus-
on three tasks from the GLUE benchmark: The tering (with or without label information) results
Stanford Sentiment Treebank (SST), MultiNLI in a large contingent of clusters with zero or few
Matched (MNLI-m) and Quora Question Pairs 2
Following Bowman et al. (2015); Yanaka et al. (2019),
(QQP). We train a bert-base model for SST and we use the binarized version of MNLI
(a) Agnostic clustering; GC=0.0064; (b) Task-based clustering; GC=0.011; (c) Task-based + label information;
IC=0.0000 IC=-0.0002 GC=0.1319, IC=0.19298

Figure 3: Error distribution of clusters obtained from three clustering methods on SST. Cluster number k=35. For
random clustering: GC=-0.0010, IC=0.0000

(a) Agnostic clustering; GC=0.0013; (b) Task-based clustering; GC=0.028; (c) Task-based + label information;
IC=0.0011 IC=-0.0017 GC=0.0434, IC=0.0023

Figure 4: Error distribution of clusters obtained from three clustering methods on MNLI. Cluster number k=100.
For random clustering: GC=-0.0007, IC=0.0002

errors (i.e. most successes are clustered together), as duplicate. In this case, TDG correctly identifies
and a few clusters with higher error rates. Using a case where subgroup data augmentation is un-
label information yields clusters of either all errors likely to be effective, and other solutions (e.g. data
or all successes, which results in high Generaliza- cleaning) should be pursued. We do not report any
tion in Context scores, but also high Interference in QQP results from now on.
Context scores. Both are likely due to label imbal-
ance, as we would expect such scores from simply 3.2 Subgroup Augmentation with LLMs
shifting the likelihood of predicting the cluster la- Based on the high-GC and low-IC clusters discov-
bel. This analysis thus indicates that task-based ered in previous step, we conduct augmentation
clustering without labels yields the clusters that targeted on those clusters with large language mod-
are most amenable to augmentation, since clusters els with human in the loop.
have positive generalization and near-zero interfer-
ence scores. We use these clusters in subsequent Human Participants We recruited 12 users to
results. label GPT-3 generated data in the subgroup aug-
mentation step. All users are from academia or
QQP All clusterings on QQP (not shown) had industry (with IRB approval) and have experience
very high interference scores, and thus were not working with AI-based natural language genera-
deemed suitable for augmentation by TDG. Indeed, tion systems (e.g. GPT-3). Each user was assigned
when we piloted data augmentation procedures on a high-error cluster discovered in the automatic
these clusters, we saw no tangible benefits. Man- subgroup discovery step (2 from SST and 10 from
ual inspection of clusters indicates that QQP has MNLI), and asked to label GPT-3 generations. We
high label noise (which would explain interfer- use the original sentences from the cluster as the
ence), such that pairs with the same phenomena initial prompt. Sentences that users labeled differ-
are often labeled differently, e.g. the pair (“What ently from the model’s prediction were added to the
makes life worth living?”, “Is life worth it?”) is augmented set. We allocated 90 minutes for user
labeled as not-duplicate, while (“Why is Deadpool labeling, with more information in the Appendix
so overrated”, “Is Deadpool overrated”) is labeled 9.1.
Baselines We compare TDG to the following ablations, as the average in-cluster accuracy has
previous works that aim to improve subgroup per- been increased from 81.45% to 83.60% on SST
formance: (1) Reweighing (Sohoni et al., 2020), and from 60.57% to 65.03% on MNLI, which is
which addresses hidden stratification caused by higher than any baseline models. Additionally, we
dataset imbalance by optimizing the per-cluster also observed that adding TDG data from all clus-
worst-case performance. In our experiments, we ters can improve all clusters by an average of 4.28%
use the same Group Distributionally Robust Op- (from 60.57% to 64.85%) on MNLI and an average
timization (GDRO) introduced in their work on of 1.55% (from 81.45% to 83.00%) on SST, which
each cluster as the fine-tuning objective. (2) Para- is also higher than all baseline models. Note that
phrasing where we use Parrot (Damodaran, 2021), the accuracy of every single cluster in TDG(all) is
a T5-based paraphrase model, to generate similar better than the target model. For some challenging
examples of data points in clusters as an augmenta- clusters, augmentation on their own (TDG(single))
tion. The size of the final fine-tune set is the same may yield better results, due to potential interfer-
as TDG for a fair comparison. ence between clusters (see Appendix 9.2 for more
details).
One cluster at a time v.s. simultaneous aug-
mentation Each participant augmented a single Improvement in overall devtest We observed
cluster, and we report these results as TDG(single), an improvement in overall performance on the de-
noting that for these we only measure in-cluster vtest set with TDG(all), with an increase of 0.55%
performance. We further pool the data from all on SST and 0.16% on MNLI. This suggests that
participants (TDG(all)) to test the improvements improving challenging clusters has the potential to
on each cluster as well as performance on the over- improve the model at a global level, while neither
all test set (devtest). In each experiment, in order baselines were able to achieve this. We notice the
to avoid the issue of catastrophic forgetting (Mc- improvement on the devtest set is not as significant
Closkey and Cohen, 1989), we randomly sampled as the improvement on individual low-performed
training data with the same frequency as TDG aug- groups. This is likely due to the fact that these
mented data in the fine-tuning process3 . vulnerable groups are usually minorities and their
representation in the devtest set is small (e.g., the
SST average size of the 10 clusters in MNLI experiment
Model is just 88 whereas the devtest has size of 4,908),
1st 2nd Avg Cluster devtest
diluting the impact of the improvement.
BERT-base 81.74 81.13 81.45 93.77
Ablation Analysis We evaluate the following
Reweighing 78.7 82.03 80.37 93.49
Paraphrasing 77.61 82.42 80.02 92.26 variations of TDG to test the effectiveness of each
step:
TDG (single) 83.8 83.39 83.60 -
TDG (all) 82.61 83.39 83.00 94.32 • Automatic Subgroup Discovery Only in
which the fine-tuning data is created by using
Table 1: Accuracy of TDG v.s. baselines tested on top-2
the same clusters as TDG but without augmen-
error clusters and left-out devtest set of SST. BERT-base
is the target model M.
tation and adding the same number of random
samples from the training data, to test the error
discovery step.
Improvement in challenging subgroups Table 1
• Subgroup Augmentation with LLM Only
and Table 2 show the results of all baselines, as well
in which the fine-tuning data is created by us-
as TDG(single) and the aggregated TDG(all), on
ing n random samples from the dev set (n is
the SST and MNLI tasks, respectively. For both
the number of total sentences in challenging
tasks, augmenting individual clusters with TDG
clusters used in TDG) and applying subgroup
tends to be more effective than all baselines and
augmentation with GPT-3, to test the effective-
3
In MNLI experiment, due to the high interference among ness of the augmentation. Augmentation ends
clusters, we adjust the weights of training samples and col-
lected responses when combining all data points for TDG(all)
once the same number of augmented data as
in fine-tuning (i.e., we set portions of original samples:user TDG is reached.
responses = 2:1). In SST, all responses are combined without
any adjustment. We see that fine-tuning with clusters alone can
MNLI
Model
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Avg Cluster devtest
RoBERTa-Large 51.85 53.57 53.85 54.84 55.56 58.82 65.71 66.56 68.75 76.19 60.57 93.46
Reweighing 51.85 53.57 30.77 58.06 55.56 58.82 68.57 65.91 68.75 73.81 58.57 93.46
Paraphrasing 51.85 42.86 53.85 54.84 44.44 58.82 65.71 65.91 68.75 26.19 53.32 86.45
TDG (single) 51.85 53.57 61.54 67.74 66.67 64.71 65.71 75.68 66.67 76.19 65.03 -
TDG (all) 59.26 53.57 64.28 61.29 55.56 64.71 74.28 68.18 68.75 78.57 64.85 93.62

Table 2: Accuracy of different models tested on top-10 high-error clusters and left-out devtest set of MNLI.

Model
SST 4 Related Work
1st 2nd Avg Cluster devtest
BERT-base 81.74 81.13 81.45 93.77 Recent research in machine learning has focused
Automatic Subgroup on enhancing the robust performance of models by
78.70 82.20 80.45 93.89
Discovery only identifying challenging subgroups and improving
Subgroup Augmentation
with LLM only
79.42 78.42 78.91 93.17 their performance.
TDG (single) 83.80 83.39 83.60 - Discovering Challenging Subgroups Several stud-
TDG (all) 82.61 83.39 83.00 94.32 ies, such as d’Eon et al. (2022) and Rajani et al.
(2022), focus on identifying challenging subgroups
Table 3: Accuracy of different ablations of TDG on
in the data. However, these works primarily fo-
top-2 high-error clusters in SST. BERT-base is the target
model M. cus on discovering general low-performing regions
in embedding space and do not address strategies
for improving these regions. In contrast, our work
aims to identify challenging subgroups that are also
improve performance on certain clusters when the amenable to improvement through data augmenta-
size is sufficient (e.g., 2nd in SST), but it can also tion using language models.
lead to over-fitting and reduced performance (e.g., Improving Performance of Known Subgroups
1st in SST). Additionally, subgroup augmentation Other studies, such as Thakur et al. (2021); Yoo
on randomly sampled clusters results in a decrease et al. (2021); He et al. (2021), focus on augmenting
in performance not only in low-performing areas, data from known subgroups or patterns. However,
but also overall on the devtest set. Without the au- it can be challenging to apply these methods in
tomatic subgroup discovery, the GPT-3 augmented scenarios where the challenging subgroups are not
sentences may introduce more noise rather than known a priori. Another stream of work focuses
benefits, which verifies the bottleneck of previous on model testing and debugging, which involves
work (Ribeiro and Lundberg, 2022) and empha- creating human-generated data points and testing
sizes the importance of the automatic subrgoup them on the model. Methods such as CheckList
discovery. (Ribeiro et al., 2020) and DynaBench (Kiela et al.,
2021) generate test cases from pre-defined topics
and templates, while AdaTest (Ribeiro and Lund-
Interpretation of low-performed groups In this berg, 2022) uses pre-trained language models to
section, we present some examples from the high- generate more tests that are similar to the human-
error groups discovered in automatic subgroup dis- created examples. Although these methods show
covery. We also provide readable interpretations promising results in improving the performance
for the clusters as shown in Table 4. Our automatic of challenging subgroups, it is not clear how to
subgroup discovery is able to identify meaningful provide the first data points from a challenging sub-
errors, such as mis-identifying the dominant sen- group. Finding such data points was the main focus
timent from a mixture of sentiments in SST, or of our work, where we showed how to find data
errors related to different language tones in MNLI. points that are suitable for further augmentation.
Furthermore, we also notice complex patterns in Model-based Approaches Another approach for
reasoning is identified, such as Factivity and Mono- enhancing the performance of challenging sub-
tonicity, which are recognized challenges in Super- groups is to develop new training strategies.
GLUE Diagnostic tasks. Sagawa et al. (2019) minimize the worst group ac-
Cluster: Having multiple sentiments and one is dominating than the rest Label Prediction
On the heels of the ring comes a similarly morose and humorless horror movie that,
positive negative
SST although flawed , is to be commended for its straight-ahead approach to creepiness .
Another one of those estrogen overdose movies like "divine secrets of the ya ya sisterhood ”
positive negative
except that the writing , acting and character development are a lot better .
Cluster: Having same meaning. Formal Tone v.s. Casual Tone Label Prediction
Sentence1: Do you think I should be concerned? not
entailment
Sentence2: Do you think it is a problem enatilment
Sentence1: He seemed too self-assured. not
entailment
Sentence2: He is very cocky enatilment
Cluster: One v.s. All Label Prediction
Sentence1: Pray be seated, mademoiselle. not
entailment
Sentence2: Please, everyone be seated. enatilment
MNLI
Sentence1: Similar conclusions have been reached by legal studies in a dozen states including Florida. not
entailment
Sentence2: Similar conclusions have been seen across the world. enatilment
Cluster: Suspicion v.s. Fact Label Prediction
Sentence1: The analysis also addresses the various alternatives to the final rule which were considered,
including differing compliance or reporting requirements, use of performance rather than design standards, not
entailment
and an exemption for small entities from coverage of the rule. enatilment
Sentence2: The rule is subject to change."
Sentence1: In the depths of the Cold War, many Americans suspected Communists
not
had infiltrated Washington and were about to subvert our democracy. entailment
enatilment
Sentence2: Communists infiltrated Washington during the Cold War.

Table 4: Interpretation about discovered high-error clusters. Each cluster is shown with two errors.

curacy when subgroups are known a priori, Khani efficient than previous HITL-based methods that
et al. (2019) add variance of loss to the optimiza- either require domain experts or require more ex-
tion function, and Liu et al. (2021) train the model tensive human input. In this paper, we purpose-
twice, one with every data point and once more fully chose state-of-the-art (SOTA) models that
with the ones that have high losses. Sohoni et al. are already very good. However, our work shows
(2020) discovered subgroups and then change the that even such models still exhibit coherent lower-
training function to improve the accuracy. Chang- performance groups that can be further improved
ing the training function usually improves the accu- with targeted data collection.
racy of challenging subgroups, but at the expense
of decreasing accuracy in other subgroups or the 5 Conclusion
overall accuracy. In contrast, our work increases
In this work, we presented a thorough analysis
the performance of challenging subgroups while
of error distribution among different groups and
also increasing the overall accuracy.
introduced Targeted Data Generation (TDG), a
Data Augmentation with Human-in-The-Loop framework that automatically identifies challenging
Recent works note that Human-in-The-Loop groups that are amenable to improvement through
(HITL) based augmentation offers unique bene- data augmentation using large language models
fits over automatic data augmentation, such as ad- (LLMs) without negatively impacting overall accu-
dressing dataset design flaws (Fanton et al., 2021), racy. Our experiments with state-of-the-art mod-
improving performance for minority groups (Sri- els demonstrate that TDG is able to improve in-
vastava et al., 2020), and avoiding syntactic and group performance by 2-13% while also increas-
semantic distortions in the text (Anaby-Tavor et al., ing overall accuracy. Furthermore, TDG was able
2020). to improve performance for every single selected
We want to point out that TDG is orthogonal cluster without interference, indicating its potential
to non-HITL augmentation (i.e. they can be used as a reliable approach for a new data collection
together). In addition, TDG’s use of LLM to gen- framework. As LLMs continue to advance and are
erate augmentations for specific data groups helps trained on more diverse and large corpora, TDG
reduce the human effort – TDG only requires min- represents a promising approach for addressing the
imal human effort for validation, making it more weaknesses of simpler models.
6 Ethic Considerations interface. We are appreciative of the insightful
suggestions provided by folks from the Microsoft
In this paper, we propose a method for automat-
Office of Applied Research. Special thanks go to
ically identifying groups of data that are under-
Brent Hecht, Aaron Halfaker, and Yujin Kim for
performing due to a lack of training examples.
their generous contributions of time and support in
It is important to note that these underperform-
our user studies. We would also like to express our
ing groups may be related to marginalized demo-
thanks to all the participants from the University of
graphic groups, which may be underrepresented in
California San Diego for their active involvement
the data. By identifying these groups, our work is
in the user studies.
able to reveal potential discriminatory behaviors
in NLP models and facilitate bias mitigation by
augmenting these underrepresented groups. How- References
ever, there is also the risk that malicious actors may
exploit this information and create adversarial ex- Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,
amples that further bias the model. To address this Amir Kantor, George Kour, Segev Shlomov, Naama
Tepper, and Naama Zwerdling. 2020. Do not have
concern, we suggest involving the user audience or enough data? deep learning to the rescue! In Pro-
implementing fairness regulations in the interactive ceedings of the AAAI Conference on Artificial Intelli-
procedure to prevent such behaviors. Finally, it’s gence, volume 34, pages 7383–7390.
worth noting that our model relies heavily on large
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
language models to improve the performance of and Christopher D. Manning. 2015. A large anno-
challenging groups as a result if some groups are tated corpus for learning natural language inference.
not represented in LLMs our method is unable to In Proceedings of the 2015 Conference on Empiri-
increase their performance. cal Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal. Association for Compu-
7 Limitations tational Linguistics.

One limitation of our approach is that we aggre- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
gated IC and GC measurements over clusters dur- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
ing the automatic subgroup discovery process, but Askell, et al. 2020. Language models are
we did not fully consider the relationships between few-shot learners. arxiv 2020. arXiv preprint
clusters. A more comprehensive strategy for uti- arXiv:2005.14165, 4.
lizing beneficial relationships and a more precise
approach to potential conflicts between clusters Prithiviraj Damodaran. 2021. Parrot: Paraphrase gener-
ation for nlu.
could lead to further improvements in overall per-
formance. Additionally, our MNLI experiments Greg d’Eon, Jason d’Eon, James R Wright, and Kevin
were conducted on large dataset that had multiple Leyton-Brown. 2022. The spotlight: A general
clusters with errors. We chose to focus on the top- method for discovering systematic errors in deep
10 clusters with the most errors due to limitations learning models. In 2022 ACM Conference on Fair-
ness, Accountability, and Transparency, pages 1962–
in resources for running a user study. While TDG 1981.
on top-K clusters has demonstrated effectiveness in
improving performance, there is still the potential Margherita Fanton, Helena Bonaldi, Serra Sinem
for further improvements by working on a larger Tekiroğlu, and Marco Guerini. 2021. Human-in-the-
loop for data collection: a multi-target counter narra-
number of clusters. At the same time, we empha- tive dataset to fight online hate speech. In Proceed-
size that TDG should be used as the last step to ings of the 59th Annual Meeting of the Association for
improve performance in low-performing groups Computational Linguistics and the 11th International
(clusters with high errors). If these groups are nu- Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 3226–3240, Online.
merous, it means the model is likely under-trained,
Association for Computational Linguistics.
and other techniques (e.g. better data/modeling)
should be applied first. Zexue He, Bodhisattwa Prasad Majumder, and Julian
McAuley. 2021. Detect and perturb: Neutral rewrit-
8 Acknowledgements ing of biased and sensitive text via gradient-based
decoding. In Findings of the Association for Com-
We would like to thank Scott Lundberg for his kind putational Linguistics: EMNLP 2021, pages 4173–
assistance in designing and implementing the user 4181.
Fereshte Khani, Aditi Raghunathan, and Percy Liang. Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Al-
2019. Maximum weighted loss discrepancy. arXiv bert Gu, and Christopher Ré. 2020. No subclass left
preprint arXiv:1906.03518. behind: Fine-grained robustness in coarse-grained
classification problems. Advances in Neural Infor-
Fereshte Khani and Marco Tulio Ribeiro. 2023. Collab- mation Processing Systems, 33:19339–19352.
orative development of nlp models. arXiv preprint
arXiv:2305.12219. Megha Srivastava, Tatsunori Hashimoto, and Percy
Liang. 2020. Robustness to spurious correlations
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh via human annotations. In International Conference
Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid- on Machine Learning, pages 9109–9119. PMLR.
gen, Grusha Prasad, Amanpreet Singh, Pratik Ring-
shia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Chloe Rose Stuart-Ulin. 2018. Microsoft’s politically
Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit correct chatbot is even worse than its racist one.
Bansal, Christopher Potts, and Adina Williams. 2021. Quartz Ideas, 31.
Dynabench: Rethinking benchmarking in NLP. In
Proceedings of the 2021 Conference of the North Nandan Thakur, Nils Reimers, Johannes Daxenberger,
American Chapter of the Association for Computa- and Iryna Gurevych. 2021. Augmented sbert: Data
tional Linguistics: Human Language Technologies, augmentation method for improving bi-encoders for
pages 4110–4124, Online. Association for Computa- pairwise sentence scoring tasks. In Proceedings of
tional Linguistics. the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Human Language Technologies, pages 296–310.
Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy
Liang, and Chelsea Finn. 2021. Just train twice: Im- Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-
proving group robustness without training group in- taro Inui, Satoshi Sekine, Lasha Abzianidze, and
formation. In International Conference on Machine Johan Bos. 2019. Can neural networks understand
Learning, pages 6781–6792. PMLR. monotonicity reasoning? In Proceedings of the 2019
ACL Workshop BlackboxNLP: Analyzing and Inter-
Michael McCloskey and Neal J Cohen. 1989. Catas-
preting Neural Networks for NLP, pages 31–40, Flo-
trophic interference in connectionist networks: The
rence, Italy. Association for Computational Linguis-
sequential learning problem. In Psychology of learn-
tics.
ing and motivation, volume 24, pages 109–165. Else-
vier. Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-
Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Woo Lee, and Woomyeong Park. 2021. Gpt3mix:
Mitchell, and James Zou. 2022. Seal: Interactive Leveraging large-scale language models for text aug-
tool for systematic error analysis and labeling. arXiv mentation. arXiv preprint arXiv:2104.08826.
preprint arXiv:2210.05839.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Marco Tulio Ribeiro and Scott Lundberg. 2022. Adap-
tive testing and debugging of NLP models. In Pro-
ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 3253–3267, Dublin, Ireland. Associa-
tion for Computational Linguistics.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
and Sameer Singh. 2020. Beyond accuracy: Behav-
ioral testing of nlp models with checklist. In Proceed-
ings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 4902–4912.
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid
to the interpretation and validation of cluster analysis.
Journal of computational and applied mathematics,
20:53–65.
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto,
and Percy Liang. 2019. Distributionally robust neural
networks. In International Conference on Learning
Representations.
9 Appendix model prediction (i.e. the bar under the “Creative”
turns red), add it.
9.1 Human-In-The-Loop Details We ask that each user clicks on the “Update
User Interface The goal of our user study is to global” button at least once during their study ses-
find bugs in the target model. To find bugs easier, sion to ensure that they continue to find meaningful
we provide the following user interface to our users, bugs in the updated model.
as shown in Figure 5. The interface is linked with
the back-end global and local models. 9.2 Analysis on Relationships Between
The UI enables the following actions through the Clusters
bottoms: We observe that sometimes fine-tuning the model
with TDG(all) augmented data on individual clus-
• Suggest: click to use the current sentence list ters can lead to improved performance on certain
as a prompt for GPT-3 to generate similar ex- clusters and worse performance on others. This
amples; suggests that there may be relationships between
clusters, such as mutual benefit or conflict.
• Add: allows users to add a sentence from the
One conjecture is data points may have multiple
generated examples to the current list;
patterns shared with different sentences, therefore,
• Update global: trains the global model us- belonging to multiple clusters. Each individual
ing the concatenation of a random sample of TDG is just working on one of them. Combining
sentences from training set and sentences in and fine-tuning together can cumulative the perfor-
current list; mance. For example, MNLI example “S1: Pray
be seated, mademoiselle. S2: Please, everyone be
• Update local: trains the local model using the seated.” can have both the patterns of the cross-
sentences in the current list, lingual entailment and the monotonicity. Another
conjecture for conflicting clusters is that the pat-
• Creative: indicates whether the local and terns within one cluster may be contradictory to
global models make different decisions. A those in another cluster. For example, in sentiment
red color indicates disagreement while green classification, sentences mentioning “American” in
indicates no disagreement. technology topics may conflict with sentences men-
tioning “American” in international relationship
• Rename: Users can rename their clusters to
topics. Such conflicts may be solved by simply
an interpretable name if they’d like to.
adding similar examples. Therefore, fine-tuning
In Figure 6, we show an example of adding a sen- these conflicting clusters together may negatively
tence to a subcluster and renaming it. impact the performance of one or both clusters.

User Study Introduction Our user study con-

sists of two parts. In the first part, users will read
the initial sentences displayed on the user inter-
face, which are the clustering results from the TDG
automatic subgroup discovery stage. They can fur-
ther categorize them into smaller sub-clusters if
they notice finer-grained groups within the current
cluster.
In the second part, users can add more bugs to
the cluster or sub-cluster by first clicking on the
“Suggest” button to request GPT-3 to generate more
similar examples. They will then review the sug-
gestions and add valid examples according to the
following criteria: (1) if the local model’s predic-
tion is incorrect (i.e. the text after “should be” is
wrong), correct it and add it; or (2) if the global
model’s prediction differs from the correct local
Figure 5: User interface used in our user study
Figure 6: Examples of potential operations.

Paths Over Graph Knowledge Graph Empowered LLM Reasoning
No ratings yet
Paths Over Graph Knowledge Graph Empowered LLM Reasoning
18 pages
Simple Is Effective Role of Graphs and Llms in KG Retrieval Augmented Generation
No ratings yet
Simple Is Effective Role of Graphs and Llms in KG Retrieval Augmented Generation
29 pages
Challenges and Applications of Large Language Models: Desi GN Behavior
No ratings yet
Challenges and Applications of Large Language Models: Desi GN Behavior
72 pages
Llms 1 15
No ratings yet
Llms 1 15
15 pages
Report-Machine-Learning-101 - 1 32
No ratings yet
Report-Machine-Learning-101 - 1 32
1 page
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
No ratings yet
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
13 pages
Thudi2025 - MixMin Finding Data Mixtures Via Convex Minimizat
No ratings yet
Thudi2025 - MixMin Finding Data Mixtures Via Convex Minimizat
17 pages
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
No ratings yet
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
14 pages
Exploring The Impact of Prompt Engineering On Large
No ratings yet
Exploring The Impact of Prompt Engineering On Large
15 pages
2017 Effective Statistical Methods For Big Data Analytics
No ratings yet
2017 Effective Statistical Methods For Big Data Analytics
20 pages
Survey of Graph Retrieval Augmented Generation For Customized Llms
No ratings yet
Survey of Graph Retrieval Augmented Generation For Customized Llms
27 pages
Challenges in The Deployment and Operation of Machine Learning in Practice
No ratings yet
Challenges in The Deployment and Operation of Machine Learning in Practice
15 pages
20
No ratings yet
20
19 pages
Generating Textual Data with GENERE
No ratings yet
Generating Textual Data with GENERE
9 pages
Literature Review 1. Computational Methods For The Analysis of Learning and Knowledge Building Communities
No ratings yet
Literature Review 1. Computational Methods For The Analysis of Learning and Knowledge Building Communities
13 pages
Applsci 14 05975
No ratings yet
Applsci 14 05975
13 pages
Fine Tuning vs. Retrieval Augmented Generation For Less Popular Knowledge
No ratings yet
Fine Tuning vs. Retrieval Augmented Generation For Less Popular Knowledge
11 pages
Non-IID Data in FL-Survey With Taxonomy-Metrics
No ratings yet
Non-IID Data in FL-Survey With Taxonomy-Metrics
25 pages
Analysis of Continual Learning Models For Intrusio
No ratings yet
Analysis of Continual Learning Models For Intrusio
22 pages
L - G L L M: Earned Rule Augmented Eneration FOR Arge Anguage Odels
No ratings yet
L - G L L M: Earned Rule Augmented Eneration FOR Arge Anguage Odels
22 pages
Enterprise Benchmarks For Large Language Model Evaluation 4lxl1jzv6cpd
No ratings yet
Enterprise Benchmarks For Large Language Model Evaluation 4lxl1jzv6cpd
15 pages
Assignment m2 Machine Learning Final
No ratings yet
Assignment m2 Machine Learning Final
5 pages
Toward Goal Oriented LLM Prompting
No ratings yet
Toward Goal Oriented LLM Prompting
9 pages
HARE - HumAn Priors - Key To Small Language Model Efficiency
No ratings yet
HARE - HumAn Priors - Key To Small Language Model Efficiency
10 pages
Data Security in LLM
No ratings yet
Data Security in LLM
15 pages
End-To-End Generation of Multiple-Choice Questions Using Text-To-Text Transfer Transformer Models
No ratings yet
End-To-End Generation of Multiple-Choice Questions Using Text-To-Text Transfer Transformer Models
12 pages
Automatic Prompt Optimization Via Heuristic Search
No ratings yet
Automatic Prompt Optimization Via Heuristic Search
19 pages
Challenge Design Guidelines
No ratings yet
Challenge Design Guidelines
39 pages
Report Format
No ratings yet
Report Format
20 pages
Jiang 等 - 2024 - RAGraph a General Retrieval-Augmented Graph Learning Framework
No ratings yet
Jiang 等 - 2024 - RAGraph a General Retrieval-Augmented Graph Learning Framework
31 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Mathematics 13 02210
No ratings yet
Mathematics 13 02210
29 pages
CSE445 NSU Week - 1
No ratings yet
CSE445 NSU Week - 1
28 pages
Journal Pre-Proof: Data Science and Management
No ratings yet
Journal Pre-Proof: Data Science and Management
28 pages
Non-Autoregressive Generative Models For Reranking Recommendation
No ratings yet
Non-Autoregressive Generative Models For Reranking Recommendation
10 pages
Python Machine Learning Basics
No ratings yet
Python Machine Learning Basics
37 pages
Information Fusion: Sciencedirect
No ratings yet
Information Fusion: Sciencedirect
55 pages
Answer
No ratings yet
Answer
4 pages
Memorization Vs Generalization Quantifying Data Le
No ratings yet
Memorization Vs Generalization Quantifying Data Le
11 pages
AWS Cert AI Practitioner
No ratings yet
AWS Cert AI Practitioner
352 pages
Zoo Machine Learning
No ratings yet
Zoo Machine Learning
50 pages
DataGemma FullPaper
No ratings yet
DataGemma FullPaper
39 pages
2024 Acl-Long 778
No ratings yet
2024 Acl-Long 778
15 pages
Causal Interpretability For Machine Learning
No ratings yet
Causal Interpretability For Machine Learning
16 pages
2024.findings Emnlp.523
No ratings yet
2024.findings Emnlp.523
13 pages
01 Intro 2
No ratings yet
01 Intro 2
67 pages
Agarwal Does Data Repair Lead To Fair Models Curating Contextually Fair WACV 2022 Paper
No ratings yet
Agarwal Does Data Repair Lead To Fair Models Curating Contextually Fair WACV 2022 Paper
10 pages
Unit 6 Machine Learning Algorithms - AI CBSE
No ratings yet
Unit 6 Machine Learning Algorithms - AI CBSE
1 page
Application of Data Augmentation On Deep Learning
No ratings yet
Application of Data Augmentation On Deep Learning
13 pages
5.11 MLBasics-Challenges
No ratings yet
5.11 MLBasics-Challenges
20 pages
Managing The Unknown in Machine Learning Definitions Related - 2024 - Neurocom
No ratings yet
Managing The Unknown in Machine Learning Definitions Related - 2024 - Neurocom
19 pages
Author Name Title Paper/Submission ID Submitted by Submission Date Total Pages Document Type
No ratings yet
Author Name Title Paper/Submission ID Submitted by Submission Date Total Pages Document Type
8 pages
Machine-Assisted Quantitizing Designs - Augmenting Humanities and Social Sciences With Artificial Intelligence
No ratings yet
Machine-Assisted Quantitizing Designs - Augmenting Humanities and Social Sciences With Artificial Intelligence
46 pages
Privacy Amplification by Structured Subsampling For Deep Differentially Private Time Series Forecasting
No ratings yet
Privacy Amplification by Structured Subsampling For Deep Differentially Private Time Series Forecasting
56 pages
Modern Machine Learning Uses
No ratings yet
Modern Machine Learning Uses
11 pages
Lec 12
No ratings yet
Lec 12
15 pages
Hogwarts Sols
No ratings yet
Hogwarts Sols
8 pages
02 Ai Project Cycle Important Questions
No ratings yet
02 Ai Project Cycle Important Questions
24 pages
ML Mid1 Notes
No ratings yet
ML Mid1 Notes
45 pages
Essential Data Mining Concepts and Techniques
No ratings yet
Essential Data Mining Concepts and Techniques
13 pages
K-Means Clustering for Customer Segmentation
No ratings yet
K-Means Clustering for Customer Segmentation
22 pages
Paper Coca Cola
No ratings yet
Paper Coca Cola
11 pages
Digital Image Classification
No ratings yet
Digital Image Classification
33 pages
Machine Learning Exam Review Guide
No ratings yet
Machine Learning Exam Review Guide
32 pages
Interactive Machine Learning For Health Informatics
No ratings yet
Interactive Machine Learning For Health Informatics
13 pages
23ma1305 - Computational Statistics Multivariate Normal Distribution
No ratings yet
23ma1305 - Computational Statistics Multivariate Normal Distribution
17 pages
Analyzing Social Media's Impact On Younger Generations Using Unsupervised Learning
No ratings yet
Analyzing Social Media's Impact On Younger Generations Using Unsupervised Learning
17 pages
Data Science & Machine Learning Course
No ratings yet
Data Science & Machine Learning Course
12 pages
Machine Learning
100% (5)
Machine Learning
35 pages
Airbnb (Air Bed and Breakfast) Listing Analysis TH
No ratings yet
Airbnb (Air Bed and Breakfast) Listing Analysis TH
24 pages
Determining Clusters
No ratings yet
Determining Clusters
4 pages
Drift-Aware Methodology For Anomaly Detection in S
No ratings yet
Drift-Aware Methodology For Anomaly Detection in S
13 pages
Bacteria Colony Counter
No ratings yet
Bacteria Colony Counter
7 pages
MVS Notes
No ratings yet
MVS Notes
51 pages
Data Mining For CRM
No ratings yet
Data Mining For CRM
14 pages
IML-IITKGP - Assignment 8 Solution
No ratings yet
IML-IITKGP - Assignment 8 Solution
8 pages
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
No ratings yet
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
5 pages
Data Warehousing & Mining Overview
No ratings yet
Data Warehousing & Mining Overview
4 pages
Lab Manual Dbscan
No ratings yet
Lab Manual Dbscan
6 pages
Balanced K-Means Revisited-7
No ratings yet
Balanced K-Means Revisited-7
2 pages
A Comparative Analysis of Data Mining Methods and Hierarchical Linear Modeling Using Pisa 2018 Data
No ratings yet
A Comparative Analysis of Data Mining Methods and Hierarchical Linear Modeling Using Pisa 2018 Data
16 pages
(Ebook PDF) Introduction To Business Data Mining 1st Editioninstant Download
100% (4)
(Ebook PDF) Introduction To Business Data Mining 1st Editioninstant Download
44 pages
Cluster Analysis in Psychology
No ratings yet
Cluster Analysis in Psychology
55 pages
(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
100% (23)
(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
70 pages
AI Lab Manual
No ratings yet
AI Lab Manual
25 pages
RapidMiner Minibook
No ratings yet
RapidMiner Minibook
121 pages
Segmentation of Stock Trading Customers According To Potential Value
No ratings yet
Segmentation of Stock Trading Customers According To Potential Value
7 pages
Restructured Syllabi For M. Sc. Tech. Course
No ratings yet
Restructured Syllabi For M. Sc. Tech. Course
30 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
35 pages

Targeted Data Generation: Finding and Fixing Model Weaknesses

Uploaded by

Targeted Data Generation: Finding and Fixing Model Weaknesses

Uploaded by

Targeted Data Generation: Finding and Fixing Model Weaknesses

Zexue He∗ Marco Tulio Ribeiro Fereshte Khani

Abstract even if data collection is adversarial (Kiela et al.,

helps little in addressing these weaknesses. Tools

User Study Introduction Our user study con-

You might also like