Targeted Data Generation: Finding and Fixing Model Weaknesses
Targeted Data Generation: Finding and Fixing Model Weaknesses
LLM
Data C LLM
Clustering Result2
Different High Generalization
Clustering Low Interference
Strategies
Clustering Result3
Test
In-group Accuracy Overall Accuracy
Figure 1: Illustration of the Targeted Data Generation (TDG) pipeline. In the automatic subgroup discovery stage,
TDG identifies challenging clusters that can benefit from additional data while minimizing potential negative impacts
on performance in other regions (i.e., high generalization (GC) and low interference (IC), as defined in Section 2.1).
In the subgroup augmentation with LLM stage, TDG utilizes GPT-3 to generate additional examples for identified
challenging clusters.
indicates label noise would make data augmenta- i.e., more data would generalize and improve per-
tion ineffective). Finally, augmenting these clusters formance on ctest , without hurting performance on
with GPT-3 results in significant improvements on Dtest .
correspondent test clusters, and also small improve-
ments on overall accuracy. 2.1 Generalization and Interference, in
Context
2 Targeted Data Generation Given the context of (Dtrain , M) and a target clus-
ter c, we obtain a new model M′ by training on a
Let M be a target model trained on a training
mixture of Dtrain and ctrain (following Ribeiro and
dataset Dtrain , and let Dtest be a held-out test dataset.
Lundberg (2022)), which effectively upweights ex-
We assume access to a validation dataset Dval ,
amples from c as a surrogate for data augmenta-
which we use to identify and evaluate challeng-
tion. We use two statistics to evaluate whether c is
ing subgroups. We cluster Dval into k disjoint
amenable to data augmentation: Generalization in
clusters, C = {c1 , c2 , . . . , ck }, using some clus-
Context (GC) and Interference in Context (IC).
tering technique (we explore various options in
Section 2.2, and drop the subscript when talking Definition 2.1 (Generalization in Context). We
about a single cluster, for clarity). We divide Dval say a cluster c generalizes in the context of the
randomly into two halves, so that each cluster is current model M and dataset D if more training
divided into ctrain and ctest ( cval can be further di- on it leads to better performance on hidden exam-
vided from ctrain if necessary), to simulate the effect ples from the same cluster. Formally, we define
of data augmentation and its impact on the same Generalization in Context (GC) as
subgroup. We say a cluster c is a challenging clus-
ter if the target model M performs much worse GC(c) = Acc(M′ , cval ) − Acc(M, cval )
on it than on the overall validation dataset, i.e.,
Acc(M, ctrain ∪ cval ) << Acc(M, Dval ). GC measures how much the target model can
Given a challenging cluster c, our goal is to iden- learn from more data from the cluster, and whether
tify whether it is amenable to data augmentation, that learning transfers to unseen data from the same
cluster. A high GC indicates that the cluster is chal- aggregate GC and IC over all clusters by taking the
lenging but not hopeless, and that data augmenta- average:
tion could help improve performance. A low GC
indicates that the cluster is either already saturated k
X GC(ci )
by existing data or too hard for the model to learn, GC(C) = (1)
k
such that more data from the cluster does not help. i=1
For example, if the clustering is random, we would
k
expect a low GC, as training on a random subset X IC(ci )
IC(C) = (2)
of data would not improve performance on another k
i=1
random subset. Conversely, if the clustering is
based on some meaningful feature that the model
struggles with, (such as club reviews (Rajani et al., 2.2 Automatic Subgroup Discovery
2022)), we would expect a high GC, as training on We use different representation spaces for cluster-
more data from the cluster would help the model ing, using increasing amounts of information about
overcome its weakness. the task, the model, and the labels. The example is
Definition 2.2 (Interference in Context). We say shown in Figure 2.
a cluster c interferes with the original data if aug-
menting it leads to worse performance on the origi- Agnostic clustering We do not use any informa-
nal data. We could similarly evaluate interference tion about the task, the model, or the labels, and
with other clusters, but for now we restrict our- instead use general-purpose embeddings, such as
selves to having the original model and dataset as the embeddings extracted from Sentence-BERT
the context. Formally, we define Interference in implemented in sentence-transformers (Reimers
Context (IC) as and Gurevych, 2019), to cluster the validation data.
This kind of representations might capture some
patterns that the target model cannot currently rep-
IC(c) = Acc(M, Dval ) − Acc(M′ , Dval ) resent well, and that augmenting these clusters
would teach the target model new concepts or rela-
A high IC indicates that the cluster is incom- tions.
patible with the original data, and that data aug-
mentation would degrade overall performance. A Task-based clustering We use the target model’s
low IC indicates that the cluster is either similar own representation from the second-to-last layer
to the original data, or sufficiently different but to cluster the validation data. This kind of repre-
not conflicting, such that data augmentation would sentations reflects how the target model perceives
not hurt overall performance. For example, if c the data, and might group together examples that
is label-imbalanced and D is label-balanced, we the model considers similar or difficult. We expect
would expect a high IC, as training on more data that if the model relies on spurious correlations or
from c might bias the model towards a certain la- heuristics, these might show up in the representa-
bel and hurt performance on D. Conversely, if c tion and get clustered together. Augmenting these
and D are from different domains but share some clusters would force the model to learn more robust
common concepts, we would expect a low IC, as features or strategies.
training on more data from c would not confuse the
model on D. A negative IC indicates that augment- Task-based + label information We use the
ing c actually improves performance on D, which same representation as task-based clustering, but
could happen if D is small and the model has not with the constraint that all examples in a cluster
saturated it yet, or if there is some domain shift must have the same label (similar to Sohoni et al.
between Dtest and Dtrain which augmentation helps (2020)). While this creates clusters that are clearly
to bridge. label-imbalanced, we expect that examples close
in the target representation will also tend to have
Aggregate statistics To summarize, GC mea- the same label, and thus this clustering technique
sures whether a cluster benefits from more data, should yield clusters with very low or very high
while IC measures whether augmenting that cluster error rate (the latter are good candidates as chal-
would hurt performance on the original dataset. We lenging clusters).
(a) (b) (c)
Figure 2: Example illustration of cluster results on binary classification from different clustering methods. Data
points from binary categories are identified by dots and squares. Errors are shown in red. (a) Agnostic clustering
where positive and negative data points are mixed together; (b) Task-based clustering where most points of one
category are located at one side of the decision boundary of model M (being separable by M) and positive/negative
points are mixed in clusters; (c) Task-based clustering + label information: besides being separable, data points with
the same label can be clustered together.
Selecting clusters for augmentation Given a RoBERTa-large models for MNLI and QQP on the
budget of k clusters we can augment, we evaluate official training corpora released in GLUE bench-
the clustering representations using the aggregate mark to match the best Transformer performance.2
GC and IC statistics of their top-k clusters ranked They are regarded as the target model M in each
by error rate, resulting a set of clusters Ck . In other task. We randomly divide the validation data into
words, we choose a representation that yields the two half sets: a dev set, used for automatic sub-
most augmentable clusters without hurting overall group discovery, and a devtest set, used exclusively
performance, as formalized in Equation 3. for evaluation. Therefore, SST has dev size of 436,
MNLI dev has size of 4,908, and QQP has dev
Ck∗ = arg max[GC(Ck ) − IC(Ck )] (3) size of 20,215. We run each experiment five times
Ck with different random seeds and report the average
2.3 Subgroup Augmentation with LLMs scores.
In order to augment those top challenging clusters 3.1 Automatic Subgroup Discovery
Ck∗ , we follow the work of Khani and Ribeiro (2023)
We conduct clustering methods on the dev set of
to use GPT-3 to create similar in-cluster examples,
each task. We assign the closest cluster to each
with a human in the loop to provide labels. We
instance in the devtest set, such that each cluster
finetune a small local model on each cluster’s data
in dev has an aligned counterpart for evaluation.
and use the disagreement between that model and
We run each clustering method five times using
the current version of M′ to rank GPT-3 generated
different random seeds and select the clustering
examples, stopping the process once the current
results with the best Silhouette scores (Rousseeuw,
version of the cluster’s model mostly agrees with
1987).
the current version of M′ . Intuitively, when M′
and the cluster’s model converge on cluster data, Comparison of clustering representations We
M′ has learned to generalize to the data in this clus- present the error rates of discovered clusters for
ter (thus fulfilling the requirment of GC), and the SST and MNLI in Figures 3 and 4. For both tasks,
original D used when updating M′ should prevent errors were randomly distributed accross clusters
high interference. produced by agnostic clustering, which indicates
that the clusters are not aligned with model behav-
3 Experiments iors and weaknesses, as also confirmed by the low
Setup We evaluate the effectiveness of TDG GC and IC scores. In contrast, task-based clus-
on three tasks from the GLUE benchmark: The tering (with or without label information) results
Stanford Sentiment Treebank (SST), MultiNLI in a large contingent of clusters with zero or few
Matched (MNLI-m) and Quora Question Pairs 2
Following Bowman et al. (2015); Yanaka et al. (2019),
(QQP). We train a bert-base model for SST and we use the binarized version of MNLI
(a) Agnostic clustering; GC=0.0064; (b) Task-based clustering; GC=0.011; (c) Task-based + label information;
IC=0.0000 IC=-0.0002 GC=0.1319, IC=0.19298
Figure 3: Error distribution of clusters obtained from three clustering methods on SST. Cluster number k=35. For
random clustering: GC=-0.0010, IC=0.0000
(a) Agnostic clustering; GC=0.0013; (b) Task-based clustering; GC=0.028; (c) Task-based + label information;
IC=0.0011 IC=-0.0017 GC=0.0434, IC=0.0023
Figure 4: Error distribution of clusters obtained from three clustering methods on MNLI. Cluster number k=100.
For random clustering: GC=-0.0007, IC=0.0002
errors (i.e. most successes are clustered together), as duplicate. In this case, TDG correctly identifies
and a few clusters with higher error rates. Using a case where subgroup data augmentation is un-
label information yields clusters of either all errors likely to be effective, and other solutions (e.g. data
or all successes, which results in high Generaliza- cleaning) should be pursued. We do not report any
tion in Context scores, but also high Interference in QQP results from now on.
Context scores. Both are likely due to label imbal-
ance, as we would expect such scores from simply 3.2 Subgroup Augmentation with LLMs
shifting the likelihood of predicting the cluster la- Based on the high-GC and low-IC clusters discov-
bel. This analysis thus indicates that task-based ered in previous step, we conduct augmentation
clustering without labels yields the clusters that targeted on those clusters with large language mod-
are most amenable to augmentation, since clusters els with human in the loop.
have positive generalization and near-zero interfer-
ence scores. We use these clusters in subsequent Human Participants We recruited 12 users to
results. label GPT-3 generated data in the subgroup aug-
mentation step. All users are from academia or
QQP All clusterings on QQP (not shown) had industry (with IRB approval) and have experience
very high interference scores, and thus were not working with AI-based natural language genera-
deemed suitable for augmentation by TDG. Indeed, tion systems (e.g. GPT-3). Each user was assigned
when we piloted data augmentation procedures on a high-error cluster discovered in the automatic
these clusters, we saw no tangible benefits. Man- subgroup discovery step (2 from SST and 10 from
ual inspection of clusters indicates that QQP has MNLI), and asked to label GPT-3 generations. We
high label noise (which would explain interfer- use the original sentences from the cluster as the
ence), such that pairs with the same phenomena initial prompt. Sentences that users labeled differ-
are often labeled differently, e.g. the pair (“What ently from the model’s prediction were added to the
makes life worth living?”, “Is life worth it?”) is augmented set. We allocated 90 minutes for user
labeled as not-duplicate, while (“Why is Deadpool labeling, with more information in the Appendix
so overrated”, “Is Deadpool overrated”) is labeled 9.1.
Baselines We compare TDG to the following ablations, as the average in-cluster accuracy has
previous works that aim to improve subgroup per- been increased from 81.45% to 83.60% on SST
formance: (1) Reweighing (Sohoni et al., 2020), and from 60.57% to 65.03% on MNLI, which is
which addresses hidden stratification caused by higher than any baseline models. Additionally, we
dataset imbalance by optimizing the per-cluster also observed that adding TDG data from all clus-
worst-case performance. In our experiments, we ters can improve all clusters by an average of 4.28%
use the same Group Distributionally Robust Op- (from 60.57% to 64.85%) on MNLI and an average
timization (GDRO) introduced in their work on of 1.55% (from 81.45% to 83.00%) on SST, which
each cluster as the fine-tuning objective. (2) Para- is also higher than all baseline models. Note that
phrasing where we use Parrot (Damodaran, 2021), the accuracy of every single cluster in TDG(all) is
a T5-based paraphrase model, to generate similar better than the target model. For some challenging
examples of data points in clusters as an augmenta- clusters, augmentation on their own (TDG(single))
tion. The size of the final fine-tune set is the same may yield better results, due to potential interfer-
as TDG for a fair comparison. ence between clusters (see Appendix 9.2 for more
details).
One cluster at a time v.s. simultaneous aug-
mentation Each participant augmented a single Improvement in overall devtest We observed
cluster, and we report these results as TDG(single), an improvement in overall performance on the de-
noting that for these we only measure in-cluster vtest set with TDG(all), with an increase of 0.55%
performance. We further pool the data from all on SST and 0.16% on MNLI. This suggests that
participants (TDG(all)) to test the improvements improving challenging clusters has the potential to
on each cluster as well as performance on the over- improve the model at a global level, while neither
all test set (devtest). In each experiment, in order baselines were able to achieve this. We notice the
to avoid the issue of catastrophic forgetting (Mc- improvement on the devtest set is not as significant
Closkey and Cohen, 1989), we randomly sampled as the improvement on individual low-performed
training data with the same frequency as TDG aug- groups. This is likely due to the fact that these
mented data in the fine-tuning process3 . vulnerable groups are usually minorities and their
representation in the devtest set is small (e.g., the
SST average size of the 10 clusters in MNLI experiment
Model is just 88 whereas the devtest has size of 4,908),
1st 2nd Avg Cluster devtest
diluting the impact of the improvement.
BERT-base 81.74 81.13 81.45 93.77
Ablation Analysis We evaluate the following
Reweighing 78.7 82.03 80.37 93.49
Paraphrasing 77.61 82.42 80.02 92.26 variations of TDG to test the effectiveness of each
step:
TDG (single) 83.8 83.39 83.60 -
TDG (all) 82.61 83.39 83.00 94.32 • Automatic Subgroup Discovery Only in
which the fine-tuning data is created by using
Table 1: Accuracy of TDG v.s. baselines tested on top-2
the same clusters as TDG but without augmen-
error clusters and left-out devtest set of SST. BERT-base
is the target model M.
tation and adding the same number of random
samples from the training data, to test the error
discovery step.
Improvement in challenging subgroups Table 1
• Subgroup Augmentation with LLM Only
and Table 2 show the results of all baselines, as well
in which the fine-tuning data is created by us-
as TDG(single) and the aggregated TDG(all), on
ing n random samples from the dev set (n is
the SST and MNLI tasks, respectively. For both
the number of total sentences in challenging
tasks, augmenting individual clusters with TDG
clusters used in TDG) and applying subgroup
tends to be more effective than all baselines and
augmentation with GPT-3, to test the effective-
3
In MNLI experiment, due to the high interference among ness of the augmentation. Augmentation ends
clusters, we adjust the weights of training samples and col-
lected responses when combining all data points for TDG(all)
once the same number of augmented data as
in fine-tuning (i.e., we set portions of original samples:user TDG is reached.
responses = 2:1). In SST, all responses are combined without
any adjustment. We see that fine-tuning with clusters alone can
MNLI
Model
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th Avg Cluster devtest
RoBERTa-Large 51.85 53.57 53.85 54.84 55.56 58.82 65.71 66.56 68.75 76.19 60.57 93.46
Reweighing 51.85 53.57 30.77 58.06 55.56 58.82 68.57 65.91 68.75 73.81 58.57 93.46
Paraphrasing 51.85 42.86 53.85 54.84 44.44 58.82 65.71 65.91 68.75 26.19 53.32 86.45
TDG (single) 51.85 53.57 61.54 67.74 66.67 64.71 65.71 75.68 66.67 76.19 65.03 -
TDG (all) 59.26 53.57 64.28 61.29 55.56 64.71 74.28 68.18 68.75 78.57 64.85 93.62
Table 2: Accuracy of different models tested on top-10 high-error clusters and left-out devtest set of MNLI.
Model
SST 4 Related Work
1st 2nd Avg Cluster devtest
BERT-base 81.74 81.13 81.45 93.77 Recent research in machine learning has focused
Automatic Subgroup on enhancing the robust performance of models by
78.70 82.20 80.45 93.89
Discovery only identifying challenging subgroups and improving
Subgroup Augmentation
with LLM only
79.42 78.42 78.91 93.17 their performance.
TDG (single) 83.80 83.39 83.60 - Discovering Challenging Subgroups Several stud-
TDG (all) 82.61 83.39 83.00 94.32 ies, such as d’Eon et al. (2022) and Rajani et al.
(2022), focus on identifying challenging subgroups
Table 3: Accuracy of different ablations of TDG on
in the data. However, these works primarily fo-
top-2 high-error clusters in SST. BERT-base is the target
model M. cus on discovering general low-performing regions
in embedding space and do not address strategies
for improving these regions. In contrast, our work
aims to identify challenging subgroups that are also
improve performance on certain clusters when the amenable to improvement through data augmenta-
size is sufficient (e.g., 2nd in SST), but it can also tion using language models.
lead to over-fitting and reduced performance (e.g., Improving Performance of Known Subgroups
1st in SST). Additionally, subgroup augmentation Other studies, such as Thakur et al. (2021); Yoo
on randomly sampled clusters results in a decrease et al. (2021); He et al. (2021), focus on augmenting
in performance not only in low-performing areas, data from known subgroups or patterns. However,
but also overall on the devtest set. Without the au- it can be challenging to apply these methods in
tomatic subgroup discovery, the GPT-3 augmented scenarios where the challenging subgroups are not
sentences may introduce more noise rather than known a priori. Another stream of work focuses
benefits, which verifies the bottleneck of previous on model testing and debugging, which involves
work (Ribeiro and Lundberg, 2022) and empha- creating human-generated data points and testing
sizes the importance of the automatic subrgoup them on the model. Methods such as CheckList
discovery. (Ribeiro et al., 2020) and DynaBench (Kiela et al.,
2021) generate test cases from pre-defined topics
and templates, while AdaTest (Ribeiro and Lund-
Interpretation of low-performed groups In this berg, 2022) uses pre-trained language models to
section, we present some examples from the high- generate more tests that are similar to the human-
error groups discovered in automatic subgroup dis- created examples. Although these methods show
covery. We also provide readable interpretations promising results in improving the performance
for the clusters as shown in Table 4. Our automatic of challenging subgroups, it is not clear how to
subgroup discovery is able to identify meaningful provide the first data points from a challenging sub-
errors, such as mis-identifying the dominant sen- group. Finding such data points was the main focus
timent from a mixture of sentiments in SST, or of our work, where we showed how to find data
errors related to different language tones in MNLI. points that are suitable for further augmentation.
Furthermore, we also notice complex patterns in Model-based Approaches Another approach for
reasoning is identified, such as Factivity and Mono- enhancing the performance of challenging sub-
tonicity, which are recognized challenges in Super- groups is to develop new training strategies.
GLUE Diagnostic tasks. Sagawa et al. (2019) minimize the worst group ac-
Cluster: Having multiple sentiments and one is dominating than the rest Label Prediction
On the heels of the ring comes a similarly morose and humorless horror movie that,
positive negative
SST although flawed , is to be commended for its straight-ahead approach to creepiness .
Another one of those estrogen overdose movies like "divine secrets of the ya ya sisterhood ”
positive negative
except that the writing , acting and character development are a lot better .
Cluster: Having same meaning. Formal Tone v.s. Casual Tone Label Prediction
Sentence1: Do you think I should be concerned? not
entailment
Sentence2: Do you think it is a problem enatilment
Sentence1: He seemed too self-assured. not
entailment
Sentence2: He is very cocky enatilment
Cluster: One v.s. All Label Prediction
Sentence1: Pray be seated, mademoiselle. not
entailment
Sentence2: Please, everyone be seated. enatilment
MNLI
Sentence1: Similar conclusions have been reached by legal studies in a dozen states including Florida. not
entailment
Sentence2: Similar conclusions have been seen across the world. enatilment
Cluster: Suspicion v.s. Fact Label Prediction
Sentence1: The analysis also addresses the various alternatives to the final rule which were considered,
including differing compliance or reporting requirements, use of performance rather than design standards, not
entailment
and an exemption for small entities from coverage of the rule. enatilment
Sentence2: The rule is subject to change."
Sentence1: In the depths of the Cold War, many Americans suspected Communists
not
had infiltrated Washington and were about to subvert our democracy. entailment
enatilment
Sentence2: Communists infiltrated Washington during the Cold War.
Table 4: Interpretation about discovered high-error clusters. Each cluster is shown with two errors.
curacy when subgroups are known a priori, Khani efficient than previous HITL-based methods that
et al. (2019) add variance of loss to the optimiza- either require domain experts or require more ex-
tion function, and Liu et al. (2021) train the model tensive human input. In this paper, we purpose-
twice, one with every data point and once more fully chose state-of-the-art (SOTA) models that
with the ones that have high losses. Sohoni et al. are already very good. However, our work shows
(2020) discovered subgroups and then change the that even such models still exhibit coherent lower-
training function to improve the accuracy. Chang- performance groups that can be further improved
ing the training function usually improves the accu- with targeted data collection.
racy of challenging subgroups, but at the expense
of decreasing accuracy in other subgroups or the 5 Conclusion
overall accuracy. In contrast, our work increases
In this work, we presented a thorough analysis
the performance of challenging subgroups while
of error distribution among different groups and
also increasing the overall accuracy.
introduced Targeted Data Generation (TDG), a
Data Augmentation with Human-in-The-Loop framework that automatically identifies challenging
Recent works note that Human-in-The-Loop groups that are amenable to improvement through
(HITL) based augmentation offers unique bene- data augmentation using large language models
fits over automatic data augmentation, such as ad- (LLMs) without negatively impacting overall accu-
dressing dataset design flaws (Fanton et al., 2021), racy. Our experiments with state-of-the-art mod-
improving performance for minority groups (Sri- els demonstrate that TDG is able to improve in-
vastava et al., 2020), and avoiding syntactic and group performance by 2-13% while also increas-
semantic distortions in the text (Anaby-Tavor et al., ing overall accuracy. Furthermore, TDG was able
2020). to improve performance for every single selected
We want to point out that TDG is orthogonal cluster without interference, indicating its potential
to non-HITL augmentation (i.e. they can be used as a reliable approach for a new data collection
together). In addition, TDG’s use of LLM to gen- framework. As LLMs continue to advance and are
erate augmentations for specific data groups helps trained on more diverse and large corpora, TDG
reduce the human effort – TDG only requires min- represents a promising approach for addressing the
imal human effort for validation, making it more weaknesses of simpler models.
6 Ethic Considerations interface. We are appreciative of the insightful
suggestions provided by folks from the Microsoft
In this paper, we propose a method for automat-
Office of Applied Research. Special thanks go to
ically identifying groups of data that are under-
Brent Hecht, Aaron Halfaker, and Yujin Kim for
performing due to a lack of training examples.
their generous contributions of time and support in
It is important to note that these underperform-
our user studies. We would also like to express our
ing groups may be related to marginalized demo-
thanks to all the participants from the University of
graphic groups, which may be underrepresented in
California San Diego for their active involvement
the data. By identifying these groups, our work is
in the user studies.
able to reveal potential discriminatory behaviors
in NLP models and facilitate bias mitigation by
augmenting these underrepresented groups. How- References
ever, there is also the risk that malicious actors may
exploit this information and create adversarial ex- Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,
amples that further bias the model. To address this Amir Kantor, George Kour, Segev Shlomov, Naama
Tepper, and Naama Zwerdling. 2020. Do not have
concern, we suggest involving the user audience or enough data? deep learning to the rescue! In Pro-
implementing fairness regulations in the interactive ceedings of the AAAI Conference on Artificial Intelli-
procedure to prevent such behaviors. Finally, it’s gence, volume 34, pages 7383–7390.
worth noting that our model relies heavily on large
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
language models to improve the performance of and Christopher D. Manning. 2015. A large anno-
challenging groups as a result if some groups are tated corpus for learning natural language inference.
not represented in LLMs our method is unable to In Proceedings of the 2015 Conference on Empiri-
increase their performance. cal Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal. Association for Compu-
7 Limitations tational Linguistics.
One limitation of our approach is that we aggre- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
gated IC and GC measurements over clusters dur- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
ing the automatic subgroup discovery process, but Askell, et al. 2020. Language models are
we did not fully consider the relationships between few-shot learners. arxiv 2020. arXiv preprint
clusters. A more comprehensive strategy for uti- arXiv:2005.14165, 4.
lizing beneficial relationships and a more precise
approach to potential conflicts between clusters Prithiviraj Damodaran. 2021. Parrot: Paraphrase gener-
ation for nlu.
could lead to further improvements in overall per-
formance. Additionally, our MNLI experiments Greg d’Eon, Jason d’Eon, James R Wright, and Kevin
were conducted on large dataset that had multiple Leyton-Brown. 2022. The spotlight: A general
clusters with errors. We chose to focus on the top- method for discovering systematic errors in deep
10 clusters with the most errors due to limitations learning models. In 2022 ACM Conference on Fair-
ness, Accountability, and Transparency, pages 1962–
in resources for running a user study. While TDG 1981.
on top-K clusters has demonstrated effectiveness in
improving performance, there is still the potential Margherita Fanton, Helena Bonaldi, Serra Sinem
for further improvements by working on a larger Tekiroğlu, and Marco Guerini. 2021. Human-in-the-
loop for data collection: a multi-target counter narra-
number of clusters. At the same time, we empha- tive dataset to fight online hate speech. In Proceed-
size that TDG should be used as the last step to ings of the 59th Annual Meeting of the Association for
improve performance in low-performing groups Computational Linguistics and the 11th International
(clusters with high errors). If these groups are nu- Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 3226–3240, Online.
merous, it means the model is likely under-trained,
Association for Computational Linguistics.
and other techniques (e.g. better data/modeling)
should be applied first. Zexue He, Bodhisattwa Prasad Majumder, and Julian
McAuley. 2021. Detect and perturb: Neutral rewrit-
8 Acknowledgements ing of biased and sensitive text via gradient-based
decoding. In Findings of the Association for Com-
We would like to thank Scott Lundberg for his kind putational Linguistics: EMNLP 2021, pages 4173–
assistance in designing and implementing the user 4181.
Fereshte Khani, Aditi Raghunathan, and Percy Liang. Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Al-
2019. Maximum weighted loss discrepancy. arXiv bert Gu, and Christopher Ré. 2020. No subclass left
preprint arXiv:1906.03518. behind: Fine-grained robustness in coarse-grained
classification problems. Advances in Neural Infor-
Fereshte Khani and Marco Tulio Ribeiro. 2023. Collab- mation Processing Systems, 33:19339–19352.
orative development of nlp models. arXiv preprint
arXiv:2305.12219. Megha Srivastava, Tatsunori Hashimoto, and Percy
Liang. 2020. Robustness to spurious correlations
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh via human annotations. In International Conference
Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid- on Machine Learning, pages 9109–9119. PMLR.
gen, Grusha Prasad, Amanpreet Singh, Pratik Ring-
shia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Chloe Rose Stuart-Ulin. 2018. Microsoft’s politically
Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit correct chatbot is even worse than its racist one.
Bansal, Christopher Potts, and Adina Williams. 2021. Quartz Ideas, 31.
Dynabench: Rethinking benchmarking in NLP. In
Proceedings of the 2021 Conference of the North Nandan Thakur, Nils Reimers, Johannes Daxenberger,
American Chapter of the Association for Computa- and Iryna Gurevych. 2021. Augmented sbert: Data
tional Linguistics: Human Language Technologies, augmentation method for improving bi-encoders for
pages 4110–4124, Online. Association for Computa- pairwise sentence scoring tasks. In Proceedings of
tional Linguistics. the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Human Language Technologies, pages 296–310.
Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy
Liang, and Chelsea Finn. 2021. Just train twice: Im- Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Ken-
proving group robustness without training group in- taro Inui, Satoshi Sekine, Lasha Abzianidze, and
formation. In International Conference on Machine Johan Bos. 2019. Can neural networks understand
Learning, pages 6781–6792. PMLR. monotonicity reasoning? In Proceedings of the 2019
ACL Workshop BlackboxNLP: Analyzing and Inter-
Michael McCloskey and Neal J Cohen. 1989. Catas-
preting Neural Networks for NLP, pages 31–40, Flo-
trophic interference in connectionist networks: The
rence, Italy. Association for Computational Linguis-
sequential learning problem. In Psychology of learn-
tics.
ing and motivation, volume 24, pages 109–165. Else-
vier. Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-
Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Woo Lee, and Woomyeong Park. 2021. Gpt3mix:
Mitchell, and James Zou. 2022. Seal: Interactive Leveraging large-scale language models for text aug-
tool for systematic error analysis and labeling. arXiv mentation. arXiv preprint arXiv:2104.08826.
preprint arXiv:2210.05839.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Marco Tulio Ribeiro and Scott Lundberg. 2022. Adap-
tive testing and debugging of NLP models. In Pro-
ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 3253–3267, Dublin, Ireland. Associa-
tion for Computational Linguistics.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
and Sameer Singh. 2020. Beyond accuracy: Behav-
ioral testing of nlp models with checklist. In Proceed-
ings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 4902–4912.
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid
to the interpretation and validation of cluster analysis.
Journal of computational and applied mathematics,
20:53–65.
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto,
and Percy Liang. 2019. Distributionally robust neural
networks. In International Conference on Learning
Representations.
9 Appendix model prediction (i.e. the bar under the “Creative”
turns red), add it.
9.1 Human-In-The-Loop Details We ask that each user clicks on the “Update
User Interface The goal of our user study is to global” button at least once during their study ses-
find bugs in the target model. To find bugs easier, sion to ensure that they continue to find meaningful
we provide the following user interface to our users, bugs in the updated model.
as shown in Figure 5. The interface is linked with
the back-end global and local models. 9.2 Analysis on Relationships Between
The UI enables the following actions through the Clusters
bottoms: We observe that sometimes fine-tuning the model
with TDG(all) augmented data on individual clus-
• Suggest: click to use the current sentence list ters can lead to improved performance on certain
as a prompt for GPT-3 to generate similar ex- clusters and worse performance on others. This
amples; suggests that there may be relationships between
clusters, such as mutual benefit or conflict.
• Add: allows users to add a sentence from the
One conjecture is data points may have multiple
generated examples to the current list;
patterns shared with different sentences, therefore,
• Update global: trains the global model us- belonging to multiple clusters. Each individual
ing the concatenation of a random sample of TDG is just working on one of them. Combining
sentences from training set and sentences in and fine-tuning together can cumulative the perfor-
current list; mance. For example, MNLI example “S1: Pray
be seated, mademoiselle. S2: Please, everyone be
• Update local: trains the local model using the seated.” can have both the patterns of the cross-
sentences in the current list, lingual entailment and the monotonicity. Another
conjecture for conflicting clusters is that the pat-
• Creative: indicates whether the local and terns within one cluster may be contradictory to
global models make different decisions. A those in another cluster. For example, in sentiment
red color indicates disagreement while green classification, sentences mentioning “American” in
indicates no disagreement. technology topics may conflict with sentences men-
tioning “American” in international relationship
• Rename: Users can rename their clusters to
topics. Such conflicts may be solved by simply
an interpretable name if they’d like to.
adding similar examples. Therefore, fine-tuning
In Figure 6, we show an example of adding a sen- these conflicting clusters together may negatively
tence to a subcluster and renaming it. impact the performance of one or both clusters.