0% found this document useful (0 votes)
39 views17 pages

Your Large Language Models Are Leaving Fingerprints

Uploaded by

steve.jing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views17 pages

Your Large Language Models Are Leaving Fingerprints

Uploaded by

steve.jing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Your Large Language Models Are Leaving Fingerprints

Hope McGovern ∗† Rickard Stureborg† Yoshi Suhara† Dimitris Alikaniotis


Cambridge Computer Lab Duke University NVIDIA Grammarly
Grammarly

Abstract and within model families that we find we can treat


each LLM as if it were a unique author with a
It has been shown that finetuned transform- distinct writing style. To do so, we use two well-
ers and other supervised detectors effectively
arXiv:2405.14057v1 [cs.CL] 22 May 2024

founded methods from the field of Author Identifi-


distinguish between human and machine-
generated text in some situations (Li et al., cation (AID) for a closed set of authors: one using
2023), but we find that even simple classi- handcrafted n-gram features and another using neu-
fiers on top of n-gram and part-of-speech fea- ral features extracted from pre-trained BERT em-
tures can achieve very robust performance on beddings, and training a simple machine learning
both in- and out-of-domain data. To under- classifier on those features.
stand how this is possible, we analyze machine-
generated output text in five datasets, finding Best Reported Model GradientBoost
that LLMs possess unique fingerprints which
Paper F1 AUROC F1 AUROC
manifest as slight differences in the frequency
of certain lexical and morphosyntactic features. Deepfake – 0.99 94.7 94.3
We show how to visualize such fingerprints, de- HC3 99.82 – 96.7 99.6
Ghostbuster 99.91 1.00 98 98
scribe how they can be used to detect machine- OUTFOX 96.9 – 98.7 98.7
generated text, and find that they are even ro-
bust across textual domains. We find that finger- Table 1: Best reported classifier performances (Deep
prints are often persistent across models in the neural networks) versus a decision-tree model with
same model family (e.g. llama-13b vs. llama- n-gram features. Best-reported classifier models are
65b) and that models fine-tuned for chat are from four recent papers which release labeled datasets
easier to detect than standard language mod- for GTD: (Li et al., 2023; Guo et al., 2023; Verma et al.,
els, indicating that LLM fingerprints may be 2023; Koike et al., 2023). The GradientBoost (Friedman,
directly induced by the training data. 2001) classifier uses a combination of character-, word-
and POS-n-gram features. GradientBoost models are
1 Introduction able to achieve impressively comparable accuracy, even
outperforming the best reported model on the OUTFOX
Large language models (LLMs) produce text of-
benchmark.
ten indistinguishable from human-authored text to
human judges (Clark et al., 2021). This unfortu-
nately allows potential misuses such as academic As shown in Table 1, the performance of the sim-
plagiarism (Westfall, 2023) and the dissemination ple classifier is surprisingly comparable to more
of disinformation (Barnett, 2023), which has there- complex neural methods, even in a multi-class set-
fore prompted interest in generated text detection ting – successfully distinguishing between, e.g.
(GTD). We conduct linguistic analysis on five pop- human-, ChatGPT-, and LLaMA-generated text
ular published datasets for GTD, showing that the (cf. Table 2). It also proves robust in cross-domain
machine-generated content in each shows linguistic experiments (cf. Figure 2) and to some adversar-
markers in aggregate which make it relatively easy ial attacks which hinder other detectors (Table 3).
to separate it from human content. Furthermore, we present evidence that instruction-
These discrepancies, which we call a model’s tuning and prompting can manipulate this finger-
“fingerprint”, are consistent enough across domains print, but not remove it (cf. Figure 4).
∗ In this paper, we empirically uncover and char-
Corresponding Author.
Email: [email protected]. acterize the fingerprints of individual and families

Work done while at Grammarly. of LLMs through a series of comprehensive analy-
OpenAI Eleuther Llama GLM FLAN
davinci-002 gpt-j llama-13b glm-130b t5-small
DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX AUX AUX
VERB VERB VERB VERB VERB
ADJ ADJ ADJ ADJ ADJ
0.2
PUNCT 0.2
PUNCT 0.2
PUNCT 0.2
PUNCT 0.2
PUNCT
0.13333333333333333 0.13333333333333333 0.13333333333333333 0.13333333333333333 0.13333333333333333
ADV 0.06666666666666667 ADV 0.06666666666666667 ADV 0.06666666666666667 ADV 0.06666666666666667 ADV 0.06666666666666667
0 NOUN 0 NOUN 0 NOUN 0 NOUN 0 NOUN
PROPN PROPN PROPN PROPN PROPN
SYM SYM SYM SYM SYM
PART PART PART PART PART
INTJ INTJ INTJ INTJ INTJ
CCONJ CCONJ CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X
davinci-003 gpt-neox llama-30b empty t5-large
DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX AUX AUX
VERB VERB VERB VERB VERB
ADJ ADJ ADJ ADJ ADJ
0.2
PUNCT 0.2
PUNCT 0.2
PUNCT PUNCT 0.2
PUNCT
0.13333333333333333 0.13333333333333333 0.13333333333333333 0.13333333333333333
ADV 0.06666666666666667 ADV 0.06666666666666667 ADV 0.06666666666666667 ADV ADV 0.06666666666666667
0 NOUN 0 NOUN 0 NOUN NOUN 0 NOUN
PROPN PROPN PROPN PROPN PROPN
SYM SYM SYM SYM SYM
PART PART PART PART PART
INTJ INTJ INTJ INTJ INTJ
CCONJ CCONJ CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X
gpt-3.5-turbo empty llama-65b empty t5-xxl
DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX AUX AUX
VERB VERB VERB VERB VERB
ADJ ADJ ADJ ADJ ADJ
0.2
PUNCT PUNCT 0.2
PUNCT PUNCT 0.2
PUNCT
0.13333333333333333 0.13333333333333333 0.13333333333333333
ADV 0.06666666666666667 ADV ADV 0.06666666666666667 ADV ADV 0.06666666666666667
0 NOUN NOUN 0 NOUN NOUN 0 NOUN
PROPN PROPN PROPN PROPN PROPN
SYM SYM SYM SYM SYM
PART PART PART PART PART
INTJ INTJ INTJ INTJ INTJ
CCONJ CCONJ CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X SCONJ NUM X

Figure 1: Visualization of the fingerprints. We plot frequencies of each part-of-speech (POS) class from the
output of several models, sorted by model family. Within each family, the shapes (distributions) look mostly similar
regardless of model size. Each radial plot is shown at the same 0% to 20% frequency scale, with POS tags sorted
from most to least common among human-written outputs. Jagged/bumpy shapes indicate the fingerprint is more
distinct from human distributions. POS is just one component of the full ‘fingerprint’ we investigate.

ses, and present a new perspective of LLM-content ML classifiers, including SVC and logistic regres-
detection as authorship identification. sion. These exhibited close or similar performance
on our data. If the classes are very imbalanced, we
2 Methodology downsample the majority class in order to have a
2.1 Fingerprint Features balanced dataset (n = 5000).

We use three feature sets: word n-grams (n ∈ 2.3 Data


[2, 4]), which we expect to be useful in capturing
We use five publicly available machine-generated
domain-specific vocabulary, but also in capturing
text detection datasets for fingerprint analysis as
function words, which are known to be highly ef-
well as training data for supervised sequence clas-
fective for authorship identification; character n-
sifiers: OUTFOX (Koike et al., 2023), Deepfake-
grams (n ∈ [3, 5]), which we intuitively expect to
TextDetect (Li et al., 2023), the Human Compari-
capture subword information broadly aligning with
son Corpus (Guo et al., 2023), Ghostbuster (Verma
the byte-pair encoding (BPE) tokenization method
et al., 2023), and the M4 dataset (Wang et al.,
of many models; and part-of-speech (POS) n-grams
2024b). We refer to these as “Outfox”,“Deepfake”,
(n ∈ [2, 4]), which should capture domain-agnostic
“HC3”, “Ghostbuster”, and “M4” in this work, re-
information about writing style.
spectively. A summary of all datasets used, in-
2.2 Classifiers cluding their domain coverage and underlying base
model(s), may be seen in Table 15.
We use a GradientBoost classifier implemented in
the Sklearn library (Pedregosa et al., 2011). The hy- 3 Results and Analysis
perparameters for the classifier were found through
grid search, though no extensive hyperparameter We conduct a series of analyses of LLM finger-
sweeps were carried out; this classifier works well prints, finding they are predictive of which model
out-of-the-box1 . Initial experiments used a range of authored a text, consistent across domains and
1
within model families (even to adversarial attacks),
Further hyperparameter tuning could improve classifier
performance, but we are primarily interested in exploring why and susceptible to modification by fine-tuning. Ad-
such a simple classifier performs well in the first place. ditional results are presented in the Appendix.
Dataset Provenance F1 1.0

Ghostbuster Human 0.934 0.8


ChatGPT 0.960
Flan T5 0.927 0.6

F1 Score
Average 0.940
0.4
Outfox Human 0.877
ChatGPT 0.936 0.2 type

Claude 0.920 In Domain


Out of Domain
Average 0.911 0.0

x
0b
b

5B

bo
l x

eo
11

30
_x

t_3

r
_6

5tu
t0_

t_n
_1
t5

b
op
n_

t_3
ma

glm

gp
fla

gp
lla
Table 2: F1 scores for each class as the positive class Model
after training under a multiclass classification setting.
Note that even for top models ChatGPT and Claude, Figure 2: F1 score of GTD on in-domain versus out-
our simple n-gram based classifier performs very well of-domain test sets for the largest model of each
(0.936 and 0.920 on the Outfox data). To compare with model family in the Deepfake benchmark. We find
binary classification results, F1 scores are computed for no statistically significant drop in performance when
each class by setting that class to be the ‘positive’. testing on these 7 models’ outputs. 95% confidence
intervals are computed through bootstrap sampling at
n = 10, 000.
We visualize fingerprints by looking at the dif-
ference of distribution in various linguistic prop-
erties. In Figure 1, we report part-of-speech tag 3.2 Robustness
distributions of data generated by different mod- Given the nature of the fingerprint features, one
els on the same deepfake data domains2 . In Ap- might expect that a shift in domain substantially
pendix B we also include analysis from named impacts the performance of GTD using these finger-
entity tags, constituency types, and top-k most fre- prints. However, in Figure 2 we find no evidence
quent tokens. The strength of the fingerprint is that performance deteriorates for out-of-domain
more evident across some axes than others, and test sets, while Figure 3 shows a clear deterioration
there are, of course, more dimensions of linguis- due to out-of-model test sets. That is, fingerprints
tic analysis that could theoretically be applied to are robust across domains but are unique for each
uncover model fingerprints. model. Even further, fingerprints are somewhat
Distinct patterns emerge when comparing the robust to adversarial attacks Table 3.
fingerprint of models within the same family com-
pared to models of different families. The degree
of similarity within families can vary between fam- 0.0
ilies; for example, LLaMA models exhibit a par-
0.2
ticularly uniform fingerprint across model sizes,
Value

while BigScience models (cf. Appendix B) look 0.4

markedly different. 0.6


type
3.1 Author Identification 0.8 Out of Domain
Out of Model
We find that fingerprints are useful for not just GTD, Human Recall Machine Recall
Metric
F1 AUROC

but also to predict which model generated a given


text. Table 2 shows that in a multiclass classifi- Figure 3: Average drop in performance on various
cation setting, these n-gram features allow strong metrics when testing on out-of-domain text (blue)
performance for author identification (AID). This versus a held-out generative model (brown). Note
that recall of the machine-generated text drops signifi-
aligns well with previous work in AID where lin-
cantly when testing on an unseen model’s output, while
guistic and stylometric features have been proven changing the domain has no impact on this metric.
highly effective time and time again (He et al.,
2024).
2
We choose to report POS results in the main paper as it 3.3 Altering fingerprints
directly maps to one feature set for our classification experi-
ments, whereas we do not directly use named entity categories, Some adversarial attacks seem to potentially al-
constituency types, or top-k words as features. ter fingerprints Table 3, though qualitatively they
also change the readability of the final text. Alter- Gehrmann et al. (2019): a unique system that uses
natively, instruction tuning through reinforcement the top-k words to highlight text spans to visually
learning or supervised fine-tuning is a potential aid humans in the task of spotting AI-written text
method to purposefully alter model fingerprints. themselves.
Figure 4 shows the potential of altering fingerprints Some of our findings support recent work: Za-
by comparing a chat model’s fingerprint to the base itsu and Jin (2023) find stylometric methods ef-
model. fective for detection in Japanese text, even com-
pared with SOTA models. Additionally, Gehrmann
CLS Unattacked DIPPER OUTFOX
et al. (2019) have proposed a feature extractor for
Ling + GB 0.992 0.761 0.967
Bert + GB 0.992 0.848 0.980 feature-based, statistical classification for machine-
Bert + LR 0.998 0.811 0.992 generated text detection, and Petukhova et al.
Deepfake 0.549 0.661 0.764
GPTZero 0.407 0.637 0.692
(2024) finds a combination of fine-tuned neural
features and hand-crafted linguistic features effec-
Table 3: F1 scores before and after adversarial at- tive for GTD on the M4 dataset as part of the Se-
tacks by DIPPER and OUTFOX. GB stands for Gra- mEval2024 task on machine-generated text detec-
dientBoost, LR stands for Logistic Regression. Ling tion (Wang et al., 2024a).
are linguistic features, and Bert are BERT-based em- For linguistic analysis, Li et al. (2023) analyze
beddings as features. Note that our models are robust their corpus DeepfakeTextDetect, but report dif-
to OUTFOX attacks, while they deteriorate slightly
ferences across POS-tag distributions between hu-
by DIPPER (though still showing relatively strong F1
scores). BERT-based features are more robust to DIP- man and machine data when considering all models
PER than our linguistic features. and domains in aggregate as insignificant; however,
they do find these distributions begin to diverge
when considering a subset of models or domains.
0.05 llama-13b We demonstrate that these differences extend to
llama-13b-chat
0.04 every publicly available machine text detection
Frequency

0.03 dataset, prove largely consistent within model fam-


0.02 ilies, and are very powerful features for training a
0.01
robust machine-generated text detection classifier.
0.00
VE N
PR B
ADN
PU P
DET
T
X
AD J

CC T
SC NJ
PR NJ
SP N

M
J
X
M
V

NU E
AD

INT
R
NC

AC
R

AU
U

OP

SY

5 Conclusion
O
O
PA
NO

Part-of-Speech Tag
We demonstrate that in five popular datasets for
Figure 4: Absolute difference in POS tag frequen- machine-generated text detection, n-gram features
cies as compared with human text. Chat models are
are highly effective for GTD. We uncover that
slightly more similar to the frequency profile of hu-
mans, but are easier to detect than base models. This LLMs have unique writing styles that can be cap-
demonstrates that fingerprints “closer” to human distri- tured in lexical and syntactic features, which we
butions in POS tags does not indicate it is less detectable. characterize as “fingerprints”, and show that they
Further, fine-tuning models for chat clearly alters their are generally unique to a model family. We also
fingerprint despite no change in model architecture. show some evidence fingerprints can be modified
with further fine-tuning, suggesting they may be
removable.
4 Related Work
A common approach to machine-generated text de-
Limitations
tection is to train a supervised binary classifier on • Text length: we examine outputs of approxi-
labeled data (Guo et al., 2023; Koike et al., 2023; Li mately 300-500 words in length. Shorter texts
et al., 2023). Li et al. (2023) proposed a variety of may be difficult to fingerprint or may not pro-
classification testbeds, finding that pre-trained lan- vide enough signal.
guage models perform the best. While n-gram fre-
quencies have often been used for author identifica- • Prompting. We do not explore prompting
tion, only a few recent works examine hand-crafted methods in any exhaustive or fine-grained
features or stylometrics in machine-generated text manner, although we do note that we conduct
detection (Zaitsu and Jin, 2023). One example is analysis on datasets that have used a variety of
prompting methods themselves to collect data. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Still, we acknowledge that choice of prompts Longformer: The long-document transformer. ArXiv,
abs/2004.05150.
has been shown to have a significant impact
on the output of an LLM. Shuyang Cai and Wanyun Cui. 2023. Evade chatgpt
detectors via a single space. ArXiv, abs/2307.02599.
• Model choice limitations: We constrain our-
selves to the data and models released as part Elizabeth Clark, Tal August, Sofia Serrano, Nikita
Haduong, Suchin Gururangan, and Noah A. Smith.
of text detection corpora, which means that 2021. All That’s ‘Human’ Is Not Gold: Evaluating
there some very good models we simply did Human Evaluation of Generated Text. In Proceed-
not have the data to test, e.g. GPT4 data. ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
• Generation uncertainty: For our instruction- Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 7282–7296, Online.
tuning fingerprint experiment, we only fine-
Association for Computational Linguistics.
tune 7b and 13b variants of Llama-2. These
are relatively small models and there is no Alex Franklin, Maggie, Meg Benner, Natalie Rambis,
guarantee that our methods would work for Perpetual Baffour, Ryan Holbrook, Scott Crossley,
and ulrichboser. 2022. Feedback prize - predicting
larger models or different instruction tuning effective arguments.
regimes.
Jerome H Friedman. 2001. Greedy function approx-
• Reflection on real-world use-case. Ana- imation: a gradient boosting machine. Annals of
lyzing fingerprints in research benchmark statistics, pages 1189–1232.
datasets is most likely not reflective of the Sebastian Gehrmann, Hendrik Strobelt, and Alexan-
true difficulty of deepfake text detection in the der M. Rush. 2019. Gltr: Statistical detection and
wild. For one thing, people don’t tend to use visualization of generated text. In Annual Meeting of
LLMs for writing entire articles/essays, etc. A the Association for Computational Linguistics.
more likely scenario for, e.g. academic pla- Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jin-
giarism, is starting from an LLM generated ran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu.
paragraph and making sentence-level rewrites. 2023. How Close is ChatGPT to Human Experts?
As this is analagous to a paraphrase attack like Comparison Corpus, Evaluation, and Detection. Pub-
lisher: arXiv Version Number: 1.
DIPPER (Krishna et al., 2023), we expect that
it would degrade our classifiers’ performance. Xie He, Arash Habibi Lashkari, Nikhill Vombatkere,
and Dilli Prasad Sharma. 2024. Authorship attri-
Ethics Statement bution methods, challenges, and future research di-
rections: A comprehensive survey. Information,
This research indicates that detecting machine- 15(3):131.
generated text is easy. However, we want to John Kirchenbauer, Jonas Geiping, Yuxin Wen,
stress that this does not necessarily mean machine- Jonathan Katz, Ian Miers, and Tom Goldstein. 2023.
detection is a high-confidence task. Using a single A watermark for large language models. In Interna-
model prediction about one single written text to tional Conference on Machine Learning.
determine whether or not it was human-written Ryuto Koike, Masahiro Kaneko, and Naoaki Okazaki.
should be evaluated on a different basis than av- 2023. Outfox: Llm-generated essay detection
erage accuracy, given the potential harms of false through in-context learning with adversarially gener-
positives or false negatives. For example, teachers ated examples. ArXiv, abs/2307.11729.
may wish to use tools to determine if students have Kalpesh Krishna, Yixiao Song, Marzena Karpinska,
cheated on exams or homework using LLMs. We John Wieting, and Mohit Iyyer. 2023. Paraphras-
discourage teachers from trusting predictions by ing evades detectors of ai-generated text, but retrieval
any classifier until more investigation is done into is an effective defense.
the confidence models have for any individual text. Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue
Wang, Linyi Yang, Shuming Shi, and Yue Zhang.
2023. Deepfake Text Detection in the Wild. Pub-
References lisher: arXiv Version Number: 1.

Sofia Barnett. 2023. ChatGPT Is Making Universities Eric Mitchell, Yoonho Lee, Alexander Khazatsky,
Rethink Plagiarism. Wired. Section: tags. Christopher D. Manning, and Chelsea Finn. 2023.
DetectGPT: Zero-Shot Machine-Generated Text De- In Proceedings of the 18th Conference of the Euro-
tection using Probability Curvature. arXiv. Version pean Chapter of the Association for Computational
Number: 2. Linguistics (Volume 1: Long Papers), pages 1369–
1407, St. Julian’s, Malta. Association for Computa-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, tional Linguistics.
Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Chris Westfall. 2023. Educators Battle Plagiarism As
Ray, John Schulman, Jacob Hilton, Fraser Kelton, 89% Of Students Admit To Using OpenAI’s Chat-
Luke E. Miller, Maddie Simens, Amanda Askell, Pe- GPT For Homework. Section: Careers.
ter Welinder, Paul Francis Christiano, Jan Leike, and
Ryan J. Lowe. 2022. Training language models to Wataru Zaitsu and Mingzhe Jin. 2023. Distinguishing
follow instructions with human feedback. ArXiv, ChatGPT(-3.5, -4)-generated and human-written pa-
abs/2203.02155. pers through Japanese stylometric analysis. PLOS
ONE, 18(8):e0288453.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.
Kseniia Petukhova, Roman Kazakov, and Ekaterina
Kochmar. 2024. Petkaz at semeval-2024 task 8: Can
linguistics capture the specifics of llm-generated text?
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D. Manning, and Chelsea Finn.
2023. Direct Preference Optimization: Your Lan-
guage Model is Secretly a Reward Model. Publisher:
arXiv Version Number: 2.
Abel Salinas and Fred Morstatter. 2024. The butterfly
effect of altering prompts: How small changes and
jailbreaks affect large language model performance.
ArXiv, abs/2401.03729.
Rafael A. Rivera Soto, Olivia Elizabeth Miano, Juanita
Ordoñez, Barry Y. Chen, Aleem Khan, Marcus
Bishop, and Nicholas Andrews. 2021. Learning uni-
versal authorship representations. In Conference on
Empirical Methods in Natural Language Processing.
Vivek Kumar Verma, Eve Fleisig, Nicholas Tomlin,
and Dan Klein. 2023. Ghostbuster: Detecting text
ghostwritten by large language models. ArXiv,
abs/2305.15047.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan
Su, Artem Shelmanov, Akim Tsvigun, Chenxi White-
house, Osama Mohammed Afzal, Tarek Mahmoud,
Giovanni Puccetti, Thomas Arnold, Alham Fikri
Aji, Nizar Habash, Iryna Gurevych, and Preslav
Nakov. 2024a. SemEval-2024 task 8: Multigenerator,
multidomain, and multilingual black-box machine-
generated text detection. In Proceedings of the 18th
International Workshop on Semantic Evaluation, Se-
mEval 2024, Mexico City, Mexico.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan
Su, Artem Shelmanov, Akim Tsvigun, Chenxi White-
house, Osama Mohammed Afzal, Tarek Mahmoud,
Toru Sasaki, Thomas Arnold, Alham Aji, Nizar
Habash, Iryna Gurevych, and Preslav Nakov. 2024b.
M4: Multi-generator, multi-domain, and multi-
lingual black-box machine-generated text detection.
A Discussion

A note about prompting. Different prompting methods have quite a large effect on the output generation
and quality of LLMs (Salinas and Morstatter, 2024). While it is possible that different prompting methods
could then have a large effect on the measured fingerprint of a single output, at the dataset level, a little
bit of prompting variation seems not to be too detrimental. Specifically, every dataset we tested used a
combination of prompting techniques – continuation, topic based, chain-of-thought, etc. and we do still
find LLM style markers which are useful for classification. Furthermore, we show that both linguistic and
neural features + GB classifier is robust to an adversarial technique that uses different prompts to try to
fool a detector, OUTFOX, in Table 3.

A.1 What to Do About Fingerprints

The knowledge that LLM fingerprints may be exploited for quick, accurate, explainable classifications we
hope will encourage more research into other straightforward and trustworthy methods of machine-text
detection. InAppendix F, we detail the negative results from a Reinforcement Learning with Human
Feedback (RLHF) setup in an attempt to remove the fingerprints. Our inconclusive RLHF experiments
raises the question of the intrinsic-ness of the fingerprints and if they can be removed at all. If they cannot,
then fingerprints may remain a strong classification signal for the task until other training procedures
prove to produce fingerprint-less models.

Perhaps the most interesting insight from our work is that LLMs may be considered analogous to
human authors for the purpose of deepfake detection. We hope that this insight will bridge two research
avenues and help to develop a more robust theory of LLM generation.

Finally, we encourage those releasing datasets for machine-generated text detection to benchmark
against simple, feature-based classifiers.
B Fingerprint Characterization

FLAN BigScience OPT


t5-base t0-3b opt-125m
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
0.2
PUNCT 0.2
PUNCT 0.2
PUNCT
ADV 0.13333333333333333 ADV 0.13333333333333333 ADV 0.13333333333333333
0.06666666666666667 0.06666666666666667 0.06666666666666667
0 NOUN 0 NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
t5-small bloom-7b opt_350m
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
0.2
PUNCT 0.2
PUNCT PUNCT
0.2
ADV 0.13333333333333333 ADV 0.13333333333333333 ADV 0.13333333333333333
0.06666666666666667 0.06666666666666667 0.06666666666666667
0 NOUN 0 NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
t5-large t0-11b opt_1.3b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
0.2
PUNCT 0.2
PUNCT 0.2
PUNCT
ADV 0.13333333333333333 ADV 0.13333333333333333 ADV 0.13333333333333333
0.06666666666666667 0.06666666666666667 0.06666666666666667
0 NOUN 0 NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
t5-xl empty max_1.3b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
0.2
PUNCT PUNCT 0.2
PUNCT
ADV 0.13333333333333333 ADV ADV 0.13333333333333333
0.06666666666666667 0.06666666666666667
0 NOUN NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
t5-xxl empty opt_2.7b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
0.2
PUNCT PUNCT 0.2
PUNCT
ADV 0.13333333333333333 ADV ADV 0.13333333333333333
0.06666666666666667 0.06666666666666667
0 NOUN NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
empty empty opt_6.7b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
PUNCT PUNCT 0.2
PUNCT
ADV ADV ADV 0.13333333333333333
0.06666666666666667
NOUN NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
empty empty opt_13b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
PUNCT PUNCT 0.2PUNCT
0.13333333333333333
ADV ADV ADV
NOUN NOUN 0 0.06666666666666667
NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
empty empty iml_30b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
PUNCT PUNCT 0.2
PUNCT
ADV ADV ADV 0.13333333333333333
0.06666666666666667
NOUN NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X
empty empty opt_30b
DET ADP PRON DET ADP PRON DET ADP PRON
AUX AUX AUX
VERB VERB VERB
ADJ ADJ ADJ
PUNCT PUNCT PUNCT
0.2
ADV ADV ADV 0.13333333333333333
0.06666666666666667
NOUN NOUN 0 NOUN
PROPN PROPN PROPN
SYM SYM SYM
PART PART PART
INTJ INTJ INTJ
CCONJ CCONJ CCONJ
SCONJ NUM X SCONJ NUM X SCONJ NUM X

Figure 5: Additional visualizations of fingerprints. Note that the POS tag distributions of OPT models are less
similar than we observe within other model families. Further investigations could examine what causes these
differences, since model size seems to not play a factor in FLAN models.
0.10
0.14 0.14
Jensen Shannon Divergence

Jensen Shannon Divergence

Jensen Shannon Divergence


0.12 0.12 0.08
0.10 0.10
0.06
0.08 0.08
0.06 0.06 0.04
0.04 0.04
0.02
0.02 0.02
0.00 0.00 0.00
t0_3b bloom_7b t0_11b gpt_j gpt_neox t5_base t5_small t5_xl t5_xxl t5_large

(a) BigScience (b) EleutherAI (c) Flan

0.20
0.08 0.12
Jensen Shannon Divergence

Jensen Shannon Divergence

Jensen Shannon Divergence


0.10 0.15
0.06
0.08
0.06 0.10
0.04
0.04
0.02 0.05
0.02
0.00 0.00 0.00
llama-30B llama-13B llama-65B davinci-002 davinci-003 gpt-3.5-turbo 350M 13B iml_30B 125M 2.7B max_1.3B 1.3B 6.7B 30B

(d) LLaMA (e) OpenAI (f) OPT

Figure 6: Fingerprint characterization of Deepfake data by model and family. We report the Jensen-Shannon
Divergence of human vs. model for each model in each model family in the Deepfake data across four categories.
Columns from left to right: constituency type, named entity tag, POS tag, top-k word frequency. We omit the
GLM family in this visualization as there is only one model (130B) available. We do include the GLM results in
table format in Table 4. Like in Figure 1, some model families exhibit remarkably consistent fingerprints within
families, e.g. LLaMa, Flan, and OpenAI. OPT and EleutherAI in particular have less distinguishable fingerprints
within family.

type
0.20 const
Jensen Shannon Divergence

ne
pos
0.15 token

0.10

0.05

0.00
chatgpt davinci flan
Model

Figure 7: Fingerprint characterization of Outfox data by model. Columns from left to right: constituency
type, named entity tag, POS tag, top-k word frequency. We note again that ChatGPT and davinci, being in the
same OpenAI model family, have very similar fingerprints, whereas Flan’s fingerprint differs substantially. Note that
this fingerprint does look different than the Deepfake davinci’s fingerprint, showing us that there is some domain
dependence to fingerprints, while underscoring the point that regardless of domain, individual models of the same
family do produce similar-sounding texts.
model const ne pos top-k token
3.5-turbo 0.104062 0.121635 0.055647 0.065913
davinci-002 0.071464 0.085834 0.032347 0.048616
davinci-003 0.101841 0.101650 0.048786 0.050924
t0_11b 0.099334 0.116554 0.083307 0.046207
t0_3b 0.096778 0.096023 0.081568 0.044135
bloom_7b 0.113688 0.145603 0.096391 0.062792
llama_13B 0.084033 0.076458 0.055965 0.052204
llama_30B 0.083770 0.071776 0.055118 0.052044
llama_65B 0.083919 0.081378 0.052662 0.052289
GLM130B 0.092373 0.079987 0.055011 0.058290
gpt_j 0.091120 0.099940 0.079746 0.062767
gpt_neox 0.139291 0.126545 0.098795 0.150069
iml_30b 0.090410 0.101563 0.074169 0.056793
max_1.3b 0.074518 0.083632 0.052747 0.042955
opt_1.3b 0.076921 0.088855 0.056709 0.046881
opt_125m 0.066651 0.088566 0.047990 0.038409
opt_13b 0.191208 0.110141 0.162453 0.108174
opt_2.7b 0.095768 0.099351 0.075968 0.052447
opt_30b 0.129652 0.101543 0.106709 0.069312
opt_350m 0.134956 0.149431 0.120094 0.049036
opt_6.7b 0.077973 0.102048 0.060880 0.044933
t5_base 0.094128 0.078521 0.075754 0.054816
t5_large 0.093037 0.069552 0.075520 0.057391
t5_small 0.096973 0.090165 0.076765 0.053622
t5_xl 0.095449 0.076024 0.072617 0.050033
t5_xxl 0.090850 0.074514 0.068368 0.050967

Table 4: Jensen Shannon Divergence of human vs. model across fingerprint axes on Deepfake data. Here we
report in full the fingerprint measurements across constituency type (‘const’), named entity type (‘ne’), part-of-
speech (‘pos’), and top-k token frequency (‘top-k token’). Horizontal lines delineate model families.

model const ne pos top-k token


chatgpt 0.210140 0.109395 0.181909 0.101498
davinci 0.206403 0.104278 0.187878 0.093806
flan 0.107426 0.073845 0.095178 0.044870

Table 5: Jensen Shannon Divergence of human vs. model across fingerprint axes on Outfox data. Here we
report in full the fingerprint measurements across constituency type (‘const’), named entity type (‘ne’), part-of-
speech (‘pos’), and top-k token frequency (‘top-k token’).

C Additional Experiments
Fingerprints are “genetic”: often consistent within a model family. A particularly interesting finding
is that our classifier is better at detecting out-of-domain texts from the same model than it is detecting
in-domain texts from a different model. In other words, Flan T5 “sounds” like T5 whether it is generating
news stories or fan fiction. In Figure 3 we report the results of taking a classifier trained on data from one
model and looking at a drop in machine recall when evaluated on either an in-domain test from from a
different model or an out-of-domain test from the same model. The classifier is far more effective in the
latter case, and the 95% confidence interval computed with bootstrapping tells us that this is not due to
particular oddities of the data splits we used, but that is it a meaningful result.
We also explicitly test how well a classifier trained on one model generalizes to (1) other models in
the same family and (2) other model families. We find that, on average, the drop in machine recall value
(out of 1) from in-domain data to other models in the same family is only 0.01, while the drop to other
families is 0.62. We report these results in Table 7.
AID methods are generally effective for LLM text detection in publicly available datasets.
In the most simplistic setup (cf. Table Table 12), binary classification of “human” or “machine” where
Feature Sets F1 AUROC Feature Sets F1 AUROC
character n-grams 0.96 0.99 character n-grams 0.96 1.00
word n-grams 0.77 0.82 word n-grams 0.86 0.94
POS-tag n-grams 0.88 0.95 POS-tag n-grams 0.88 0.95
all 0.95 0.95 all 0.95 0.95
(a) Linguistic feature ablation on Deep- (b) Linguistic feature ablation on Outfox
fake considering data produced by gpt-j- data for content generated by Flan T5 in
6b across all domains. the one domain present in Outfox (student
essay).

Table 6: Linguistic Feature ablation. For two single model subsets of our data, we compare F1 and AUROC
for only characater-, word-, or POS n-gram features to a combination of the three. In both cases, each feature
set clearly carries some signal, although word n-grams individually perform the worst. This is likely because
word-level features are highly influenced by topic and therefore may not be the most useful feature to help the
model generalize across text domain. Interestingly, character n-grams by themselves appear to be stronger than the
combined feature set. This warrants more exploration, but we posit this is because the range of character n-grams
captures high-frequency subword tokens in a given model’s vocabulary.

Average drop in performance


Experiment HRec MRec F1 AUC
Same Family | Different Domain −0.03 −0.01 −0.02 0.00
Different Family | Same Domain 0.00 −0.62 −0.21 −0.44

Table 7: Models exhibit individual writing styles which are more similar across domains than across model
families. We report the average drop in performance of a GradientBoost from a binary classifier trained on Deepfake
data. In 7 independent trials, we train a classifier on a randomly selected model and compare its performance on the
in-domain test set to: (1) data from a model in the same family but in a held-out domain, and (2) data from a model
in a different family but same domains present in the train set (this is made possible by the fact that Deepfake is
multi-parallel). Performance drop is low over data from a model in the same family, and high over data from a model
in a different family. The human recall value is small but not 0 as the human data is shuffled and downsampled, so
the exact same set of prompts is not seen in every trial.

Domain In-domain As held-out


Finance 0.979 0.751
Medicine 0.996 0.941
Open QA 0.996 0.832
Reddit eli5 0.999 0.829
Wiki csai 0.957 0.739
Average 0.985 0.818
All 0.983 —

Table 8: Out-of-domain F1 results on HC3 data by domain. For each domain in the left-hand column, we train
two models: (1) one where training data is all domains and the held-out test set is in this domain, and (2) one
where training data is all domains except this domain, and the held-out test set is in this domain. The final row
denotes an experiment in which we train of all domains mixed and test on held-out data which is all in-domain.
Some combinations of domains are better for generalizing to unseen domains, which is a noted phenomenon in
AID as well (Soto et al., 2021); however in general, out-of-domain performance on HC3 is not as strong as it is for
Deepfake data (Figure 2). This could potentially be improved by increasing the number of samples seen during
training (which in each of these experiments was 5000)
Model Accuracy F1 AUROC HumanRec MachineRec AvgRec
ChatGPT 0.946 0.949 0.996 1.000 0.891 0.946
BloomZ 0.992 0.992 1.000 1.000 0.984 0.992
Dolly 0.714 00.763 0.845 0.913 0.512 0.712
Cohere 0.776 0.723 0.938 0.579 0.976 0.778
Davinci 0.868 0.882 0.977 0.980 0.754 0.867

Table 9: Bert features + GradientBoost classifier on M4 data. Here we take 5000 training sample, extract
BERT pre-trained embeddings for them using Huggingface’s feature-extractor pipeline, and train a GradientBoost
Classifier. We then test on in-domain data from a held out test set of 500 examples. Dolly and Cohere fool the
classifier the most; for Dolly, the machine recall value is the cause of its low performance, perhaps indicating high
linguistic variation of the model’s generation (i.e. it is easier to pinpoint what a ‘human’ text sounds like than what
a ‘machine’ text sounds like). The opposite is true for Cohere. Our model is able to find distinctive markers of
machine text but the human data is indistinctive.

Model Accuracy F1 AUROC HumanRec MachineRec AvgRec


ChatGPT 0.852 0.871 0.993 0.705 0.998 0.851
BloomZ 0.996 0.996 1.000 0.991 1.000 0.996
Dolly 0.707 0.768 0.875 0.443 0.971 0.707
Cohere 0.551 0.421 0.636 0.776 0.326 0.551
Davinci 0.767 0.810 0.974 0.540 0.995 0.767

Table 10: Linguistic features + GradientBoost classifier on M4 data. Here we take 5000 training sample, extract
linguistic features (char-, word-, POS n-grams) from them, and train a GradientBoost Classifier. We then test on
in-domain data from a held out test set of 500 examples. Dolly and Cohere fool the classifier the most; in Table 9,
Dolly exhibits a high machine recall value whereas Cohere exhibits low machine recall. In this case, the opposite is
true. This warrants further exploration. Possibly, a combined approach of neural and linguistic features such as in
Petukhova et al. (2024) would be more robust here.

data is from one model in one text-domain, our GradientBoost classifier with linguistic features achieves
upwards of 0.94 AUROC on every test.
We also compare the classifier performance on data produced by a single model, but in multiple domains
mixed, seen in Table 12. While performance, in general, is robust, M4’s Cohere data is a clear outlier,
only achieving 0.64 AUROC, perhaps indicating that it shows a high degree of linguistic diversity across
domain-specific generations. However, in a comparison of linguistic vs. neural features on the M4 dataset,
we find that the GB classifier can achieve high performance on Cohere (cf. Table 14).
In a multi-class setup (Table 2), where the classifier must distinguish between “human”, “model 1”, and
“model 2”, it achieves an average of 0.94 and 0.91 for the Ghostbuster and Outfox datasets, respectively.
All of the above are calculated using unseen data from the same distribution as the training data, but we
see strong out-of-domain performance as well in Figure 2. Choosing the largest model in each model
family available in Deepfake’s data, we calculate the F1 score for both an in-domain and out-of-domain
test set. Through multiple independent trials and our bootstrapped confidence intervals, we see that F1
on unseen domains has a larger variance, but that overall the difference in F1 score for in-domain and
out-of-domain performance is not statistically significant because the confidence intervals overlap. In
other words, linguistic features + ML classifiers do generalize well to data in unseen domains.

D Implementation Details

D.1 GradientBoost
Parameters: learning rate of 0.2, number of estimators 90, max depth of 8, max features ’sqrt’, sum-
sample ratio 0.8, random state 10, minimum samples leaf 30 and minimum samples to split 400, these
hyperparameters were optimized using Sklearn’s gridsearch function. Features: char n-grams:(2,4), word
n-grams:(3,5), pos n-grams:(3,5). Maximum 2000 features for each feature set.
Dataset Base Model Domain F1 AUROC HumanRec MachineRec AvgRec
Domain-Specific gpt-j-6b cmv 0.98 0.98 0.97 1.00 0.98
eli5 0.94 0.94 0.98 0.90 0.94
hswag 0.96 0.98 1.00 0.97 0.98
roct 0.99 0.99 1.00 0.98 0.99
sci_gen 0.99 0.99 0.99 0.98 0.99
squad 0.91 0.97 1.00 0.95 0.97
tldr 0.97 0.96 0.97 0.95 0.96
xsum 0.95 0.95 0.95 0.95 0.95
yelp 0.95 0.94 0.99 0.89 0.94
wp 0.98 0.98 1.00 0.96 0.98
Mixed Domains gpt-j-6b mixed 0.95 0.95 0.97 0.93 0.95
gpt-3.5-turbo mixed 0.90 0.97 0.91 0.91 0.91
flan-t5-xxl mixed 0.91 0.95 0.91 0.79 0.85
opt 30b mixed 0.98 0.99 0.99 0.93 0.96
llama 65B mixed 0.92 0.93 0.94 0.75 0.85
glm 130B mixed 0.94 0.95 0.98 0.78 0.88
davinci-003 mixed 0.82 0.92 0.80 0.86 0.83
Mixed Model Set OpenAI GPT mixed 0.70 0.73 0.64 0.82 0.73
Meta Llama* mixed 0.81 0.82 0.79 0.84 0.82
GLM-130B* mixed 0.86 0.86 0.89 0.83 0.86
Google FLAN-T5* mixed 0.83 0.85 0.77 0.93 0.85
Facebook OPT* mixed 0.86 0.87 0.87 0.86 0.87
BigScience mixed 0.79 0.81 0.74 0.88 0.81
EleutherAI mixed 0.95 0.95 0.96 0.94 0.95
Ghostbuster gpt-3.5-turbo Reuters 0.98 0.98 0.98 0.99 0.98
essay 0.98 0.97 0.99 0.96 0.97
wp 0.98 0.99 0.99 0.98 0.99
claude Reuters 0.97 0.96 0.99 0.94 0.96
essay 0.98 0.98 0.97 0.99 0.98
wp 0.95 0.95 0.94 0.97 0.95
HC3 gpt-3.5-turbo eli5 1.00 1.00 1.00 1.00 1.00
open_qa 0.97 1.00 0.96 0.98 0.97
wiki_csai 0.97 0.99 0.96 0.98 0.97
medicine 0.99 1.00 0.99 0.99 0.99
finance 0.97 0.99 0.99 0.96 0.97
OUTFOX gpt-3.5-turbo essay 0.95 0.94 1.00 0.89 0.94
text-DaVinci-003 essay 0.98 0.98 1.00 0.95 0.98
flan_t5_xxl essay 0.95 0.96 0.93 0.98 0.96

Table 11: Full In-Domain Results with GradientBoost classifier. Here we report classifier metrics for four
datasets, broken down by model, model family, and domain.
Dataset Model AUROC
Deepfake gpt-3.5-turbo 0.97
Dataset Model AUROC flan-t5-xxl 0.95
Deepfake gpt-j-6b 96.8 ± 1.8 opt 30B 0.99
llama 65B 0.93
HC3 gpt-3.5-turbo 99.6 ± 0.46 glm 130 B 0.95
Ghostbuster gpt-3.5-turbo 98 ± 0.81 davinci-003 0.92
claude 96.3 ± 1.2 M4 gpt-3.5-turbo 0.99
Outfox gpt-3.5-turbo 94.0 BloomZ 1.0
text-davinci-003 98.0 Dolly 0.88
flan-t5-xxl 96.0 Cohere 0.64
Davinci 0.97
(a) Domain-Specific results: AUROC for binary
HC3 gpt-3.5-turbo 1.0
classification where “model” data is from a single
model in a single domain. Where we have data for (b) Mixed Domain results: For those
more than one domain per model, we average them, datasets for which we have data from a sin-
reporting mean and standard deviation. Outfox data gle model in multiple domains, we report
only has one domain (essay), so there is no standard AUROC on a held-out test set.
deviation to report.

Table 12: Single and mixed domain results with linguistic features and GradientBoost classifier for our datasets of
interest.

Dataset Model AUROC


Deepfake gpt-3.5-turbo 0.97
flan-t5-xxl 0.95
opt 30B 0.99
llama 65B 0.93
glm 130 B 0.95
davinci-003 0.92
M4 gpt-3.5-turbo 0.99
BloomZ 1.0
Dolly 0.88
Cohere 0.64
Davinci 0.97
HC3 gpt-3.5-turbo 1.0

Table 13: Classifier performance on data produced by a single model in a mixture of domains. We report
classifier (linguistic + GradientBoost) performance as AUROC for Deepfake, M4, and HC3 datasets, which all
include data from one model across a variety of domains (Outfox only includes the essay domain). The classifier
proves broadly robust, except for Cohere data. These results are extracted from the full table in Table 11.

Model Ling+GB BERT+GB


ChatGPT 0.993 0.996
BloomZ 0.993 1.000
Dolly 0.875 0.845
Cohere 0.636 0.938
Davinci 0.974 0.977

Table 14: Linguistic vs. BERT features. We report AUROC for binary classification on M4 data, comparing
linguistic and BERT features with the same machine learning classifier. For most models, the performance is
comparable, with BERT features slightly edging out linguistic features. Cohere is the outlier, proving to be a difficult
classification task for the linguistic set to capture. However, BERT features remain robust to Cohere data.
Dataset Base Model/Family Domain Human Machine
Domain-Specific gpt-j-6b cmv 509 636
eli5 952 863
hswag 1000 868
roct 999 833
sci_gen 950 529
squad 686 718
tldr 772 588
xsum 997 913
yelp 984 856
wp 940 784
Total 8789 7588
Mixed Model Set OpenAI GPT mixed 67k 67k
Meta Llama mixed 37k 37k
GLM-130B mixed 9k 9k
Google FLAN-T5 mixed 47k 47k
Facebook OPT mixed 80k 80k
BigScience mixed 27k 27k
EleutherAI mixed 14k 14k
Total 282k 282k
Ghostbuster gpt-3.5-turbo Reuters 500 500
essay 1000 1000
wp 500 500
Total 2000 2000
HC3 gpt-3.5-turbo eli5 17.1k 17.1k
open_qa 1.19k 1.19k
wiki_csai 842 842
medicine 1.25k 1.25k
finance 3.93k 3.93k
Total 24.3k 24.3k
OUTFOX gpt-3.5-turbo essay 15k 15k
text-davinci-003 essay 15k 15k
flan_t5_xxl essay 15k 15k
Total 46k 46k

Table 15: Dataset statistics (number of documents) for publicly available machine-generated text detection datasets.

E Dataset Information

E.1 Outfox

Outfox is a parallel human-machine dataset built on the Kaggle Feedback Prize dataset (Franklin et al.,
2022) and contains approximately 15,000 essay problem statements and human-written essays, ranging
in provenance from 6th to 12th grade native-speaking students in the United States. For each problem
statement, there is also an essay generated by each of three LLMs: ChatGPT (gpt-3.5-turbo-0613),
GPT-3.5 (text-davinci-003), and Flan (FLAN-T5-XXL). Each example contain an instruction prompt
(“Given the following problem statement, please write an essay in 320 words with a clear opinion.”), a
problem statement (“Explain the benefits of participating in extracurricular activities and how they can
help students succeed in both school and life. Use personal experiences and examples to support your
argument.”), the text of the essay, and a binary label for human or machine authorship.
While we conduct fingerprint analysis on the whole dataset, we use only the human-written subset of
the Outfox data as a training corpus for our fine-tuning setup; given an instruction prompt and problem
statement, we fine-tune our LLMs of interest to produce text which minimises cross-entropy loss when
compared with the original human-written response to the same problem statement. We withhold a test-set
of human-written examples from training to be used for evaluation.
E.2 Ghostbuster
Verma et al. (2023) provide three new datasets for evaluating AI-generated text detection in creative
writing, news, and student essays. Using prompts scraped from the subreddit r/WritingPrompts, the
Reuters 50-50 authorship identification dataset, and student essays from the online source IvyPanda, they
obtained ChatGPT- and Claude-generated responses and made efforts to maintain consistency in length
with human-authored content in each domain.

E.3 HC3
We also analyze data from (Guo et al., 2023), which includes questions from publicly available datasets
and wiki sources with human- and ChatGPT-generated responses based on instructions and additional
context. The resulting corpus comprises 24,322 English and 12,853 Chinese questions, of which we only
use the English split.

E.4 Deepfake
The Deepfake corpus is a comprehensive dataset designed for benchmarking machine-generated content
detection in real-world scenarios (Li et al., 2023). It contains approximately 9,000 human examples across
10 text domains, each paired with machine outputs from 27 models (e.g. GPT-3.5-turbo, text-davinci-002)
from 7 different model families (e.g. OpenAI), producing several testbeds designed for examining a
detector’s sensitivity to model provenance and text domain. Each example contains the text, binary label
denoting human or machine, and the source information – which domain, model, and prompting method
were used.
We reserve the whole Deepfake dataset as an evaluation corpus to allow us to examine the robustness
of our proposed methods across different models, model families, and domains. For generation, we take
the first 30 tokens (as split by whitespace) of an example as the problem statement and prepend it with a
simple continuation instruction prompt: “Read the following instruction and generate appropriate text.
### Continue the following text:”.
Training Data. We primarily use the Deepfake and Outfox data for training classifiers to analyze
different aspects of the LLM fingerprints. They are both conveniently multi-parallel: they contain N model
responses for each human text sample in the dataset. This has the benefit of removing some uncertainty
from our classifier results. Performance on the human class is often identical across trials, as the human
data is often identical. This allows a controlled test of how our classifier deals with the machine text
samples. Additionally, the different testbeds provided in Deepfake provide convenient, parallel domain
and model (/model family) data splits. Specifically, we use the mixed model sets and model-specific,
domain-specific testbeds from Deepfake.

F RLHF Negative Result


F.1 Related Works
Watermarking. Unlike the work of Kirchenbauer et al. (2023) on watermarking LLMs, the fingerprints
we describe are not intentionally added after-the-fact in order to make a model more detectable, but are
naturally imprinted on the model during its training/fine-tuning process. Adversarial Attacks. While this
work is related to adversarial research in that we attempt to create data which will degrade the performance
of a classifier, we have a distinctly different goal than that of adversarial attacking; we are not interested
in fooling a detector so much as we are exploring what it is that keeps machine-generated data from
being truly indistinguishable from that of humans. A variety of adversarial attacks have proved incredibly
effective against AI-text detectors (Cai and Cui, 2023; Koike et al., 2023). Recognizing the shortcomings
of existing detection methods, some have proposed creating more robust classifiers by augmenting training
corpora with more challenging examples. Recently, Koike et al. (2023) proposed using perturbed (e.g.
paraphrased) text to augment training data. Reinforcement Learning. Having uncovered evidence of
unique LLM writing styles, we designed an experimental setup to try to remove the fingerprints with
Reinforcement Learning with Human Feedback (RLHF). The results of this experiment were inconclusive,
but we report it in summary in Appendix A. For this setup, we use a mixture of Deepfake and Outfox data
for both instruction fine-tuning and RLHF training data.

F.2 Generation
For some experiments, we use LLaMA models and their instruction-tuned counterparts to generate
machine responses to human prompts in the Outfox and Deepfake datasets. Outfox’s examples are already
formatted with an instruction prompt, (e.g. “Given the following problem statement, please write an
essay in 320 words with a clear opinion.”) and explicit problem statement (e.g.“Explain the benefits of
participating in extracurricular activities and how they can help students succeed in both school and life.
Use personal experiences and examples to support your argument.”), whereas Deepfake’s are not. To
remedy this, for every example in Deepfake, we take the first 30 tokens (as split by whitespace) of the
example text as the problem statement and prepend it with a simple continuation instruction prompt:
“Read the following instruction and generate appropriate text. ### Continue the following text:”. Further
information about generations, including model sampling parameters, may be found in Appendix E.4. We
conduct minimal pre- and post-processing on our input as well as the generated outputs, only removing
newlines and cleaning up whitespace.

F.3 Negative Results


Here we describe an experimental setup devised to remove a given model’s fingerprint with reinforcement
learning.
RLHF is a popular training methodology that uses reinforcement learning principles to encourage
models to produce text with desired attributes. Specifically, it optimizes model behavior with human
feedback, often facilitated by crowd-sourced annotators or domain experts (Ouyang et al., 2022). Direct
Preference Optimization (DPO) (Rafailov et al., 2023) is a recent training paradigm for implementing
RLHF, which eliminates the need for fine-tuning a reward model and instead directly uses the language
model for optimizing the reward function by solving a classification problem on human preference data.
For three recent LLMs (LLaMA-7b, LLaMA-13b, and Falcon-7b), we performed a supervised fine-
tuning step to adapt them to the domain of the training data, which is student-generated essays from
Outfox. Once the SFT models are trained, we use them as a starting point for DPO training.3
In a DPO setup, the model is exposed to a triplet of prompt, chosen_response and
rejected_response where the model learns to optimize the marginal reward of the chosen re-
sponse over the rejected response. Intuitively, if the SFT or DPO process has been successful in removing
the LLM fingerprint, we expect to see classifier performance drop – the machine responses are more
difficult to distinguish from human responses because they exhibit more similar distributions of linguistic
features than the responses of the base model do. To quantify our results, we used our GradientBoost
classifier, a finetuned Longformer (Beltagy et al., 2020) model released by Li et al. (2023), and DetectGPT
as a zero-shot detection method (Mitchell et al., 2023).
After significant tuning, retraining, and optimizing, we did not consistently see an appreciable decrease
in classifier performance after the successive stages of fine-tuning.

F.4 Limitations of RLHF


We acknowledge that using RLHF to align model outputs with human data necessarily amplifies any
subconscious biases which may be present in the human training data. It is well known that certain
demographics are over-represented in the creation of user-generated content on the internet, and as our
proposed method focuses exclusively on the syntax and lexical choice of LLMs – not semantic content –
we therefore have not included any filtering for offensive content.

3
We do this because best practices for DPO from (Rafailov et al., 2023) suggest that models should first be adapted to the text
domain via fine-tuning before being trained with reinforcement learning.

You might also like