Your Large Language Models Are Leaving Fingerprints
Your Large Language Models Are Leaving Fingerprints
Figure 1: Visualization of the fingerprints. We plot frequencies of each part-of-speech (POS) class from the
output of several models, sorted by model family. Within each family, the shapes (distributions) look mostly similar
regardless of model size. Each radial plot is shown at the same 0% to 20% frequency scale, with POS tags sorted
from most to least common among human-written outputs. Jagged/bumpy shapes indicate the fingerprint is more
distinct from human distributions. POS is just one component of the full ‘fingerprint’ we investigate.
ses, and present a new perspective of LLM-content ML classifiers, including SVC and logistic regres-
detection as authorship identification. sion. These exhibited close or similar performance
on our data. If the classes are very imbalanced, we
2 Methodology downsample the majority class in order to have a
2.1 Fingerprint Features balanced dataset (n = 5000).
F1 Score
Average 0.940
0.4
Outfox Human 0.877
ChatGPT 0.936 0.2 type
x
0b
b
5B
bo
l x
eo
11
30
_x
t_3
r
_6
5tu
t0_
t_n
_1
t5
b
op
n_
t_3
ma
glm
gp
fla
gp
lla
Table 2: F1 scores for each class as the positive class Model
after training under a multiclass classification setting.
Note that even for top models ChatGPT and Claude, Figure 2: F1 score of GTD on in-domain versus out-
our simple n-gram based classifier performs very well of-domain test sets for the largest model of each
(0.936 and 0.920 on the Outfox data). To compare with model family in the Deepfake benchmark. We find
binary classification results, F1 scores are computed for no statistically significant drop in performance when
each class by setting that class to be the ‘positive’. testing on these 7 models’ outputs. 95% confidence
intervals are computed through bootstrap sampling at
n = 10, 000.
We visualize fingerprints by looking at the dif-
ference of distribution in various linguistic prop-
erties. In Figure 1, we report part-of-speech tag 3.2 Robustness
distributions of data generated by different mod- Given the nature of the fingerprint features, one
els on the same deepfake data domains2 . In Ap- might expect that a shift in domain substantially
pendix B we also include analysis from named impacts the performance of GTD using these finger-
entity tags, constituency types, and top-k most fre- prints. However, in Figure 2 we find no evidence
quent tokens. The strength of the fingerprint is that performance deteriorates for out-of-domain
more evident across some axes than others, and test sets, while Figure 3 shows a clear deterioration
there are, of course, more dimensions of linguis- due to out-of-model test sets. That is, fingerprints
tic analysis that could theoretically be applied to are robust across domains but are unique for each
uncover model fingerprints. model. Even further, fingerprints are somewhat
Distinct patterns emerge when comparing the robust to adversarial attacks Table 3.
fingerprint of models within the same family com-
pared to models of different families. The degree
of similarity within families can vary between fam- 0.0
ilies; for example, LLaMA models exhibit a par-
0.2
ticularly uniform fingerprint across model sizes,
Value
CC T
SC NJ
PR NJ
SP N
M
J
X
M
V
NU E
AD
INT
R
NC
AC
R
AU
U
OP
SY
5 Conclusion
O
O
PA
NO
Part-of-Speech Tag
We demonstrate that in five popular datasets for
Figure 4: Absolute difference in POS tag frequen- machine-generated text detection, n-gram features
cies as compared with human text. Chat models are
are highly effective for GTD. We uncover that
slightly more similar to the frequency profile of hu-
mans, but are easier to detect than base models. This LLMs have unique writing styles that can be cap-
demonstrates that fingerprints “closer” to human distri- tured in lexical and syntactic features, which we
butions in POS tags does not indicate it is less detectable. characterize as “fingerprints”, and show that they
Further, fine-tuning models for chat clearly alters their are generally unique to a model family. We also
fingerprint despite no change in model architecture. show some evidence fingerprints can be modified
with further fine-tuning, suggesting they may be
removable.
4 Related Work
A common approach to machine-generated text de-
Limitations
tection is to train a supervised binary classifier on • Text length: we examine outputs of approxi-
labeled data (Guo et al., 2023; Koike et al., 2023; Li mately 300-500 words in length. Shorter texts
et al., 2023). Li et al. (2023) proposed a variety of may be difficult to fingerprint or may not pro-
classification testbeds, finding that pre-trained lan- vide enough signal.
guage models perform the best. While n-gram fre-
quencies have often been used for author identifica- • Prompting. We do not explore prompting
tion, only a few recent works examine hand-crafted methods in any exhaustive or fine-grained
features or stylometrics in machine-generated text manner, although we do note that we conduct
detection (Zaitsu and Jin, 2023). One example is analysis on datasets that have used a variety of
prompting methods themselves to collect data. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Still, we acknowledge that choice of prompts Longformer: The long-document transformer. ArXiv,
abs/2004.05150.
has been shown to have a significant impact
on the output of an LLM. Shuyang Cai and Wanyun Cui. 2023. Evade chatgpt
detectors via a single space. ArXiv, abs/2307.02599.
• Model choice limitations: We constrain our-
selves to the data and models released as part Elizabeth Clark, Tal August, Sofia Serrano, Nikita
Haduong, Suchin Gururangan, and Noah A. Smith.
of text detection corpora, which means that 2021. All That’s ‘Human’ Is Not Gold: Evaluating
there some very good models we simply did Human Evaluation of Generated Text. In Proceed-
not have the data to test, e.g. GPT4 data. ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
• Generation uncertainty: For our instruction- Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 7282–7296, Online.
tuning fingerprint experiment, we only fine-
Association for Computational Linguistics.
tune 7b and 13b variants of Llama-2. These
are relatively small models and there is no Alex Franklin, Maggie, Meg Benner, Natalie Rambis,
guarantee that our methods would work for Perpetual Baffour, Ryan Holbrook, Scott Crossley,
and ulrichboser. 2022. Feedback prize - predicting
larger models or different instruction tuning effective arguments.
regimes.
Jerome H Friedman. 2001. Greedy function approx-
• Reflection on real-world use-case. Ana- imation: a gradient boosting machine. Annals of
lyzing fingerprints in research benchmark statistics, pages 1189–1232.
datasets is most likely not reflective of the Sebastian Gehrmann, Hendrik Strobelt, and Alexan-
true difficulty of deepfake text detection in the der M. Rush. 2019. Gltr: Statistical detection and
wild. For one thing, people don’t tend to use visualization of generated text. In Annual Meeting of
LLMs for writing entire articles/essays, etc. A the Association for Computational Linguistics.
more likely scenario for, e.g. academic pla- Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jin-
giarism, is starting from an LLM generated ran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu.
paragraph and making sentence-level rewrites. 2023. How Close is ChatGPT to Human Experts?
As this is analagous to a paraphrase attack like Comparison Corpus, Evaluation, and Detection. Pub-
lisher: arXiv Version Number: 1.
DIPPER (Krishna et al., 2023), we expect that
it would degrade our classifiers’ performance. Xie He, Arash Habibi Lashkari, Nikhill Vombatkere,
and Dilli Prasad Sharma. 2024. Authorship attri-
Ethics Statement bution methods, challenges, and future research di-
rections: A comprehensive survey. Information,
This research indicates that detecting machine- 15(3):131.
generated text is easy. However, we want to John Kirchenbauer, Jonas Geiping, Yuxin Wen,
stress that this does not necessarily mean machine- Jonathan Katz, Ian Miers, and Tom Goldstein. 2023.
detection is a high-confidence task. Using a single A watermark for large language models. In Interna-
model prediction about one single written text to tional Conference on Machine Learning.
determine whether or not it was human-written Ryuto Koike, Masahiro Kaneko, and Naoaki Okazaki.
should be evaluated on a different basis than av- 2023. Outfox: Llm-generated essay detection
erage accuracy, given the potential harms of false through in-context learning with adversarially gener-
positives or false negatives. For example, teachers ated examples. ArXiv, abs/2307.11729.
may wish to use tools to determine if students have Kalpesh Krishna, Yixiao Song, Marzena Karpinska,
cheated on exams or homework using LLMs. We John Wieting, and Mohit Iyyer. 2023. Paraphras-
discourage teachers from trusting predictions by ing evades detectors of ai-generated text, but retrieval
any classifier until more investigation is done into is an effective defense.
the confidence models have for any individual text. Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue
Wang, Linyi Yang, Shuming Shi, and Yue Zhang.
2023. Deepfake Text Detection in the Wild. Pub-
References lisher: arXiv Version Number: 1.
Sofia Barnett. 2023. ChatGPT Is Making Universities Eric Mitchell, Yoonho Lee, Alexander Khazatsky,
Rethink Plagiarism. Wired. Section: tags. Christopher D. Manning, and Chelsea Finn. 2023.
DetectGPT: Zero-Shot Machine-Generated Text De- In Proceedings of the 18th Conference of the Euro-
tection using Probability Curvature. arXiv. Version pean Chapter of the Association for Computational
Number: 2. Linguistics (Volume 1: Long Papers), pages 1369–
1407, St. Julian’s, Malta. Association for Computa-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, tional Linguistics.
Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Chris Westfall. 2023. Educators Battle Plagiarism As
Ray, John Schulman, Jacob Hilton, Fraser Kelton, 89% Of Students Admit To Using OpenAI’s Chat-
Luke E. Miller, Maddie Simens, Amanda Askell, Pe- GPT For Homework. Section: Careers.
ter Welinder, Paul Francis Christiano, Jan Leike, and
Ryan J. Lowe. 2022. Training language models to Wataru Zaitsu and Mingzhe Jin. 2023. Distinguishing
follow instructions with human feedback. ArXiv, ChatGPT(-3.5, -4)-generated and human-written pa-
abs/2203.02155. pers through Japanese stylometric analysis. PLOS
ONE, 18(8):e0288453.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.
Kseniia Petukhova, Roman Kazakov, and Ekaterina
Kochmar. 2024. Petkaz at semeval-2024 task 8: Can
linguistics capture the specifics of llm-generated text?
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D. Manning, and Chelsea Finn.
2023. Direct Preference Optimization: Your Lan-
guage Model is Secretly a Reward Model. Publisher:
arXiv Version Number: 2.
Abel Salinas and Fred Morstatter. 2024. The butterfly
effect of altering prompts: How small changes and
jailbreaks affect large language model performance.
ArXiv, abs/2401.03729.
Rafael A. Rivera Soto, Olivia Elizabeth Miano, Juanita
Ordoñez, Barry Y. Chen, Aleem Khan, Marcus
Bishop, and Nicholas Andrews. 2021. Learning uni-
versal authorship representations. In Conference on
Empirical Methods in Natural Language Processing.
Vivek Kumar Verma, Eve Fleisig, Nicholas Tomlin,
and Dan Klein. 2023. Ghostbuster: Detecting text
ghostwritten by large language models. ArXiv,
abs/2305.15047.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan
Su, Artem Shelmanov, Akim Tsvigun, Chenxi White-
house, Osama Mohammed Afzal, Tarek Mahmoud,
Giovanni Puccetti, Thomas Arnold, Alham Fikri
Aji, Nizar Habash, Iryna Gurevych, and Preslav
Nakov. 2024a. SemEval-2024 task 8: Multigenerator,
multidomain, and multilingual black-box machine-
generated text detection. In Proceedings of the 18th
International Workshop on Semantic Evaluation, Se-
mEval 2024, Mexico City, Mexico.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan
Su, Artem Shelmanov, Akim Tsvigun, Chenxi White-
house, Osama Mohammed Afzal, Tarek Mahmoud,
Toru Sasaki, Thomas Arnold, Alham Aji, Nizar
Habash, Iryna Gurevych, and Preslav Nakov. 2024b.
M4: Multi-generator, multi-domain, and multi-
lingual black-box machine-generated text detection.
A Discussion
A note about prompting. Different prompting methods have quite a large effect on the output generation
and quality of LLMs (Salinas and Morstatter, 2024). While it is possible that different prompting methods
could then have a large effect on the measured fingerprint of a single output, at the dataset level, a little
bit of prompting variation seems not to be too detrimental. Specifically, every dataset we tested used a
combination of prompting techniques – continuation, topic based, chain-of-thought, etc. and we do still
find LLM style markers which are useful for classification. Furthermore, we show that both linguistic and
neural features + GB classifier is robust to an adversarial technique that uses different prompts to try to
fool a detector, OUTFOX, in Table 3.
The knowledge that LLM fingerprints may be exploited for quick, accurate, explainable classifications we
hope will encourage more research into other straightforward and trustworthy methods of machine-text
detection. InAppendix F, we detail the negative results from a Reinforcement Learning with Human
Feedback (RLHF) setup in an attempt to remove the fingerprints. Our inconclusive RLHF experiments
raises the question of the intrinsic-ness of the fingerprints and if they can be removed at all. If they cannot,
then fingerprints may remain a strong classification signal for the task until other training procedures
prove to produce fingerprint-less models.
Perhaps the most interesting insight from our work is that LLMs may be considered analogous to
human authors for the purpose of deepfake detection. We hope that this insight will bridge two research
avenues and help to develop a more robust theory of LLM generation.
Finally, we encourage those releasing datasets for machine-generated text detection to benchmark
against simple, feature-based classifiers.
B Fingerprint Characterization
Figure 5: Additional visualizations of fingerprints. Note that the POS tag distributions of OPT models are less
similar than we observe within other model families. Further investigations could examine what causes these
differences, since model size seems to not play a factor in FLAN models.
0.10
0.14 0.14
Jensen Shannon Divergence
0.20
0.08 0.12
Jensen Shannon Divergence
Figure 6: Fingerprint characterization of Deepfake data by model and family. We report the Jensen-Shannon
Divergence of human vs. model for each model in each model family in the Deepfake data across four categories.
Columns from left to right: constituency type, named entity tag, POS tag, top-k word frequency. We omit the
GLM family in this visualization as there is only one model (130B) available. We do include the GLM results in
table format in Table 4. Like in Figure 1, some model families exhibit remarkably consistent fingerprints within
families, e.g. LLaMa, Flan, and OpenAI. OPT and EleutherAI in particular have less distinguishable fingerprints
within family.
type
0.20 const
Jensen Shannon Divergence
ne
pos
0.15 token
0.10
0.05
0.00
chatgpt davinci flan
Model
Figure 7: Fingerprint characterization of Outfox data by model. Columns from left to right: constituency
type, named entity tag, POS tag, top-k word frequency. We note again that ChatGPT and davinci, being in the
same OpenAI model family, have very similar fingerprints, whereas Flan’s fingerprint differs substantially. Note that
this fingerprint does look different than the Deepfake davinci’s fingerprint, showing us that there is some domain
dependence to fingerprints, while underscoring the point that regardless of domain, individual models of the same
family do produce similar-sounding texts.
model const ne pos top-k token
3.5-turbo 0.104062 0.121635 0.055647 0.065913
davinci-002 0.071464 0.085834 0.032347 0.048616
davinci-003 0.101841 0.101650 0.048786 0.050924
t0_11b 0.099334 0.116554 0.083307 0.046207
t0_3b 0.096778 0.096023 0.081568 0.044135
bloom_7b 0.113688 0.145603 0.096391 0.062792
llama_13B 0.084033 0.076458 0.055965 0.052204
llama_30B 0.083770 0.071776 0.055118 0.052044
llama_65B 0.083919 0.081378 0.052662 0.052289
GLM130B 0.092373 0.079987 0.055011 0.058290
gpt_j 0.091120 0.099940 0.079746 0.062767
gpt_neox 0.139291 0.126545 0.098795 0.150069
iml_30b 0.090410 0.101563 0.074169 0.056793
max_1.3b 0.074518 0.083632 0.052747 0.042955
opt_1.3b 0.076921 0.088855 0.056709 0.046881
opt_125m 0.066651 0.088566 0.047990 0.038409
opt_13b 0.191208 0.110141 0.162453 0.108174
opt_2.7b 0.095768 0.099351 0.075968 0.052447
opt_30b 0.129652 0.101543 0.106709 0.069312
opt_350m 0.134956 0.149431 0.120094 0.049036
opt_6.7b 0.077973 0.102048 0.060880 0.044933
t5_base 0.094128 0.078521 0.075754 0.054816
t5_large 0.093037 0.069552 0.075520 0.057391
t5_small 0.096973 0.090165 0.076765 0.053622
t5_xl 0.095449 0.076024 0.072617 0.050033
t5_xxl 0.090850 0.074514 0.068368 0.050967
Table 4: Jensen Shannon Divergence of human vs. model across fingerprint axes on Deepfake data. Here we
report in full the fingerprint measurements across constituency type (‘const’), named entity type (‘ne’), part-of-
speech (‘pos’), and top-k token frequency (‘top-k token’). Horizontal lines delineate model families.
Table 5: Jensen Shannon Divergence of human vs. model across fingerprint axes on Outfox data. Here we
report in full the fingerprint measurements across constituency type (‘const’), named entity type (‘ne’), part-of-
speech (‘pos’), and top-k token frequency (‘top-k token’).
C Additional Experiments
Fingerprints are “genetic”: often consistent within a model family. A particularly interesting finding
is that our classifier is better at detecting out-of-domain texts from the same model than it is detecting
in-domain texts from a different model. In other words, Flan T5 “sounds” like T5 whether it is generating
news stories or fan fiction. In Figure 3 we report the results of taking a classifier trained on data from one
model and looking at a drop in machine recall when evaluated on either an in-domain test from from a
different model or an out-of-domain test from the same model. The classifier is far more effective in the
latter case, and the 95% confidence interval computed with bootstrapping tells us that this is not due to
particular oddities of the data splits we used, but that is it a meaningful result.
We also explicitly test how well a classifier trained on one model generalizes to (1) other models in
the same family and (2) other model families. We find that, on average, the drop in machine recall value
(out of 1) from in-domain data to other models in the same family is only 0.01, while the drop to other
families is 0.62. We report these results in Table 7.
AID methods are generally effective for LLM text detection in publicly available datasets.
In the most simplistic setup (cf. Table Table 12), binary classification of “human” or “machine” where
Feature Sets F1 AUROC Feature Sets F1 AUROC
character n-grams 0.96 0.99 character n-grams 0.96 1.00
word n-grams 0.77 0.82 word n-grams 0.86 0.94
POS-tag n-grams 0.88 0.95 POS-tag n-grams 0.88 0.95
all 0.95 0.95 all 0.95 0.95
(a) Linguistic feature ablation on Deep- (b) Linguistic feature ablation on Outfox
fake considering data produced by gpt-j- data for content generated by Flan T5 in
6b across all domains. the one domain present in Outfox (student
essay).
Table 6: Linguistic Feature ablation. For two single model subsets of our data, we compare F1 and AUROC
for only characater-, word-, or POS n-gram features to a combination of the three. In both cases, each feature
set clearly carries some signal, although word n-grams individually perform the worst. This is likely because
word-level features are highly influenced by topic and therefore may not be the most useful feature to help the
model generalize across text domain. Interestingly, character n-grams by themselves appear to be stronger than the
combined feature set. This warrants more exploration, but we posit this is because the range of character n-grams
captures high-frequency subword tokens in a given model’s vocabulary.
Table 7: Models exhibit individual writing styles which are more similar across domains than across model
families. We report the average drop in performance of a GradientBoost from a binary classifier trained on Deepfake
data. In 7 independent trials, we train a classifier on a randomly selected model and compare its performance on the
in-domain test set to: (1) data from a model in the same family but in a held-out domain, and (2) data from a model
in a different family but same domains present in the train set (this is made possible by the fact that Deepfake is
multi-parallel). Performance drop is low over data from a model in the same family, and high over data from a model
in a different family. The human recall value is small but not 0 as the human data is shuffled and downsampled, so
the exact same set of prompts is not seen in every trial.
Table 8: Out-of-domain F1 results on HC3 data by domain. For each domain in the left-hand column, we train
two models: (1) one where training data is all domains and the held-out test set is in this domain, and (2) one
where training data is all domains except this domain, and the held-out test set is in this domain. The final row
denotes an experiment in which we train of all domains mixed and test on held-out data which is all in-domain.
Some combinations of domains are better for generalizing to unseen domains, which is a noted phenomenon in
AID as well (Soto et al., 2021); however in general, out-of-domain performance on HC3 is not as strong as it is for
Deepfake data (Figure 2). This could potentially be improved by increasing the number of samples seen during
training (which in each of these experiments was 5000)
Model Accuracy F1 AUROC HumanRec MachineRec AvgRec
ChatGPT 0.946 0.949 0.996 1.000 0.891 0.946
BloomZ 0.992 0.992 1.000 1.000 0.984 0.992
Dolly 0.714 00.763 0.845 0.913 0.512 0.712
Cohere 0.776 0.723 0.938 0.579 0.976 0.778
Davinci 0.868 0.882 0.977 0.980 0.754 0.867
Table 9: Bert features + GradientBoost classifier on M4 data. Here we take 5000 training sample, extract
BERT pre-trained embeddings for them using Huggingface’s feature-extractor pipeline, and train a GradientBoost
Classifier. We then test on in-domain data from a held out test set of 500 examples. Dolly and Cohere fool the
classifier the most; for Dolly, the machine recall value is the cause of its low performance, perhaps indicating high
linguistic variation of the model’s generation (i.e. it is easier to pinpoint what a ‘human’ text sounds like than what
a ‘machine’ text sounds like). The opposite is true for Cohere. Our model is able to find distinctive markers of
machine text but the human data is indistinctive.
Table 10: Linguistic features + GradientBoost classifier on M4 data. Here we take 5000 training sample, extract
linguistic features (char-, word-, POS n-grams) from them, and train a GradientBoost Classifier. We then test on
in-domain data from a held out test set of 500 examples. Dolly and Cohere fool the classifier the most; in Table 9,
Dolly exhibits a high machine recall value whereas Cohere exhibits low machine recall. In this case, the opposite is
true. This warrants further exploration. Possibly, a combined approach of neural and linguistic features such as in
Petukhova et al. (2024) would be more robust here.
data is from one model in one text-domain, our GradientBoost classifier with linguistic features achieves
upwards of 0.94 AUROC on every test.
We also compare the classifier performance on data produced by a single model, but in multiple domains
mixed, seen in Table 12. While performance, in general, is robust, M4’s Cohere data is a clear outlier,
only achieving 0.64 AUROC, perhaps indicating that it shows a high degree of linguistic diversity across
domain-specific generations. However, in a comparison of linguistic vs. neural features on the M4 dataset,
we find that the GB classifier can achieve high performance on Cohere (cf. Table 14).
In a multi-class setup (Table 2), where the classifier must distinguish between “human”, “model 1”, and
“model 2”, it achieves an average of 0.94 and 0.91 for the Ghostbuster and Outfox datasets, respectively.
All of the above are calculated using unseen data from the same distribution as the training data, but we
see strong out-of-domain performance as well in Figure 2. Choosing the largest model in each model
family available in Deepfake’s data, we calculate the F1 score for both an in-domain and out-of-domain
test set. Through multiple independent trials and our bootstrapped confidence intervals, we see that F1
on unseen domains has a larger variance, but that overall the difference in F1 score for in-domain and
out-of-domain performance is not statistically significant because the confidence intervals overlap. In
other words, linguistic features + ML classifiers do generalize well to data in unseen domains.
D Implementation Details
D.1 GradientBoost
Parameters: learning rate of 0.2, number of estimators 90, max depth of 8, max features ’sqrt’, sum-
sample ratio 0.8, random state 10, minimum samples leaf 30 and minimum samples to split 400, these
hyperparameters were optimized using Sklearn’s gridsearch function. Features: char n-grams:(2,4), word
n-grams:(3,5), pos n-grams:(3,5). Maximum 2000 features for each feature set.
Dataset Base Model Domain F1 AUROC HumanRec MachineRec AvgRec
Domain-Specific gpt-j-6b cmv 0.98 0.98 0.97 1.00 0.98
eli5 0.94 0.94 0.98 0.90 0.94
hswag 0.96 0.98 1.00 0.97 0.98
roct 0.99 0.99 1.00 0.98 0.99
sci_gen 0.99 0.99 0.99 0.98 0.99
squad 0.91 0.97 1.00 0.95 0.97
tldr 0.97 0.96 0.97 0.95 0.96
xsum 0.95 0.95 0.95 0.95 0.95
yelp 0.95 0.94 0.99 0.89 0.94
wp 0.98 0.98 1.00 0.96 0.98
Mixed Domains gpt-j-6b mixed 0.95 0.95 0.97 0.93 0.95
gpt-3.5-turbo mixed 0.90 0.97 0.91 0.91 0.91
flan-t5-xxl mixed 0.91 0.95 0.91 0.79 0.85
opt 30b mixed 0.98 0.99 0.99 0.93 0.96
llama 65B mixed 0.92 0.93 0.94 0.75 0.85
glm 130B mixed 0.94 0.95 0.98 0.78 0.88
davinci-003 mixed 0.82 0.92 0.80 0.86 0.83
Mixed Model Set OpenAI GPT mixed 0.70 0.73 0.64 0.82 0.73
Meta Llama* mixed 0.81 0.82 0.79 0.84 0.82
GLM-130B* mixed 0.86 0.86 0.89 0.83 0.86
Google FLAN-T5* mixed 0.83 0.85 0.77 0.93 0.85
Facebook OPT* mixed 0.86 0.87 0.87 0.86 0.87
BigScience mixed 0.79 0.81 0.74 0.88 0.81
EleutherAI mixed 0.95 0.95 0.96 0.94 0.95
Ghostbuster gpt-3.5-turbo Reuters 0.98 0.98 0.98 0.99 0.98
essay 0.98 0.97 0.99 0.96 0.97
wp 0.98 0.99 0.99 0.98 0.99
claude Reuters 0.97 0.96 0.99 0.94 0.96
essay 0.98 0.98 0.97 0.99 0.98
wp 0.95 0.95 0.94 0.97 0.95
HC3 gpt-3.5-turbo eli5 1.00 1.00 1.00 1.00 1.00
open_qa 0.97 1.00 0.96 0.98 0.97
wiki_csai 0.97 0.99 0.96 0.98 0.97
medicine 0.99 1.00 0.99 0.99 0.99
finance 0.97 0.99 0.99 0.96 0.97
OUTFOX gpt-3.5-turbo essay 0.95 0.94 1.00 0.89 0.94
text-DaVinci-003 essay 0.98 0.98 1.00 0.95 0.98
flan_t5_xxl essay 0.95 0.96 0.93 0.98 0.96
Table 11: Full In-Domain Results with GradientBoost classifier. Here we report classifier metrics for four
datasets, broken down by model, model family, and domain.
Dataset Model AUROC
Deepfake gpt-3.5-turbo 0.97
Dataset Model AUROC flan-t5-xxl 0.95
Deepfake gpt-j-6b 96.8 ± 1.8 opt 30B 0.99
llama 65B 0.93
HC3 gpt-3.5-turbo 99.6 ± 0.46 glm 130 B 0.95
Ghostbuster gpt-3.5-turbo 98 ± 0.81 davinci-003 0.92
claude 96.3 ± 1.2 M4 gpt-3.5-turbo 0.99
Outfox gpt-3.5-turbo 94.0 BloomZ 1.0
text-davinci-003 98.0 Dolly 0.88
flan-t5-xxl 96.0 Cohere 0.64
Davinci 0.97
(a) Domain-Specific results: AUROC for binary
HC3 gpt-3.5-turbo 1.0
classification where “model” data is from a single
model in a single domain. Where we have data for (b) Mixed Domain results: For those
more than one domain per model, we average them, datasets for which we have data from a sin-
reporting mean and standard deviation. Outfox data gle model in multiple domains, we report
only has one domain (essay), so there is no standard AUROC on a held-out test set.
deviation to report.
Table 12: Single and mixed domain results with linguistic features and GradientBoost classifier for our datasets of
interest.
Table 13: Classifier performance on data produced by a single model in a mixture of domains. We report
classifier (linguistic + GradientBoost) performance as AUROC for Deepfake, M4, and HC3 datasets, which all
include data from one model across a variety of domains (Outfox only includes the essay domain). The classifier
proves broadly robust, except for Cohere data. These results are extracted from the full table in Table 11.
Table 14: Linguistic vs. BERT features. We report AUROC for binary classification on M4 data, comparing
linguistic and BERT features with the same machine learning classifier. For most models, the performance is
comparable, with BERT features slightly edging out linguistic features. Cohere is the outlier, proving to be a difficult
classification task for the linguistic set to capture. However, BERT features remain robust to Cohere data.
Dataset Base Model/Family Domain Human Machine
Domain-Specific gpt-j-6b cmv 509 636
eli5 952 863
hswag 1000 868
roct 999 833
sci_gen 950 529
squad 686 718
tldr 772 588
xsum 997 913
yelp 984 856
wp 940 784
Total 8789 7588
Mixed Model Set OpenAI GPT mixed 67k 67k
Meta Llama mixed 37k 37k
GLM-130B mixed 9k 9k
Google FLAN-T5 mixed 47k 47k
Facebook OPT mixed 80k 80k
BigScience mixed 27k 27k
EleutherAI mixed 14k 14k
Total 282k 282k
Ghostbuster gpt-3.5-turbo Reuters 500 500
essay 1000 1000
wp 500 500
Total 2000 2000
HC3 gpt-3.5-turbo eli5 17.1k 17.1k
open_qa 1.19k 1.19k
wiki_csai 842 842
medicine 1.25k 1.25k
finance 3.93k 3.93k
Total 24.3k 24.3k
OUTFOX gpt-3.5-turbo essay 15k 15k
text-davinci-003 essay 15k 15k
flan_t5_xxl essay 15k 15k
Total 46k 46k
Table 15: Dataset statistics (number of documents) for publicly available machine-generated text detection datasets.
E Dataset Information
E.1 Outfox
Outfox is a parallel human-machine dataset built on the Kaggle Feedback Prize dataset (Franklin et al.,
2022) and contains approximately 15,000 essay problem statements and human-written essays, ranging
in provenance from 6th to 12th grade native-speaking students in the United States. For each problem
statement, there is also an essay generated by each of three LLMs: ChatGPT (gpt-3.5-turbo-0613),
GPT-3.5 (text-davinci-003), and Flan (FLAN-T5-XXL). Each example contain an instruction prompt
(“Given the following problem statement, please write an essay in 320 words with a clear opinion.”), a
problem statement (“Explain the benefits of participating in extracurricular activities and how they can
help students succeed in both school and life. Use personal experiences and examples to support your
argument.”), the text of the essay, and a binary label for human or machine authorship.
While we conduct fingerprint analysis on the whole dataset, we use only the human-written subset of
the Outfox data as a training corpus for our fine-tuning setup; given an instruction prompt and problem
statement, we fine-tune our LLMs of interest to produce text which minimises cross-entropy loss when
compared with the original human-written response to the same problem statement. We withhold a test-set
of human-written examples from training to be used for evaluation.
E.2 Ghostbuster
Verma et al. (2023) provide three new datasets for evaluating AI-generated text detection in creative
writing, news, and student essays. Using prompts scraped from the subreddit r/WritingPrompts, the
Reuters 50-50 authorship identification dataset, and student essays from the online source IvyPanda, they
obtained ChatGPT- and Claude-generated responses and made efforts to maintain consistency in length
with human-authored content in each domain.
E.3 HC3
We also analyze data from (Guo et al., 2023), which includes questions from publicly available datasets
and wiki sources with human- and ChatGPT-generated responses based on instructions and additional
context. The resulting corpus comprises 24,322 English and 12,853 Chinese questions, of which we only
use the English split.
E.4 Deepfake
The Deepfake corpus is a comprehensive dataset designed for benchmarking machine-generated content
detection in real-world scenarios (Li et al., 2023). It contains approximately 9,000 human examples across
10 text domains, each paired with machine outputs from 27 models (e.g. GPT-3.5-turbo, text-davinci-002)
from 7 different model families (e.g. OpenAI), producing several testbeds designed for examining a
detector’s sensitivity to model provenance and text domain. Each example contains the text, binary label
denoting human or machine, and the source information – which domain, model, and prompting method
were used.
We reserve the whole Deepfake dataset as an evaluation corpus to allow us to examine the robustness
of our proposed methods across different models, model families, and domains. For generation, we take
the first 30 tokens (as split by whitespace) of an example as the problem statement and prepend it with a
simple continuation instruction prompt: “Read the following instruction and generate appropriate text.
### Continue the following text:”.
Training Data. We primarily use the Deepfake and Outfox data for training classifiers to analyze
different aspects of the LLM fingerprints. They are both conveniently multi-parallel: they contain N model
responses for each human text sample in the dataset. This has the benefit of removing some uncertainty
from our classifier results. Performance on the human class is often identical across trials, as the human
data is often identical. This allows a controlled test of how our classifier deals with the machine text
samples. Additionally, the different testbeds provided in Deepfake provide convenient, parallel domain
and model (/model family) data splits. Specifically, we use the mixed model sets and model-specific,
domain-specific testbeds from Deepfake.
F.2 Generation
For some experiments, we use LLaMA models and their instruction-tuned counterparts to generate
machine responses to human prompts in the Outfox and Deepfake datasets. Outfox’s examples are already
formatted with an instruction prompt, (e.g. “Given the following problem statement, please write an
essay in 320 words with a clear opinion.”) and explicit problem statement (e.g.“Explain the benefits of
participating in extracurricular activities and how they can help students succeed in both school and life.
Use personal experiences and examples to support your argument.”), whereas Deepfake’s are not. To
remedy this, for every example in Deepfake, we take the first 30 tokens (as split by whitespace) of the
example text as the problem statement and prepend it with a simple continuation instruction prompt:
“Read the following instruction and generate appropriate text. ### Continue the following text:”. Further
information about generations, including model sampling parameters, may be found in Appendix E.4. We
conduct minimal pre- and post-processing on our input as well as the generated outputs, only removing
newlines and cleaning up whitespace.
3
We do this because best practices for DPO from (Rafailov et al., 2023) suggest that models should first be adapted to the text
domain via fine-tuning before being trained with reinforcement learning.