Empirical Evidence of Large Language Model's Influence On Human Spoken Communication
Empirical Evidence of Large Language Model's Influence On Human Spoken Communication
a
Center for Adaptive Rationality, Max-Planck Institute for Human Development,
Germany
Abstract
From the invention of writing [1] and the printing press [2, 3], to television [4, 5] and social
media [6], human history is punctuated by major innovations in communication technology,
which fundamentally altered how ideas spread and reshaped our culture [7, 8]. Recent chat-
bots powered by generative artificial intelligence constitute a novel medium that encodes
cultural patterns in their neural representations and disseminates them in conversations
with hundreds of millions of people [9]. Understanding whether these patterns transmit into
human language, and ultimately shape human culture, is a fundamental question. While
fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very
challenging, lexicographic shift in human spoken communication may offer an early indicator
of such broad phenomenon. Here, we apply econometric causal inference techniques [10] to
740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 con-
versational podcast episodes across multiple disciplines. We detect a measurable and abrupt
increase in the use of words preferentially generated by ChatGPT—such as delve, compre-
hend, boast, swift, and meticulous—after its release. These findings suggest a scenario where
machines, originally trained on human data and subsequently exhibiting their own cultural
traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed
cultural feedback loop in which cultural traits circulate bidirectionally between humans and
machines [11]. Our results motivate further research into the evolution of human-machine
culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks
of scalable manipulation.
Introduction
Human culture has long been transformed by communication technologies [12, 13]. Writing
enabled complex social organizations [1]. The printing press subsequently democratized
knowledge, catalyzing economic and social revolutions [12, 3]. Cinema provided a shared
visual experience [12], and social media has since embedded humanity into a globalized
network [6, 14]. Now, generative AI—particularly Large Language Models (LLMs) like
∗ Corresponding author: [email protected]
† These authors contributed equally to this work.
1
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
A. AI-driven linguistic patterns may potentially spread through diverse pathways C. Transcription pipelines
delve
Youtube Podcasts
talks
740,249
hours
delve
delve
writing a
document delve
preparing delve
delve
delve
Has dialogue
spontanous
conversation
delve
Transcription Transcription
First generation Second generation Third generation
56% 249,949
Conversation
7.35 billion
words
44% 136,247 117,080 130,558 137,757
Speech
Talks from
Business Education Science & Sports Religion
academic channels
Technology
Specifically, we collected extensive data on human verbal communication from the In-
ternet and conducted a quasi-experiment to estimate the population-level causal impact of
ChatGPT’s release (Figs. 1B and C; see Constructing datasets of human spoken communi-
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
A. Word frequency for delve in academic Youtube talks C. Top words preferred by ChatGPT
cation in Methods for details). To begin, we created a dataset comprising 360,445 YouTube
videos focused on academic conversations. This is because the literature has demonstrated
the influence of ChatGPT on academic papers [24, 25], including the increased use of delve,
from which we could expect a similar effect on verbal communication. We later expanded
our analysis to include 771,591 podcast episodes to investigate the influence in other conver-
sational domains and spontaneous communication, ensuring that the effect did not primarily
stem from reading ChatGPT-generated scripts.
We then examined changes in word usages in human verbal communication before and
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
A. Changing trend of word usages upon ChatGPT release B. Changes of word usages associated
with ChatGPT’s preference
Figure 3: Trend changes of top GPT words in academic YouTube talks. (a) Linear
regression of the frequency of videos containing top GPT words and the corresponding synthetic
controls. Dots represent monthly aggregated frequencies while lines depict temporal trends
modeled with a change point marked by the vertical dashed line. Words such as comprehend,
boast, and swift show a significant increase in usage, similarly to delve. (b) Relationship between
the GPT scores of words (x-axis) and the magnitude of changes in their frequency trends (y-axis).
The bars represent the 95% high-density interval of the posterior distribution, and the shaded
region in the inset chart represents the 95% confidence interval of a Gaussian process regression.
Focusing on the top 20 GPT words (i.e., the most right side of the inset chart), we observe a set
of words that show an increase in usage of approximately 25% to 50% per year.
after the release of ChatGPT, focusing on GPT words (i.e., words strongly preferred by
the LLM), as demonstrated in Fig. 2A. To accomplish this, we quantified ChatGPT’s word
preferences by editing human-written texts with ChatGPT and comparing their word fre-
quency distributions to those of the ChatGPT-edited versions (Fig. 2B; see Measuring word
preferences of large language models in Methods for details). The log-odds ratio derived
from the two versions (hereinafter GPT score) highlights ChatGPT’s strong preferences
for delve, which were consistent between different versions of ChatGPT (Supplementary
Fig. S6), along with other words associated with ChatGPT [24, 25] (Fig. 2C).
To estimate the causal influence of ChatGPT’s release on spoken communication, we
employed the synthetic control method [10]. This approach aims to infer the counterfac-
tual usage of “treated” words—those with high GPT scores—in the absence of ChatGPT’s
introduction. It assumes that the “untreated” words—those with near-zero GPT scores—
that displayed similar pre-release patterns would project the counterfactual usage. For
this purpose, the method constructs a synthetic control for a treated word by taking the
convex combination of multiple untreated “donor” words that best matches its pre-release
usage (Fig. 2D). To test the robustness of our results, we employed alternative donor selec-
tion methods, including random sampling (see Estimating causal influence of ChatGPT in
Methods for details).
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
1 if t > Tevent
where dPost = for t ∈ [Tstart , Tend ]
0 otherwise
0 if w is a synthetic control
and dGPT = (1)
1 if w is a GPT word
In this model, log10 (yw,t ) represents the log relative frequency of YouTube videos containing
the word w in a specific time period t (sampled on a monthly basis, see Constructing datasets
of human spoken communication), and Tevent is set corresponding to the release of ChatGPT
(marked by the vertical dotted lines in Fig. 3A). Thereby, the coefficient β captures the
temporal trend commonly observed for the GPT word and the synthetic control prior to
the introduction of ChatGPT, while βPost captures the change in the temporal trend of
the synthetic control following the introduction of ChatGPT. βGPT,Post measures how the
change in trend before (pre-) and after (post-) ChatGPT’s introduction differs between the
GPT word and the synthetic control, providing an estimate of ChatGPT’s causal impact on
word usage. Fig. 3A illustrates this effect as the difference between the orange and green
lines from the vertical line to the right. These results indicate that the linguistic shift in
academic communications after the release of ChatGPT also occurred for other words, such
as comprehend, boast, swift, and meticulous. This is clearly illustrated by the posterior
distribution of βGPT,Post shown in Supplementary Fig. S3.
Fig. 3B reports the estimated increase in usage frequency after the release of ChatGPT
(βGPT,Post ) for the set of words with the highest GPT score (x-axis). Many of the top GPT
words exhibit an annual growth in usage frequency of about 25% to 50%, yet for others, usage
has not changed significantly after ChatGPT’s release. This suggests that words preferred by
ChatGPT are not necessarily adopted into human verbal communication. However, we did
not observe any top GPT word exhibiting a significant decrease in usage, and, as can be seen
from the inset figure, words showing a significant increase in usage are primarily observed
among those with high GPT scores, i.e., words that ChatGPT outputs more frequently
than humans. This allows us to conclude that the increase in usage of GPT words after the
release date of ChatGPT has been influenced by interactions with the LLM.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
used across domains (e.g., ChatGPT’s word preferences exhibit distinct occurrence patterns
depending on the domain in which they are used; see Fig. 5b).
These findings indicate that AI-driven linguistic shifts are not confined to structured,
scripted communication. Instead, they suggest a plausible deeper process of linguistic adap-
tation, where certain words become embedded in everyday speech rather than simply being
imitated. We discuss the implications of these findings in the following section.
Discussion
Our findings provide the first large-scale empirical evidence that AI-driven language shifts
are propagating beyond written text into spontaneous spoken communication. By analyzing
over 740,000 hours of transcribed YouTube academic talks and podcast episodes, we reveal
a measurable surge in words preferred by ChatGPT—including delve, boast, and swift—with
a causal association to its public release. This linguistic influence is not confined to domains
where LLM-generated text is integrated by early adopters—such as academia, science, and
technology—but it is spreading to other domains, such as education and business. Beyond
diversity in domains, this shift is also not limited to one mode of communication, such
as scripted or formal speech—often featured in YouTube academic talks—but extends to
unscripted, conversational, and spontaneous discourse characteristic of podcast episodes.
However, the mechanisms underlying this adoption remain unknown. This can be the
result of either direct imitation [51], cognitive ease [52], or deeper integration into human
thinking processes [53], or the combination of all three. The uptake of words preferred by
LLMs in real-time human-human interactions suggests a deeper cognitive process at play—
potentially an internalization of AI-driven linguistic patterns. Notably, language is known to
influence human cognition [54], and it now underpins modern machine intelligence as well.
Next-generation LLMs appear to move beyond ingesting large-scale text corpora; they also
internalize structured reasoning processes—such as step-by-step logical inference [20, 21]—
that are themselves embedded in language. This opens the possibility that those words
preferred by LLMs encode modes of thinking that can be internalized by humans, beyond a
mere word-adoption phenomenon. Understanding how such AI-preferred patterns become
woven into human cognition represents a new frontier for psycholinguistics and cognitive
science.
Yet, regardless of the precise mechanism—be it imitation, cognitive ease, or deeper
integration—this measurable shift marks a precedent: machines trained on human culture
are now generating cultural traits that humans adopt, effectively closing a cultural feedback
loop. The shifts we documented suggest a scenario in which AI-generated cultural traits are
measurably and irreversibly embedded in and reshaping human cultural evolution.
An immediate consequence is that these traits no longer remain confined to interactions
between humans and AI systems but instead can diffuse further throughout human-human
communication. For example, given the unprecedented scale of this shift, it is plausible that
the influence extends beyond 1-to-1 interactions with ChatGPT. The cultural traits can
now permeate discourse even among individuals who have never interacted directly with the
LLM, as we observed in spontaneous conversations of podcast episodes where these words
were mentioned. As a novel medium, generative AI provides unprecedented leverage over
cultural evolution, raising unpredicted societal changes.
A central concern of this development is cultural homogenization. If AI systems dis-
proportionately favor specific cultural traits, they may accelerate the erosion of cultural
diversity [55]. Compounding this threat is the fact that future AI models will train on data
increasingly dominated by AI-driven traits, further amplified by human adoption, thereby
reinforcing homogeneity in a self-perpetuating cycle. This phenomenon poses an additional
hazard to AI itself: as specific patterns become monopolized, the risk of model collapse [56]
rises through a new pathway: even incorporating humans into the loop of training models
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
might not provide the required data diversity. This not only threatens cultural diversity but
also reduces the societal benefits of AI systems.
After such a cultural singularity—where the lines between human and machine cul-
ture [11] become increasingly blurred—long-standing norms of idea exchange, authority,
and social identity may also be altered, with direct implications for social dynamics. For
instance, language has functioned as a potent instrument of social distinction [57, 58]. While
LLMs can help level linguistic barriers [59, 60], especially for non-native speakers seeking to
communicate in formal English, they may also generate new biases. It is conceivable that
certain words preferred by LLMs, like delve, could become stereotypically associated with
lower skill or intellectual authority, thus reshaping perceptions of credibility and competence.
While our method provided causal evidence of AI-driven shifts in language use, language
itself—although central—is only one facet of the complexity of human culture. Future re-
search must explore parallel phenomena in other facets not only to mitigate risks but also
to understand the unprecedented ways in which AI is becoming a source of cultural inno-
vations [61, 11]. This urgency is amplified by the deepening interplay between geopolitical
interests and AI development [62], which may reshape cultural narratives in unforeseen
ways. From influencing public discourse to redefining social cohesion, our next question is
no longer whether machines influence us, but how profoundly and through which channels.
Methods
Here, we provide details about the experimental design and the key metrics used to quantify
the influence of LLMs on human spoken communication. We begin by describing the pipeline
to construct datasets of YouTube talks and podcasts, including the criteria for preprocessing
and filtering. We also explain the method of quantifying GPT score, which measures the
difference in the usage of words between human- and ChatGPT-edited texts. Then, we
detail the analysis design used in our research to examine the causal influence of ChatGPT
on human verbal communication.
podcast series,2 as of April 2024.3 We then applied broad filters based on duration and
publication date, selecting episodes released between January 1, 2017, and the penultimate
quarter of 2024.
Next, we downloaded the selected podcasts, randomly selecting 6,000 podcasts per quar-
ter for each category to ensure a robust sample. Here, we used pre-labeled categories given
by the podcast provider, PodcastIndex,4 which were mapped into more broad and general
categories for clarity.5 A small fraction of the episodes could not be retrieved or were not
available in MP3 format, so they were discarded. As detailed in later sections, the number of
usable podcasts decreases throughout our filtering pipeline. Thus, if any category resulted
in fewer than 3,500 transcripts per quarter after filtering, we re-ran the sampling to meet
this threshold.
across different time points. For example, in the YouTube transcripts, we found an unnat-
ural increase in the frequency of the filler word um starting around May 2020, which we
found difficult to attribute to an actual increase in speakers’ usage of the word. It is more
plausible that YouTube switched to a transcription model that transcribes fillers verbatim,
and thus, we conducted the transcription process to avoid a potential source of bias. Lastly,
we used the language detection library [64] again to both datasets to filter out transcripts
from videos and podcasts that were judged as non-English by whisper and transcribed into
different languages.
Given the above, we acknowledge that our datasets are not all-encompassing. Here, the
data collection procedure was designed to construct a dataset that can capture the influence
of LLMs on spoken language in a systematic manner. While recognizing the potential
presence of noise or missing data, we believe that the datasets are sufficiently comprehensive
to represent the phenomena without introducing bias.
Preprocessing
We preprocessed the obtained transcripts to capture essential changes in word frequency by
removing noise and highlighting relevant patterns. We followed a systematic procedure:
1. Tokenization: The text is divided into individual tokens (words) for processing.
2. Normalization: All words are converted to lowercase to ensure uniformity and avoid
duplication due to case differences.
3. Stop word removal: Commonly used words that do not carry significant semantic
meaning, such as and, the, and is, are removed. The list of stop words used in this
process is sourced from the Natural Language Toolkit (NLTK) library [69], which
provides a standard set of English stop words.
4. Non-alphabetic filtering: Words containing non-alphabetic characters are excluded,
ensuring only standard words are retained.
5. Length filtering: Words with fewer than three characters are removed to eliminate
overly short and potentially uninformative tokens.
6. Stemming: Words are reduced to their root forms using the Porter stemming algo-
rithm [70]. This algorithm applies a series of heuristic rules to iteratively strip suffixes
from words (e.g., running to run). Stemming reduces the total vocabulary size, facili-
tating more efficient analysis.
For subsequent analyses, we calculated the log relative frequency of YouTube videos
and podcast episodes containing each presented word, sampled monthly. Older data may
exhibit different word usage trends due to factors such as the relatively low number of
YouTube videos and podcast episodes.7 Hence, we analyzed data spanning four years before
the initial release of ChatGPT on November 30, 2022. Additionally, due to the timing
of data collection, the analysis includes data up to May 31, 2024, which is 18 months
post-release. We employed log frequency to facilitate trend interpretation within this early
diffusion phase, using Laplace smoothing [71] to account for zero counts, which helps detect
emerging patterns that may initially exhibit exponential growth.
Positive LOR values indicate higher prevalence in ChatGPT-edited texts, while negative
values suggest human-associated usage. This metric was computed independently across all
dataset–model–prompt combinations.
a GPT score that marginalizes over uncertainties in model usage patterns. Given the un-
known true distribution of ChatGPT’s usage across datasets (D), models (M ), and prompts
(P ), we adopted a Bayesian hierarchical model with non-informative Dirichlet priors:
where 1 denotes flat priors for each parameter. The joint distribution P (D, M, P ) was
computed as:
P (D, M, P ) = P (D) · P (M | D) · P (P | D, M ) .
For each of 1000 Monte Carlo samples from this prior distribution, we calculated weighted
word probabilities: X X X (d,m,p)
p̂w = pw · λ (d, m, p) ,
d∈D m∈M p∈P
Rationale for Bayesian weighting The Dirichlet prior structure reflects maximum
entropy assumptions about potential correlations between datasets, models, and prompts.
By sampling from the joint prior distribution, we emulate the variability expected under
real-world deployment scenarios where specific GPT-family model, dataset, and prompt
combinations are not systematically favored. The resulting GPT scores thus represent robust
centrality estimates of word preferences across plausible usage distributions.
Donor selection
For each treated word, the synthetic control method requires the selection of a set of donor
words that are used to build the synthetic control. Following recommendations in the liter-
ature [72] and because of the nature of the optimization algorithm, in which we specifically
used sequential least squares programming, we restricted the number of donor words to
approximately 100. To avoid arbitrary selection and to assess the robustness of the results,
we developed three different approaches to select the words in the donor pool: untreated,
synonym, and random. The results from the last two approaches are included in the Sup-
plementary Material.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
Synonym donors This approach consisted of selecting the 100 words that are most
semantically similar to the treated word without restricting the GPT score to be close to
zero. We then constructed synthetic controls as a convex combination of these words.
Random donors This approach involved randomly sampling 100 words from all avail-
able words. It was set to examine the generalizability of the results to the case with minimal
assumptions.
Placebo test
We assessed the significance of the causal effect using a placebo test based on the ratio of
post- to pre-treatment mean squared prediction error (MSPE) [74]. This approach assumes
that, in the absence of an effect, the ratio of post- to pre-treatment MSPE between the
treated word and its synthetic control would be similar for words that are actually treated
and for those that are not treated. To test this null hypothesis, we employed a permutation
test where, given a treated word, we applied the synthetic control method to each word in
the donor pool and computed the ratio of post- to pre-treatment MSPE. We then reject
the null hypothesis of no effect if the ratio for the treated word is in the top five percentile
of the distribution of ratios. Additionally, to examine whether the change in usage of the
target GPT word is attributable to the ChatGPT release, we performed an in-time placebo
test. Specifically, following [10], we compared the MSPE ratio with those of fake treatments,
which were set every three months over the two years preceding the ChatGPT release.
Acknowledgements
H.Y. is supported in part by Japan Science and Technology Agency (JST) PRESTO Grant
Number JPMJPR246B. E.L.-L. is funded by the Deutsche Forschungsgemeinschaft (DFG),
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
(a) Word preferences when revising arXiv abstracts. (b) Word preferences of revisions by GPT-
4.
6
5 6
Log Odds Ratio
re
ty
y
cia
iftl
ssi
o
de
lve
ore
ty
rsc
cru
cia
sw
iftl
ce
ssi
de
rsc
de
cru
sw
ne
ce
un
de
ne
Word
un
Word
Word preference patterns exhibited notable stability across GPT-family models (Fig. 5A),
suggesting these biases emerge from intrinsic characteristics of the training pipeline rather
than version-specific training. However, specifically for delve we found descreasing prefer-
ence in newer models. While the odds ratio for delve in GPT-3.5-turbo– and GPT-4–revised
arXiv abstracts exceeded 300:1 relative to human texts, this ratio declined to approximately
100:1 for GPT-4-turbo and 40:1 for GPT-4o, indicating iterative mitigation of this stylistic
artifact.
LOR magnitudes varied substantially across source corpora (Fig. 5B). Analysis of log-
probability distributions (Supplementary Fig. S6) revealed this variability stems primarily
from baseline differences in human word choices. For instance, while humans rarely use un-
derscore in essays, GPT revisions introduced this term frequently across domains including
essays.
Focusing on scientific abstracts (arXiv, bioRxiv, Nature) and GPT-family models (GPT-
3.5-turbo, GPT-4, GPT-4-turbo, GPT-4o), we computed a weighted GPT score by marginal-
izing over model, prompt, and dataset combinations. Here, delve emerged as the most
strongly overused term (LOR > 4), followed by underscore, comprehend, bolster, boast,
swift, inquiry, meticulous, pinpoint and groundbreak (LOR > 2.5) (Figure 2).
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
References
[1] Jack Goody and Ian Watt. The consequences of literacy. Comparative Studies in Society
and History, 5(3):304–345, 1963.
[2] Jeremiah Dittmar. Information technology and economic change: The impact of the
printing press. The Quarterly Journal of Economics, 126(3):1133–1172, 2011.
[3] Jared Rubin. Printing and protestants: An empirical test of the role of printing in the
reformation. Review of Economics and Statistics, 96(2):270–286, 2014.
[4] Robert Putnam. Bowling Alone: The Collapse and Revival of American Community.
Simon & Schuster, New York, USA, 2000.
[5] Robert Jensen and Emily Oster. The power of TV: Cable television and women’s status
in india. The Quarterly Journal of Economics, 124(3):1057–1094, 2009.
[6] Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and false news online.
Science, 359(6380):1146–1151, 2018.
[7] Everett M. Rogers. Diffusion of innovations. Simon & Schuster, New York, USA, 5th
edition, 2003.
[8] Cristian Jara-Figueroa, Amy Z. Yu, and César A. Hidalgo. How the medium shapes the
message: Printing and the rise of the arts and sciences. PLOS ONE, 14(2):e0205771,
2019.
[9] Andrew Ross Sorkin and Sarah Kessler. Sam Altman on Microsoft, Trump
and Musk. New York Times: https://www.nytimes.com/2024/12/14/business/
dealbook/sam-altman-dealbook.html, December 2024.
[10] Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for
comparative case studies: Estimating the effect of California’s tobacco control program.
Journal of the American Statistical Association, 105(490):493–505, 2010.
[12] Marshall McLuhan. Understanding Media: The Extensions of Man. McGraw-Hill, New
York, USA, 1964.
[13] Joseph Henrich. The Secret of Our Success: How Culture Is Driving Human Evolu-
tion, Domesticating Our Species, and Making Us Smarter. Princeton University Press,
Princeton, USA, 2016.
[14] Neil F. Johnson, Nicolas Velásquez, Nicholas Johnson Restrepo, Rhys Leahy, Nicholas
Gabriel, Sara El Oud, Minzhang Zheng, Pedro Manrique, Stefan Wuchty, and Yonatan
Lupu. The online competition between pro- and anti-vaccination views. Nature,
582(7811):230–233, 2020.
[15] Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenen-
baum, and Evelina Fedorenko. Dissociating language and thought in large language
models. Trends in Cognitive Sciences, 28(6):517–540, 2024.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
[16] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato,
Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural
Information Processing Systems, volume 33, pages 1877–1901, Online, 2020. Curran
Associates.
[18] Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, An-
ton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph,
Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared
Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of sub-
jective global opinions in language models. In Proceedings of the 1st Conference on
Language Modeling, pages 1–27, Philadelphia, USA, 2024. OpenReview.
[19] Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian
Coda-Forno, Peter Dayan, Can Demircan, Maria K. Eckstein, Noémi Élteto, Thomas L.
Griffiths, Susanne Haridi, Akshay K. Jagadish, Ji-An Li, Alexander Kipnis, Sreejan
Kumar, Tobias Ludwig, Marvin Mathony, Marcelo G. Mattar, Alireza Modirshanechi,
Surabhi S. Nath, Joshua C. Peterson, Milena Rmus, Evan M. Russek, Tankred Saanum,
Natalia Scharfenberg, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi,
Xin Sui, Mirko Thalmann, Fabian J. Theis, Vuong Truong, Vishaal Udandarao, Kon-
stantinos Voudouris, Robert Wilson, Kristin Witte, Shuchen Wu, Dirk Wulff, Huadong
Xiong, and Eric Schulz. Centaur: A foundation model of human cognition. arXiv,
2410.20268:1–141, 2024.
[20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,
Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reason-
ing in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle
Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Sys-
tems, volume 35, pages 24824–24837, New Orleans, USA, 2022. Curran Associates.
[21] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang,
Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai
Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu,
Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao,
Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang
Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li,
H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo
Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang
Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai
Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang
Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang,
Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang,
Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang,
Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou,
Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei,
Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun
Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang,
Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang,
Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha
Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia
Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao,
Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He,
Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan
Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo,
Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang,
Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan,
Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang,
Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu,
Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu
Zhang, and Zhen Zhang. DeepSeek-R1: Incentivizing reasoning capability in LLMs via
reinforcement learning. arXiv, 2501.12948:1–22, 2025.
[22] Thomas H Costello, Gordon Pennycook, and David G Rand. Durably reducing con-
spiracy beliefs through dialogues with AI. Science, 385(6714):eadq1814, 2024.
[23] Robert Boyd and Peter J Richerson. Culture and the evolutionary process. University
of Chicago Press, Chicago, USA, 1988.
[24] Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong
Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts,
Christopher D Manning, and James Y. Zou. Mapping the increasing use of LLMs in
scientific papers. In Proceedings of the 1st Conference on Language Modeling, pages
1–27, Philadelphia, USA, 2024. OpenReview.
[25] Dmitry Kobak, Rita González Márquez, Emöke-Ágnes Horvát, and Jan Lause. Delv-
ing into ChatGPT usage in academic writing through excess vocabulary. arXiv,
2406.07016:1–13, 2024.
[26] Mingmeng Geng and Roberto Trotta. Is ChatGPT transforming academics’ writing
style? In Proceedings of the ICML 2024 Workshop on the Next Generation of AI
Safety, pages 1–14, Vienna, Austria, 2024. OpenReview.
[27] Krystal Hu. ChatGPT sets record for fastest-growing user base
– analyst note. Reuters: https://www.reuters.com/technology/
chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/,
February 2023.
[28] Andy Extance. ChatGPT has entered the classroom: how LLMs could transform
education. Nature, 623(7987):474–477, 2023.
[29] Bahar Memarian and Tenzin Doleck. ChatGPT in education: Methods, potentials, and
limitations. Computers in Human Behavior: Artificial Humans, 1(2):100022, 2023.
[32] Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse. Opening up ChatGPT:
Tracking openness, transparency, and accountability in instruction-tuned text gener-
ators. In Minha Lee, Cosmin Munteanu, Martin Porcheron, Johanne Trippas, and
Sarah Theres Völkel, editors, Proceedings of the 5th International Conference on Con-
versational User Interfaces, pages 47:1–47:6, Eindhoven, The Netherlands, 2023. ACM.
[33] Billy Perrigo. OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT
less toxic. Time: https://time.com/6247678/openai-chatgpt-kenya-workers/, Jan-
uary 2023.
[34] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud,
Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori
Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abili-
ties of large language models. Transactions on Machine Learning Research, 2022(08):1–
30, 2022.
[35] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
imal policy optimization algorithms. arXiv, 1707.06347:1–12, 2017.
[36] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul-
man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter
Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models
to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agar-
wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information
Processing Systems, volume 35, pages 27730–27744, New Orleans, USA, 2022. Curran
Associates.
[37] Alberto Acerbi and Joseph M Stubbersfield. Large language models show human-like
content biases in transmission chain experiments. Proceedings of the National Academy
of Sciences, 120(44):e2313790120, 2023.
[38] Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. AI
generates covertly racist decisions about people based on their dialect. Nature,
633(8028):147–154, 2024.
[39] Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, and Alexander
Trautsch. A large-scale comparison of human-written versus ChatGPT-generated es-
says. Scientific Reports, 13(18617):1–11, 2023.
[40] Peter S. Park, Philipp Schoenegger, and Chongyang Zhu. Diminished diversity-of-
thought in a standard large language model. Behavior Research Methods, 56:5754–5770,
2024.
[41] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell,
Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. John-
ston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver
Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards un-
derstanding sycophancy in language models. In Proceedings of the 12th International
Conference on Learning Representations, pages 1–33, Vienna, Austria, 2024. OpenRe-
view.
[42] Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori
Hashimoto. Whose opinions do language models reflect? In Andreas Krause, Emma
Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett,
editors, Proceedings of the 40th International Conference on Machine Learning, pages
29971–30004, Honolulu, USA, 2023. Proceedings of Machine Learning Research.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
[43] Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideol-
ogy of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-
libertarian orientation. arXiv, 2301.01768:1–21, 2023.
[44] M. Keith Chen. The effect of language on economic behavior: Evidence from savings
rates, health behaviors, and retirement assets. American Economic Review, 103(2):690–
731, 2013.
[45] Patricia M. Greenfield. The changing psychology of culture from 1800 through 2000.
Psychological Science, 24(9):1722–1731, 2013.
[46] Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining
data to language models to downstream tasks: Tracking the trails of political biases
leading to unfair NLP models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki
Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com-
putational Linguistics, pages 11737–11762, Toronto, Canada, 2023. ACL.
[47] Tom S. Juzek and Zina B. Ward. Why does ChatGPT “delve” so much? Exploring
the sources of lexical overrepresentation in large language models. In Owen Rambow,
Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven
Schockaert, editors, Proceedings of the 31st International Conference on Computational
Linguistics, pages 6397–6411, Abu Dhabi, UAE, 2025. ACL.
[48] Matthew Gentzkow and Jesse M. Shapiro. What drives media slant? Evidence from
U.S. daily newspapers. Econometrica, 78(1):35–71, 2010.
[49] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K.
Gray, Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative
analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011.
[50] Vittorio Tantucci and Aiqing Wang. Shifts in conversational engagement: Evidence
from 20 years of British English corpora. Applied Linguistics, 2024.
[51] Martin J. Pickering and Simon Garrod. Toward a mechanistic psychology of dialogue.
Behavioral and Brain Sciences, 27(02), 2004.
[52] Alexandra A. Cleland and Martin J. Pickering. The use of lexical and syntactic infor-
mation in language production: Evidence from the priming of noun-phrase structure.
Journal of Memory and Language, 49(2):214–230, 2003.
[53] Peter Gordon. Numerical cognition without words: Evidence from Amazonia. Science,
306(5695):496–499, 2004.
[54] Lera Boroditsky. How language shapes thought. Scientific American, 304(2):62–65,
2011.
[55] Jason W. Burton, Ezequiel Lopez-Lopez, Shahar Hechtlinger, Zoe Rahwan, Samuel
Aeschbach, Michiel A. Bakker, Joshua A. Becker, Aleks Berditchevskaia, Julian Berger,
Levin Brinkmann, Lucie Flek, Stefan M. Herzog, Saffron Huang, Sayash Kapoor,
Arvind Narayanan, Anne-Marie Nussberger, Taha Yasseri, Pietro Nickl, Abdullah Al-
maatouq, Ulrike Hahn, Ralf H. J. M. Kurvers, Susan Leavy, Iyad Rahwan, Divya
Siddarth, Alice Siu, Anita W. Woolley, Dirk U. Wulff, and Ralph Hertwig. How
large language models can reshape collective intelligence. Nature Human Behaviour,
8(9):1643–1655, 2024.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
[56] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross J. Anderson,
and Yarin Gal. AI models collapse when trained on recursively generated data. Nature,
631(8022):755–759, 2024.
[57] Pierre Bourdieu. Language and Symbolic Power. Harvard University Press, Cambridge,
USA, 1991.
[58] William Labov. The Social Stratification of English in New York City. Cambridge
University Press, Cambridge, UK, 2 edition, 2006.
[59] Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects
of generative artificial intelligence. Science, 381(6654):187–192, 2023.
[60] Xiaofei Wang, Hayley M. Sanders, Yuchen Liu, Kennarey Seang, Bach Xuan Tran,
Atanas G. Atanasov, Yue Qiu, Shenglan Tang, Josip Car, Ya Xing Wang, Tien Yin
Wong, Yih-Chung Tham, and Kevin C. Chung. ChatGPT: Promise and challenges for
deployment in low- and middle-income countries. The Lancet Regional Health - Western
Pacific, 41:100905, 2023.
[61] Levin Brinkmann, Deniz Gezerli, Kira von Kleist, Thomas Franz Müller, Iyad Rahwan,
and Niccolo Pescetelli. Hybrid social learning in human-algorithm cultural transmis-
sion. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 380(2227):20200426, 2022.
[62] Yasheng Huang. Why US–China relations are too important to be left to politicians.
Nature, 631(8022):736–739, 2024.
[64] Shuyo Nakatani. Language detection library for Java, 2010. https://www.slideshare.
net/slideshow/language-detection-library-for-java/6014274 (Accessed on July
31, 2024).
[65] Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural
speaker diarization. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors,
Proceedings of the 24th Interspeech Conference, pages 3222–3226, Dublin, Ireland, 2023.
ISCA.
[66] Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: Principle, benchmark,
and recipe. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors, Pro-
ceedings of the 24th Interspeech Conference, pages 1983–1987, Dublin, Ireland, 2023.
ISCA.
[67] Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. WhisperX: Time-
accurate speech transcription of long-form audio. In Naomi Harte, Julie Carson-
Berndsen, and Gareth Jones, editors, Proceedings of the 24th Interspeech Conference,
pages 4489–4493, Dublin, Ireland, 2023. ISCA.
[68] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and
Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv,
2212.04356:1–28, 2022.
[69] Steven Bird. NLTK: The natural language toolkit. In Nicoletta Calzolari, Claire Cardie,
and Pierre Isabelle, editors, Proceedings of the 44th Annual Meeting of the Association
for Computational Linguistics, pages 69–72, Sydney, Australia, 2006. ACL.
[70] Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
[72] Alberto Abadie and Jaume Vives-i-Bastida. Synthetic controls in action. arXiv,
2203.06279:1–39, 2022.
[73] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. In Proceedings of the 1st International Conference on
Learning Representations, pages 1–9, Scottsdale, USA, 2013. OpenReview.
[74] Bruno Ferman and Cristine Pinto. Placebo tests for synthetic controls. Technical
Report 78079, University Library of Munich, Germany, 2017.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
Supplementary Materials
Materials and Methods
Datasets to compute Log Odds Ratios of human and LLM word usage
arXiv The arXiv dataset contains abstracts from research papers that were published
on the arXiv website. We used the arXiv API8 to extract 150 papers from five different
categories, namely Computer Science, Electrical Engineering and Systems Science, Mathe-
matics, Physics, and Statistics, each month from 2019 to 2022. All categories were further
divided into 133 subcategories, and we gathered 7182 abstracts in total.
bioRxiv The bioRxiv abstracts were gathered using the bioRxiv API.9 We ran a brute
force query from the start to the end of the month for 4 years (2019-2022). We collected 60
papers per month, which was 2,880 papers in total. Both arXiv and bioRxiv accounted for
10,000 unique abstracts.
Nature Nature abstracts were collected using the search engine on the Nature website.
The process involved querying for up to 20 pages, each displaying 50 results. Our goal was to
collect between 7,000 and 10,000 abstracts to match a comparable size of our other datasets.
To achieve this, we ran a query without specific search terms, focusing on publications from
2019 to 2023. The results were sorted in two ways: ascending and descending per year. This
dual sorting approach yielded over 8,000 unique abstracts, sufficient for our purposes.
Emails The email dataset was sourced from the publicly available Enron email dataset
on Kaggle.10 The original dataset contains 500,000 emails generated by employees of the
Enron Corporation. However, for our use case, we randomly sampled 10,000 emails and
processed them into a dataset. The emails were sent between 2000 and 2001, far before the
introduction of ChatGPT.
Essays This dataset comprises student essays collected from The Hewlett Foundation:
Automated Essay Scoring challenge on Kaggle.11 The goal of this challenge was to develop
an automated scoring algorithm for essays. All of the essays in the challenge, which was
released in 2012, were composed by students. Similar to other datasets, we sampled 10,000
essays for our analysis.
News The original dataset consisted of 210,000 news headlines from 2012 to 2022 from
HuffPost. The dataset is available on Kaggle with 42 different news categories.12 For our
purposes, we sampled 10,000 short descriptions of the news articles that served the purpose
of news abstracts.
Wikipedia We utilized the Wikipedia API13 to pull the articles from Wikipedia. One
restriction of the API is that it only gives the article title and article ID. Due to the existence
of duplicate articles, we were forced to query 30,000 of them. We extracted the articles’ dates
of publication after eliminating duplicates, keeping just those that were released between
2019 and 2022. We used the page title to randomly collect the content of 10,000 articles
from this chosen subset.
8 https://info.arxiv.org/help/api/index.html
9 https://api.biorxiv.org/
10 https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
11 https://www.kaggle.com/competitions/asap-aes
12 https://www.kaggle.com/datasets/rmisra/news-category-dataset
13 https://www.mediawiki.org/wiki/API:Main_page
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
Supplementary Figures
Figure S1: Robustness of the findings presented in Fig. 2A across different donor
selection strategies. (Left: synonym donors, Right: random donors) The observed frequency
increase of delve in academic YouTube talks after ChatGPT’s release remains consistent when
employing different donor words for synthetic control construction.
Figure S3: Posterior distribution of the event-related coefficient βw,event,GPT for top
GPT words following ChatGPT’s release. The distribution quantifies the distinct trend
change in word frequency specifically attributed to GPT words, compared to synthetic controls,
after the release date. The shaded regions represent the 95% credible intervals, demonstrating
the robustness of the observed linguistic shifts in multiple words.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
Figure S4: Trend changes of top GPT words across different podcast topics The syn-
thetic controls were constructed from untreated donors, and the same model as (1) was applied.
While Science and Technology tended to exhibit the most significant increases in the frequency
of GPT words, some GPT words (e.g., boast, swift, and meticulous) show similar trends across
podcast topics.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
System prompt:
You are a great research assistant who is asked to analyze YouTube data. You
will be provided a list of YouTube channels as well as a target information.
Please select the best channel that seems to be owned by the target.
Importantly, please do not add explanations or comments other than the
selected channel name. If there is no appropriate channel, please return N/A.
# Candidates
Figure S5: Prompt provided to gpt-3.5-turbo-0125 to pick the most plausible channel among
query results from Youtube API.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
2
Log probability
10
2
Log probability
10
2
Log probability
10
2
Log probability
10
ia
iv
iv
Em cts
s
s
ws
ail
say
arX
Rx
ed
encompass align comprehend
Ne
tra
kip
bio
Es
bs
Wi
eA
2 tur
Na
Dataset
Log probability
4
gpt3.5-turbo
6 gpt4
gpt4-turbo
8 gpt4o
human
10
ia
Em cts
ia
ia
iv
iv
Em cts
s
s
ws
iv
iv
s
ws
iv
iv
Em cts
s
s
ws
ail
say
ail
say
ail
say
arX
Rx
arX
Rx
arX
Rx
ed
ed
ed
Ne
Ne
Ne
tra
tra
tra
kip
kip
kip
bio
bio
bio
Es
Es
Es
bs
bs
bs
Wi
Wi
Wi
eA
eA
eA
tur
tur
tur
Na
Na
Na
Figure S6: Log probabilities of human and LLM-revised text. We calculated the log-
probability of a word appearing in human-authored text and its appearance in a version of
the same text revised by different LLMs. Each colored point represents the log-probability
for a specific combination of model, dataset, and prompt. The log-probability for the original
human-authored text is shown in gray. Some LLM calls failed due to various reasons, such as
policy violations. Consequently, the corresponding human-authored texts were removed from
the dataset, introducing slight variations in the associated probabilities, even though the source
dataset remained identical.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint
6
Log Odds Ratio
6
Log Odds Ratio
6
Log Odds Ratio
6
Log Odds Ratio
6
dataset
arXiv
Log Odds Ratio
4 bioRxiv
Nature Abstracts
2 Emails
Essays
News
0 Wikipedia
gpt3.5-turbo gpt4 gpt4-turbo gpt4o gpt3.5-turbo gpt4 gpt4-turbo gpt4o gpt3.5-turbo gpt4 gpt4-turbo gpt4o
Model Model Model
Figure S7: Log-Odds ratios (LORs) of words in human vs. LLM-revised text. We
calculated the LOR of a word appearing in human-authored text compared to its appearance in
a version revised by an LLM. Displayed here are the 19 words with the highest average LOR
across all datasets, models, and prompts. The data are stratified by dataset and model, with
error bars representing the standard error associated with the three prompts analyzed.