0% found this document useful (0 votes)
20 views29 pages

Empirical Evidence of Large Language Model's Influence On Human Spoken Communication

Uploaded by

Duany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

Empirical Evidence of Large Language Model's Influence On Human Spoken Communication

Uploaded by

Duany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Empirical evidence of Large Language Model’s

influence on human spoken communication


Hiromu Yakura∗†a , Ezequiel Lopez-Lopez†b , Levin Brinkmann†a ,
Ignacio Sernaa , Prateek Guptaa , Ivan Soraperraa , and Iyad Rahwana
a
Center for Humans and Machines, Max-Planck Institute for Human
Development, Germany
arXiv:2409.01754v3 [cs.CY] 8 Jul 2025

a
Center for Adaptive Rationality, Max-Planck Institute for Human Development,
Germany

Abstract
From the invention of writing [1] and the printing press [2, 3], to television [4, 5] and social
media [6], human history is punctuated by major innovations in communication technology,
which fundamentally altered how ideas spread and reshaped our culture [7, 8]. Recent chat-
bots powered by generative artificial intelligence constitute a novel medium that encodes
cultural patterns in their neural representations and disseminates them in conversations
with hundreds of millions of people [9]. Understanding whether these patterns transmit into
human language, and ultimately shape human culture, is a fundamental question. While
fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very
challenging, lexicographic shift in human spoken communication may offer an early indicator
of such broad phenomenon. Here, we apply econometric causal inference techniques [10] to
740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 con-
versational podcast episodes across multiple disciplines. We detect a measurable and abrupt
increase in the use of words preferentially generated by ChatGPT—such as delve, compre-
hend, boast, swift, and meticulous—after its release. These findings suggest a scenario where
machines, originally trained on human data and subsequently exhibiting their own cultural
traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed
cultural feedback loop in which cultural traits circulate bidirectionally between humans and
machines [11]. Our results motivate further research into the evolution of human-machine
culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks
of scalable manipulation.

Introduction
Human culture has long been transformed by communication technologies [12, 13]. Writing
enabled complex social organizations [1]. The printing press subsequently democratized
knowledge, catalyzing economic and social revolutions [12, 3]. Cinema provided a shared
visual experience [12], and social media has since embedded humanity into a globalized
network [6, 14]. Now, generative AI—particularly Large Language Models (LLMs) like
∗ Corresponding author: [email protected]
† These authors contributed equally to this work.

1
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

ChatGPT—emerges as a novel medium, prompting the fundamental question: how will it


shape culture?
Generative AI functions as a medium that encodes and transmits cultural patterns. In
particular, LLMs trained on extensive corpora do not merely store text verbatim; rather,
they internalize and reproduce broader structures—linguistic conventions [15], factual knowl-
edge [16], lexicographical biases [17], moral values [18], patterns of human behavior [19], rea-
soning strategies [20, 21], and persuasive techniques [22]. As users engage with these models,
such cultural patterns may be reinforced and perpetuated. Scholars of cultural evolution
have long examined how the differential transmission and reproduction of cultural traits
shape societal change [23]. Notably, LLMs are known to exhibit distinctive linguistic char-
acteristics, such as a preference for certain words, styles, and text structures [24, 25, 26]. If
humans prove receptive to the patterns generated by AI systems, this could mark a profound
shift on the trajectory of cultural evolution.
In this study, we present large-scale empirical, causal evidence that humans are reproduc-
ing cultural traits exhibited by ChatGPT, based on analyses of longitudinal corpora of tran-
scriptions from human verbal communication. With more than 100 million users in its first
two months [27], ChatGPT’s rapid proliferation has enabled machine-generated content to
permeate diverse domains—from education to scientific writing [28, 29, 30, 24, 26]—thereby
creating an unprecedented opportunity to investigate this phenomenon as it happens. As
machine-generated outputs measurably reshape human cultural evolution, the challenge is
not if AI systems influence society, but how profoundly and in which direction.

Comparing Word Preferences of Humans and LLMs


ChatGPT, trained on vast amounts of data, learns from the cumulative textual knowledge of
humanity—spanning books, websites, forums, Wikipedia, and other publicly available cor-
pora [31]. However, beyond this broad pretraining, ChatGPT is fine-tuned using proprietary
techniques with limited transparency regarding the annotator demographics and the rein-
forcement learning process [32, 33]. As a result, its final behavior is not a mere reflection of
human language but an emergent phenomenon [34]—one shaped by statistical learning [16],
reinforcement dynamics [35], and alignment objectives [36]. The outcome is a linguistic and
behavioral profile that, while rooted in human language, exhibits systematic biases that dis-
tinguish it from organic human communication [37, 38]. When performing tasks, ChatGPT
exhibits a persistent preference for normative and socially desirable communication patterns,
such as politeness, neutrality, and conflict avoidance—traits that are conventionally empha-
sized in professional settings [39, 40]. For instance, when generating an email, it defaults
to structured and formal etiquette; in conversational exchanges, it maintains a consistently
polite and conciliatory tone, avoids direct confrontation [41], and adheres to mainstream
social norms [39, 42, 43]. These tendencies reveal a self-consistent communicative style that
differentiates ChatGPT’s language from spontaneous human interaction.
Linguistic preferences are a key aspect of cultural behavior. Just as humans develop
distinctive word choices shaped by social and historical contexts [44, 45], ChatGPT also
demonstrates systematic lexical biases that reflect its training data and optimization pro-
cess [46, 38]. One striking example is its preference for the word delve—a term it frequently
selects over alternatives such as explore or examine, highlighting its distinct lexicographic
tendencies [25, 47].
Such systematic preferences can be quantitatively examined through word frequency
analysis, a well-established method for studying linguistic and cultural patterns [48, 49, 50].
Prior research has used word frequencies to track ideological shifts in media [48], the spread
of new grammer [49], and changes in conversational norms over time [50]. By applying
this approach to ChatGPT’s outputs, we can rigorously compare its linguistic tendencies to
those observed in human communication, revealing the extent to which AI-driven linguistic
patterns diverge from or influence human cultural patterns (Fig. 1A).
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

A. AI-driven linguistic patterns may potentially spread through diverse pathways C. Transcription pipelines

delve

Youtube Podcasts
talks
740,249
hours

delve
delve

English title English category

writing a
document delve

Duration > 20 min Duration > 15 min


delve
< 99% perc (3h) < 5.5 hours
delve

preparing delve

interacting with a talk


ChatGPT Diarization

delve
delve

Has dialogue
spontanous
conversation
delve

Transcription Transcription
First generation Second generation Third generation

B. Number of Transcripts for Youtube and Podcast Categories Youtube Podcasts

360,445 English language English language


771,591 episodes with conversation

56% 249,949
Conversation
7.35 billion
words
44% 136,247 117,080 130,558 137,757
Speech

Talks from
Business Education Science & Sports Religion
academic channels
Technology

Figure 1: Determining the reciprocal transmission of cultural traits between humans


and machines through large-scale data on verbal communication. (A) Potential path-
ways for integrating AI-driven cultural traits, such as those generated by ChatGPT, into human
communication. (B) Composition of the human verbal communication data, comprising 360,445
YouTube videos of academic talks and 771,591 podcast episodes, systematically collected to as-
sess the influence of ChatGPT. We initially focused on academic communication on YouTube,
as the influence of the LLM on academic written discourse has been confirmed in the literature.
(C) Transcription pipelines prepared for robust analysis of changes in word usage, yielding over
7.35 billion transcribed words. For podcasts, only those featuring dialogue were used to examine
the influence on spontaneous communication.

Specifically, we collected extensive data on human verbal communication from the In-
ternet and conducted a quasi-experiment to estimate the population-level causal impact of
ChatGPT’s release (Figs. 1B and C; see Constructing datasets of human spoken communi-
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

A. Word frequency for delve in academic Youtube talks C. Top words preferred by ChatGPT

D. Construction of synthetic controls


Optimize pre-treatement
similarity

Synthetic control GPT word

B. Measuring word-wise preferences of ChatGPT


Synthetic control delve

x 0.13 x 0.10 x 0.07


Human-written Your task is to proofread the provided sentence for ChatGPT-edited
abstract from arXiv grammatical accuracy. Ensure that the corrections abstracts
introduce minimal distortion to the original content.

Word-level frequency Log-odds ratio in text Word-level frequency

darker morph embedded

Figure 2: Quantifying ChatGPT’s word preferences and their influence on human


communication. (A) Temporal trends in the usage of the word delve in academic YouTube
talks, showing a significant increase after the release of ChatGPT in contrast to its synthetic
control (p = 0.010). (B) Method for measuring word preferences of ChatGPT, in which we
compared the word frequency distributions of human-written texts and their ChatGPT-edited
versions. The log-odds ratio derived from these distributions, termed the GPT score, quantifies
the extent to which a word is peculiar to ChatGPT. (C) Distinctive words associated with
ChatGPT, termed the GPT words, measured by GPT scores. ChatGPT’s strong preference for
delve was consistent across different versions. (D) Construction of a synthetic control with a
similar pre-release pattern through a convex combination of donor words, used to infer causal
influence.

cation in Methods for details). To begin, we created a dataset comprising 360,445 YouTube
videos focused on academic conversations. This is because the literature has demonstrated
the influence of ChatGPT on academic papers [24, 25], including the increased use of delve,
from which we could expect a similar effect on verbal communication. We later expanded
our analysis to include 771,591 podcast episodes to investigate the influence in other conver-
sational domains and spontaneous communication, ensuring that the effect did not primarily
stem from reading ChatGPT-generated scripts.
We then examined changes in word usages in human verbal communication before and
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

A. Changing trend of word usages upon ChatGPT release B. Changes of word usages associated
with ChatGPT’s preference

Synthetic control GPT word

Figure 3: Trend changes of top GPT words in academic YouTube talks. (a) Linear
regression of the frequency of videos containing top GPT words and the corresponding synthetic
controls. Dots represent monthly aggregated frequencies while lines depict temporal trends
modeled with a change point marked by the vertical dashed line. Words such as comprehend,
boast, and swift show a significant increase in usage, similarly to delve. (b) Relationship between
the GPT scores of words (x-axis) and the magnitude of changes in their frequency trends (y-axis).
The bars represent the 95% high-density interval of the posterior distribution, and the shaded
region in the inset chart represents the 95% confidence interval of a Gaussian process regression.
Focusing on the top 20 GPT words (i.e., the most right side of the inset chart), we observe a set
of words that show an increase in usage of approximately 25% to 50% per year.

after the release of ChatGPT, focusing on GPT words (i.e., words strongly preferred by
the LLM), as demonstrated in Fig. 2A. To accomplish this, we quantified ChatGPT’s word
preferences by editing human-written texts with ChatGPT and comparing their word fre-
quency distributions to those of the ChatGPT-edited versions (Fig. 2B; see Measuring word
preferences of large language models in Methods for details). The log-odds ratio derived
from the two versions (hereinafter GPT score) highlights ChatGPT’s strong preferences
for delve, which were consistent between different versions of ChatGPT (Supplementary
Fig. S6), along with other words associated with ChatGPT [24, 25] (Fig. 2C).
To estimate the causal influence of ChatGPT’s release on spoken communication, we
employed the synthetic control method [10]. This approach aims to infer the counterfac-
tual usage of “treated” words—those with high GPT scores—in the absence of ChatGPT’s
introduction. It assumes that the “untreated” words—those with near-zero GPT scores—
that displayed similar pre-release patterns would project the counterfactual usage. For
this purpose, the method constructs a synthetic control for a treated word by taking the
convex combination of multiple untreated “donor” words that best matches its pre-release
usage (Fig. 2D). To test the robustness of our results, we employed alternative donor selec-
tion methods, including random sampling (see Estimating causal influence of ChatGPT in
Methods for details).
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Relationship between ChatGPT’s word preferences


and human adoption
As shown in Fig. 2A, we started by examining the changes in the usage of delve, the word
for which ChatGPT exhibited the strongest preference. The placebo test (see Estimating
causal influence of ChatGPT in Methods) revealed a significant increase in word frequency
(p = 0.010) compared to its synthetic control (see Estimating causal influence of ChatGPT
in Methods for details). This was consistent for other donor selection strategies (Supple-
mentary Fig. S1). Furthermore, the discrepancy between delve and its synthetic control was
maximized when we specified November 30, 2022, as the treatment date, compared to other
dates (Supplementary Fig. S2). This finding suggests that the increase in the usage of delve
in academic YouTube talks coincided with the release of ChatGPT.
Fig. 3 extends this analysis across a broader range of GPT words. Here, we employed the
following continuous piecewise linear regression to model the magnitude of trend changes
across different words:

log10 (yw,t ) = α + β t + βPost dPost t + βGPT,Post dGPT dPost t + ϵt

1 if t > Tevent

where dPost = for t ∈ [Tstart , Tend ]
0 otherwise

0 if w is a synthetic control

and dGPT = (1)
1 if w is a GPT word

In this model, log10 (yw,t ) represents the log relative frequency of YouTube videos containing
the word w in a specific time period t (sampled on a monthly basis, see Constructing datasets
of human spoken communication), and Tevent is set corresponding to the release of ChatGPT
(marked by the vertical dotted lines in Fig. 3A). Thereby, the coefficient β captures the
temporal trend commonly observed for the GPT word and the synthetic control prior to
the introduction of ChatGPT, while βPost captures the change in the temporal trend of
the synthetic control following the introduction of ChatGPT. βGPT,Post measures how the
change in trend before (pre-) and after (post-) ChatGPT’s introduction differs between the
GPT word and the synthetic control, providing an estimate of ChatGPT’s causal impact on
word usage. Fig. 3A illustrates this effect as the difference between the orange and green
lines from the vertical line to the right. These results indicate that the linguistic shift in
academic communications after the release of ChatGPT also occurred for other words, such
as comprehend, boast, swift, and meticulous. This is clearly illustrated by the posterior
distribution of βGPT,Post shown in Supplementary Fig. S3.
Fig. 3B reports the estimated increase in usage frequency after the release of ChatGPT
(βGPT,Post ) for the set of words with the highest GPT score (x-axis). Many of the top GPT
words exhibit an annual growth in usage frequency of about 25% to 50%, yet for others, usage
has not changed significantly after ChatGPT’s release. This suggests that words preferred by
ChatGPT are not necessarily adopted into human verbal communication. However, we did
not observe any top GPT word exhibiting a significant decrease in usage, and, as can be seen
from the inset figure, words showing a significant increase in usage are primarily observed
among those with high GPT scores, i.e., words that ChatGPT outputs more frequently
than humans. This allows us to conclude that the increase in usage of GPT words after the
release date of ChatGPT has been influenced by interactions with the LLM.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Synthetic control GPT word (delve)

Figure 4: Word frequency of delve in English-language podcast conversations across


multiple topics. After the release of ChatGPT, delve usage increases significantly in Science
& Technology, Business, and Education podcasts (p < 0.05), but not in Religion & Spirituality
(p = 0.29) or Sports (p = 0.66). These trends mirror patterns observed in YouTube academic
talks (Fig. 2A) yet extend to the more spontaneous domain of podcast conversations. Synthetic
controls highlight that the effect is domain-dependent, suggesting variable levels of AI-driven
linguistic shifts across different fields.

Influence in spontaneous communications in multiple


domains
The above results pose the question: is the influence of LLM on verbal communication ob-
served only in academic contexts? One possibility is that, despite 56% of YouTube academic
talks containing conversational elements (see Constructing datasets of human spoken com-
munication in Methods), these linguistic shifts stem primarily from direct machine-to-human
transmission—such as speakers reading from ChatGPT-co-authored materials—rather than
reflecting an internalized cultural trait. To ascertain the influence of ChatGPT outside
the academic context and beyond scripted speech, we analyzed 771,591 podcast episodes
spanning diverse topics. We then found a clear post-ChatGPT increase in the usage of
delve—the word with the highest GPT score—rising in all categories except Sports (Fig. 4).
However, the effect was statistically significant, when compared to their synthetic controls,
in Science and Technology (p = 0.03), Business (p = 0.04), and Education (p = 0.04),
but not in Religion and Spirituality (p = 0.29) or Sports (p = 0.66), suggesting that lin-
guistic shifts do not spread evenly across domains. Instead, STEM-adjacent fields appear
to adopt the word preferred by ChatGPT more readily. Beyond topical differences, we
also observed that ChatGPT’s influence extends across communication formats. Despite
differences in structure—academic talks that tend to be formal versus more spontaneous
podcast conversations—similar linguistic trends emerged. Delve increased most in topics
that arguably closer to academia, while words like swift and meticulous saw broader adop-
tion across all domains, mirroring trends in YouTube talks (see Supplementary Fig. S4).
This suggests a two-phase diffusion process, where the words preferred by the LLM first
gains traction in fields with greater exposure to content influenced by ChatGPT, such as in
academic environments, before filtering into more informal discourse. At the same time, dif-
ferent words may follow different adoption pathways, reflecting variations in how language is
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

used across domains (e.g., ChatGPT’s word preferences exhibit distinct occurrence patterns
depending on the domain in which they are used; see Fig. 5b).
These findings indicate that AI-driven linguistic shifts are not confined to structured,
scripted communication. Instead, they suggest a plausible deeper process of linguistic adap-
tation, where certain words become embedded in everyday speech rather than simply being
imitated. We discuss the implications of these findings in the following section.

Discussion
Our findings provide the first large-scale empirical evidence that AI-driven language shifts
are propagating beyond written text into spontaneous spoken communication. By analyzing
over 740,000 hours of transcribed YouTube academic talks and podcast episodes, we reveal
a measurable surge in words preferred by ChatGPT—including delve, boast, and swift—with
a causal association to its public release. This linguistic influence is not confined to domains
where LLM-generated text is integrated by early adopters—such as academia, science, and
technology—but it is spreading to other domains, such as education and business. Beyond
diversity in domains, this shift is also not limited to one mode of communication, such
as scripted or formal speech—often featured in YouTube academic talks—but extends to
unscripted, conversational, and spontaneous discourse characteristic of podcast episodes.
However, the mechanisms underlying this adoption remain unknown. This can be the
result of either direct imitation [51], cognitive ease [52], or deeper integration into human
thinking processes [53], or the combination of all three. The uptake of words preferred by
LLMs in real-time human-human interactions suggests a deeper cognitive process at play—
potentially an internalization of AI-driven linguistic patterns. Notably, language is known to
influence human cognition [54], and it now underpins modern machine intelligence as well.
Next-generation LLMs appear to move beyond ingesting large-scale text corpora; they also
internalize structured reasoning processes—such as step-by-step logical inference [20, 21]—
that are themselves embedded in language. This opens the possibility that those words
preferred by LLMs encode modes of thinking that can be internalized by humans, beyond a
mere word-adoption phenomenon. Understanding how such AI-preferred patterns become
woven into human cognition represents a new frontier for psycholinguistics and cognitive
science.
Yet, regardless of the precise mechanism—be it imitation, cognitive ease, or deeper
integration—this measurable shift marks a precedent: machines trained on human culture
are now generating cultural traits that humans adopt, effectively closing a cultural feedback
loop. The shifts we documented suggest a scenario in which AI-generated cultural traits are
measurably and irreversibly embedded in and reshaping human cultural evolution.
An immediate consequence is that these traits no longer remain confined to interactions
between humans and AI systems but instead can diffuse further throughout human-human
communication. For example, given the unprecedented scale of this shift, it is plausible that
the influence extends beyond 1-to-1 interactions with ChatGPT. The cultural traits can
now permeate discourse even among individuals who have never interacted directly with the
LLM, as we observed in spontaneous conversations of podcast episodes where these words
were mentioned. As a novel medium, generative AI provides unprecedented leverage over
cultural evolution, raising unpredicted societal changes.
A central concern of this development is cultural homogenization. If AI systems dis-
proportionately favor specific cultural traits, they may accelerate the erosion of cultural
diversity [55]. Compounding this threat is the fact that future AI models will train on data
increasingly dominated by AI-driven traits, further amplified by human adoption, thereby
reinforcing homogeneity in a self-perpetuating cycle. This phenomenon poses an additional
hazard to AI itself: as specific patterns become monopolized, the risk of model collapse [56]
rises through a new pathway: even incorporating humans into the loop of training models
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

might not provide the required data diversity. This not only threatens cultural diversity but
also reduces the societal benefits of AI systems.
After such a cultural singularity—where the lines between human and machine cul-
ture [11] become increasingly blurred—long-standing norms of idea exchange, authority,
and social identity may also be altered, with direct implications for social dynamics. For
instance, language has functioned as a potent instrument of social distinction [57, 58]. While
LLMs can help level linguistic barriers [59, 60], especially for non-native speakers seeking to
communicate in formal English, they may also generate new biases. It is conceivable that
certain words preferred by LLMs, like delve, could become stereotypically associated with
lower skill or intellectual authority, thus reshaping perceptions of credibility and competence.
While our method provided causal evidence of AI-driven shifts in language use, language
itself—although central—is only one facet of the complexity of human culture. Future re-
search must explore parallel phenomena in other facets not only to mitigate risks but also
to understand the unprecedented ways in which AI is becoming a source of cultural inno-
vations [61, 11]. This urgency is amplified by the deepening interplay between geopolitical
interests and AI development [62], which may reshape cultural narratives in unforeseen
ways. From influencing public discourse to redefining social cohesion, our next question is
no longer whether machines influence us, but how profoundly and through which channels.

Methods
Here, we provide details about the experimental design and the key metrics used to quantify
the influence of LLMs on human spoken communication. We begin by describing the pipeline
to construct datasets of YouTube talks and podcasts, including the criteria for preprocessing
and filtering. We also explain the method of quantifying GPT score, which measures the
difference in the usage of words between human- and ChatGPT-edited texts. Then, we
detail the analysis design used in our research to examine the causal influence of ChatGPT
on human verbal communication.

Constructing datasets of human spoken communication


We first constructed a dataset of YouTube transcriptions related to academic communica-
tion, where the influence of LLM on written communication is observed. Additionally, to
verify whether the influence observed in the YouTube dataset can also be confirmed in other
domains, we collected podcasts and transcribed them.

YouTube data collection


To systematically collect academic talk videos, we first cataloged 20,622 research institutes
from the Research Organization Registry [63] as of May 13, 2024. Institutes not identified
as active educational entities were omitted to minimize the inclusion of non-educational
content, such as videos from corporate or inactive channels. Subsequently, we queried the
YouTube API1 with each institute name and its country name to list relevant channels and
provided the results to gpt-3.5-turbo-0125 as input to pick the most plausible channel
using the prompt shown in Supplementary Fig. S5. The identified channels were then used
to compile a list of 2,958,103 videos through a subsequent YouTube API query.

Podcast data collection


We collected podcast transcripts across five categories: Business, Education, Religion &
Spirituality, Science & Technology, and Sports. To ensure a sufficiently large dataset for
analysis, we first compiled a list of podcast episodes using a database of over 4 million
1 Youtube API: https://developers.google.com/youtube/v3/docs/search/list
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

podcast series,2 as of April 2024.3 We then applied broad filters based on duration and
publication date, selecting episodes released between January 1, 2017, and the penultimate
quarter of 2024.
Next, we downloaded the selected podcasts, randomly selecting 6,000 podcasts per quar-
ter for each category to ensure a robust sample. Here, we used pre-labeled categories given
by the podcast provider, PodcastIndex,4 which were mapped into more broad and general
categories for clarity.5 A small fraction of the episodes could not be retrieved or were not
available in MP3 format, so they were discarded. As detailed in later sections, the number of
usable podcasts decreases throughout our filtering pipeline. Thus, if any category resulted
in fewer than 3,500 transcripts per quarter after filtering, we re-ran the sampling to meet
this threshold.

Filtering and transcription


To maximize the number of transcripts we can obtain with limited GPU resources, we
implemented the following filtering criteria. First, using a language detection library [64], we
excluded YouTube videos with non-English titles, as they are less likely to contain English
communication, and podcasts not labeled with English as their language. Additionally,
we removed videos shorter than 20 minutes long6 and podcasts shorter than 15 minutes,
since they often include non-speech content such as promotional material. Furthermore, we
excluded YouTube videos that exceed the 99th percentile in duration (approximately 3.0
hours) to avoid noise and unnecessary GPU use, such as a 5-hour graduation ceremony. For
the podcasts, we ruled out episodes over 20,000 seconds (approximately 5.5 hours) for the
same reason, given that we could not pre-determine the population due to the following
filtering process.
With the podcast dataset, we aimed to specifically analyze the influence of LLMs on
spontaneous communication. Therefore, we employed an additional filter to qualify only
conversations. This involved analyzing a 10-minute slice extracted from the middle of each
episode to detect the number of unique speakers and their speech boundaries. We ap-
plied speaker diarization—a process that partitions audio into segments labeled by speaker
identity—using the pyannote library [65, 66]. To classify an episode as conversational, we re-
quired at least two distinct speakers and four or more exchanges (alternating turns) between
them, ensuring dynamic dialogue.
This process required us to re-run the sampling of podcasts, as mentioned above, be-
cause the ratio of drop-off significantly varied across the categories. Religion & Spirituality
had the highest drop-off, with only 16.2% of collected files making it to full transcription,
reflecting a high prevalence of monologues, which required a larger initial dataset. Science
& Technology also saw a considerable drop, with only 40.8% of collected files reaching tran-
scription, while other categories showed approximately 60% to 75% retention rate. We also
note that we applied the same speaker diarization technique to conduct a post-hoc analysis
of the conversation ratio in the YouTube videos, which yielded a result of 56%, as illustrated
in Fig. 1B.
The transcription of the collected data was performed using the large-v3 model of Whis-
perX [67], a faster version of the Whisper speech-to-text model [68]. We employed batch
processing with the model, achieving an average transcription speed of approximately 2
minutes per hour of audio with Nvidia A100 GPU. Here, we opted to run the transcrip-
tion process ourselves rather than using the pre-existing transcript data from YouTube or
other podcast platforms, due to the possibility that they have switched transcription models
2 Taddy Podcast: https://taddy.org/developers/podcast-api/bulk-download-podcastseries
3 https://archive.org/details/podcastseries-2024-04-09T03-19-15.140Z.txt
4 Podcast Index API: https://podcastindex-org.github.io/docs-api
5 Apple Podcasts categories: https://podcasters.apple.com/support/1691-apple-podcasts-categories
6 This threshold was derived from the YouTube API’s categorization of videos under 20 minutes as short.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

across different time points. For example, in the YouTube transcripts, we found an unnat-
ural increase in the frequency of the filler word um starting around May 2020, which we
found difficult to attribute to an actual increase in speakers’ usage of the word. It is more
plausible that YouTube switched to a transcription model that transcribes fillers verbatim,
and thus, we conducted the transcription process to avoid a potential source of bias. Lastly,
we used the language detection library [64] again to both datasets to filter out transcripts
from videos and podcasts that were judged as non-English by whisper and transcribed into
different languages.
Given the above, we acknowledge that our datasets are not all-encompassing. Here, the
data collection procedure was designed to construct a dataset that can capture the influence
of LLMs on spoken language in a systematic manner. While recognizing the potential
presence of noise or missing data, we believe that the datasets are sufficiently comprehensive
to represent the phenomena without introducing bias.

Preprocessing
We preprocessed the obtained transcripts to capture essential changes in word frequency by
removing noise and highlighting relevant patterns. We followed a systematic procedure:
1. Tokenization: The text is divided into individual tokens (words) for processing.
2. Normalization: All words are converted to lowercase to ensure uniformity and avoid
duplication due to case differences.
3. Stop word removal: Commonly used words that do not carry significant semantic
meaning, such as and, the, and is, are removed. The list of stop words used in this
process is sourced from the Natural Language Toolkit (NLTK) library [69], which
provides a standard set of English stop words.
4. Non-alphabetic filtering: Words containing non-alphabetic characters are excluded,
ensuring only standard words are retained.
5. Length filtering: Words with fewer than three characters are removed to eliminate
overly short and potentially uninformative tokens.
6. Stemming: Words are reduced to their root forms using the Porter stemming algo-
rithm [70]. This algorithm applies a series of heuristic rules to iteratively strip suffixes
from words (e.g., running to run). Stemming reduces the total vocabulary size, facili-
tating more efficient analysis.
For subsequent analyses, we calculated the log relative frequency of YouTube videos
and podcast episodes containing each presented word, sampled monthly. Older data may
exhibit different word usage trends due to factors such as the relatively low number of
YouTube videos and podcast episodes.7 Hence, we analyzed data spanning four years before
the initial release of ChatGPT on November 30, 2022. Additionally, due to the timing
of data collection, the analysis includes data up to May 31, 2024, which is 18 months
post-release. We employed log frequency to facilitate trend interpretation within this early
diffusion phase, using Laplace smoothing [71] to account for zero counts, which helps detect
emerging patterns that may initially exhibit exponential growth.

Measuring word preferences of large language models


We investigated the word preferences of commonly used LLMs by prompting various models
to edit a diverse set of human-authored texts. Building on prior research [24, 25], we analyzed
differences in word frequencies between original human-written texts and their LLM-edited
versions. Our analysis spans a wide range of human texts, prompts, and models, enabling
the computation of an aggregated GPT score.
7 For example, the number of videos uploaded in 2015 in our dataset is less than half of those uploaded in 2023.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Creation of contrastive datasets


We compiled datasets from diverse sources, all predating the introduction of ChatGPT.
These included 7,182 abstracts from arXiv (2019–2022) using the arXiv API, 2,880 ab-
stracts from bioRxiv (2019–2022) via the bioRxiv API, over 8,000 abstracts from Nature
(2019–2023) collected through its search engine, and 10,000 samples each from the Enron
email dataset (2000–2001), Hewlett Foundation student essays (2012), HuffPost news de-
scriptions (2012–2022), and Wikipedia articles (2019–2022). Detailed dataset creation steps
are provided in the Supplementary Information materials.
To assess how prompts influence model word preferences, we used three standard prompts
across all datasets and models:
• Prompt 1: Please polish this text: {text}
• Prompt 2: Can you improve this text: {text}
• Prompt 3: Please rephrase this text to improve its clarity: {text}
As Prompt 3 frequently altered the content of emails, we extended the prompt to include:
“It’s an email, so please don’t change the structure of the text.” in that spe-
cific case.
We preprocessed both the original human texts and their LLM revisions using the same
procedure applied to transcript datasets of YouTube videos and podcasts. For robustness,
we considered only words appearing in at least one per mile of all documents, either in
the source or edited texts, and excluded prompt-related words (rephrase, polish, dear, text,
certainly, subject, readable, claritiy, enhance, version, title) that were frequently repeated
in the LLM’s responses. Also, our analysis included multiple GPT-family models (GPT-
3.5-turbo, GPT-4, GPT-4-turbo, and GPT-4o), allowing us to investigate across different
versions of LLMs.

Log-odds ratio estimation


To identify words preferentially associated with LLMs, we computed log-odds ratios com-
paring word frequencies in human-authored and ChatGPT-edited corpora. For each word
w, we estimated its document frequency in human (phuman ) and ChatGPT (pGPT ) corpora
using Laplace smoothing to mitigate zero-count issues:
number of documents containing word w + 1
pw = .
total documents + 1
The log-odds transformation was applied to these smoothed probabilities:
 
p
log-odds (p) = ln ,
1−p

yielding the log-odds ratio (LOR) for each word:

LORw = log-odds (pw,GPT ) − log-odds (pw,human ) .

Positive LOR values indicate higher prevalence in ChatGPT-edited texts, while negative
values suggest human-associated usage. This metric was computed independently across all
dataset–model–prompt combinations.

Calculation of an weighted GPT score


To quantify systematic word preferences across GPT-family models (GPT-3.5-turbo, GPT-4,
GPT-4-turbo, and GPT-4o) and scientific abstracts (arXiv, bioRxiv, Nature), we developed
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

a GPT score that marginalizes over uncertainties in model usage patterns. Given the un-
known true distribution of ChatGPT’s usage across datasets (D), models (M ), and prompts
(P ), we adopted a Bayesian hierarchical model with non-informative Dirichlet priors:

P (D) ∼ Dirichlet (1) ,


P (M | D) ∼ Dirichlet (1) ,
P (P | D, M ) ∼ Dirichlet (1) ,

where 1 denotes flat priors for each parameter. The joint distribution P (D, M, P ) was
computed as:
P (D, M, P ) = P (D) · P (M | D) · P (P | D, M ) .
For each of 1000 Monte Carlo samples from this prior distribution, we calculated weighted
word probabilities: X X X (d,m,p)
p̂w = pw · λ (d, m, p) ,
d∈D m∈M p∈P

where λ (d, m, p) ∝ P (D = d, M = m, P = p) and D, M, P represent datasets, models, and


prompts respectively. The GPT score for each word was defined as the median LOR across all
samples, with uncertainty quantified via 95% percentile intervals. This approach propagates
epistemic uncertainty about ChatGPT’s usage patterns into the final score estimates.

Rationale for Bayesian weighting The Dirichlet prior structure reflects maximum
entropy assumptions about potential correlations between datasets, models, and prompts.
By sampling from the joint prior distribution, we emulate the variability expected under
real-world deployment scenarios where specific GPT-family model, dataset, and prompt
combinations are not systematically favored. The resulting GPT scores thus represent robust
centrality estimates of word preferences across plausible usage distributions.

Estimating causal influence of ChatGPT


To assess ChatGPT’s causal impact on human verbal communication, we employed the
synthetic control method [10]. This method allows us to estimate the usage pattern of a
“treated” GPT word (i.e., a word with a high GPT score) in the counterfactual scenario
where ChatGPT was never deployed. This is built on the assumption that words sharing
similar pre-release usage patterns would have continued exhibiting comparable patterns in
the absence of the release. Thus, the inferred post-release usage of “treated” words can be
derived by comparing them to the “untreated” words—those with near-zero GPT scores—
that displayed similar pre-release patterns. Here, the method constructs a synthetic control
for a treated word by forming a convex combination of multiple untreated “donor” words
that closely align with its usage prior to the release, and it predicts the counterfactual
pattern by calculating the post-release usage of this combination.

Donor selection
For each treated word, the synthetic control method requires the selection of a set of donor
words that are used to build the synthetic control. Following recommendations in the liter-
ature [72] and because of the nature of the optimization algorithm, in which we specifically
used sequential least squares programming, we restricted the number of donor words to
approximately 100. To avoid arbitrary selection and to assess the robustness of the results,
we developed three different approaches to select the words in the donor pool: untreated,
synonym, and random. The results from the last two approaches are included in the Sup-
plementary Material.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Untreated donors To estimate a proper counterfactual usage of a GPT word, we need


to include words that are not preferentially used by ChatGPT, i.e., words that are arguably
non treated by the release of ChatGPT. We first selected the 10% of words whose GPT
scores were closest to zero, i.e., words that are neither over nor under-represented in GPT
rewriting of human text. As a second step, we restricted our selection further by selecting
the 100 words that are the closest to the treated word in the word2vec semantic space, built
using the Google News dataset [73].

Synonym donors This approach consisted of selecting the 100 words that are most
semantically similar to the treated word without restricting the GPT score to be close to
zero. We then constructed synthetic controls as a convex combination of these words.

Random donors This approach involved randomly sampling 100 words from all avail-
able words. It was set to examine the generalizability of the results to the case with minimal
assumptions.

Placebo test
We assessed the significance of the causal effect using a placebo test based on the ratio of
post- to pre-treatment mean squared prediction error (MSPE) [74]. This approach assumes
that, in the absence of an effect, the ratio of post- to pre-treatment MSPE between the
treated word and its synthetic control would be similar for words that are actually treated
and for those that are not treated. To test this null hypothesis, we employed a permutation
test where, given a treated word, we applied the synthetic control method to each word in
the donor pool and computed the ratio of post- to pre-treatment MSPE. We then reject
the null hypothesis of no effect if the ratio for the treated word is in the top five percentile
of the distribution of ratios. Additionally, to examine whether the change in usage of the
target GPT word is attributable to the ChatGPT release, we performed an in-time placebo
test. Specifically, following [10], we compared the MSPE ratio with those of fake treatments,
which were set every three months over the two years preceding the ChatGPT release.

Differences-in-differences linear regression model analysis


We employed a hierarchical Bayesian Gaussian regression to model the frequency transition
of a specific word in our datasets. Our model represented in Eqn. 1 allows the estimation
of the influence of the introduction of ChatGPT on the usage of GPT words in a difference-
in-difference approach. Again, β captures the time trend observed for the GPT word and
the synthetic control prior to the introduction of ChatGPT. In contrast, βPost then captures
the change in the usage of the synthetic control after the introduction of ChatGPT, and
βGPT,Post captures how the pre- and post-introduction change differs between the GPT word
and the synthetic control. Note that t is a continuous variable representing time in years
and it is centered in Tstart . This means that betas measure changes in log10 (yw,t ) over
a year long time period. α is the intercept, and the error term ϵ is normally distributed
with a mean of zero and variance σ. We used a half-Cauchy prior for σ while using normal
distribution priors for all other parameters. Parameters were estimated using MCMC with
sampling conducted across four chains using STAN’s no-U-turn sampler, with 1,000 samples
per chain.

Acknowledgements
H.Y. is supported in part by Japan Science and Technology Agency (JST) PRESTO Grant
Number JPMJPR246B. E.L.-L. is funded by the Deutsche Forschungsgemeinschaft (DFG),
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

project number 458366841 (POLTOOLS - Assisting behavioral science and evidence-based


policy making using online machine tools).

Appendix A Word preferences of Large Language


Models
Large language models (LLMs) from the GPT family systematically alter word frequencies
when revising human-written text [24, 25]. To quantify these word preferences, we computed
log-odds ratios (LORs) comparing word frequencies in human-authored texts and their GPT-
revised counterparts. We systematically evaluated the sensitivity of this effect to model
version, prompting, and source dataset (Supplementary Figs. S6 and S7).

(a) Word preferences when revising arXiv abstracts. (b) Word preferences of revisions by GPT-
4.
6
5 6
Log Odds Ratio

4 5 Log Odds Ratio


model 4 dataset
3 arXiv
gpt3.5-turbo 3 bioRxiv
2 gpt4 Nature Abstracts
gpt4-turbo 2 Emails
1 gpt4o
1 Essays
0 News
0 Wikipedia
lve

re

ty
y
cia

iftl

ssi
o
de

lve

ore

ty
rsc

cru

cia
sw

iftl
ce

ssi
de

rsc
de

cru

sw
ne

ce
un

de

ne
Word
un

Word

Figure 5: Log-Odds-Ratios (LORs) of words in human versus LLM-revised text (a)


Word preferences remain relatively stable across different models within the same family when
revising arXiv abstracts. (b) Substantial variations in LORs emerge when examining revisions
of GPT-4 across different datasets.

Word preference patterns exhibited notable stability across GPT-family models (Fig. 5A),
suggesting these biases emerge from intrinsic characteristics of the training pipeline rather
than version-specific training. However, specifically for delve we found descreasing prefer-
ence in newer models. While the odds ratio for delve in GPT-3.5-turbo– and GPT-4–revised
arXiv abstracts exceeded 300:1 relative to human texts, this ratio declined to approximately
100:1 for GPT-4-turbo and 40:1 for GPT-4o, indicating iterative mitigation of this stylistic
artifact.
LOR magnitudes varied substantially across source corpora (Fig. 5B). Analysis of log-
probability distributions (Supplementary Fig. S6) revealed this variability stems primarily
from baseline differences in human word choices. For instance, while humans rarely use un-
derscore in essays, GPT revisions introduced this term frequently across domains including
essays.
Focusing on scientific abstracts (arXiv, bioRxiv, Nature) and GPT-family models (GPT-
3.5-turbo, GPT-4, GPT-4-turbo, GPT-4o), we computed a weighted GPT score by marginal-
izing over model, prompt, and dataset combinations. Here, delve emerged as the most
strongly overused term (LOR > 4), followed by underscore, comprehend, bolster, boast,
swift, inquiry, meticulous, pinpoint and groundbreak (LOR > 2.5) (Figure 2).
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

References
[1] Jack Goody and Ian Watt. The consequences of literacy. Comparative Studies in Society
and History, 5(3):304–345, 1963.

[2] Jeremiah Dittmar. Information technology and economic change: The impact of the
printing press. The Quarterly Journal of Economics, 126(3):1133–1172, 2011.

[3] Jared Rubin. Printing and protestants: An empirical test of the role of printing in the
reformation. Review of Economics and Statistics, 96(2):270–286, 2014.

[4] Robert Putnam. Bowling Alone: The Collapse and Revival of American Community.
Simon & Schuster, New York, USA, 2000.

[5] Robert Jensen and Emily Oster. The power of TV: Cable television and women’s status
in india. The Quarterly Journal of Economics, 124(3):1057–1094, 2009.

[6] Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and false news online.
Science, 359(6380):1146–1151, 2018.

[7] Everett M. Rogers. Diffusion of innovations. Simon & Schuster, New York, USA, 5th
edition, 2003.

[8] Cristian Jara-Figueroa, Amy Z. Yu, and César A. Hidalgo. How the medium shapes the
message: Printing and the rise of the arts and sciences. PLOS ONE, 14(2):e0205771,
2019.

[9] Andrew Ross Sorkin and Sarah Kessler. Sam Altman on Microsoft, Trump
and Musk. New York Times: https://www.nytimes.com/2024/12/14/business/
dealbook/sam-altman-dealbook.html, December 2024.

[10] Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for
comparative case studies: Estimating the effect of California’s tobacco control program.
Journal of the American Statistical Association, 105(490):493–505, 2010.

[11] Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex,


Thomas F. Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi,
Thomas L. Griffiths, Joseph Henrich, Joel Z. Leibo, Richard McElreath, Pierre-Yves
Oudeyer, Jonathan Stray, and Iyad Rahwan. Machine culture. Nature Human Be-
haviour, 7(11):1855–1868, 2023.

[12] Marshall McLuhan. Understanding Media: The Extensions of Man. McGraw-Hill, New
York, USA, 1964.

[13] Joseph Henrich. The Secret of Our Success: How Culture Is Driving Human Evolu-
tion, Domesticating Our Species, and Making Us Smarter. Princeton University Press,
Princeton, USA, 2016.

[14] Neil F. Johnson, Nicolas Velásquez, Nicholas Johnson Restrepo, Rhys Leahy, Nicholas
Gabriel, Sara El Oud, Minzhang Zheng, Pedro Manrique, Stefan Wuchty, and Yonatan
Lupu. The online competition between pro- and anti-vaccination views. Nature,
582(7811):230–233, 2020.

[15] Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenen-
baum, and Evelina Fedorenko. Dissociating language and thought in large language
models. Trends in Cognitive Sciences, 28(6):517–540, 2024.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

[16] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato,
Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural
Information Processing Systems, volume 33, pages 1877–1901, Online, 2020. Curran
Associates.

[17] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret


Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In
Madeleine Clare Elish, William Isaac, and Richard S. Zemel, editors, Proceedings of the
2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,
Online, 2021. ACM.

[18] Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, An-
ton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph,
Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared
Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of sub-
jective global opinions in language models. In Proceedings of the 1st Conference on
Language Modeling, pages 1–27, Philadelphia, USA, 2024. OpenReview.

[19] Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian
Coda-Forno, Peter Dayan, Can Demircan, Maria K. Eckstein, Noémi Élteto, Thomas L.
Griffiths, Susanne Haridi, Akshay K. Jagadish, Ji-An Li, Alexander Kipnis, Sreejan
Kumar, Tobias Ludwig, Marvin Mathony, Marcelo G. Mattar, Alireza Modirshanechi,
Surabhi S. Nath, Joshua C. Peterson, Milena Rmus, Evan M. Russek, Tankred Saanum,
Natalia Scharfenberg, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi,
Xin Sui, Mirko Thalmann, Fabian J. Theis, Vuong Truong, Vishaal Udandarao, Kon-
stantinos Voudouris, Robert Wilson, Kristin Witte, Shuchen Wu, Dirk Wulff, Huadong
Xiong, and Eric Schulz. Centaur: A foundation model of human cognition. arXiv,
2410.20268:1–141, 2024.

[20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,
Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reason-
ing in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle
Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Sys-
tems, volume 35, pages 24824–24837, New Orleans, USA, 2022. Curran Associates.

[21] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang,
Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai
Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu,
Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao,
Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang
Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li,
H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo
Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang
Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai
Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang
Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang,
Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang,
Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang,
Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou,
Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei,
Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun
Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang,
Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang,
Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha
Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia
Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao,
Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He,
Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan
Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo,
Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang,
Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan,
Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang,
Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu,
Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu
Zhang, and Zhen Zhang. DeepSeek-R1: Incentivizing reasoning capability in LLMs via
reinforcement learning. arXiv, 2501.12948:1–22, 2025.

[22] Thomas H Costello, Gordon Pennycook, and David G Rand. Durably reducing con-
spiracy beliefs through dialogues with AI. Science, 385(6714):eadq1814, 2024.

[23] Robert Boyd and Peter J Richerson. Culture and the evolutionary process. University
of Chicago Press, Chicago, USA, 1988.

[24] Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong
Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts,
Christopher D Manning, and James Y. Zou. Mapping the increasing use of LLMs in
scientific papers. In Proceedings of the 1st Conference on Language Modeling, pages
1–27, Philadelphia, USA, 2024. OpenReview.

[25] Dmitry Kobak, Rita González Márquez, Emöke-Ágnes Horvát, and Jan Lause. Delv-
ing into ChatGPT usage in academic writing through excess vocabulary. arXiv,
2406.07016:1–13, 2024.

[26] Mingmeng Geng and Roberto Trotta. Is ChatGPT transforming academics’ writing
style? In Proceedings of the ICML 2024 Workshop on the Next Generation of AI
Safety, pages 1–14, Vienna, Austria, 2024. OpenReview.

[27] Krystal Hu. ChatGPT sets record for fastest-growing user base
– analyst note. Reuters: https://www.reuters.com/technology/
chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/,
February 2023.

[28] Andy Extance. ChatGPT has entered the classroom: how LLMs could transform
education. Nature, 623(7987):474–477, 2023.

[29] Bahar Memarian and Tenzin Doleck. ChatGPT in education: Methods, potentials, and
limitations. Computers in Human Behavior: Artificial Humans, 1(2):100022, 2023.

[30] Chris Stokel-Walker. ChatGPT listed as author on research papers. Nature,


613(7945):620–621, 2023.

[31] OpenAI. GPT-4 technical report. arXiv, 2303.08774:1–100, 2023.


Yakura, Lopez-Lopez, Brinkmann et al. Preprint

[32] Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse. Opening up ChatGPT:
Tracking openness, transparency, and accountability in instruction-tuned text gener-
ators. In Minha Lee, Cosmin Munteanu, Martin Porcheron, Johanne Trippas, and
Sarah Theres Völkel, editors, Proceedings of the 5th International Conference on Con-
versational User Interfaces, pages 47:1–47:6, Eindhoven, The Netherlands, 2023. ACM.

[33] Billy Perrigo. OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT
less toxic. Time: https://time.com/6247678/openai-chatgpt-kenya-workers/, Jan-
uary 2023.

[34] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud,
Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori
Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abili-
ties of large language models. Transactions on Machine Learning Research, 2022(08):1–
30, 2022.

[35] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
imal policy optimization algorithms. arXiv, 1707.06347:1–12, 2017.

[36] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul-
man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter
Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models
to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agar-
wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information
Processing Systems, volume 35, pages 27730–27744, New Orleans, USA, 2022. Curran
Associates.

[37] Alberto Acerbi and Joseph M Stubbersfield. Large language models show human-like
content biases in transmission chain experiments. Proceedings of the National Academy
of Sciences, 120(44):e2313790120, 2023.

[38] Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. AI
generates covertly racist decisions about people based on their dialect. Nature,
633(8028):147–154, 2024.

[39] Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, and Alexander
Trautsch. A large-scale comparison of human-written versus ChatGPT-generated es-
says. Scientific Reports, 13(18617):1–11, 2023.

[40] Peter S. Park, Philipp Schoenegger, and Chongyang Zhu. Diminished diversity-of-
thought in a standard large language model. Behavior Research Methods, 56:5754–5770,
2024.

[41] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell,
Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. John-
ston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver
Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards un-
derstanding sycophancy in language models. In Proceedings of the 12th International
Conference on Learning Representations, pages 1–33, Vienna, Austria, 2024. OpenRe-
view.

[42] Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori
Hashimoto. Whose opinions do language models reflect? In Andreas Krause, Emma
Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett,
editors, Proceedings of the 40th International Conference on Machine Learning, pages
29971–30004, Honolulu, USA, 2023. Proceedings of Machine Learning Research.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

[43] Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideol-
ogy of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-
libertarian orientation. arXiv, 2301.01768:1–21, 2023.

[44] M. Keith Chen. The effect of language on economic behavior: Evidence from savings
rates, health behaviors, and retirement assets. American Economic Review, 103(2):690–
731, 2013.

[45] Patricia M. Greenfield. The changing psychology of culture from 1800 through 2000.
Psychological Science, 24(9):1722–1731, 2013.

[46] Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining
data to language models to downstream tasks: Tracking the trails of political biases
leading to unfair NLP models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki
Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com-
putational Linguistics, pages 11737–11762, Toronto, Canada, 2023. ACL.

[47] Tom S. Juzek and Zina B. Ward. Why does ChatGPT “delve” so much? Exploring
the sources of lexical overrepresentation in large language models. In Owen Rambow,
Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven
Schockaert, editors, Proceedings of the 31st International Conference on Computational
Linguistics, pages 6397–6411, Abu Dhabi, UAE, 2025. ACL.

[48] Matthew Gentzkow and Jesse M. Shapiro. What drives media slant? Evidence from
U.S. daily newspapers. Econometrica, 78(1):35–71, 2010.

[49] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K.
Gray, Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative
analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011.

[50] Vittorio Tantucci and Aiqing Wang. Shifts in conversational engagement: Evidence
from 20 years of British English corpora. Applied Linguistics, 2024.

[51] Martin J. Pickering and Simon Garrod. Toward a mechanistic psychology of dialogue.
Behavioral and Brain Sciences, 27(02), 2004.

[52] Alexandra A. Cleland and Martin J. Pickering. The use of lexical and syntactic infor-
mation in language production: Evidence from the priming of noun-phrase structure.
Journal of Memory and Language, 49(2):214–230, 2003.

[53] Peter Gordon. Numerical cognition without words: Evidence from Amazonia. Science,
306(5695):496–499, 2004.

[54] Lera Boroditsky. How language shapes thought. Scientific American, 304(2):62–65,
2011.

[55] Jason W. Burton, Ezequiel Lopez-Lopez, Shahar Hechtlinger, Zoe Rahwan, Samuel
Aeschbach, Michiel A. Bakker, Joshua A. Becker, Aleks Berditchevskaia, Julian Berger,
Levin Brinkmann, Lucie Flek, Stefan M. Herzog, Saffron Huang, Sayash Kapoor,
Arvind Narayanan, Anne-Marie Nussberger, Taha Yasseri, Pietro Nickl, Abdullah Al-
maatouq, Ulrike Hahn, Ralf H. J. M. Kurvers, Susan Leavy, Iyad Rahwan, Divya
Siddarth, Alice Siu, Anita W. Woolley, Dirk U. Wulff, and Ralph Hertwig. How
large language models can reshape collective intelligence. Nature Human Behaviour,
8(9):1643–1655, 2024.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

[56] Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross J. Anderson,
and Yarin Gal. AI models collapse when trained on recursively generated data. Nature,
631(8022):755–759, 2024.

[57] Pierre Bourdieu. Language and Symbolic Power. Harvard University Press, Cambridge,
USA, 1991.

[58] William Labov. The Social Stratification of English in New York City. Cambridge
University Press, Cambridge, UK, 2 edition, 2006.

[59] Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects
of generative artificial intelligence. Science, 381(6654):187–192, 2023.

[60] Xiaofei Wang, Hayley M. Sanders, Yuchen Liu, Kennarey Seang, Bach Xuan Tran,
Atanas G. Atanasov, Yue Qiu, Shenglan Tang, Josip Car, Ya Xing Wang, Tien Yin
Wong, Yih-Chung Tham, and Kevin C. Chung. ChatGPT: Promise and challenges for
deployment in low- and middle-income countries. The Lancet Regional Health - Western
Pacific, 41:100905, 2023.

[61] Levin Brinkmann, Deniz Gezerli, Kira von Kleist, Thomas Franz Müller, Iyad Rahwan,
and Niccolo Pescetelli. Hybrid social learning in human-algorithm cultural transmis-
sion. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 380(2227):20200426, 2022.

[62] Yasheng Huang. Why US–China relations are too important to be left to politicians.
Nature, 631(8022):736–739, 2024.

[63] Research Organization Registry. ROR Data, 2024. https://doi.org/10.5281/zenodo.


11186879.

[64] Shuyo Nakatani. Language detection library for Java, 2010. https://www.slideshare.
net/slideshow/language-detection-library-for-java/6014274 (Accessed on July
31, 2024).

[65] Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural
speaker diarization. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors,
Proceedings of the 24th Interspeech Conference, pages 3222–3226, Dublin, Ireland, 2023.
ISCA.

[66] Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: Principle, benchmark,
and recipe. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors, Pro-
ceedings of the 24th Interspeech Conference, pages 1983–1987, Dublin, Ireland, 2023.
ISCA.

[67] Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. WhisperX: Time-
accurate speech transcription of long-form audio. In Naomi Harte, Julie Carson-
Berndsen, and Gareth Jones, editors, Proceedings of the 24th Interspeech Conference,
pages 4489–4493, Dublin, Ireland, 2023. ISCA.

[68] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and
Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv,
2212.04356:1–28, 2022.

[69] Steven Bird. NLTK: The natural language toolkit. In Nicoletta Calzolari, Claire Cardie,
and Pierre Isabelle, editors, Proceedings of the 44th Annual Meeting of the Association
for Computational Linguistics, pages 69–72, Sydney, Australia, 2006. ACL.

[70] Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

[71] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to


information retrieval. Cambridge University Press, Cambridge, UK, 2008.

[72] Alberto Abadie and Jaume Vives-i-Bastida. Synthetic controls in action. arXiv,
2203.06279:1–39, 2022.

[73] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. In Proceedings of the 1st International Conference on
Learning Representations, pages 1–9, Scottsdale, USA, 2013. OpenReview.

[74] Bruno Ferman and Cristine Pinto. Placebo tests for synthetic controls. Technical
Report 78079, University Library of Munich, Germany, 2017.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Supplementary Materials
Materials and Methods
Datasets to compute Log Odds Ratios of human and LLM word usage
arXiv The arXiv dataset contains abstracts from research papers that were published
on the arXiv website. We used the arXiv API8 to extract 150 papers from five different
categories, namely Computer Science, Electrical Engineering and Systems Science, Mathe-
matics, Physics, and Statistics, each month from 2019 to 2022. All categories were further
divided into 133 subcategories, and we gathered 7182 abstracts in total.

bioRxiv The bioRxiv abstracts were gathered using the bioRxiv API.9 We ran a brute
force query from the start to the end of the month for 4 years (2019-2022). We collected 60
papers per month, which was 2,880 papers in total. Both arXiv and bioRxiv accounted for
10,000 unique abstracts.

Nature Nature abstracts were collected using the search engine on the Nature website.
The process involved querying for up to 20 pages, each displaying 50 results. Our goal was to
collect between 7,000 and 10,000 abstracts to match a comparable size of our other datasets.
To achieve this, we ran a query without specific search terms, focusing on publications from
2019 to 2023. The results were sorted in two ways: ascending and descending per year. This
dual sorting approach yielded over 8,000 unique abstracts, sufficient for our purposes.

Emails The email dataset was sourced from the publicly available Enron email dataset
on Kaggle.10 The original dataset contains 500,000 emails generated by employees of the
Enron Corporation. However, for our use case, we randomly sampled 10,000 emails and
processed them into a dataset. The emails were sent between 2000 and 2001, far before the
introduction of ChatGPT.

Essays This dataset comprises student essays collected from The Hewlett Foundation:
Automated Essay Scoring challenge on Kaggle.11 The goal of this challenge was to develop
an automated scoring algorithm for essays. All of the essays in the challenge, which was
released in 2012, were composed by students. Similar to other datasets, we sampled 10,000
essays for our analysis.

News The original dataset consisted of 210,000 news headlines from 2012 to 2022 from
HuffPost. The dataset is available on Kaggle with 42 different news categories.12 For our
purposes, we sampled 10,000 short descriptions of the news articles that served the purpose
of news abstracts.

Wikipedia We utilized the Wikipedia API13 to pull the articles from Wikipedia. One
restriction of the API is that it only gives the article title and article ID. Due to the existence
of duplicate articles, we were forced to query 30,000 of them. We extracted the articles’ dates
of publication after eliminating duplicates, keeping just those that were released between
2019 and 2022. We used the page title to randomly collect the content of 10,000 articles
from this chosen subset.
8 https://info.arxiv.org/help/api/index.html
9 https://api.biorxiv.org/
10 https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
11 https://www.kaggle.com/competitions/asap-aes
12 https://www.kaggle.com/datasets/rmisra/news-category-dataset
13 https://www.mediawiki.org/wiki/API:Main_page
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Supplementary Figures

Synonym donors Random donors

Figure S1: Robustness of the findings presented in Fig. 2A across different donor
selection strategies. (Left: synonym donors, Right: random donors) The observed frequency
increase of delve in academic YouTube talks after ChatGPT’s release remains consistent when
employing different donor words for synthetic control construction.

Neutral donors Synonym donors Random donors

Figure S2: Specificity of the increase in post/pre-ratio of MSPE to ChatGPT’s release


date. The difference between the frequency of delve and its synthetic control (i.e., MSPE) is
elevated as the treatment was set closer to the ChatGPT release date. Since the difference
sustained high after the release date, the ratio is influenced by the quality of the synthetic
controls in approximating pre-treatment error trends. Therefore, the ratio does not necessarily
peak at the right end, as in the case of random donors.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Neutral donors Synonym donors Random donors

Figure S3: Posterior distribution of the event-related coefficient βw,event,GPT for top
GPT words following ChatGPT’s release. The distribution quantifies the distinct trend
change in word frequency specifically attributed to GPT words, compared to synthetic controls,
after the release date. The shaded regions represent the 95% credible intervals, demonstrating
the robustness of the observed linguistic shifts in multiple words.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

Religion and Science and


Business Education Spirituality Technology Sports

Figure S4: Trend changes of top GPT words across different podcast topics The syn-
thetic controls were constructed from untreated donors, and the same model as (1) was applied.
While Science and Technology tended to exhibit the most significant increases in the frequency
of GPT words, some GPT words (e.g., boast, swift, and meticulous) show similar trends across
podcast topics.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

System prompt:
You are a great research assistant who is asked to analyze YouTube data. You
will be provided a list of YouTube channels as well as a target information.
Please select the best channel that seems to be owned by the target.
Importantly, please do not add explanations or comments other than the
selected channel name. If there is no appropriate channel, please return N/A.

Example input prompt:


# Institution

Name: Max Planck Institute for Human Development


Address: Berlin, Germany

# Candidates

Title: Max Planck Institute for Human Development


Description: The Max Planck Institute for Human Development (MPIB),
which was founded in 1963, is dedicated to the study of human ...
---
Title: IMPRS LIFE
Description: The International Max Planck Research School on the Life
Course (LIFE) is a joint international PhD Program of the Max Planck ...
---
Title: Behavioral Insights Bicocca
Description: BIB-Behavioral Insights Bicocca is a new research center focused
on the behavioral analysis of public policies and public ...

Figure S5: Prompt provided to gpt-3.5-turbo-0125 to pick the most plausible channel among
query results from Youtube API.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

delve underscore crucial swiftly

2
Log probability

10

necessity heightened meticulous unwavering

2
Log probability

10

amidst notable showcase boast

2
Log probability

10

significant pivot renowned intricate

2
Log probability

10

ia
iv

iv

Em cts
s

s
ws
ail
say
arX

Rx

ed
encompass align comprehend

Ne
tra

kip
bio

Es
bs

Wi
eA
2 tur
Na
Dataset
Log probability

4
gpt3.5-turbo
6 gpt4
gpt4-turbo
8 gpt4o
human
10
ia

Em cts

ia

ia
iv

iv

Em cts
s

s
ws

iv

iv

s
ws

iv

iv

Em cts
s

s
ws
ail
say

ail
say

ail
say
arX

Rx

arX

Rx

arX

Rx
ed

ed

ed
Ne

Ne

Ne
tra

tra

tra
kip

kip

kip
bio

bio

bio
Es

Es

Es
bs

bs

bs
Wi

Wi

Wi
eA

eA

eA
tur

tur

tur
Na

Na

Na

Dataset Dataset Dataset

Figure S6: Log probabilities of human and LLM-revised text. We calculated the log-
probability of a word appearing in human-authored text and its appearance in a version of
the same text revised by different LLMs. Each colored point represents the log-probability
for a specific combination of model, dataset, and prompt. The log-probability for the original
human-authored text is shown in gray. Some LLM calls failed due to various reasons, such as
policy violations. Consequently, the corresponding human-authored texts were removed from
the dataset, introducing slight variations in the associated probabilities, even though the source
dataset remained identical.
Yakura, Lopez-Lopez, Brinkmann et al. Preprint

delve underscore crucial swiftly

6
Log Odds Ratio

necessity heightened meticulous unwavering

6
Log Odds Ratio

amidst notable showcase boast

6
Log Odds Ratio

significant pivot renowned intricate

6
Log Odds Ratio

gpt3.5-turbo gpt4 gpt4-turbo gpt4o


encompass align comprehend Model

6
dataset
arXiv
Log Odds Ratio

4 bioRxiv
Nature Abstracts
2 Emails
Essays
News
0 Wikipedia

gpt3.5-turbo gpt4 gpt4-turbo gpt4o gpt3.5-turbo gpt4 gpt4-turbo gpt4o gpt3.5-turbo gpt4 gpt4-turbo gpt4o
Model Model Model

Figure S7: Log-Odds ratios (LORs) of words in human vs. LLM-revised text. We
calculated the LOR of a word appearing in human-authored text compared to its appearance in
a version revised by an LLM. Displayed here are the 19 words with the highest average LOR
across all datasets, models, and prompts. The data are stratified by dataset and model, with
error bars representing the standard error associated with the three prompts analyzed.

You might also like