Showing posts with label research blogging. Show all posts
Showing posts with label research blogging. Show all posts

Tuesday, November 23, 2010

purple pain and a gene called 'straightjacket'

Dr. Kevin Mitchell, a neuroscientist at Smurfit Institute of Genetics, University of Dublin, posted at his excellent blog Wiring the Brain about a weird, interesting study* that points to a possible genetic explanation of synaesthesia** (e.g., hearing a word and experiencing the color red). The authors were studying pain mechanisms in fruit flies (turns out the mechanisms are similar to us mammals, whuddathunk?). Once they identified a particular gene they dubbed straightjacket*** which is "involved in modulating neurotransmission," they systematically deleted it in test flies and discovered that the test subjects**** no longer processed the pain stimuli, even though the pain stimuli was following the pathway. In Mitchell's words:

Somehow, deletion of CACNA2D3 alters connectivity within the thalamus or from thalamus to cortex in a way that precludes transmission of the signal to the pain matrix areas. This is where the story really gets interesting. While they did not observe responses of the pain matrix areas in response to painful stimuli, they did observe something very unexpected – responses of the visual and auditory areas of the cortex! What’s more, they observed similar responses to tactile stimuli administered to the whiskers. Whatever is going on clearly affects more than just the pain circuitry (emphasis added).

So, if I understand this, they turned off the ability to recognize pain, but when they administered painful stimuli (heat), the test subjects had visual, auditory, and tactile experiences. Imagine putting a flame to your hand and seeing purple. Pretty frikkin awesome. Dr. Mitchell's post does more justice to this complex study, I just thought it was awesome.


*Geez! Take a look at the author list of the publication. Do you have a place for 12th author on YOUR CV?

**FYI: Synaesthesia is NOT the same thing as sound symbolism, necessarily. True synaesthesia is a rare phenomenon that appears to have biophysical roots. Sound symbolism is mostly hippie-dippy bullshit exploited by marketing professionals to sell stuff.

***I have no clue why they called it this, but it's a hell of a lot more awesome than CACNA2D3.

****There were multiple studies referenced, some involving fruit flies, some involving mice, and it wasn't clear to me which evidence came from which studies, so I have chosen to use the cover term "test subjects."

ResearchBlogging.org
Neely GG, Hess A, Costigan M, Keene AC, Goulas S, Langeslag M, Griffin RS, Belfer I, Dai F, Smith SB, Diatchenko L, Gupta V, Xia CP, Amann S, Kreitz S, Heindl-Erdmann C, Wolz S, Ly CV, Arora S, Sarangi R, Dan D, Novatchkova M, Rosenzweig M, Gibson DG, Truong D, Schramek D, Zoranovic T, Cronin SJ, Angjeli B, Brune K, Dietzl G, Maixner W, Meixner A, Thomas W, Pospisilik JA, Alenius M, Kress M, Subramaniam S, Garrity PA, Bellen HJ, Woolf CJ, & Penninger JM (2010). A Genome-wide Drosophila Screen for Heat Nociception Identifies α2δ3 as an Evolutionarily Conserved Pain Gene. Cell, 143 (4), 628-38 PMID: 21074052

Friday, October 1, 2010

do boys need more language help than girls?

No.

UPDATE: Much thanks to Dorothy Bishop, Professor of Developmental Neuropsychology, Department of Experimental Psychology, University of Oxford for emailing me a copy of the original paper. I am reading it now and hope to post a more substantive review of the actual article later. For now, I've added just a few points in orange below.

But that's the conclusion of the anonymous journalist/stenographer from the Science Daily who wrote the recent story Building Language Skills More Critical for Boys Than Girls, Research Suggests. The author states Developing language skills appears to be more important for boys than girls in helping them to develop self-control and, ultimately, succeed in school.

Unfortunately I cannot find the original article (citation below) freely available, so all I have to go on is the brief description from the Science Daily piece:

The researchers examined data on children as they aged from 1 to 3 and their mothers who participated in the National Early Head Start Research and Evaluation study. As with previous research, Vallotton and Ayoub found that language skills -- specifically the building of vocabulary -- help children regulate their emotions and behavior and that boys lag behind girls in both language skills and self-regulation.


What was surprising, Vallotton said, was that language skills seemed so much more important to the regulation of boys' behavior. While girls overall seemed to have a more natural ability to control themselves and focus, boys with a strong vocabulary showed a dramatic increase in this ability to self-regulate -- even doing as well in this regard as girls with a strong vocabulary (emphasis added).

I cannot speak directly to the methodology without access to the original article. My guess is that there was some attempt to qualitatively correlate scores on vocabulary tests to either records of bad behavior or observed behavior. I could be wrong.

UPDATE: They measured two linguistic features, talkativeness and vocabulary, in 120 kids aged 14 months, 24 months, and 36 months: "Mother–child dyads were videotaped at home for 10 min in a semi-tructured play task ... Every vocalization by mothers and children was transcribed ... a trained observer used the Bayley Behavior Rating Scale (BBRS; Bayley, 1993) to rate the child’s ability to self-regulate. Children were rated on each of seven items which included behaviors such as their ability to maintain attention on the tasks, their degree of negativity, and their adaptation to changes in testing materials."

But I'm skeptical about the claims in Science Daily because it strikes me as the sort of thing that would take years of studying and dozens of researchers to come to any definite conclusions about (UPDATE: I remain skeptical about the Science Daily claims, but those are distinct from the claims in the original article). Yet we have just this one study. It also draws a causal connection between a language skill (vocabulary) and a non-language behavior (emotion and "self-regulation"). It is extremely difficult, under even the best circumstances, to do that. And even when this is done, there are typically teams of neuroscientists using fMRIs and such involved. I mean no disrespect to the authors of the study. They are both accomplished professors of psychology, a very important and challenging field. But they are not, as far as I can tell, either neuroscientists or psycholinguists. The second author, Catherine Ayoub, appears to have a specialty in "Legal mental health issues with children" (see PDF here).

UPDATE: According to the original article, there are well established empirical methods for judging a child's "expressive language".

This seems to be a case of over-interpretation with the intent of building actionable policy directives. I understand and sympathize with the impulse to translate scientific research into something directly useful that a teacher can implement today. Look, all you have to do is help boys build their vocabulary and they will behave themselves better! Unfortunately, it is rarely wise to make that leap so quickly. I suspect there is no there there.

UPDATE: There certainly is something here. I'll need more time to digest the methods and results to comment further.

ResearchBlogging.org
Vallotton, C., & Ayoub, C. (2010). Use your words: The role of language in the development of toddlers’ self-regulation Early Childhood Research Quarterly DOI: 10.1016/j.ecresq.2010.09.002

Sunday, September 26, 2010

pullum bait

Jeremy Porter decided to adapt Strunk and White's infamous Elements of Style for Tweeting here. C'mon Geoffrey, you know you wanna respond..I dare ya...

Wednesday, September 1, 2010

the largest whorfian study EVER! (and why it matters)

Let me take the ball Mark Liberman threw on Monday and run with it a bit. Liberman posted a thorough discussion of Fausey and Broditsky's neo-Whorfian English and Spanish speakers remember causal agents differently. Specifically, he invited readers to carefully examine the methodology of the experiments themselves, and not just focus on the conclusions. It turns out that a few years ago another set of neo-Whorfians, Jürgen Bohnemeyer and company, published a paper that addressed similar methodological concerns:

Ways to go: Methodological considerations in Whorfian studies on motion events. (With S. Eisenbeiss and B. Narasimhan) Colchester: University of Essex, Department of Language and Linguistics (Essex Research Reports in Linguistics 50: 1-19). 2006.

This paper addressed experiments involving motion events like rolling and falling whereas Fausey and Broditsky's work addressed agentivity like breaking and popping, but there's enough overlap to warrant some comparison, particularly since the Bohnemeyer et al. paper specifically addresses methodology wrt Whorfian experiments.

But before I get into the details, let me state clearly why I think this is important. In other posts, I have dismissed popular lingo-topics like language evolution as outside the mainstream of linguistics because they don't bear directly on what I consider to be the center of the linguistics universe: How the brain does language. But linguistic relativity (aka, The Whorfian hypothesis) is one of the great questions of linguistics and cognitive science precisely because it bears directly on the question of how the brain does language. And we're only just now developing the proper tools and methodologies to study the question with scientific rigor. It may turn out that language does not affect other cognitive processes or the effect is minor. I don't care. I just want to know one way or the other. And it's work like Bohnemeyer's and Broditsky's that will lead us to knowing, eventually.

Now the fun stuff.

Thursday, June 10, 2010

Is Arabic The Least Positive Language? (hint, no) ... sigh

Sometimes bad science reporting is a function of bad science. Garbage in, garbage out.

There's been some buzz about new research regarding the bias of negative and positive words in English as well as cross linguistically. I have refrained from commenting because it sounded like typical bad reporting and misunderstanding of academic research. Then Andrew Sullivan got involved. Sigh. Sullivan has his strengths and weaknesses as a blogger. His strength shone brightly last summer when he helped publicize the Iranian green movement. His weakness, however, peeps out anytime he blogs about anything remotely related to science or academics (see HERE and HERE). His most recent silliness has the title The English Language Is An Optimist. His megaphone is so big, I feel someone must clear up the foggy facts and murky interpretations currently being disseminated.

To begin, the research under question is from Rozin et al., U Penn psychologists who appear to be focused on emotion research (full citation below). As far as I can tell, no linguists were involved (and boy oh boy, they should have been. Ya know, Penn has a linguistics department that is, let's just say, above average). The basic point of the research cited is this: Positive events are more common (more tokens), but negative events are more differentiated (more types).  Sullivan simply posts a quote from another blog which regurgitates the research as if it were true with no ciritical analysis on anyone's part. I will offer the much needed critical analysis here.

Here are the four facts about English that everyone seems to find so fascinating:

Wednesday, May 26, 2010

yeah right ctd.

Thanks to Twitter #linguistics, I discovered that Hebrew University grad student Oren Tsur will be in DC next week presenting a paper on automatic detection of sarcasm in product reviews (see here and here for reactions). I've posted on sarcasm before (see here and here) so I'm curious. The conference is the 4th Int'l AAAI Conference on Weblogs and Social Media at GW and it looks rather interesting (the first interesting thing to happen in Foggy Bottom since Watergate?). I might could take some PTO and check it out.

Tsur's work can be found here: A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews (pdf).

FYI: while Tsur's work relies solely on written words, Joseph Tepperman et al. from USC work on sarcasm in voice recognition: “YEAH RIGHT”: SARCASM RECOGNITION FOR SPOKEN DIALOGUE SYSTEMS (pdf).

Monday, May 24, 2010

The Politics of Publishing


Let's talk about class warfare in academics, shall we? I just read a nice little article on speech production from Cognition and while I enjoyed it, I couldn't help but wonder how it got published because it was rather light weight. To be fair, Cognition published it as a "Brief article" so it was meant to be short*; nonetheless, it had the feel of a grad student poster, not a publication. You might argue that this is the point of a "Brief article",  but I will argue that similar content would likely not have been published had it not been recognizably associated with a well known scholar. Despite the precautions of blind reviews, it is not uncommon for a linguistics reviewer to have a pretty good idea of who authored or co-authored a paper, simply because linguistics is a small field, and the sub-fields even smaller. Most scholars have easy-to-recognize methodologies, content areas, or style that acts almost as a scholarly fingerprint. I don't mean to be mean-spirited, I hope this doesn't come across that way, but minus the second author's fingerprint, I don't see this paper getting accepted.

But first, let's look at the paper itself: A purple giraffe is faster than a purple elephant: Inconsistent phonology affects determiner selection in English (full citation below).

Tuesday, April 6, 2010

John’s grandmother feeds the monkey every morning

There's a brief and shallow puff piece out discussing new research about differences in how the brain processes word order versus inflection with the absurd title Languages use different parts of the brain. Even if you know nothing about linguistics you can quickly determine that the title is absurd because the article itself admits that the study involved used only ONE language! This was not a cross-linguistic study. It says nothing about what parts of the brain different languages use. The author makes the leap of logic assuming that (A) because languages can be typed according to their morphology (fusional, agglutinating, etc) that (B) therefore languages that are predominantly agglutinating must be processed differently than fusional languages. Nope. The study did not show this.

The research paper which spawned this puff piece is Dissociating neural subsystems for grammar by contrasting word order and inflection Aaron J. Newmaa, Ted Supalla, Peter Hauser, Elissa L. Newport, and Daphne Bavelier, but it's behind a firewall, of course. As far as I can tell from the abstract, the researchers used sign language stimuli to discover that sentences which relied on word order to convey case information activated different patterns in the brain than sentences using inflections (which the puff piece quaintly calls "tags"). From the abstract:

During functional (f)MRI, native signers viewed sentences that used only word order and sentences that included inflectional morphology. The two sentence types activated an overlapping network of brain regions, but with differential patterns. Word order sentences activated left-lateralized areas involved in working memory and lexical access, including the dorsolateral prefrontal cortex, the inferior frontal gyrus, the inferior parietal lobe, and the middle temporal gyrus. In contrast, inflectional morphology sentences activated areas involved in building and analyzing combinatorial structure, including bilateral inferior frontal and anterior temporal regions as well as the basal ganglia and medial temporal/limbic areas. These findings suggest that for a given linguistic function, neural recruitment may depend upon on the cognitive resources required to process specific types of linguistic cues. (emphasis added).

The final sentence of the abstract is compelling as it makes a claim about neural recruitment and cognitive  resources. NOT about different languages using different parts of the brain!  There are some respected linguistics on the author list, so I suspect the paper worth reading (if they would let me, that is!). But the original puff piece did provide two of the stimuli:
  • John’s grandmother feeds the monkey every morning.
  • The prison warden says all juveniles will be pardoned tomorrow.
Psycholinguistics stimuli are often funny because they need to be constructed to contain very specific features, so I can forgive them these awkward sentences, but really? They couldn't have gramma feeding a dog? It had to be a monkey? Hmmmmm. Probably has something to do with the inflections for nouns, but c'mon, a monkey? Sounds down right lewd.

Wednesday, February 17, 2010

A Constraint Based Approach To Figure Skating

While perhaps not quite a pure crash blossom, this headline caught me off guard:


Honestly, my first reaction was to wonder if there was a new scoring system (yes, there is) and what was wrong with the old one (bias and collusion). In other words, what was broken and how was it improved? Of course, there's another meaning of fixed -- 'to cheat.'  In other words, are figure skating outcomes rigged by cheating?  Were this headline from any other publication than the increasingly dumbed down Slate, I'd assume the ambiguity was intentional, but with Slate these days, you just never know. Note that there are at least two other senses for the word fixed: to spay/neuter a pet and to have sufficient amount of something like money (British English as in 'You Kev mate, you fixed for goin' out later? HT Urban Dictionary). With at least 4 senses to choose from, no wonder I was a tad confused.

But how did my super duper human language processing system resolve this?

Sunday, February 7, 2010

Dolphin-Bikes and The Iconicity Effect

Since the journal Cognition typically allows free online access to its current volume, I was able to read a recent paper on a topic that I've always found interesting: the role of embodied experience in language processing. The basic question is, how does our size and shape and orientation as human beings affect our language? Think about a creature that's physically very different from us, like jelly fish or bacteria or dolphins. Now imagine those creatures magically had the same cognitive capacity that we do. 

Would our language system work for them or would it necessarily have to be different? 

Friday, January 15, 2010

Blue Meat and Clever Research

Cognitive Daily reviews some really clever research on synesthesia, the phenomenon of associating words with colors, as well as other multimodal associations (not to be confused with its poor cousin sound symbolism). For example, there are people who will experience seeing the color blue when they hear the word meat (the actual word-color associations are not fixed or predictable, as far as I know). There is neuroscience research suggesting that people who experience this have some sort of overlap in processing areas for the word-color pairs (read an excellent roundup of the research here at NeuroLogica Blog). But this is a difficult area to study because there are so few true synesthetes and their experiences are inconsistent.

Bargary et al. 2009 wanted to discover when the color association was triggered in the time course of lexical recognition. Exactly how were they going to track that?

Sunday, January 10, 2010

More Russian Illusions Than I

Colin Phillips gave a nice plenary talk at the LSA this afternoon on the role grammatical illusions can play in studying the online processing of sentences (was it just me, or did his English accent seem more pronounced than usual? Was this a social register effect or am I off my rocker?).  He drew a really nice parallel with optical illusions and the value they have added to the study of vision.  The point is that there are some sentences that seem perfectly grammatical at first, but upon reflection, are completely incoherent. For example:
  • More people have been to Russia than I have.
Most native speakers of English will read this sentence and be perfectly happy, but re-read it a few times. Do you see the incoherence? It's incoherent because ...

Sunday, December 20, 2009

SEX! TORTURE! BANANA!

Do some words grab your attention more than others because of their semantic content? If I want to get the attention of 12 screaming kids, would I be better off yelling "SEX!" or "EGGPLANT!"

This was the topic (kinda) of a study recently reviewed by the excellent Cognitive Daily blog: Huang, Y., Baddeley, A., & Young, A. (2008). Attentional capture by emotional stimuli is modulated by semantic processing. Journal of Experimental Psychology: Human Perception and Performance, 34 (2), 328-339 DOI: 10.1037/0096-1523.34.2.328.

The study used an interesting methodology: rapid serial visual presentation, or RSVP which involves showing participants a random stream of stimuli, flashing by one every tenth of a second. Wiz bang! That's a lot of flashing. Let Cognitive Daily explain:

Typically if you're asked to spot two items in an RSVP presentation, you'll miss the second one if it occurs between about 2/10 and 4/10 of a second after the first one, but not sooner or later. This phenomenon is called Attentional Blink -- a blind spot caused by the temporary distraction of seeing the first item... Their streams were simply random strings of letters and digits, with two words embedded in each stream. Then they asked students to look for words naming fruit as they flashed by. If a fruit word appeared, it was always the second word in a stream. The key was in the first word: half the time, this first word was a neutral word like bus, vest, bowl, tool, elbow, or tower, and half the time it was an emotional word like rape, grief, torture, failure or morgue. So a sequence might look like this:

  • JW34KA
  • QPLX12
  • MC15KW
  • 083FLB
  • TORTURE
  • S21L0C
  • DJW09S
  • BANANA
  • 3LW8Z9
  • XOWL01
And so on. The first word acts as a distractor: the students are looking for fruit words, but this is always a non-fruit word. The question is, are emotional words more distracting?

The results result of the experiments was ...

Sunday, December 13, 2009

On Linguistic Fingerprinting

Can an author's writing style be defined by the frequency of unique words in their writings? According to physicist Sebastian Bernhardsson, the answer is yes. He found a couple of interesting facts: 1) the more we write, the more we repeat words and 2) the rate of repetition (or rate of change) seems to be unique to individual authors (creating a "linguistic fingerprint"... literally his words, not mine). Let me walk through his claims and findings, just a bit.

Bernhardsson et al. are in press with a corpus linguistics study which compared rates of unique words between short and long form writing (short stories vs. novels vs. corpora). I stumbled on to this research earlier this week when a BBC News title caught my eye: Rare words 'author's fingerprint': Analyses of classic authors' works provide a way to "linguistically fingerprint" them, researchers say.

The idea of linguistically fingerprinting authors has been around for a while. In some ways it acted as a lost leader decades ago, piquing interest in the use of corpora and statistical methods to study language and now there is even a whole journal called Literary and Linguistic Computing. Plus, there is an established practice of forensic linguistics where linguistic methods are used to establish authorship of critical legal documents.

However, Bernhardsson makes a bold claim. He claims that the process of writing (a cognitively complex process) can be described as the process of pulling chunks out of a large meta-book which shows the same statistical regularities of an authors real work (he hedges on this a bit, of course). I always shiver when I run across a non-linguist jumping head first into linguistics making bold claims like this, but I also recognize that Bernhardsson and and his co-authors are pretty smart folks so I gave them the benefit of the doubt and skimmed one of their two available papers (freely available here).
  • The meta book and size-dependent properties of written language. Authors: Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter Minnhagen. New Journal of Physics (2009), accepted.
First, I concentrated on the first section because the paper goes into a different direction that was not necessary for me to cover (and had lots of scary algorithms; it is Sunday and I do want to watch football, hehe). What they did was count the number of words in a text, then count the number of unique words (this is a classic type/token distinction). Here's what they found:

When the length of a text is increased, the number of different words is also increased. However, the average usage of a specific word is not constant, but increases as well. That is, we tend to repeat the words more when writing a longer text. One might argue that this is because we have a limited vocabulary and when writing more words the probability to repeat an old word increases. But, at the same time, a contradictory argument could be that the scenery and plot, described for example in a novel, are often broader in a longer text, leading to a wider use of ones vocabulary. There is probably some truth in both statements but the empirical data seem to suggest that the dependence of N (types) on M (tokens) reflects a more general property of an authors language. (my emphasis and additions).

First, let's make sure we get what the author's did. We have to use words more than once, right? I've already repeated the word "we" in just the last two sentences. And we repeat words like "the" and "of" all the time. We have to. So there are types of words, like "the" but there are also the number of times those words get repeated (tokens). It's pretty straight forward to simply count the total number of words in a story, then count the total number of types of words. Thus giving us a ratio. For example, let's say we have a short story by Author X with 1000 words it (= tokens). Then we count how many times each word is repeated and we find that there are only 250 unique words (= types), this means there is a ratio of 1000/250, or 100/25 (for comparison's sake I'm using this ratio). This means that only 25% of the words are unique, which also means that, on average, a word is repeated 4 times in this story.

Now let's take a novel by Author X with 100,000 words (= tokens). After counting repetitions we find it has 11000 unique words. Our token/type ration = 100,000/11000, or 100/11. This means that only 11% of the words are unique, which means, on average, a word gets repeated about 9 times. That's higher than in the short story. Words are being repeated more in the novel. Now let's imagine we take all of Author X's written work, put it together into a single corpus and repeat the process and discover that the ratio is 100/7 (on average, a word gets repeated about 14 times).

UPDATE: whoa, my maths was off a bit the first time I did this. That'll teach me to write a blog post while watching Indie crush Denver. Sorry, eh,

This is what the author's found: "The curve shows a decreasing rate of adding new words which means that N grows slower than linear (α less than 1)."

They discovered something potentially even more interesting. there is a rate of change between these ratios is unique to each author: Here's is their graph from the article (H = Thomas Hardy, M = Herman Melville, and L = D.H. Lawrence):
FIG. 1: The number of different words, N, as a function of the total number of words, M, for the authors Hardy, Melville and Lawrence. The data represents a collection of books by each author. The inset shows the exponent = lnN/ lnM as a function of M for each author.

Their conclusions about the meta-book and linguistic fingerprint:

These findings lead us towards the meta book concept : The writing of a text can be described by a process where the author pulls a piece of text out of a large mother book (the meta book) and puts it down on paper. This meta book is an imaginary infinite book which gives a representation of the word frequency characteristics of everything that a certain author could ever think of writing. This has nothing to do with semantics and the actual meaning of what is written, but rather to the extent of the vocabulary, the level and type of education and the personal preferences of an author. The fact that people have such different backgrounds, together with the seemingly different behavior of the function N(M) for the different authors, opens up for the speculation that every person has its own and unique meta book, in which case it can be seen as a fingerprint of an author. (my emphasis)

They are quick to point out that this finding says nothing about the semantic content of the writings. So what does it say? I admit I was having a hard time seeing any conclusion about cognition or the writing process, even while finding this methodology interesting, I'm just not at all sure what it really says about the human brain and language, if anything at all. The speculation that "every person has their own unique meta book" is bold. Unfortunately, it is also almost entirely untestable. Keep in mind that this research had zero psycholinguistic component. They were just counting words on pages. I'd caution against drawing any conclusion about the human language system based solely on this work. (I should note that I skipped one of the most interesting findings, that the section of work doesn't matter, simply the size. meaning, they took random chunks from their corpora and found the same patterns, if I understood that part correctly.) Which begs the question: why is this being published in a physics journal? It's being published in The New Journal of Physics and a quick perusal of the articles from previous editions doesn't show anything remotely similar to this work (no surprise).

I'm a fan of corpus linguistics, but I'm also a fan of caution. I'm not convinced any conclusions about the psycholinguistics of the complex writing process can be drawn from this work. Not as yet. But interesting, nonetheless.

FYI: it's easy enough to fact check some of these results using freely available tools, namely KWIC Concordance
. This tool will take any text and count the total tokens and number of repeats for us. I did this for Melville's Bartleby, the Scrivener and Moby Dick. I got text versions of each from Project Gutenberg, then ran the wordlist function within KWIC and here are my results:

Bartleby
Total Tokens: 18111
Total Types: 3462
Type-Token Ratio: 0.191155

Moby Dick
Total Tokens: 221912
Total Types: 17354
Type-Token Ratio: 0.078202

Bartleby = 0.191155
Moby Dick =
0.078202

Yep, the short story
Bartleby has more unique words than the longer Moby Dick. FYI, this is a weak test simply because the tokens are not stemmed, meaning morphological variants are treated as different words. I don't know if this is consistent with Bernhardsson's methodology or not.


Wednesday, December 2, 2009

Thinking Words (part 1)

(image from make-noise.com)

I’d like to present a brief lesson in contemporary linguistic research with the goal of showing that we live in a marvelous age of quick and ready research tools freely available to even the most humble of internet users. Hence, a little effort goes a long way. My point is that when we make claims about language usage (and by "we" I mostly mean those of us who present our claims about language to the public via the interwebz) we need not make such claims based on our intuitions and emotions; rather, we can perform a little due diligence in a way that linguistic pontificators of the past simply could not. And bully for us.

My subject for today’s Full Liberman is this classic example of language mavenry from Prospect magazine: Words that think for us by Edward Skidelsky, lecturer in philosophy at Exeter University (HT Arts and Letters Daily). In this article, Skidelsky laments the following “linguistic shift”:

No words are more typical of our moral culture than “inappropriate” and “unacceptable.” They seem bland, gentle even, yet they carry the full force of official power. When you hear them, you feel that you are being tied up with little pieces of soft string. Inappropriate and unacceptable began their modern careers in the 1980s as part of the jargon of political correctness. They have more or less replaced a number of older, more exact terms: coarse, tactless, vulgar, lewd. They encompass most of what would formerly have been called “improper” or “indecent.”…“Inappropriate” and “unacceptable” are the catchwords of a moralism that dare not speak its name. They hide all measure of righteous fury behind the mask of bureaucratic neutrality. For the sake of our own humanity, we should strike them from our vocabulary.


UPDATE: A very lively discussion of the meaning of the words in question (something I largely ignore) has broken out on Language Log here)

This article makes four testable linguistic claims:
  1. The words inappropriate and unacceptable have increased in frequency over the last couple decades.
  2. This frequency increase is due to replacing other words: coarse, tactless, vulgar, lewd, improper, and indecent.
  3. These other words are “older”
  4. These other words are “more exact”
With a little investigation using entirely freely available online linguistics tools, we can easily fact check each of these claims. In the interest of time, I'll answer the first two together.

First and Second -- Has the frequency of inappropriate and unacceptable increased since the 1980s? & have they replaced the other words?

In order to quickly get some data, I took this to mean the frequency of the first two words have increased while the frequency of the other words have decreased since the 1980s (is this is an unfair interpretation?. In any case, that’s how I operationalized my methodology.). Thanks to Mark Davies excellent resource, the TIME Corpus of American English (100 million words, 1923-2006, requires registration, but it's free) we can quickly get a snapshot of the frequency of each word’s usage for the last 9 decades (not bad, huh? Thanks Mark!!).

Caveat: raw frequency is a poor data point by itself. What we really need is a way to compare apples to apples and oranges to oranges, and the problem we have is different sized corpora for each decade. Fear not, Davies does this work for us. His handy dandy interface allows us to report frequency per million, thus giving us comparable frequencies across different decades.

Using the TIME corpus, I discovered the frequency per million of each word per decade. Then I entered that data into a spread sheet. I used Excel 2007 to create a line graph of these frequencies.

Here's the relevant data:


And here's the graph:

UPDATE (2hrs after original post): original graph was confusing (same graph, just confusing labels) so I fixed it.

What this shows us is that both inappropriate and unacceptable do in fact show a rise in frequency (consistent with Skidelsky's claim), but starting in the 1960s, not 1980s. However, unacceptable shows a more recent dramatic decline, which is inconsistent with his claim. Lewd actually made a bit of a comeback in the 1990s (thank you Mr. Clinton?), but has since dropped back (it's a bit of a jumpy word, isn't it?). The other words do seem to be falling off in usage, consistent with Skidelsky's claim. So the picture is not quite what Skidelsky thinks it is, though he does seem to be on to something.

UPDATE: See myl's plot of this same data (but grouping the words as Skidelsky does) here which suggests that "'coarse', 'tactless', 'vulgar' etc. declined until WWII and then stayed about the same, perhaps with an additional decline in past decade; while 'inappropriate' and 'unacceptable' rose gradually from the 1930s to 1970 or so, and then leveled off. " The plot does suggest that we could view the two groups as having roughly inverted frequency, somewhat conforming to Skidelsky's hunch.

Third -- Are these other four words “older”?

Unfortunately, as I am no longer affiliated with a university, therefore I have no access to the OED (I’ve decided not to pay the $295 for their individual subscription. Condemn me if you must). If anyone would care to look those up and post them in comments, I’d be happy to update. Most of these words have multiple senses and the question is, when did the most relevant sense enter usage? For that, the OED is most valuable. Again, you can do that work for me, or send me a check for $295.

However, a simple search of the Merriam Webster online dictionary gives us a quick answer:

unacceptable = 15th century
inappropriate = 1804
coarse = 14th century
tactless = circa 1847
vulgar = 14th century
lewd = 14th century
improper = 15th century
indecent = circa 1587

This data suggests these five words fall into roughly two groups:

A -- words that entered the language around the 19th century
  • Set A = inappropriate, tactless
B -- words that entered the language around the 15-16 centuries
  • Set B = unacceptable, coarse, vulgar, lewd, improper, indecent
This grouping does not conform to Skidelsky’s assumption that inappropriate & unacceptable fall together in a newer class and the others in an older class.


UPDATE: much thanks to commenter panoptical who provides the following OED dates which appear to largely confirm the Merriam Webster dates, with the notable except of lewd which dates back to Old English it seems...does have a certain Beowulf ring to it, doesn't it?

unacceptable: 1483
inappropriate: 1804
coarse: 1424
tactless: 1847
vulgar: 1391
lewd: c890
improper: 1531
indecent: 1563


Fourth -- Are the other words "more exact"?

Finding a way to empirically test this is a challenge I will take up in later post (you can see Wordnet coming, can't you?). It will require teasing apart senses and relationships between senses (oh my, I wish I had the OED right now...).

Tuesday, November 24, 2009

Abracadabra! I Win!

(image from Slate.com)

I tend to avoid Slate.com these days because, frankly, I typically find myself scoffing at some idiot article they've published that promotes such a ridiculous mis-reading of academic research that it's hardly worth finishing... like this one from today: A Better Way to Fight With Your Husband which linked to this article: The Healthiest Way To Fight With Your Husband. It's a classic piece of idiot journalism worthy of a Full Liberman* if only it weren't so trivial and obvious as to be beneath the man, so I'll take a crack at it.

The big point is that fabulous new research from real life scholars (psychologists nonetheless, and they're almost like scientists) proves that women should use particular words when yelling at their husbands (the experiment used heterosexual married couples). Pretty awesome, ain't it! Just use the right words, and like a magic key you can unlock the mysteries of the brain and make it do what you please (okay, I'm starting to exaggerate, but less than you might think).

First let's look at the way the academic article is summarized in the puff piece that Slate linked to:

A new study of married couples, however, has found physiological evidence for one technique to diffuse tension: choosing the right fighting words. Couples who used analytical language, such as “think,” “understand,” “because,” or “reason,” during heated arguments were able to keep important stress-related chemicals in check, according to research published in the latest issue of the journal Health Psychology. Cytokines are inflammatory chemicals that spike during periods of prolonged tension and can lower your immunity and lead to early frailty, Type 2 diabetes, arthritis, and some cancers. The authors noted a curious gender twist in their results. Husbands benefitted from their wives’ measured language, but a man’s carefully chosen words had little effect on a woman’s cytokine balance.

To be fair, here is a passage from the authors' abstract of the original article:

Effects of word use were not mediated by ruminative thoughts after conflict. Although both men and women benefited from their own cognitive engagement, only husbands' IL-6 patterns were affected by spouses' engagement. Conclusion: In accord with research demonstrating the value of cognitive processing in emotional disclosure, this research suggests that productive communication patterns may help mitigate the adverse effects of relationship conflict on inflammatory dysregulation.

And here is a passage from this interview with the first author, Jennifer Graham, Penn State assistant professor of biobehavioral health:

"We specifically looked at words that are linked with cognitive processing in other research and which have been predictive of health in studies where people express emotion about stressful events," explained Graham. "These are words like 'think,' 'because,' 'reason' (and) 'why' that suggest people are either making sense of the conflict or at least thinking about it in a deep way."

For the study, the 42 couples made two separate overnight visits over two weeks.

"We found that, controlling for depressed mood, individuals who showed more evidence of cognitive discussion during their fights showed smaller increases in both Il-6 and TNF-alpha cytokines over a 24-hour period," said Graham, whose findings appear in the current issue of Health Psychology.

During their first visit, couples had a neutral, fairly supportive discussion with their spouse. But during the second visit, couples focused on the topic of greatest contention between them.

"An interviewer figured out ahead of time what made the man and woman most upset in terms of their relationship, and we gave each person a turn to talk about that issue," said Graham.

Researchers measured the levels of cytokines before and after the two visits and used linguistic software to determine the percentage of certain types of words from a transcript of the conversation. (my italics)

The researchers' results suggest that people who used more cognitive words during the fight showed a smaller increase in the Il-6 and TNF-alpha. Cognitive words used during the neutral discussion had no effect on the cytokines.

When they averaged the couples' cognitive words during the fight, they found a low average translated into a steeper increase in the husbands' Il-6 over time. There were no effects on the TNF-alpha. However, neither couple's nor spouse's cognitive word use predicted changes in wives' Il-6, or TNF-alpha levels for either wives or husbands.

Graham speculates that women may be more adept at communication and perhaps their cognitive word use had a bigger impact on their husbands. Wives also were more likely than husbands to use cognitive words.

Well, thank gawd they used fancy computers to count cognitive words! After reading these three descriptions, it was clear to me that the original work is likely flawed. I don't have access to the original study, unfortunately, but taken together, the abstract and first author's interview suggests to me that it makes the same mistake most non-linguists make: they assume the linguistics part is easy and don't put enough effort into it. Dr. Graham's initial claim in the interview jumps out at me: "We specifically looked at words that are linked with cognitive processing in other research..."

Hmm? Words that are "linked with cognitive processing?" What does this mean? I would love to see the references page to follow-up on this "other research." Graham later refers to these as "cognitive words." They are alternately referred to as analytical language, measured language, conflict-resolution words, and cerebral words. From the puff piece and the interview we have five examples:
  1. because
  2. reason
  3. why
  4. think
  5. understand
Huh? One conjunction, one interrogative, and three verbs of cognition. Hmmm. Is there any intuitive reason to believe that "because" is "linked with cognitive processing" in some special way that other words are not? Is it the fact that it grammatically links clauses? Many words do this. Are the verbs on the list simply because they are verbs of cognition? Are run and jump less "linked with cognition" because they are verbs of motion? I would have to speculate on what this "other research" discovered about the magical properties of the special words that make them the key to brain chemicals. Abracadabra! Poof! Also, it's not at all clear to me why they averaged the couples' frequency count. What is this average supposed to tell us?

However, the puff piece makes the leap into idiotsville all by itself:

"The study is significant because it’s one of the first to link language with biological markers and show what kinds of words help sparring couples rather than just recommending they “communicate more,” explains James Pennebaker, chair of the department of psychology at the University of Texas-Austin, who has studied the role of language on relationships." (my italics).

Nope. No link. Just a transcript. Given the study's methodology of counting words in a transcript, at no point could they possibly have been able to show any causal relationship between a particular word's utterance and the levels of a particular chemical in a person's brain.

The puff piece authors pull the classic journalist's trick of "being fair" by adding actual linguist Deborah Tannen's skepticism of the "link" between particular words and particular chemicals, but they abandon all skepticism just a few sentences later and end with a bang! "Even when it seems like he is ignoring you, your words may be having an effect—at least on a chemical level,” says Graham"

Sigh.

*I'm going to start using the term "The Full Liberman" to refer to Mark Liberman's excellent manner of debunking bad journalism (see here and here for examples).

UPDATE (11/28/09): A nice summary of Full Liberman's at LL here.

Saturday, June 27, 2009

Adam's Tongue (pt 3)

(classic depiction of Saussure's arbitrariness of the sign claim)

This is the third in a series of posts detailing my notes and thoughts about the book Adam's Tongue as I prepare to lead a book discussion meeting July 6, 2009 in the DC metro area (see my first post here and second here).

Ch 3 - Thinking Like Engineers

I've spent the last 5 years working in natural language processing and with engineers and I agree that there is something very valuable for a linguist to "think like an engineer" so I was curious from the start about this chapter, but I was also weary because the Chomskyan syntacticians also "think like engineers" and I believe they have led linguistics down a garden path of false starts and flawed theories for 40 years. So I read on cautiously.
  • DB notes that he came into linguistics via pidgins and creoles and they bear on his thinking about language evolution. But does this bias him too, like the man who has a hammer and sees everything as a nail? We shall see.
  • DB says there's no syntax when we try to speak with people who don't share our language (p 39) because we don't know enough of the language, the foreign words just pop out as we grope for them. Now, I certainly defer to DB's far greater expertise of pidgin & creole formation, but this thought experiment of his does not jive with my own experiences. Like many travelers, I've had this exact experience in places like Guangzhou China and Prague but I don't think the foreign words "just popped out" quite as randomly as he suggests. I'm tending to side with Slobin here.
  • He claims that protowords must not have had any internal morphological structure (41) because early language users would have had no rules defining that structure. On it's face, this makes sense, nonetheless this begs the question: which came first, the word or the morphology? Is it not plausible that some neurologically based process for seeking internal structure to sounds developed prior to the advent or words? I just don't know.
  • The boom vocalization of the Campbell's monkey occurs 30 seconds before the alarm (42). My first reaction: wow! this is stretching the limits of transitional probabilities, isn't it? Can we plausibly claim that an association between sounds 30 seconds apart is neurologically feasible?
  • DB claims these booms are not modifiers (p42) because the boom "cancels out" the alarm. I'd have to review the literature on these boom carefully, but my first reaction is: does it really cancel the alarm? If I understand the context, it simply means "not immediate threat (but still a threat)". That's not a cancellation. It's more like epistemic modality: "there MIGHT be danger."
  • Page 44 -- The gavagai problem restated.
  • Confused: I'm confused by DB's claim on page 45 that "words combine as separate units -- they never blend. They're atoms, not mudballs." I'm not sure what he means. Blending and combining are different, in that blending suggests some elements of both previous words/calls are preserved in the new word/call. This happens all the time in contemporary linguistic change (classic example: motel blends motor + hotel, persevering bits of each's morphology as well as semantic blending). But I suspect DB is not referencing that. So what is he referencing?
  • He makes a nice distinction between ACSs and Language: ACSs are primarily for manipulation of behavior while language is primarily for information sharing. I have no clue if this is really true, but if yes, it's a good point (p 47).
  • He writes "language units are symbolic because they're designed to convey information." A nice follow-up point on the difference point above, but it begs the question: what is "information"? Any answer which supports DB would have to couch a definition in abstraction, right? E.g., Information is a conceptualization that is independent from direct reference.
  • DB makes a bold claim on page 52 that strikes at the heart of post-Saussurean linguistics: displacement is a more important factor to language evolution than arbitrariness. But it's worth noting that both are functions of abstraction, so perhaps this is just another version of his previous point that the jump to abstract thought is the key.
On to chapter 3 -- Singing Apes....

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...