Saturday, September 17, 2011

`I' vs `the'

In the Sept 3-9th edition of New Scientist, James Pennebaker discusses the individual variations in frequency with which we use pronouns and other small words, and he considers what this metric might say about our personalities and relationships. The paper version (p. 45) has a graph entitled "The real word count" with the caption "The 20 most frequently used words in the English language, across both spoken and written texts." The graph shows that I is the most common word, followed closely by the.

This prompted the following query by Mike Scott to the Corpora mailing list:
"I wrote to the author, James Pennemaker of the U of Texas, about this, expressing my surprise at the pronoun I having greater frequency than THE, as even in the spoken-only section of the BNC (10m words) we find I occurring only just over half as often as THE. His data contains a mix of spoken and written with a large amount of blog data. He reports that with all his studies in the USA and Mexico, "people always use more I more than THE. It's never close." Can anyone help here, clearing up the position? Someone with access to a really top quality corpus, more up to date and representative than the BNC? "
An interesting discussion followed, which you can see in the list archives.

Adam Kilgarriff, remarked that it all depends on text type, which is certainly true and belies the universal nature of Pennebaker's claim and that of the article graph. He goes on to say that "Asking for a more representative corpus won't help because we all have different ideas about what it should be representative of."

Thereafter commenters presented counts from a variety of corpora, which led Marc Brysbaert to observe:

Maybe we can turn the question around and use the “the/I” ratio as an index of how socially vs. description oriented a corpus is? Here is a summary of the data I have at hand.


COCA (academic)
COCA (newspapers)
Google (books)
COCA (magazines)
American blogs
COCA (fiction)
COCA (television programs)
Shakespearean plays
SUBTLEX (film subtitles)

Ken Litkowski added the following:

This discussion has focused on only one aspect of James Pennebaker's work, the 'I' frequency, and perhaps not as much on his many contributions to content analysis, which may have even more relevance to discussions on this list.

Kyle Dent of Xerox has recently performed an analysis of 2400 tweets, with the aim of classifying them into "Questions" and "Not Questions". He developed an elaborate NLP system to deal with these tweets. He kindly provided me with these data, so that I could examine them with my content analysis program to see how well they could be analyzed without all the NLP superstructure. I happened to run a first analysis at the time of this thread. It simply compares the two sets as a whole.

The corpus size is 31,000 words (hardly the stature of BNC, COCA, or OEC). But, curiously, both "i" and "the" hold the top two frequency positions in both:

Set                "the"    "I"
Questions            400    327
Not Questions        437    575

Wow! Could this be a classification signature? Although this is not likely, various other statistics in various combinations generated in the program may very well be. So, here we have a micro-genre analysis that confirms the other comments on this thread, much like the Known Similarity Corpora of Adam Kilgarriff (15 years ago!).

Sentiment analysis is an emerging field, but is currently dominated by heavy NLP techniques. I would suggest that techniques from content analysis might provide a nice complement.

As I said, you can read the whole discussion in the list archives.


Faldone said...

My first guess would be that he is looking for the strings "i" and "the" with no attempt to eliminate the occurrences of those string embedded in words. In that sentence there are seven occurrences of "i"
and three of "the". I would have gotten another "the" if there had been the word "there" in there somewhere.

Brett said...

It's a possibility, but one hopes that he has addressed this. Indeed the results from very social spoken corpora match his results.

Faldone said...

One has to wonder about the transcriptions of the spoken corpora. Google ngrams has "the" hovering around 5% in all subsets and "I" indistinguishable from 0%. Would the spoken corpora transcribe /dʌ/ or /nʌ/ as "the"?

Brett said...

"Would the spoken corpora transcribe /dʌ/ or /nʌ/ as "the"? "

I would really depend on the corpora and its purpose. The BNC includes articles: de, ze, t', ta, na, th', & nu, but these make up only a few hundred words.

Michael Vnuk said...

Mark Brysbaert's table reads as an "I"/"the" ratio to me.

Brett said...

Yes, it does seem that he's got it backwards.