This prompted the following query by Mike Scott to the Corpora mailing list:
"I wrote to the author, James Pennemaker of the U of Texas, about this, expressing my surprise at the pronoun I having greater frequency than THE, as even in the spoken-only section of the BNC (10m words) we find I occurring only just over half as often as THE. His data contains a mix of spoken and written with a large amount of blog data. He reports that with all his studies in the USA and Mexico, "people always use more I more than THE. It's never close." Can anyone help here, clearing up the position? Someone with access to a really top quality corpus, more up to date and representative than the BNC? "An interesting discussion followed, which you can see in the list archives.
Adam Kilgarriff, remarked that it all depends on text type, which is certainly true and belies the universal nature of Pennebaker's claim and that of the article graph. He goes on to say that "Asking for a more representative corpus won't help because we all have different ideas about what it should be representative of."
Thereafter commenters presented counts from a variety of corpora, which led Marc Brysbaert to observe:
Maybe we can turn the question around and use the “the/I” ratio as an index of how socially vs. description oriented a corpus is? Here is a summary of the data I have at hand.
COCA (television programs)
SUBTLEX (film subtitles)
This discussion has focused on only one aspect of James Pennebaker's work, the 'I' frequency, and perhaps not as much on his many contributions to content analysis, which may have even more relevance to discussions on this list.
Kyle Dent of Xerox has recently performed an analysis of 2400 tweets, with the aim of classifying them into "Questions" and "Not Questions". He developed an elaborate NLP system to deal with these tweets. He kindly provided me with these data, so that I could examine them with my content analysis program to see how well they could be analyzed without all the NLP superstructure. I happened to run a first analysis at the time of this thread. It simply compares the two sets as a whole.
The corpus size is 31,000 words (hardly the stature of BNC, COCA, or OEC). But, curiously, both "i" and "the" hold the top two frequency positions in both:
Set "the" "I"
Questions 400 327
Not Questions 437 575
Wow! Could this be a classification signature? Although this is not likely, various other statistics in various combinations generated in the program may very well be. So, here we have a micro-genre analysis that confirms the other comments on this thread, much like the Known Similarity Corpora of Adam Kilgarriff (15 years ago!).
Sentiment analysis is an emerging field, but is currently dominated by heavy NLP techniques. I would suggest that techniques from content analysis might provide a nice complement.
As I said, you can read the whole discussion in the list archives.
My first guess would be that he is looking for the strings "i" and "the" with no attempt to eliminate the occurrences of those string embedded in words. In that sentence there are seven occurrences of "i"
and three of "the". I would have gotten another "the" if there had been the word "there" in there somewhere.
It's a possibility, but one hopes that he has addressed this. Indeed the results from very social spoken corpora match his results.
One has to wonder about the transcriptions of the spoken corpora. Google ngrams has "the" hovering around 5% in all subsets and "I" indistinguishable from 0%. Would the spoken corpora transcribe /dʌ/ or /nʌ/ as "the"?
"Would the spoken corpora transcribe /dʌ/ or /nʌ/ as "the"? "
I would really depend on the corpora and its purpose. The BNC includes articles: de, ze, t', ta, na, th', & nu, but these make up only a few hundred words.
Mark Brysbaert's table reads as an "I"/"the" ratio to me.
Yes, it does seem that he's got it backwards.
Post a Comment