Thursday, July 12, 2007

Idioms: differences between corpora

One of the arguments that came up yesterday was that idioms are simply not reflected in the frequency counts because the corpora don't reflect the type of language use in which these idioms would usually occur. Michael Stout suggests something similar in his comment.

It is certainly true that particular words, expressions, and forms vary quite considerably in their distribution and frequency on a scale that is often seen as ranging from spoken/informal language to written/academic language. (In fact this is probably better seen as multidimensional space, but that's another issue.) Indeed, if we look at fuck, we can see that it is strikingly common in spoken conversation, occurring 136 times pmw in the BNC vs. 1.63 times pmw in the academic subcorpus.

The following look at "hit the jackpot" should show the range in frequency of these types of idioms:
  • From yesterday, we have 0.32 occurrences per million words in the BNC. Looking at the subcorpora, we have a high of 1.88 pmw in News.
  • In the Time corpus, we have 0.8 pmw with a high of 2.0 pmw in 1940s.
  • The MICASE corpus that Michael mentioned in his comment yesterday has zero instances of "hit the jackpot" in 1,848,364 words. MICASE is unscripted 'merican speech at universities, mostly in lectures and academic discussions.
  • The Enron e-mail corpus has 18 occurrences of "hit the jackpot" in 96.3 million words (about 0.2 pmw; but a number of them are duplications)
  • The first release of the ANC is 11 million words. "Jackpot" occurs 3 times, but "hit the jackpot" does not occur.
  • The million-word Brown corpus, which is an early US written-text corpus has no instances of "jackpot".
  • The Corpus of Spoken, Professional American-English does not have any instances of "jackpot" in its sample of 42,739 words.
  • Nor does the 2-million word US talk TV corpus at Lextutor.
  • I don't have access to the The Wellington Corpus of Spoken New Zealand English, but if anyone else does, please let me know the results. I'll bet that they are all in the same range. [update: July 23| Bernadette Vine, who manages the corpus, was kind enough to do a search for me. She reports that there are no instances of 'hit the jackpot' in the one-million-word corpus.]
    [Update2: Oct 9, 2011. Some new corpora have become available, so I've added them below.]
  • In the Corpus of Current American English, we have a high of 0.61 in magazines and a low of 0.05 in academic writing.
  • And here's the frequency in the Google Books corpus throughout the 20th century, which maxes out at 0.06 pmw.

So, overall, we nothing goes above 2.0 and most are much lower. I would be very surprised, then, to find a situation in which "hit the jackpot" occurs at, say, about 30 pmw or more, at least not one that is going to be relevant to many learners of English.
Tomorrow, more about what exactly "common" would mean for a learner (hint: see the previous paragraph.)
In this series: Idioms, Where's the cutoff, & Interpreting the frequencies

2 comments:

reineke said...

http://subtlexus.lexique.org/

worth checking out, this corpus was based on movie and sitcom scripts.

jackpot:

189 occurrences in the entire corpus of 51 million words
3.7 occurrences per million words
The word occurs in 148 movies (out of some 8,000)

Bingo 12.55 occurrences per million words
0.17 in the BNC

The f word 378 occurrences per million words

Some other (hopefully) interesting comparisons here:

http://learnalanguageortwo.blogspot.com/2009/04/corpora-comparison-by-frequency.html

Brett said...

Thanks for the data.

The full idiom, hit the jackpot occurs there 51 times or once per million words, right in the middle of our established range.