Wednesday, July 11, 2007


Language teachers are overly enamoured of idioms. My pointing out, over on the TESL-L list, that a series of "common gambling idioms" is not at all common began a raveled thread of comments questioning the value of corpus data and supporting the teaching of idioms.

Here's what I posted. The numbers are the occurrences per million words, first in the British National Corpus, and second in the Time corpus (in peak decade).
  • hit the jackpot: 0.32 (2.0 in 1940s)
  • on a roll: 0.30 (2.21 in 1990s)
  • ace in the hole: 0.04 (0.08 in 1940s)
  • Bingo!: 0.17 (0.64 in 1990s)
  • play(s/ed/ing): [somebody's] cards close to [somebody's] chest 0.07 (0.06 in 1960s)
  • wild card: 0.54 (1.38 in 1990s)
  • shoot the works: 0 (0.80 in 1930s)
  • put(s/ting) * money down: 0.05 (0.11 in 1990s)
  • beginner's luck: 0.04 (0.32 in 1960s)

To give you some context anathema, which is about the 23,800th most common word in the British National Corpus, occurs 1.42 times per million words. In other words, unless a learner of English has a huge vocabulary, there are lots and lots and lots and lots and lots of more useful things teachers can be teaching them than gambling idioms (or almost any other idiom, for that matter).

The following is a sample of the responses. Over the next few days I'll try to untangle some of them.

  • "So I think we can safely say that there are times that word frequency lists can be misleading."

  • " I have noted, at first with some dismay, the rabid attacks on any form of linguistic sophistication. Apparently, our foreign students have far better things to do than learn the subtleties of the language they are studying."

  • "Comments have been made on teaching idioms. Idioms are of utmost necessity in using and understanding English. "

  • "I have never said the word anathema, partly cos I'm not sure how to say it. But the gambling idiomatic terms turn up frequently - maybe once a month for each, in colloquial speech in NZ, so they should/could be taught."

  • "None of the correspondents have suggested that they have any difficulty recognising or understanding “hit the jackpot”, yet the low level of occurrence of the expression in corpora suggests that it should be so unfamiliar as to cause difficulty even to native speakers (I could imagine that there are many native speakers of English who would have problems with “anathema”). If the expression is so uncommon, how do we all know it?

    "One possible explanation is that the corpora that we have available seriously misrepresent the language we encounter in our daily lives."

Followups: Differences between the corpora, Where's the cutoff, & Interpreting the frequencies


Anonymous said...

I think your explanation is quite possible. Out of curiosity I checked the MiCASE and found only one occurance of Bingo. I did a little more checking and found out some more. This is probably old news to you but I'll go on anyway.

The MiCASE corpus contains only data coming from university faculty, students and staff. It's meant to represent academic spoken English and the spoken English of mostly middle-class English speakers. The BNC as you know is more demographically balanced but the spoken data accounts for only 10% of the corpus. There is another corpus called London-Lund but I wasn't able to do a search using it.

It seems to me that the biggest problem lies with data collection. I'm not convinced that anyone has really found a way to cllect truely representative samples of spoken language across contexts.

Reineke said...

The number of occurrences within the Subtlexus corpus of some 51 million words

hit the jackpot 51
"on a roll" =101
"ace in the hole" =23
"Bingo!" =169
"wild card" =19
"shoot the works" =13

“The” 1,501,908 occurrences, 10,000 per million words

1 DELICATE, ELUSIVE b: difficult to understand or perceive

Unknown said...


for those of you who are interested, we are finding that word frequencies based on television subtitles much better explain word recognition times, even for native speakers. The main difference between television language and written language is that it is skewed much more to a core of high frequency words. In our subtlexUS corpus ( find that 4,500 word forms account for 95% of the words used in films, whereas for written texts you need almost 10,000 words to reach this criterion.

best, marc