Wednesday, July 18, 2007

Idioms: interpreting the frequencies

I suppose this isn't really specific to idioms, it would apply to any vocabulary item.
As I wrote before, one response to my explanation about idioms was,
"None of the correspondents have suggested that they have any difficulty recognising or understanding 'hit the jackpot', yet the low level of occurrence of the expression in corpora suggests that it should be so unfamiliar as to cause difficulty even to native speakers."
I'm afraid this doesn't show up a problem with the corpora themselves, but it might go some way to explaining why language teachers seem to be so loath to use corpus data: they don't understand what it tells them.
It wouldn't be unusual for a native speaker of English to encounter language that occurs with the frequency of "hit the jackpot" a number of times per month. That's because native speakers of English tend to encounter millions of words each month. The recent Mehl paper in Science suggests that we speak on average something like 16,000 words per day. Presumably, we're doing much of that in conversation with others, often more than one person, so let's put our conversational word count at 40,000 per day spoken and heard.
Then there's TV. I don't have average numbers, but after looking at a few transcripts, it looks like 7,000 words per hour might be a reasonable estimate. According to Neilson, the average American spends 4.5 hours per day watching TV, so we can add another 30,000 words or so to our count, which now totals 70,000.
I have no data on how much people write, but I suspect it's very little. In terms of reading, I can find no adult data, but 5th-grade children read about 5,300 words per day, bringing our total daily word exposure to roughly 75,300 or 2,290,000 words per month. There are likely other sources of input that I have omitted, but this should be sufficient to make the point.
At the previously established rate of 0.18 to 2.0 occurrences paw, we could expect to see "hit the jackpot" about one to four times a month. If you're about my age, you've probably heard it about 900 times in your life. So, contrary to the above writer's conclusion, it's not at all surprising that we know it. But would you be surprised hear that my six-year-old son doesn't? (I just asked him [update: May 25, 2009. He's almost eight and he still says he doesn't know. update 2: Oct 9, 2011: 10 and still unfamiliar.]).
In contrast to native speakers, our learners don't get anything like 2.3 million words a month input. And what input they do get is degraded by the fact that they don't understand much of it. Thus, what seems very common to us, is quite rare for learners. Somehow, though, it's hard to get many language teachers to accept this. They refuse to believe that idioms are not common, but as we saw recently, anything below about 30 occurrences pmw should be considered low frequency.
There are many factors that can skew our perception of a word's commonality. Psychologists have taken this issue much more seriously than have language teachers/applied linguists and have evolved a number of measures. These include:
  • number of letters/phonemes/syllables
  • written/spoken frequency
  • range/keyness/burstiness
  • subjective familiarity rating
  • concreteness rating
  • imagability rating
  • meaningfulness
  • average age of aquisition
  • word category (noun, verb, adj, etc.)
  • affixation
  • status (colloquial/dialect/alien etc)
  • semantic grouping
It would obviously be too onerous to consider all of these constantly in our teaching, but it might not be a bad thing to know about and understand each measure.
Earlier posts this series: Idioms, Differences between the corpora, & Where's the cutoff


Rick Sprague said...

Idioms are almost always composed of quite common words, while non-idiomatic lexical items of comparable frequency are by definition uncommon. This makes idioms seem more familiar, and therefore more common than they are.

Also, I think we're culturally conditioned to pursue the meaning of an unfamiliar idiom, because the familiarity of its component words makes it seem colloquial and we have a strong motivation to understand colloquial usage. In contrast, low frequency non-idiomatic items (such as "anathema") are regarded by the masses as highfalutin, pedantic, etc., and we're either neutral about investigating the meaning (low payoff) or even antipathetic (don't want to appear pretentious). If we preferentially acquire idioms, this invokes something like the Frequency Illusion to make them seem more common than they are.

Brett said...

I think there's likely a good deal of truth in that, Rick. I think there's also something in idioms that makes them easy to learn, for native speakers at least. First of all, as you say, all the words are known. Secondly, they've become idiomatic because they were particularly graphic, humourous, sonorous, or otherwise remarkable.

reineke said...


Regarding native input, some interesting information can be found here:

The link

The blog is mine, there is too much to post in a simple comment!

BTW, thanks for this post, it was rather interesting!

Scott Whitney said...

One good justification for teaching idioms is also common to teaching phrasal verbs. I have taught deaf students, who have many issues in common with second language learners, including the need for the 2,000 most common words, difficulties with function words, and difficulties with idioms. One HUGE need is to teach students to read chunks of text rather than reading word for word. Phrasal verbs and idioms both require reading the entire phrase or idiom in order to get the meaning. That being said, phrasal verbs may have more utility both in frequency of use and in teaching to read chunks of text.

Brett Reynolds said...

It would probably be more useful to use the most frequent ngrams.