Thursday, August 17, 2006

What's the frequency, Jack?

Frequency information can be a great boon to language learners. If I were going to learn Spanish, for example, I'd want to know the most common, say, 500 words, at least for starters. But this information isn't always easy for language learners to access. In fact, it's often pretty difficult for linguists to compile.

To begin with, there's the question of what consititutes a word. Are jump, jumps, and jumped one word or three? Is jump (the action of propelling yourself up into the air) the same word as the barrier that you hurl yourself over or the ramp that you ski off? What about jumper (a person who jumps) or jumper (a sweater or a pinafore, depending on if you're GBish or USian). How many words do we have now?

Next is the problem of deciding what a representative corpus of the language would look like. Do we look at only spoken language, only written or both (if both, in what ratio)? Do we include only language produced after a certain time: perhaps, 1980? Do we inlcude only certain flavours of English (Australian or Canadian)? What genres do we include? If it's spoken, can it be scripted or must it be spontaneous? And how big must our corpus be? Clearly a corpus of a mere hundred words would be of almost no help at all, but is a million words enough? What about 500 million?

Another large problem is simply going about actually counting all this stuff. What we need is a machine like the one in Dr. Seuss's Sleep Book with balls that drop and a chap who counts them. Unfortunately, what we have is humans and computers. Humans are smart and, given the right instructions, can usually do a fairly reliable job of dealing with the question of what a word is. Unfortunately, we have short attention spans and are much slower than computers.

Some early frequency counts in English were done by humans, for example Michael West's A General Service List of English Words (Longman, Harlow, Essex, 1953). But most recent counts are done by computers, they being great at this kind of tedious taks. Unfortunately, they have serious problems deciding what a word is (see, jump, above). They also don't deal with spelling mistakes, or spoken language very well (though they're getting better at that).

This becomes even more difficult where you're dealing with a language, such as Japanese, which doesn't have spaces between the words, and for which it's legal (though often non-standard) to write words in a variety of ways. For example, the following are all the same word (hikkoshi = move house): 引っ越し, 引越し, 引越, 引っ越, ひっこし, and each of them has at least 300,000 google hits. This was a huge problem for me when I was compiling a corpus for a series of Japanese graded readers, about which I'm presenting later next week at the CAJLE conference in Toronto, but that's another posting for another time.

Many people have attacked these problems from different angles. For English, some of most useful results are:
  1. Word Frequencies in Written and Spoken English by Geoffrey Leech, Paul Rayson, Andrew Wilson
  2. The Academic Word List by Averil Coxhead
  3. The BNC word family lists by Paul Nation
I should also mention Tom Cobb's wonderful Compleat Lexical Tutor, which provides tools based on frequency counts for English and French.

For other languages, Routledge has recently begun releasing frequency dictionaries. Those currently available include a Spanish one by Mark Davies (Series editor, along with Paul Rayson, and designer of VIEW).

No comments: