Thursday, June 21, 2007

On lexicons, gaps, Zipf and exponents

While I won't spend more time dwelling on the numerical problems that I brought up in my previous comment on Brown's lexicon article in the Globe, I would like to note an interesting feature which does deal with numbers.

Part I:
"If Little Princess is an average child, she'll know 6,000 root-word meanings by the end of Grade 2. That's okay, but nothing special: At that point, the top 25 per cent of children already know twice as many words as the lowest 25 per cent, and the gap grows exponentially."

I don't think that Brown means exponentially in the literal sense. Such a literal meaning would be described by the formula:

$G_y = G_b^y$

where Gy is the gap in year y, y is the number of years that has elapsed, and Gb is the gap in the base year. This kind of exponential growth would soon have adults knowing hundreds of thousands of words, far beyond what any reasonable estimate claims.

Part II:

"But by (graduation) the foundation of her so-called mind has hardened. Limited by early lexical laxity, the average North American adult knows only 30,000 to 60,000 words, out of a potential "working vocabulary" of 700,000. If only Little Princess had learned more words earlier! If only you were a better parent!"

Brown is right about something: People learn fewer and fewer words as they grow older. But how much has it to do with calcification of the synapses and how much with the frequency of words? This brings us to our second equation of the day.

Empirical research has found that in English, the frequencies of the approximately 1000 most-frequently-used words are approximately proportional to 1/ns where s seems to be just slightly more than 1.

This second equation describes a distribution that is knows as Zipf's Law after George Zipf (not to be confused with the ill-fated George Zipp). In other words, the second most common word is about 1/2 as frequent as the first (which happens to be the), the third most common is only about 1/3 as frequent as the, and the fourth most common is only about 1/5 as frequent.

Now this isn't exponential change either, but it sure means that the frequency of words drops off pretty darn fast. Consider then a child's chances of meeting and learning their 3000th word compared to an adult's chance of meeting and learning their 60,000th word. At less than 1/60,000 the frequency of the, you could easily be well into middle age before you even encounter the word for the first time. If you're lucky enough to encounter it again before you retire, you're unlikely to recall that first meeting when you do. In other words, it's mainly the inherent distribution of words that makes it so hard to learn new words as you get older.