Tuesday, March 01, 2011

Vocabulary, reading, and hockey sticks

Yesterday, the Modern Language Journal published new online content, among which was an interesting paper entitled "The Percentage of Words Known in a Text and Reading Comprehension" by Norbert Schmitt, Xiangying Jiang, and William Grabe. (The preprint is available for free on Schmitt's website.) It's a very readable article, clear in its description of the procedure, and well worth reading.

The issue they take up is the possibility of a vocabulary threshold in reading. The idea is that if readers know only, say,  90% of the words in a text, they would largely be at a loss when it came to understanding it. But in texts where the reader knows more than perhaps 95% of the words (numbers that have been put forward tend to start at about 95% and go as high as 99%), comprehension would be much easier, leading to a hockey-stick shaped graph that perhaps rises linearly but then takes a sharp curve upwards, like this:


One reason for this might be that below the threshold, vocabulary guessing would be impossible (you can get a sense of why this might be in this older post), but that above the threshold, guessing might kick in. 

Nevertheless, this study showed that this is not the case. Rather they found this relationship:


In other words, there's no threshold. Generally, the more vocabulary you know, the better your comprehension will be. Note that these are averages across 661 language learners. At the individual level, things are much noisier, and in fact the authors caution that "with so much variation, it would be difficult to predict any individual’s comprehension from only their vocabulary coverage."

One final note is that even though the relationship between vocabulary coverage and comprehension is roughly linear, the relationship between vocabulary knowledge and comprehension is not. It is a trivial thing to learn a few hundred words of English, which will get you 50% vocabulary coverage. To get from there to 90% requires a couple of thousand, and the four percent between 95% and 99% is likely another five to eight thousand words on top of that. This is a Zipfian distribution that looks sort of like the blue line of this graph (from Wikimedia Commons):

1 comment:

swan-tower said...

This reminds me of an exercise my sister used while teaching English in Japan: she gave her students "The Jabberwocky" to read. Despite not knowing a bunch of the words (and despite not knowing that a bunch of the words were pure nonsense), they still managed to accurately summarize the plot. She did it to prove to them that they could understand things even if they missed some of the words, and it worked very well.