Thursday, June 17, 2010

Words as species

The most recent edition of Reading in a Foreign Language is a festschrift for Paul Nation. Paul was one of my professors at TUJ and apart from being a very practical and productive researcher, he's also one of the friendliest and most personable folks you'd want to meet.

There's much worth reading there, but one that caught my attention was the Meara and Olmos Alcoy paper.

Here's the abstract:
This paper addresses the issue of how we might be able to assess productive vocabulary size in second language learners. It discusses some previous attempts to develop measures of this sort, and argues that a fresh approach is needed in order to overcome some persistent problems that dog research in this area. The paper argues that there might be some similarities between assessing productive vocabularies—where many of the words known by learners do not actually appear in the material we can extract them from—and counting animals in the natural environment. If this is so, then there might be a case for adapting the capture-recapture methods developed by ecologists to measure animal populations. The paper reports a preliminary attempt to develop this analogy.
And from the body, 

Ecologists have developed a number of methods that allow them to resolve this problem. All of these methods rely on capturing a small number of animals, and then extrapolating this basic count to an estimate of the actual number of animals that could have been caught. The basic approach is known as the capture-recapture methodology, first developed by Petersen (1896), and further developed by Lincoln (1930). In this approach, we first develop a way of capturing the animals we are interested in, and standardise it. Suppose, for example, that we want to count the number of fish in a river. We could identify a suitable stretch of river to investigate, and then distribute traps that will catch the fish without harming them. We leave the traps out for a set time, overnight, for instance, and count the number of fish that we have trapped. We then mark these animals in a way that will allow us to identify them, before releasing them back into the wild. The next night, we carry out the same counting exercise, enumerating the fish trapped overnight. This gives us three numbers: We have N, the number of fish captured on Day 1; M, the number of fish captured on Day 2; and X, the number of fish that were captured on both occasions. Petersen argued that it was possible to extrapolate from these figures to the total number of fish in the stretch of river. Petersen’s estimate is calculated as follows:

E = \frac{(NM)}{X}

What they do then is to take English-speaking learners of Spanish and get two samples of writing from them. Taking the types of words produced as being analogous to individual animals, they then plug the results into the formula above with N being the total types produced in the first piece of writing, M being the types produced on the second day, and X being the total number of types produced.

It's an interesting idea, but it's got a number of problems.

First, the number of types produced is a function of the number of tokens, but the relationship isn't linear. The Peterson estimate they use is consequently also largely a function of the number of tokens, but because of the non-linear relationship, simply dividing by X will not normalize the estimates. In other words, the larger the sample, the larger the estimate will be.

Another issue is that they mention that the metaphorical trap is just sampling from a section of the river. In fish studies, the result of the estimate might be multiplied by the length of the river to come up with the total population. In the Spanish situation, they seem to have ignored this multiplier. But then again, we have no idea how long the Spanish river is, and that strikes me as the crux of the situation.

But in fact, I don't think the formula can be applied here at all. From what I can discover, the derivation appears to be based on the idea that the proportion of marked animals that are re-caught \frac{X}{N} should equal the proportion of the total population that is caught \frac{M}{E}.

 \frac{X}{N}=\frac{M}{E} is equivalent to

\frac{E}{N}=\frac{M}{X}  and when you isolate E, you get


This assumes that you're as likely to catch any individual fish as any other. That's probably not exactly true of fish, since some will be more timid and others more outgoing, but it's blatantly false with words. You're almost certain to capture the but your chances of catching meagre are... meagre.

Though the title of the paper implied the metaphor of words as species, which I believe is the correct approach, in fact, I think the metaphor used was words as members of a species, which is what Peterson was interested in. Looking around more, I found a 1976 paper which does take the words as species approach: B. Efron and R. Thisted. (1976). Estimating the number of unseen species: how many words did Shakespeare know? Biometrika, 63, 435–467.

The problem of guessing how many species are in a biosystem from a small observation that doesn't capture all species is, apparently, a classic problem and a great deal of work has been done on it. Unfortunately, the approaches taken are beyond my mathematical ability to understand. For those with more ability, a recent survey of approaches, with a useful bibliography is Gandolfi, A. and Sastri, C.C.A. (2004) Nonparametric Estimations about Species Not Observed in a Random Sample. Milan Journal of Mathematics, 72, 81–105.

This topic coincidentally popped up on Language Log recently, where Mark Liberman linked to course notes addressing the math behind some of the simpler estimates.

No comments: