Thursday, August 09, 2007

Defining words (or not)

Another new journal has cropped up: ELR Journal. ELR stands for empirical language research.

In the inaugural issue, Yasunori Nishima at the University of Birmingham (Google has apparently never heard of him) has a paper entitled "A Corpus-Driven Approach to Genre Analysis: The Reinvestigation of Academic, Newspaper and Literary Texts". The paper really seems more like that of a student learning to use the tools of the trade rather than a real contribution to the field. It's mostly a rehash of existing work with different corpora, none of it done in a way that is particularly novel, and none of it really challenging any existing results or even testing out anything questionable.

Much of the paper is built around word counts and frequencies. Surprisingly though, Nishima doesn't find it useful to define for us what he means by word. When he discusses the most frequent words, does he count run (e.g., I went for a run) and run (e.g., You run well) as two instances of one word or as two distinct words? What about if we add in runner, running, ran, runs, etc.? Who knows? Nishima says he got his frequency information from Adam Kilgarriff (though there is no proper citation). Presumably, he means he got them here, but did he use the lemmatised list or the unlemmatised one? He doesn't say. At least we can guess he is either counting unique word forms or lemmas.

But then he seems to conflate two senses of word when he compares his frequency data to arguments made by Paul Nation. As I have discussed before, Nation feels that we should consider only the most frequent 2000 word families to be high-frequency. Note, however, that Nation is explicitly talking about word families, while as far as I can tell Nishima isn't.

Though Nishima is merely the most recent linguist to ignore this terminological conundrum, his is a rather flagrant and troubling oversight, largely because the paper doesn't even show an awareness of the issue. How could you be doing research with words and not give a second thought to what a word is?

Not an auspicious start for the journal.

