English, Jack

Of of of

2019-01-09T20:44:00.000-05:00

In response to:

Pullum, G. K. (2018). Intuition and decidability in grammar and number theory. In K + K = 120: Papers dedicated to László Kálmán and András Kornai on the occasion of their 60th birthdays (pp. 1–10).

"Given a CFG for English, therefore, we could use a fully general algorithm to find out (in
cubic time) whether, for example, there is a grammatical string with ofofof as a substring. (I think there probably is, but I leave the exercise of constructing one for the reader to pursue in idle moments.)"

I have constructed a sentence with three consecutive instances of of.

We informed the people you were thinking of of, of course, not only your specific intentions, but also, your general vision.

Contact magazine, Vol 40(4) now available

2014-12-04T16:52:00.000-05:00

Get your free copy here.

Useful examples for language learners

2014-11-06T06:29:00.001-05:00

The odd choices of example sentences that sometimes show up in these "teach yourself to speak..." type books along with phrase books has been rightly mocked in the past. In fact, the subtext of this blog's title references just such a phrase book.

Recently, Radiolab ran a program called translation, and started each segment with Robert Krulwich imitating a language lesson LP…with the twist of it being an LP that helps us to learn Robert’s imaginary native tongue, "Luden". The phrases chosen start out a little strangely (e.g., my mother wrote the best poem) and then get progressively more fanciful and bizarre. It all makes sense in the context of the program, and I really encourage you to listen to the whole thing.

I was reminded of this because I've started to study Old English. Now phrases that are useful for learning a modern language, phrases like what time is it and what does this mean are really quite pointless for learning Old English, because you're never going to speak it with anyone. Instead, you use it to read old texts. The result is that you get to study examples like this:

for þan iċ hine sweorde swebban nelle

therefore I will not kill him with a sword

and

þū scealt yfelum dēaðe sweltan

you must die by a wretched death

Good times!

On meeting 'otiose' twice again

2014-10-06T12:30:00.000-05:00

I asked Mark Liberman to have a look at what I wrote yesterday since I was struggling to get my head around the probabilities. He was kind enough to write the following guest post:

Maybe a better way of thinking about it is this:

Say the probability that word w_i will be selected at random from a collection of text is P(w_i). Then assuming independence, the probability that the next word will NOT be w_i is (1-P(w_i)), and the probability of failing to find w_i in N successive draws is

(1-P(w_i))^N

If P(w_i) is 1/10^7 (one in ten million), and N is 1000, then we get

(1-(1/10^7))^1000

which is 0.9999. So if we take notice of a rare-ish (P = 1/10000000) word, and draw 1,000 other words at random looking to see it again, then 9,999 times out of 10,0000, we'll fail to find the moderately rare word we were waiting for. And if we draw 10,000 additional words instead of 1,000, the probability of failure is still

(1-(1/10^7))^10000 = 0.999

so we're still gonna fail 999 times out of a thousand.

But the thing is, Rare Words Are Common. That is, a large proportion of word tokens belong to relatively rare types. So suppose that there are 10,000 other words of approximately equal rareness, and every time we see one of them, we set a subconscious process to watch for recurrences of that word within the next thousand instances

If we do this a thousand times, then the chances of failure (for a thousand instances of noting a rare word and looking for it to occur again) become

((1-(1/10^7))^1000)^1000 = about 0.9

((1-(1/10^7))^10000)^1000 = about 0.368

So if you do enough reading for these conditions to be satisfied once a day, you should expect to have this experience several times a week.

Now, none of this reasoning really applies, because you aren't picking words at random from a well-mixed urn, you're reading them in order in coherent text. And words in coherent text are far from independent Bernoulli trials -- when a rare word appears, the probability that it will appear again before long in the same text is massively increased by topic effects (and to a lesser extent style and priming effects). But this just means that the experience should be more common rather than less common -- unless you insist that the texts be separate and on different topics, and so forth, in which case it gets complicated.

But still, I think that the real puzzle is not why you had this apparently odd experience, but why such we occasionally notice the kinds of coincidences that are in fact rather common.

This is not an unimportant question, since it has a lot to do with the genesis of superstition (and probably science, for that matter...)

The above is a guest post by Mark Liberman.

On meeting 'otiose' twice in a day

2014-10-05T05:36:00.002-05:00

Well, not in the same day, but certainly within a 24-hour period. As I was lying in bed last night, reading Charles Mann's 1493, I came across the phrase the otiose Percy on p. 78.

As of this morning, I've read to p. 90, so that's about 4,500 words later. I also read a few NY Times articles, adding perhaps another 1,200 words. And then I set about to edit an article for Contact, the TESL Ontario magazine for which I'm the editor. Almost immediately, I came across a quote from David Crystal in which he wonders,

whether the presence of a global language will eliminate the demand for world translation services, or whether the economics of automatic translation will so undercut the cost of global language learning that the latter will become otiose.

So that's twice in about 6,000 words. What are the chances of this? [See the update]

Well, not liking to leave rhetorical questions hanging, I set out to see. The adjective otiose (roughly meaning "of no value or use") occurs at a rate of 0.03 times per million words in COCA. That's about once every 32 million words or so. Google Books put it at a higher rate: one in 10 million. So let's say my chanced of meeting it are about 1 in 15 million words (0.00000667%).

As I say, we're looking at a window of 6,000 words, so we divide 6,000 by 15,000,000 to get 0.0004. Multiply this by the original odds to get a 1 in 37.5 trillion (0.0000000027%) chance of meeting otiose once in a span of 6,000 words.

My odds of doing that twice in 6,000 words (assuming these are independent events, which is more or less right) is simply the odds of doing it once squared. So, that's about 1 in 1.4 sextillion.

So that's, like, wow!

But.

That's the odds of me doing it in this particular stretch of 6,000, not the odds of some person doing it some time.

Let's say the average English reader reads 10,000 words per day and there are about 300 million people who read English. That's about 30 trillion words per day. So—this is where I lose confidence that I'm getting the reasoning right—the odds of some English speaker reading otiose twice in the span of 6,000 words is about one in 4,687,5000 each day or one in 128,000 per year.

Given that, I think it's most likely that no other person has ever accomplished this feat.

Except you.

A title misparsed

2014-09-02T16:23:00.003-05:00

This morning, I was reading this article at New Statesman, when I came across the following:

Yet surely, when night after night atrocities are served up to us as entertainment, it's worth some anxiety. We become clockwork oranges if we accept all this pop culture without asking what's in it.

The plural clockwork oranges suddenly threw into sharp relief the title of Burgess's book A clockwork orange. For some reason that I am unable to articulate now, if I ever was aware of it, I had always parsed that title like this:

That is to say, I took orange to be a postpositive modifier of clockwork (like proof positive, governor general, the city proper, etc.) instead of clockwork as an attributive modifier of orange, like this:

This was, I must admit and odd and, even to me, puzzling title, but then it's an odd and puzzling book, so I just rolled with it. As I say, it was the plural oranges that made me see the light: adjectives don't do plurals.

I somehow overlooked the frequency of clockwork as a modifier, which should have tipped me off: in COCA, almost 40% of all instances of clockwork are attributive modifiers. Another thing that I was aware of, but which just seemed like more of the weirdness, is that clockwork is rarely--but sometimes--countable, so a clockwork is kinda weird, but not totally beyond the pale.

Perhaps one thing the pushed me to the first analysis was the stress pattern. Usually, an NP with a noun as modifier gets the main stress in the NP. It's a

FAculty office, not faculty OFfice,
SOCcer ball, not soccer BALL, and
poLICE officers, not police OFficers.

My impression is that people tend to say a clockwork ORANGE, rather than a CLOCKwork orange. This is the same pattern you get with postpositive modifiers like proof POsitive.

Whatever the reason, what really impressed me is how decades of misapprehension can be overcome by a single choice example.

Antedating "determinative"

2014-08-19T07:13:00.001-05:00

The OED gives:

b. Gram. determinative adjective, determinative pronoun, etc. (see quots.); determinative compound = tatpurusha n.

1921   E. Sapir Lang. vi. 135   The words of the typical suffixing languages (Turkish, Eskimo, Nootka) are ‘determinative’ formations, each added element determining the form of the whole anew.

1924   H. E. Palmer Gram. Spoken Eng. ii. 24   To group with the pronouns all determinative adjectives..shortening the term to determinatives.

1933   L. Bloomfield Language xiv. 235   One can..distinguish..determinative (attributive or subordinative) compounds (Sanskrit tatpurusha).

1961   R. B. Long Sentence & its Parts 486   The, a, and every are exceptional among the determinative pronouns in requiring stated heads.

Today, I was reading Kellner's Historical outlines of English syntax from 1892 and came across the following on pp. 113–114 (emphasis added):

In Old English the possessive pronoun, or, as the French say, "pronominal adjective," expresses only the conception of belonging and possession ; it is a real adjective, and does not convey, as at present, the idea of determination. If, therefore, Old English authors want to make such nouns determinative, they add the definite article :

"hæleð min se leofa" (my dear warrior). —Elene, 511.
"ðu eart dohtor min seo dyreste" (thou art my dearest daughter). —Juliana, 193.

§179. In Middle English the possessive pronoun apparently has a determinative meaning (as in Modern English, Modern therefore its connection; German, and Modern French) with the definite article is made superfluous, while the indefinite article is quite impossible. Hence arises a certain embarrassment with regard to one case which the language cannot do without.

Suppose we want to say "she is in a castle belonging to her," where it is of no importance what-ever, either to the speaker or hearer, to know whether "she" has got more than one castle how could the English of the Middle period put it? The French of the same age said still "un sien castel," but that was no longer possible in English.

§180. We should expect the genitive of the personal pronoun ("of me," &c., as in Modern German)—and there may have been a time when this use prevailed—but, so far as I know, the language decided in favour of the more complicated construction "of mine, of thine," &c.

This was, in all probability, brought about by the analogy of the very numerous cases in which the indeterminative noun connected with mine, &c., had a really partitive sense (cf. the examples below), and, further, by the remembrance of the old construction with the possessive pronoun.

And later:

Later on, the possessive pronoun apparently implies a determinative meaning (as in Modern German and Modern French) ; therefore its connection with the definite article is made superfluous, while the indefinite article is quite impossible. Instead of the old construction we find henceforth what may be termed the genitive pseudo-partitive. See above, 178–180.

Proscribing, narrowly

2014-07-07T07:24:00.001-05:00

Over at the NYT, Alexander Nazaryan has a rather strident article about "The fallacy of balanced literacy." Therein, he writes, "balanced literacy is an especially irresponsible approach, given that New York State has adopted the federal Common Core standards, which skew toward a narrowly proscribed list of texts, many of them nonfiction." [Now changed to narrowly prescribed.]

These texts are prescribed. That is, they're imposed, not declared unacceptable or invalid. Nevertheless, the Google Books corpus suggests narrowly proscribed is a new and growing phrase.
So, I'm curious: was this simply a typo, or did he have in mind some metaphor of narrowing down by proscription. Or was it something else?

Thinking like a freak

2014-06-23T08:48:00.001-05:00

I listen to the Freakonomics Radio podcast from time to time, and back in May they aired an episode called "the three hardest words...," which, purportedly, were I don't know. The premise was that people hate to admit ignorance and so they hardly ever say, "I don't know."

Except that in most corpus studies, the head-and-shoulders most common, number one, top-of-the-heap three-word string in English is I don't know (It's a three-word string, not four, since -n't is an inflectional suffix, not just a contraction as is taught in elementary schools, but that's another issue.) For instance, in the 3-grams list from the Corpus of Contemporary American English. I don't know is by far the most frequent 3-gram with 199,110 instances (second is one of the at 167,785). In business meetings, we find the same results. Consider table 3.10 on p. 59 of this book, or table 5.8 on p. 183 of this paper.

Now, these are not mostly "I don't know (period)." Far more commonly, they're "I don't know if..." "I don't know what..." etc., which can often be used as a signal of disagreement rather than as an admission of ignorance. Nevertheless, the data stands in rather stark contradiction to the freaky claim. It looks pretty silly to be saying people should fess up to their ignorance, while basing the argument on a point on which you're so ignorant that you assert the most common phrase is the least (or at least the hardest).

(If you're interested in other freaky foolishness, see Joseph Heath's recent post on their simplistic view of the UK medical system.)

Audio and the OED

2014-05-16T16:32:00.002-05:00

As I mentioned, Schwa Fire is now out, and I've been quite enjoying it. Arika Okrent (whose name I have inexplicably misread for years as Akira) has written an article called "Ghost voices" about preserving audio-tape recordings of our all-too-impermanent voices, dialects, and languages. As I was reading it, it occurred to me that the OED should include audio recordings of the quotations it uses. These should be in the dialect, and where possible the actual voice, of the original author.

Schwa Fire

2014-05-16T16:23:00.001-05:00

Back in November, 2013, there was a proposal on Kickstarter for a new language magazine. I chipped in to sponsor it and ended up on the editorial panel as a result. The first issue is now out.

Issue 1, Season 1

May 16, 2014• Schwa Fire

The golden age of language journalism begins now. In this inaugural issue, Arika Okrent tells the story of 5,700 hours of Yiddish recordings that were almost lost ("Ghost Voices"), and Russell Cobb writes about Americans' fondness for the Englishes we used to speak and what that fondness obscures ("The Way We Talked"). Michael Erard describes and defends "language journalism," and Robert Lane Greene provides a lesson on the languages of love ("Wooing in Danish"). Also included: an English homophone puzzle.

When "syndrome" is a final "s"

2014-05-15T07:41:00.003-05:00

1982 gave us the acronym AIDS formed from acquired immune deficiency syndrome. This is pronounced /eɪdz/. The fact that the final S is pronounced /z/ is notable, since a final s is typically pronounced /s/ (e.g., bus) unless it is an inflectional morpheme (e.g., dogs). There are cases such as news and lens, in which a final s is pronounced /z/, but the -s in news was originally a plural morpheme. That leaves lens, which comes from the Latin word for lentil. Apparently, it was pronounced /leːns/ in Latin, so why it has a final /z/ in English is something of a mystery to me. I cannot find another example of an English noun with a final s pronounced /z/.

This brings us back to AIDS. Presumably, this final /z/ was influenced by the homographs aids, the noun, and aids, the verb. But then in 2003 we got SARS. There is no English word sar, so there is no preexisting homograph from which to analogously get /sɑɹz/, but that is the only pronunciation I've ever heard. I've never heard anyone say /sɑɹs/. So this seems to be an extension of the AIDS analogy to aids.

And now today we have MERS. On CBC's Metro Morning this morning, Matt Galloway started out pronouncing it /mɜɹs/, which initially threw me. I'd been mentally pronouncing it with a final /z/, and indeed Galloway finished up with /mɜɹz/ (I couldn't tell what the person he was interviewing was saying, but I suspect she was using the /z/ form, given his shift.)

So perhaps we have a new rule developing: acronym-final s for syndrome is pronounced /z/.

As I was looking around writing this post, it appears that at least one other person has taken note of the pronunciation of MERS.

[John Wells points out "Latin fifth-declension nouns in -es have final noninflectional /z/ in English, too: species, series... Why that should apply to MERS is a further question, which I cannot answer: but compare Mars."]

It's turtles round and round

2014-04-22T08:07:00.003-05:00

Part I
I've been trying to understand categories better, and one of the books I've been reading in pursuit of this goal is George Lakoff's Women, Fire, and Dangerous Things. In fact, a few nights ago, I fell asleep reading it, and it must have stirred something in my mind because the next morning in the shower, it occurred to me that perhaps categories are just a distraction and it's really properties we should be looking at.
The category of red things is just a human convenience. But red is a property. Almost immediately, though, I realized that red is a category of electromagnetic waves and that electromagnetic waves are themselves a category. And from there, well, it's turtles all the way down. I set the idea aside as I dried myself and got ready to leave home.

Part II
When I got to work, our Blackboard system was down, so I couldn't do the grading I had intended to do. Distractedly I opened the Simple English Wiktionary and saw that somebody had edited the entry for preposition pace. The change was an improvement on, what I thought was a rather odd previous definition. Going through the history, though, I noticed that the older definition was one I had provided. Curious about what I had been thinking, I went to the OED's entry for pace. There I found the following example:

1995 Computers & Humanities 29 404/1, I do not believe, pace Peirce and Derrida, that it is signs all the way down.

This struck me as a huge coincidence. The expression shows up in the Corpus of Contemporary American English about once per 150 million words. I had encountered it, or a variation thereof, twice in a morning.

Part III
I looked up the expression and found that Wikipedia has an entry (linked above). One of the citations listed there is due to John (Haj) Ross's 1967 linguistics dissertation Constraints on Variables in Syntax. I followed the link and opened up his dissertation, which indeed contains the story with the line "It's turtles all the way down."
On the next page was the Acknowledgements, which list a number of linguists, but on page x, Ross writes,

This thesis is an integral part of a larger theory of grammar which George Lakoff and I have been collaborating on for the past several years.

This is, of course, the same Lakoff whose book I had been reading the night before.

Looking to the futurate

2014-04-11T10:54:00.003-05:00

The verb look has been used to talk about the future for a long time. Perhaps the most common use is in the expression look forward to (something). This use may be based on the metaphor that time is a landscape we move through. As such, our future should be visible to us. This is probably the same metaphor that underlies the use of go for the future in expressions like we're going to get to that in a moment.

Despite its venerable history, futurate look began a significant upsurge in about 1980, particularly, in the looking + to infinitive construction.
I noticed that this seems to be particularly common with are. Especially, you are. And even more specifically if you are. By this time we are looking at a small minority of the cases. But I wondered if it might tell us something about the meaning of the looking futurate as opposed to the going futurate. When I started looking at various corpus genres, though things got a little to complex. Maybe you have some thought to add.

The opacity of etymology

2014-04-08T11:20:00.002-05:00

The word disseminate is a familiar one. It appears hundreds of times on my hard drive and in well over 20 email messages I've read or written in the last five years. But until today, I had never seen the seeds in the word.

We often use a plant metaphor to talk about words. Morphology is a branch of linguistics just as plant morphology is a branch of biology. Both sciences talk of roots and stems, but in linguistics, seeds aren't part of the metaphor.

The root of disseminate though is semin or semen, from the Latin word meaning seed. Dissemination is the spreading of seeds. Semin is also the root of the word seminary, literally a seed plot, but now metaphorically used to mean a place to train priests. This is also where seminar comes from. I hadn't connected up these words either.

Disseminate appears in a passage that my level-8 class is studying. It's a passage that I've been over many times with other classes, and I have had to explain the word before. But never before have I made this connection.

How could this be?

On the flip side of this are cases of people seeing connections where there are none. Consider the regular flare ups about the word niggardly based on a mistaken perception that it's based on a racial slur. There are so many opportunities for false positives. The string semen, for instance shows up in a variety of words such as basement and horsemen, and nobody would ever think it referred to seeds.

How does this noticing thing work?

Contact Magazine summer issue

2013-09-06T09:42:00.001-05:00

The summer issue is now available here. Check it out.

New issue of Contact available

2013-06-28T12:07:00.002-05:00

The newest issue of TESL Ontario's Contact magazine is now available. This is the research symposium issue, based on talks given each fall at TESL Ontario's conference. The issue is edited by Hedy McGarrell and David Wood. The authors are: Dianne Larsen-Freeman, Antonella Valeo, Farahnaz Faez, Douglas Fleming, Alister Cumming, Robert Kohls, Hedy McGarrell, David Wood, and Randy Appel.

Please, check it out.

Then: the meaning changes

2013-05-27T20:28:00.001-05:00

A student wrote to me asking about the meaning and placement of then in this sentence.

The product is then designed with this target price in mind.

As I was explaining that then could be resultative (meaning something like "so that" or "in that case") or simply temporal (meaning roughly "next" or "after that"). I also pointed out that it could occur initially, medially (in a number of places), or finally:

1 The product is 2 designed with this target price in mind 3.

Then, it occurred to me that position 1 could only be interpreted as being the temporal meaning, position 2 was ambiguous, and position 3 strongly suggests the resultative meaning.

Is it just me?

New issues of Contact available

2013-03-05T21:46:00.001-05:00

I'm now into my second year as editor of TESL Ontario's Contact magazine. You can download issues for free as far back as 2000. I've been the editor for just over a year now (volumes 38 & 39). We just put out the conference issue today, as we do every spring. I hope you enjoy it. Please, leave me a comment to let me know what you thought.

Education First's English Proficiency Index

2012-10-29T11:00:00.001-05:00

Last year, I blogged about the first English Proficiency Index here. Education First has now released their second index, and they're getting some good publicity out of it (I'm linking to them). The same problems as last year occur: self selection of participants, and need for a computer with internet access to do the test. There's also no data on test reliability and no validity arguments. Still, with 1.7 million participants over three years in more than 50 countries, you wouldn't want to completely dismiss the results either.

One of the interesting things was the levels of adult English learners in English-speaking countries, which can be found on p. 34. Canada and Australia have "high proficiency" learners, while the UK's are "mid proficiency" and the US's "low proficiency". There is some attempt to explain this in terms of language teaching infrastructure and other variables. Also check out the regional info-graphics.

ER-Central.com

2012-10-25T08:49:00.001-05:00

Another new and potentially useful site:

Extensive Reading Central is a not-for-profit organization dedicated to developing an Extensive Reading and Extensive Listening approach to foreign and second language learning. It was started by Dr. Rob Waring of Note Dame Seishin University, Okayama, Japan and Dr. Charles Browne of Meiji Gakuin University, Tokyo, Japan as a free service to the EFL community.

This site will consolidate a number of existing sites, including Rob Waring's pages robwaring.org/er/ and Tom Robb's site extensivereading.net. The ER foundation, though, will remain a separate site.

Writes Rob,

Our hope is that it will act as the go-to place for ER/EL. Please take a look. It has information for teachers, Authors and publishers. There are how-to's, 90 videos about ER, links, downloads, and lots more - take a look. There's a student page where we'll add some student stuff in the future. We also want to allow the publishers to make announcements about their products, ask for help with say piloting materials and so on. They are part of our community too.

We'll be adding a discussion section, advice center and so on in the coming weeks. We also hope to have pages in different languages. All in time.

ER-Central will open an Amazon Affiliate bookstore soon. Note that the site has cost a lot of money to build and any earnings from this will pay for the development costs and go towards paying the annual bandwidth costs of ($650 per year). We expect to make a huge loss each year. However, if it does turn a profit, it will be used to buy books for disadvantaged schools. We hope we! don't have to resort to advertising.

So how can you help?

If you have any handouts, presentations, worksheets, webpages etc you wish to add to the site, please let use the upload form on the top page to send stuff to me and I'll link it in.

Spread the word on Facebook, Twitter or however you like

Help translate some pages

Find dead links and send new ones

Make comments on pages

Leave announcements about ER events etc.

Buy books from the store (available in a week or so)

Leave suggestions for features you want - we'll try to add them

The site is in Wordpress (which both Charlie and I are still learning). If anyone can help design the site help us to add images and various widgets, let me know and I'll add you as an administrator.

Tutela

2012-10-25T08:31:00.000-05:00

Tutela has been in beta for about a year but had its official launch during the TESL Canada conference. From their terms of use:

Tutela.ca (“Service”) is a Canadian not-for-profit online repository and community for ESL (English as a Second Language) and FSL (French as a Second Language) professionals registered to use the Service (“Users”). As a repository, the Service provides Users with access to ESL and FSL materials including classroom materials, lesson plans, assessment information, and reusable learning objects. As a community, the Service enables Users to share materials, discover new approaches, locate solutions, and network including through the use of the online meeting and webinar conferencing capabilities of the Service (“Conferencing”). The Service is supported by funding from Citizenship and Immigration Canada (“CIC”) and is owned and operated by Citadel Rock Online Communities Inc. (“Citadel”). (my bold)

I'm not really sure how this works. Tutela itself is not for profit, but Citadel is a privately owned corporation, for profit as far as I can tell. It has received a number of grants from the federal government. Just so as you know...

More Google Ngrams 2.0 POS tagging

2012-10-20T09:03:00.000-05:00

As I wrote yesterday, there are some strange tagging decisions concerning determinatives in this corpus. It seems though, as I added in an update, that these are largely the fault of the Part-of_Speech Tagging guidelines for the Penn Treebank Project.

The problems, though, are not limited to determinatives. Subordinators are also affected. The words that, whether, and if (e.g., They told me that it was OK and They asked me whether/if it was OK) are tagged as _ADP_ (short for ADPOSITION, a more inclusive term for prepositions.) The guidelines say:

"We make no explicit distinction between prepositions and subordinating conjunctions. (The distinction is not lost, however - a preposition is an IN that precedes a noun phrase or a prepositional phrase, and a subordinate conjunction is an IN that precedes a clause).
The preposition "to" has its own special tag TO."

This makes good sense for words like because, after, and since, which have been treated as "subordinating conjunctions" but really are prepositions and function as the heads of preposition phrases. It doesn't work, though, with that, whether, and if, which function as markers of subordination and not heads. Consider the difference between these two clauses:

Language Learner Literature Award Winners

2012-10-20T08:05:00.000-05:00

Somehow I missed the announcement, but the 2012 LLL Award Winners for books published in 2011 have been announced.

Google Ngrams 2.0 and POS tagging

2012-10-19T09:56:00.001-05:00

As Ben Zimmer blogged yesterday, there's a new and improved version of the Google Ngram viewer. The "improved" bit has a number of elements, but one is POS tagging. This is a wonderful thing, and I'm inordinately happy about it. Unfortunately, there are some very odd quirks to deal with.

The subordinators that, whether, and if (e.g., she asked me whether/if I'd be able to go; he told me that he'd be able to go) are tagged as _ADP_ (adposition, a more general term for prepositions). I've never seen such a classification, and it strikes me as deeply strange.

The second is their list of _DET_ (determinatives, or what they call "determiners"). I'm happy to report that they do not include the dependent possessive pronouns (my, your, his, her, our, etc.). These are tagged as _PRON_.

It seems that the following words are tagged as determiners at least part of the time:

a	100%
all	100%
an	100%
another	100%
any	100%
both	100%
each	100%
every	100%
some	100%
the	100%
these	100%
this	100%
those	100%
whatever	100%
which	100%
whichever	94%
no	88%
neither	56%
that	43%
either	42%
what	3%

Apart from these, many and much are usually tagged _ADJ_ (and _ADV_ for much), but less than 1% of the time they get tagged _DET_. I believe all cases tagged _ADJ_ should have been _DET_, so I have no idea what distinction is being made here.

I believe that which is generally a determinative, but in relative uses, it's usually a pronoun:

They could be late, which would be a problem. [pron]

They could be late, in which case we would have a problem. [det]

The following words are generally considered determiners (at least in some cases) and yet they are never tagged as such in the corpus:

few, fewer, fewest, last, least, less, little, more, most

Some other words which are determinatives in some accounts but not here are:

a few, a little, anyone, anything, certain, enough, everything, none, said, us, we, you

Finally, the list above doesn't appear to be exhaustive as I cannot get to 100% when dividing by _DET_, even when I include upper case first letters and all upper case. I wonder what I'm missing.

[Update: It seems that these oddities are based mostly in the Part-of_Speech Tagging guidelines for the Penn Treebank Project (3rd Revision, 2nd Printing) by Beatrice Santorini.

This category includes the articles a(n), every, no and the, the indefinite determiners another, any and some, each, either (as in either way), neither (as in neither decision), that, these, this and those, and instances of all and both when they do not precede a determiner or possessive pronoun (as in all roads or both times). (Instances of all or both that do precede a determiner or possessive pronoun are tagged as predeterminers (PDT).) Since any noun phrase can contain at most one determiner, the fact that such can occur together with a determiner (as in the only such case) means that it should be tagged as an adjective (JJ), unless it precedes a determiner, as in such a good time, in which case it is a predeterminer (PDT).

This explains the missing m determiners, but it doesn't explain why a small subset of many and much are tagged as _DET_. ]

[Update 2: The instances of many_DET are mostly of the form many a, for example many a day went by. Under the Penn system, this is a predeterminer. Thanks to Slav Petrov for solving this puzzle!]