Saturday, October 20, 2012

More Google Ngrams 2.0 POS tagging

As I wrote yesterday, there are some strange tagging decisions concerning determinatives in this corpus. It seems though, as I added in an update, that these are largely the fault of the Part-of_Speech Tagging guidelines for the Penn Treebank Project.

The problems, though, are not limited to determinatives. Subordinators are also affected. The words that, whether, and if (e.g., They told me that it was OK and They asked me whether/if it was OK) are tagged as _ADP_ (short for ADPOSITION, a more inclusive term for prepositions.) The guidelines say:
"We make no explicit distinction between prepositions and subordinating conjunctions. (The distinction is not lost, however - a preposition is an IN that precedes a noun phrase or a prepositional phrase, and a subordinate conjunction is an IN that precedes a clause).
The preposition "to" has its own special tag TO."
This makes good sense for words like because, after, and since, which have been treated as "subordinating conjunctions" but really are prepositions and function as the heads of preposition phrases. It doesn't work, though, with that, whether, and if, which function as markers of subordination and not heads. Consider the difference between these two clauses:

