tag:blogger.com,1999:blog-318304972024-03-13T08:27:30.768-05:00English, JackSecond thoughts on English and how she's taughtBretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.comBlogger577125tag:blogger.com,1999:blog-31830497.post-52917831911872590242019-01-09T20:44:00.000-05:002019-01-11T06:58:43.097-05:00Of of of<span lang="EN-CA" style="mso-ansi-language: EN-CA;">In response to:</span><br />
<br />
<div style="margin-left: 24pt; text-indent: -24.0pt;">
Pullum, G. K. (2018). Intuition and decidability in grammar and number theory. In <i>K + K = 120: Papers dedicated to László Kálmán and András Kornai on the occasion of their 60th birthdays</i> (pp. 1–10).<br />
<br /></div>
"Given a CFG for English, therefore, we could use a fully general algorithm to find out (in<br />
cubic time) whether, for example, there is a grammatical string with ofofof as a substring. (I think there probably is, but I leave the exercise of constructing one for the reader to pursue in idle moments.)"<br />
<br />
<span lang="EN-CA" style="mso-ansi-language: EN-CA;">I have constructed a sentence with three consecutive instances of <i>of</i>.</span><i><span lang="EN-CA" style="mso-ansi-language: EN-CA;"> </span></i><br />
<br />
<ul>
<li><i><span lang="EN-CA" style="mso-ansi-language: EN-CA;">We informed
the people you were thinking of of, of course, not only your specific
intentions, but also, your general vision.</span></i></li>
</ul>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-536870145 1107305727 0 0 415 0;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;
mso-font-charset:0;
mso-generic-font-family:swiss;
mso-font-pitch:variable;
mso-font-signature:-536859905 -1073732485 9 0 511 0;}
@font-face
{font-family:"Yu Mincho";
panose-1:2 2 4 0 0 0 0 0 0 0;
mso-font-alt:游明朝;
mso-font-charset:128;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-2147482905 717749503 18 0 131231 0;}
@font-face
{font-family:"\@Yu Mincho";
mso-font-charset:128;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-2147482905 717749503 18 0 131231 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Yu Mincho";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Yu Mincho";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-paper-source:0;}
div.WordSection1
{page:WordSection1;}
</style> Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com1tag:blogger.com,1999:blog-31830497.post-76431020259116537842014-12-04T16:52:00.000-05:002014-12-04T16:52:07.650-05:00Contact magazine, Vol 40(4) now availableGet your free copy <a href="http://www.teslontario.net/publication/contact-magazine" target="_blank">here</a>.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-87967534317430343722014-11-06T06:29:00.001-05:002014-11-06T06:30:22.633-05:00Useful examples for language learnersThe odd choices of example sentences that sometimes show up in these "teach yourself to speak..." type books along with phrase books has been rightly mocked in the past. In fact, the subtext of this blog's title references <a href="https://en.wikipedia.org/wiki/English_As_She_Is_Spoke" target="_blank">just such a phrase book</a>.<br />
<a name='more'></a><br />
Recently, Radiolab ran a program called <a href="http://www.radiolab.org/story/translation/" target="_blank">translation</a>, and started each segment with Robert Krulwich imitating a language lesson LP…with the twist of
it being an LP that helps us to learn Robert’s imaginary native tongue, "Luden". The phrases chosen start out a little strangely (e.g., <i>my mother wrote the best poem</i>) and then get progressively more fanciful and bizarre. It all makes sense in the context of the program, and I really encourage you to listen to the whole thing. <br />
<br />
I was reminded of this because I've started to study Old English. Now phrases that are useful for learning a modern language, phrases like <i>what time is it</i> and <i>what does this mean</i> are really quite pointless for learning Old English, because you're never going to speak it with anyone. Instead, you use it to read old texts. The result is that you get to study examples like this:<br />
<span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; border-collapse: separate; color: black; font-family: Helvetica; font-size: xx-small; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"></span><br />
<div style="background-color: white; color: black; font-family: arial; font-size: 20px; margin-bottom: 1.5em; margin-left: 1.5em; margin-right: 1.5em; margin-top: 1.5em; text-align: center;">
<div id="qa">
for þan iċ hine sweorde swebban nelle<br />
<hr id="answer" style="background-color: #cccccc; margin-bottom: 1em; margin-left: 1em; margin-right: 1em; margin-top: 1em;" />
therefore I will not kill him with a sword</div>
</div>
<br />
and<br />
<span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; border-collapse: separate; color: black; font-family: Helvetica; font-size: xx-small; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"></span><br />
<div style="background-color: white; color: black; font-family: arial; font-size: 20px; margin-bottom: 1.5em; margin-left: 1.5em; margin-right: 1.5em; margin-top: 1.5em; text-align: center;">
<div id="qa">
þū scealt yfelum dēaðe sweltan<br />
<hr id="answer" style="background-color: #cccccc; margin-bottom: 1em; margin-left: 1em; margin-right: 1em; margin-top: 1em;" />
you must die by a wretched death</div>
</div>
<br />
Good times!Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com1tag:blogger.com,1999:blog-31830497.post-72685113751730972892014-10-06T12:30:00.000-05:002014-10-06T12:30:00.616-05:00On meeting 'otiose' twice againI asked <a href="http://www.ling.upenn.edu/~myl/" target="_blank">Mark Liberman</a> to have a look at <a href="http://english-jack.blogspot.co.uk/2014/10/on-meeting-otiose-twice-in-day.html" target="_blank">what I wrote yesterday</a> since I was struggling to get my head around the probabilities. He was kind enough to write the following guest post:<br />
<hr />
Maybe a better way of thinking about it is this:<br />
<div>
<br /></div>
<div>
Say
the probability that word w_i will be selected at random from a
collection of text is P(w_i). Then assuming independence, the
probability that the next word will NOT be w_i is (1-P(w_i)), and the
probability of failing to find w_i in N successive draws is</div>
<div>
<br /></div>
<div>
(1-P(w_i))^N</div>
<div>
<br /></div>
<div>
If P(w_i) is 1/10^7 (one in ten million), and N is 1000, then we get</div>
<div>
<br /></div>
<div>
(1-(1/10^7))^1000</div>
<div>
<br /></div>
<div>
which
is 0.9999. So if we take notice of a rare-ish (P = 1/10000000) word,
and draw 1,000 other words at random looking to see it again, then 9,999
times out of 10,0000, we'll fail to find the moderately rare word we
were waiting for. And if we draw 10,000 additional words instead of
1,000, the probability of failure is still</div>
<div>
<br /></div>
<div>
(1-(1/10^7))^10000 = 0.999</div>
<div>
<br /></div>
<div>
so we're still gonna fail 999 times out of a thousand.</div>
<div>
<br /></div>
<div>
But
the thing is, Rare Words Are Common. That is, a large proportion of
word tokens belong to relatively rare types. So suppose that there are
10,000 other words of approximately equal rareness, and every time we
see one of them, we set a subconscious process to watch for recurrences
of that word within the next thousand instances</div>
<div>
<br /></div>
<div>
If
we do this a thousand times, then the chances of failure (for a
thousand instances of noting a rare word and looking for it to occur
again) become </div>
<div>
<br /></div>
<div>
((1-(1/10^7))^1000)^1000 = about 0.9</div>
<div>
or</div>
<div>
((1-(1/10^7))^10000)^1000 = about 0.368</div>
<div>
<br /></div>
<div>
So
if you do enough reading for these conditions to be satisfied once a
day, you should expect to have this experience several times a week.</div>
<div>
<br /></div>
<div>
Now,
none of this reasoning really applies, because you aren't picking words
at random from a well-mixed urn, you're reading them in order in
coherent text. And words in coherent text are far from independent
Bernoulli trials -- when a rare word appears, the probability that it
will appear again before long in the same text is massively increased by
topic effects (and to a lesser extent style and priming effects). But
this just means that the experience should be more common rather than
less common -- unless you insist that the texts be separate and on
different topics, and so forth, in which case it gets complicated.</div>
<div>
<br /></div>
<div>
But
still, I think that the real puzzle is not why you had this apparently
odd experience, but why such we occasionally notice the kinds of
coincidences that are in fact rather common.</div>
<div>
<br /></div>
<div>
This
is not an unimportant question, since it has a lot to do with the
genesis of superstition (and probably science, for that matter...)</div>
<hr />
The above is a guest post by <a href="http://www.ling.upenn.edu/~myl/" target="_blank">Mark Liberman</a>.
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-19482284986385430402014-10-05T05:36:00.002-05:002014-10-07T08:42:00.721-05:00On meeting 'otiose' twice in a dayWell, not in the same day, but certainly within a 24-hour period. As I was lying in bed last night, reading Charles Mann's <i>1493</i>, I came across the phrase <i>the <u>otiose</u> Percy </i>on p. 78.<br />
<br />
As of this morning, I've read to p. 90, so that's about 4,500 words later. I also read a few NY Times articles, adding perhaps another 1,200 words. And then I set about to edit an article for <i><a href="http://www.teslontario.net/publication/contact-magazine" target="_blank">Contact</a>, </i>the TESL Ontario magazine for which I'm the editor. Almost immediately, I came across a quote from David Crystal in which he wonders,<br />
<blockquote class="tr_bq">
whether the presence of a global language will eliminate the demand for world translation services, or whether the economics of automatic translation will so undercut the cost of global language learning that the latter will become <u>otiose</u>.</blockquote>
<a name='more'></a>So that's twice in about 6,000 words. What are the chances of this? [See the <a href="http://english-jack.blogspot.co.uk/2014/10/on-meeting-otiose-twice-again.html" target="_blank">update</a>]<br />
<br />
Well, not liking to leave rhetorical questions hanging, I set out to see. The adjective <i>otiose </i>(roughly meaning "of no value or use") occurs at a rate of 0.03 times per million words <a href="http://corpus.byu.edu/coca/?c=coca&q=33444656" target="_blank">in COCA</a>. That's about once every 32 million words or so. Google Books put it at a higher rate: one in 10 million. So let's say my chanced of meeting it are about 1 in 15 million words (0.00000667%).<br />
<br />
As I say, we're looking at a window of 6,000 words, so we divide 6,000 by 15,000,000 to get 0.0004. Multiply this by the original odds to get a 1 in 37.5 trillion (0.0000000027%) chance of meeting <i>otiose </i>once in a span of 6,000 words.<br />
<br />
My odds of doing that twice in 6,000 words (assuming these are independent events, which is more or less right) is simply the odds of doing it once squared. So, that's about 1 in 1.4 sextillion.<br />
<br />
So that's, like, wow!<br />
<br />
But.<br />
<br />
That's the odds of <b>me</b> doing it in this particular stretch of 6,000, not the odds of some person doing it some time.<br />
<br />
Let's say the average English reader reads 10,000 words per day and there are about 300 million people who read English. That's about 30 trillion words per day. So<span class="st">—</span>this is where I lose confidence that I'm getting the reasoning right<span class="st">—</span>the odds of some English speaker reading <i>otiose </i>twice in the span of 6,000 words is about one in 4,687,5000 each day or one in 128,000 per year.<br />
<br />
Given that, I think it's most likely that no other person has ever accomplished this feat.<br />
<br />
Except you.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-34677483668064298602014-09-02T16:23:00.003-05:002014-09-02T16:23:39.644-05:00A title misparsedThis morning, I was reading <a href="http://www.newstatesman.com/future-proof/2014/08/tropes-vs-anita-sarkeesian-passing-anti-feminist-nonsense-critique" target="_blank">this article</a> at <i>New Statesman,</i> when I came across the following:<br />
<blockquote class="tr_bq">
Yet surely, when night after night atrocities are served up to us as
entertainment, it's worth some anxiety. We become clockwork oranges if
we accept all this pop culture without asking what's in it.</blockquote>
The plural <i>clockwork oranges</i> suddenly threw into sharp relief the title of Burgess's book <i>A clockwork orange. </i>For some reason that I am unable to articulate now, if I ever was aware of it, I had always parsed that title like this:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-uO7J9v9LDA4/VAXq4Lw_BSI/AAAAAAAAAdI/ZjrlTwpUYWU/s1600/syntax_tree(16).png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-uO7J9v9LDA4/VAXq4Lw_BSI/AAAAAAAAAdI/ZjrlTwpUYWU/s1600/syntax_tree(16).png" /></a></div>
That is to say, I took <i>orange </i>to be a postpositive modifier of <i>clockwork </i>(like <i>proof <u>positive</u>, governor <u>general</u>, the city <u>proper</u>, </i>etc.)<i> </i>instead of <i>clockwork </i>as an attributive modifier of <i>orange, </i>like this:<i><br /></i><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<i><a href="http://1.bp.blogspot.com/-v6FqJKSLsbw/VAXrTbmGsdI/AAAAAAAAAdQ/Bwr_feHoiIU/s1600/syntax_tree(17).png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-v6FqJKSLsbw/VAXrTbmGsdI/AAAAAAAAAdQ/Bwr_feHoiIU/s1600/syntax_tree(17).png" /></a></i></div>
This was, I must admit and odd and, even to me, puzzling title, but then it's an odd and puzzling book, so I just rolled with it. As I say, it was the plural <i>oranges </i>that made me see the light: adjectives don't do plurals.<br />
<br />
I somehow overlooked the frequency of <i>clockwork </i>as a modifier, which should have tipped me off: in COCA, almost 40% of all instances of <i>clockwork</i> are attributive modifiers. Another thing that I was aware of, but which just seemed like more of the weirdness, is that <i>clockwork</i> is rarely--but sometimes--countable, so <i>a clockwork </i>is kinda weird, but not totally beyond the pale.<br />
<br />
Perhaps one thing the pushed me to the first analysis was the stress pattern. Usually, an NP with a noun as modifier gets the main stress in the NP. It's a <i><b> </b></i><br />
<ul>
<li><i><b>FA</b>culty <b>of</b>fice</i>, not <i><b>fa</b>culty <b>OF</b>fice</i>, </li>
<li><i><b>SOC</b>cer <b>ball</b>, </i>not <i><b>soc</b>cer <b>BALL</b>, </i>and <i> </i></li>
<li><i>po<b>LICE</b> <b>of</b>ficers, </i>not <i>po<b>lice</b> <b>OF</b>ficers.</i> </li>
</ul>
My impression is that people tend to say <i>a <b>clock</b>work <b>ORANGE</b></i>, rather than <i>a <b>CLOCK</b>work <b>orange. </b></i>This is the same pattern you get with postpositive modifiers like <i><b>proof</b> <u><b>PO</b>sitive</u>.</i><br />
<br />
Whatever the reason, what really impressed me is how decades of misapprehension can be overcome by a single choice example.<i> </i>Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com7tag:blogger.com,1999:blog-31830497.post-72306340675352926122014-08-19T07:13:00.001-05:002014-08-19T07:13:28.470-05:00Antedating "determinative"The OED gives:
<blockquote class="tr_bq">
<h3>
<b>b.</b> Gram. determinative adjective, determinative pronoun, etc. (see quots.); determinative compound = <a href="http://www.oed.com.rap.ocls.ca/view/Entry/198095#eid19025346" rel="198095" rev="/view/Entry/198095#eid19025346" target="_blank">tatpurusha n.</a></h3>
</blockquote>
<div>
<div>
<blockquote>
<div>
1921 E. Sapir <i><a href="https://www.blogger.com/null" rel="0029091">Lang.</a></i> vi. 135
The words of the typical suffixing languages (Turkish, Eskimo,
Nootka) are ‘determinative’ formations, each added element determining
the form of the whole anew.</div>
<div>
1924 H. E. Palmer <i><a href="https://www.blogger.com/null" rel="0081990">Gram. Spoken Eng.</a></i> ii. 24
To group with the pronouns all determinative adjectives..shortening the term to <i>determinatives</i>.</div>
<div>
1933 L. Bloomfield <i><a href="https://www.blogger.com/null" rel="0012790">Language</a></i> xiv. 235
One can..distinguish..<wbr></wbr>determinative (attributive or subordinative) compounds (Sanskrit <i>tatpurusha</i>).</div>
<div>
1961 R. B. Long <i><a href="https://www.blogger.com/null" rel="0066502">Sentence & its Parts</a></i> 486
<i>The, a</i>, and <i>every</i> are exceptional among the determinative pronouns in requiring stated heads.</div>
</blockquote>
Today, I was reading Kellner's <a href="https://archive.org/details/historicaloutlin00kelliala" target="_blank"><i>Historical outlines of English syntax</i></a> from 1892 and came across the following on pp. 113–114 (emphasis added):<br />
<br />
<div style="margin-left: 40px;">
In Old English the possessive pronoun, or, as the French say, "pronominal adjective," expresses only the conception of belonging and possession ; it is a real adjective, and does not convey, as at present, the idea of <b>determination</b>. If, therefore, Old English authors want to make such nouns <b>determinative</b>, they add the definite article : </div>
<blockquote class="tr_bq">
<div style="margin-left: 40px;">
"hæleð min se leofa" (my dear warrior). <i>—Elene</i>, 511.<br />
"ðu eart dohtor min seo dyreste" (thou art my dearest daughter). —<i>Juliana</i>, 193.
</div>
</blockquote>
<div style="margin-left: 40px;">
§179. In Middle English the possessive pronoun apparently has a <b>determinative</b> meaning (as in Modern English, Modern therefore its connection; German, and Modern French) with the definite article is made superfluous, while the indefinite article is quite impossible. Hence arises a certain
embarrassment with regard to one case which the language cannot do
without. </div>
<div style="margin-left: 40px;">
</div>
<div style="margin-left: 40px;">
Suppose we want to say "she is in a castle belonging to her,"
where it is of no importance what-ever, either to the speaker or
hearer, to know whether "she" has got more than one castle how could the
English of the Middle period put it? The French of the same age said still "un sien castel," but that was no longer possible in English.</div>
<div style="margin-left: 40px;">
</div>
<div style="margin-left: 40px;">
<br />
§180. We should expect the genitive of the personal pronoun ("of me," &c., as in Modern German)—and there may have been a time when this use prevailed—but, so far as I know, the language decided in favour of the more complicated construction "of mine, of thine," &c.<br />
<br />
This was, in all probability, brought about by the analogy of the very numerous cases in which the <b>indeterminative</b> noun connected with mine, &c., had a really partitive sense (cf. the examples below), and, further, by the remembrance of the old construction with the possessive pronoun.<br />
</div>
And later:<br />
<br />
<div style="margin-left: 40px;">
Later on, the possessive pronoun apparently implies a <b>determinative</b> meaning (as in Modern German and Modern French) ; therefore its connection with the definite article is made superfluous, while the indefinite article is quite impossible. Instead of the old construction we find henceforth what may be termed the genitive pseudo-partitive. See above, 178–180.</div>
</div>
</div>
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-11595583415547500232014-07-07T07:24:00.001-05:002014-07-07T11:35:15.204-05:00Proscribing, narrowlyOver at the NYT, Alexander Nazaryan has a rather strident article about "<a href="http://www.nytimes.com/2014/07/07/opinion/the-fallacy-of-balanced-literacy.html" target="_blank">The fallacy of balanced literacy</a>." Therein, he writes, "balanced literacy is an especially irresponsible
approach, given that New York State has adopted the federal Common Core
standards, which skew toward a narrowly proscribed list of texts,
many of them nonfiction." [Now changed to <i>narrowly prescribed.</i>]<br />
<br />
These texts are prescribed. That is, they're
imposed, not declared unacceptable or invalid. Nevertheless, the Google Books corpus suggests <i>narrowly proscribed </i>is a new and growing phrase.
<iframe frameborder="0" height="500" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=narrowly+prescribed%2Cnarrowly+proscribed&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cnarrowly%20prescribed%3B%2Cc0%3B.t1%3B%2Cnarrowly%20proscribed%3B%2Cc0" vspace="0" width="600"></iframe><br />
So, I'm curious: was this simply a typo, or did he have in mind some
metaphor of narrowing down by proscription. Or was it something else?Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-27983662844795935942014-06-23T08:48:00.001-05:002014-06-23T08:48:43.621-05:00Thinking like a freakI listen to the Freakonomics Radio podcast from time to time, and back in May they aired an episode called "<a href="http://freakonomics.com/2014/05/15/the-three-hardest-words-in-the-english-language-a-new-freakonomics-radio-podcast/" target="_blank">the three hardest words</a>...," which, purportedly, were <i>I don't know</i>. The premise was that people hate to admit ignorance and so they hardly ever say, "I don't know."<br />
<br />
Except that in most corpus studies, the head-and-shoulders most common, number one, top-of-the-heap three-word string in English is <i>I don't know</i> (It's a three-word string, not four, since -<i>n't</i> is an inflectional suffix, not just a contraction as is taught in elementary schools, but that's another issue.) For instance, in the <a href="http://www.ngrams.info/download_coca.asp" target="_blank">3-grams list</a> from the Corpus of Contemporary American English. <i>I don't know </i>is by far the most frequent 3-gram with 199,110 instances (second is <i>one of the </i>at 167,785). In business meetings, we find the same results. Consider table 3.10 on p. 59 of <a href="http://books.google.ca/books?id=A7qsmaIvVTwC&lpg=PA39&ots=TCJ6Z76c_j&dq=corpus%20of%20business%20meetings&pg=PA59#v=onepage&q&f=false" target="_blank">this book</a>, or table 5.8 on p. 183 of <a href="http://etheses.nottingham.ac.uk/1893/" target="_blank">this paper</a>.<br />
<br />Now, these are not mostly "I don't know (period)." Far more commonly, they're "I don't know if..." "I don't know what..." etc., which can often be used as a signal of disagreement rather than as an admission of ignorance. Nevertheless, the data stands in rather stark contradiction to the freaky claim. It looks pretty silly to be saying people should fess up to their ignorance, while basing the argument on a point on which you're so ignorant that you assert the most common phrase is the least (or at least the hardest).<br />
<br />
(If you're interested in other freaky foolishness, see Joseph Heath's <a href="http://induecourse.ca/think-like-a-jackass/" target="_blank">recent post</a> on their simplistic view of the UK medical system.)Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-3135841532575389382014-05-16T16:32:00.002-05:002014-05-16T16:32:48.193-05:00Audio and the OEDAs I mentioned, <i>Schwa Fire</i> is now out, and I've been quite enjoying it. Arika Okrent (whose name I have inexplicably misread for years as Akira) has written an article called "Ghost voices" about preserving audio-tape recordings of our all-too-impermanent voices, dialects, and languages. As I was reading it, it occurred to me that the <i>OED</i> should include audio recordings of the quotations it uses. These should be in the dialect, and where possible the actual voice, of the original author.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-26503196423103728262014-05-16T16:23:00.001-05:002014-05-16T16:23:39.718-05:00Schwa FireBack in November, 2013, there was a proposal on Kickstarter for a new language magazine. I chipped in to sponsor it and ended up on the editorial panel as a result. The first issue is now out.<br /><br />
Issue 1, Season 1<br /><br />May 16, 2014• <a href="http://schwa-fire.com/" target="_blank">Schwa Fire</a><br /><br />The golden age of language journalism begins now. In this inaugural issue, Arika Okrent tells the story of 5,700 hours of Yiddish recordings that were almost lost ("Ghost Voices"), and Russell Cobb writes about Americans' fondness for the Englishes we used to speak and what that fondness obscures ("The Way We Talked"). Michael Erard describes and defends "language journalism," and Robert Lane Greene provides a lesson on the languages of love ("Wooing in Danish"). Also included: an English homophone puzzle.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-37708227545855317872014-05-15T07:41:00.003-05:002014-05-15T08:30:17.179-05:00When "syndrome" is a final "s"1982 gave us the acronym <i>AIDS</i> formed from acquired immune deficiency syndrome. This is pronounced /eɪdz/. The fact that the final <i>S</i> is pronounced /z/ is notable, since a final <i>s</i> is typically pronounced /s/ (e.g., <i>bus</i>) unless it is an inflectional morpheme (e.g., <i>dogs</i>). There are cases such as <i>news </i>and <i>lens, </i>in which a final <i>s </i>is pronounced /z/, but the <i>-s </i>in <i>news </i>was originally a plural morpheme. That leaves <i>lens, </i>which comes from the Latin word for <i>lentil. </i><a href="https://en.wiktionary.org/wiki/lens#Latin" target="_blank">Apparently</a>, it was pronounced /leːns/ in Latin, so why it has a final /z/ in English is something of a mystery to me. I cannot find another example of an English noun with a final <i>s </i>pronounced /z/.<br />
<br />
This brings us back to <i>AIDS</i>. Presumably, this final /z/ was influenced by the homographs <i>aids,</i> the noun, and <i>aids, </i>the verb. But then in 2003 we got <i>SARS</i>. There is no English word <i>sar</i>, so there is no preexisting homograph from which to analogously get /sɑɹz/, but that is the only pronunciation I've ever heard. I've never heard anyone say /sɑɹs/. So this seems to be an extension of the <i>AIDS </i>analogy to <i>aids.</i><br />
<br />
And now today we have <i>MERS</i>. On CBC's Metro Morning<i> </i>this morning, Matt Galloway started out pronouncing it /mɜɹ<span class="IPA" lang=""></span>s/, which initially threw me. I'd been mentally pronouncing it with a final /z/, and indeed Galloway finished up with /mɜɹ<span class="IPA" lang=""></span>z/ (I couldn't tell what the person he was interviewing was saying, but I suspect she was using the /z/ form, given his shift.)<br />
<br />
So perhaps we have a new rule developing: acronym-final<i> s</i> for <i>syndrome </i>is pronounced /z/.<br />
<br />
As I was looking around writing this post, it appears that at least <a href="http://crofsblogs.typepad.com/h5n1/2014/05/pronouncing-mers.html" target="_blank">one other person</a> has taken note of the pronunciation of <i>MERS</i>.<br />
<br />
[<a href="https://www.blogger.com/profile/13684304410735867148" target="_blank">John Wells</a> points out "Latin fifth-declension nouns in -<i>es</i> have final
noninflectional /z/ in English, too: <i>species</i>, <i>series</i>... Why that should
apply to <i>MERS</i> is a further question, which I cannot answer: but compare <i>
Mars</i>."] Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com6tag:blogger.com,1999:blog-31830497.post-70308333726950723442014-04-22T08:07:00.003-05:002014-04-22T08:07:34.145-05:00It's turtles round and round<b>Part I</b><br />
I've been trying to understand categories better, and one of the books I've been reading in pursuit of this goal is George Lakoff's <i><a href="https://en.wikipedia.org/wiki/Women,_Fire,_and_Dangerous_Things" target="_blank">Women, Fire, and Dangerous Things</a>.</i> In fact, a few nights ago, I fell asleep reading it, and it must have stirred something in my mind because the next morning in the shower, it occurred to me that perhaps categories are just a distraction and it's really properties we should be looking at.<br />
The category of red things is just a human convenience. But red is a property. Almost immediately, though, I realized that red is a category of electromagnetic waves and that electromagnetic waves are themselves a category. And from there, well, it's <a href="https://en.wikipedia.org/wiki/Turtles_all_the_way_down" target="_blank">turtles all the way down</a>. I set the idea aside as I dried myself and got ready to leave home.<br />
<b><br /></b>
<b>Part II</b><br />
When I got to work, our Blackboard system was down, so I couldn't do the grading I had intended to do. Distractedly I opened the Simple English Wiktionary and saw that somebody had edited the entry for preposition <i><a href="https://simple.wiktionary.org/wiki/pace#Preposition" target="_blank">pace</a>. </i>The change was an improvement on, what I thought was a rather odd previous definition. Going through the history, though, I noticed that the older definition was one I had provided. Curious about what I had been thinking, I went to the OED's entry for <i>pace. </i>There I found the following example:<br />
<blockquote class="tr_bq">
<span class="noIndent" id="eid73822130">1995 <em><a class="sourcePopup" href="https://www.blogger.com/null" rel="0046815">Computers & Humanities</a></em> <strong>29</strong> 404/1,</span>
I do not believe, <em>pace</em> Peirce and Derrida, that <span style="background-color: white;"><span style="color: red;">it is signs all the way down</span></span>.</blockquote>
This struck me as a huge coincidence. The expression shows up in the Corpus of Contemporary American English about once per 150 million words. I had encountered it, or a variation thereof, twice in a morning.<br />
<b><br /></b>
<b>Part III</b><br />
I looked up the expression and found that Wikipedia has an entry (linked above). One of the citations listed there is due to John (Haj) Ross's 1967 linguistics dissertation <i>Constraints on Variables in Syntax</i>. I followed the link and opened up his dissertation, which indeed contains the story with the line "It's turtles all the way down."<br />
On the next page was the Acknowledgements, which list a number of linguists, but on page x, Ross writes,<br />
<blockquote class="tr_bq">
This thesis is an integral part of a larger theory of grammar which George Lakoff and I have been collaborating on for the past several years.</blockquote>
This is, of course, the same Lakoff whose book I had been reading the night before.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com1tag:blogger.com,1999:blog-31830497.post-9431637329201834912014-04-11T10:54:00.003-05:002014-04-11T10:56:19.131-05:00Looking to the futurateThe verb <i>look</i> has been used to talk about the future for a long time. Perhaps the most common use is in the expression <i>look forward to </i>(something). This use may be based on the metaphor that time is a landscape we move through. As such, our future should be visible to us. This is probably the same metaphor that underlies the use of <i>go </i>for the future in expressions like <i>we're going to get to that in a moment</i>.<br />
<br />
Despite its venerable history, <a href="https://en.wiktionary.org/wiki/futurate" target="_blank">futurate</a> <i>look</i> began a significant upsurge in about 1980, particularly, in the <i>looking + to </i>infinitive construction.<br />
<iframe frameborder="0" height="260" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=looking+to+_VERB_&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Clooking%20to%20_VERB_%3B%2Cc0" vspace="0" width="600"></iframe>
I noticed that this seems to be particularly common with <i>are</i>.
<iframe frameborder="0" height="260" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=be_INF+looking+to&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t3%3B%2Cbe_INF%20looking%20to%3B%2Cc0%3B%2Cs0%3B%3Bare%20looking%20to%3B%2Cc0%3B%3Bwas%20looking%20to%3B%2Cc0%3B%3Bwere%20looking%20to%3B%2Cc0%3B%3Bis%20looking%20to%3B%2Cc0%3B%3Bbe%20looking%20to%3B%2Cc0%3B%3Bbeen%20looking%20to%3B%2Cc0%3B%3Bam%20looking%20to%3B%2Cc0" vspace="0" width="600"></iframe>
Especially, <i>you are</i>.
<iframe name="ngram_chart" src="https://books.google.com/ngrams/interactive_chart?content=*+are+looking+to&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t2%3B%2C*%20are%20looking%20to%3B%2Cc0%3B%2Cs0%3B%3Byou%20are%20looking%20to%3B%2Cc0%3B%3Bwho%20are%20looking%20to%3B%2Cc0%3B%3Bwe%20are%20looking%20to%3B%2Cc0%3B%3Bthey%20are%20looking%20to%3B%2Cc0%3B%3Bpeople%20are%20looking%20to%3B%2Cc0%3B%3BWe%20are%20looking%20to%3B%2Cc0%3B%3Band%20are%20looking%20to%3B%2Cc0%3B%3BThey%20are%20looking%20to%3B%2Cc0%3B%3Bcompanies%20are%20looking%20to%3B%2Cc0%3B%3Bthat%20are%20looking%20to%3B%2Cc0" width=600 height=260 marginwidth=0 marginheight=0 hspace=0 vspace=0 frameborder=0 scrolling=no></iframe>
And even more specifically <i>if you are</i>.
<iframe name="ngram_chart" src="https://books.google.com/ngrams/interactive_chart?content=*+you+are+looking+to&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t2%3B%2C*%20you%20are%20looking%20to%3B%2Cc0%3B%2Cs0%3B%3BIf%20you%20are%20looking%20to%3B%2Cc0%3B%3Bif%20you%20are%20looking%20to%3B%2Cc0%3B%3Bwell%20you%20are%20looking%20to%3B%2Cc0%3B%3Bthat%20you%20are%20looking%20to%3B%2Cc0%3B%3Bwhen%20you%20are%20looking%20to%3B%2Cc0%3B%3Band%20you%20are%20looking%20to%3B%2Cc0%3B%3Bwhat%20you%20are%20looking%20to%3B%2Cc0%3B%3BWhether%20you%20are%20looking%20to%3B%2Cc0%3B%3Bwhether%20you%20are%20looking%20to%3B%2Cc0%3B%3BWhen%20you%20are%20looking%20to%3B%2Cc0" width=600 height=260 marginwidth=0 marginheight=0 hspace=0 vspace=0 frameborder=0 scrolling=no></iframe>
By this time we are looking at a small minority of the cases. But I wondered if it might tell us something about the meaning of the <i>looking</i> futurate as opposed to the <i>going</i> futurate. When I started looking at various corpus genres, though things got a little to complex. Maybe you have some thought to add.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-42673070709211278212014-04-08T11:20:00.002-05:002014-04-08T11:20:41.494-05:00The opacity of etymologyThe word <i>disseminate </i>is a familiar one. It appears hundreds of times on my hard drive and in well over 20 email messages I've read or written in the last five years. But until today, I had never seen the seeds in the word<i>.</i><br />
<br />
We often use a plant metaphor to talk about words. Morphology is a branch of linguistics just as plant morphology is a branch of biology. Both sciences talk of roots and stems, but in linguistics, seeds aren't part of the metaphor.<br />
<br />
The root of <i>disseminate </i>though is <i>semin </i>or <i>semen, </i>from the Latin word meaning seed. Dissemination is the spreading of seeds. <i>Semin </i>is also the root of the word <i>seminary, </i>literally a seed plot, but now metaphorically used to mean a place to train priests. This is also where <i>seminar</i> comes from. I hadn't connected up these words either.<br />
<br />
<i>Disseminate</i> appears in a passage that my level-8 class is studying. It's a passage that I've been over many times with other classes, and I have had to explain the word before. But never before have I made this connection.<br />
<br />
How could this be?<br />
<br />
On the flip side of this are cases of people seeing connections where there are none. Consider the <a href="https://en.wikipedia.org/wiki/Controversies_about_the_word_%22niggardly%22" target="_blank">regular flare ups</a> about the word <i>niggardly</i> based on a mistaken perception that it's based on a racial slur. There are so many opportunities for false positives. The string <i>semen, </i>for instance shows up in a variety of words such as <i>basement </i>and <i>horsemen, </i>and nobody would ever think it referred to seeds.<br />
<br />
How does this noticing thing work?Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com3tag:blogger.com,1999:blog-31830497.post-1057520678739170032013-09-06T09:42:00.001-05:002013-09-06T09:42:11.217-05:00Contact Magazine summer issueThe summer issue is now available <a href="http://www.teslontario.net/uploads/publications/contact/ContactSummer2013.pdf" target="_blank">here</a>. Check it out.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-28238638935418011062013-06-28T12:07:00.002-05:002013-06-28T12:07:32.858-05:00New issue of Contact availableThe newest issue of TESL Ontario's <i>Contact</i> magazine is <a href="http://www.teslontario.net/uploads/publications/researchsymposium/ResearchSymposium2013.pdf" target="_blank">now available</a>. This is the research symposium issue, based on talks given each fall at TESL Ontario's conference. The issue is edited by Hedy McGarrell and David Wood. The authors are: Dianne Larsen-Freeman, Antonella Valeo, Farahnaz Faez, Douglas Fleming, Alister Cumming, Robert Kohls, Hedy McGarrell, David Wood, and Randy Appel.<br />
<br />
Please, check it out.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-90851166437977911012013-05-27T20:28:00.001-05:002013-05-27T20:28:39.507-05:00Then: the meaning changesA student wrote to me asking about the meaning and placement of <i>then</i> in this sentence.<div>
<blockquote class="tr_bq">
<i>The product is then designed with this target price in mind.</i></blockquote>
As I was explaining that <i>then</i> could be resultative (meaning something like "so that" or "in that case") or simply temporal (meaning roughly "next" or "after that"). I also pointed out that it could occur initially, medially (in a number of places), or finally:<br />
<blockquote class="tr_bq">
<u>1</u><i> The product is </i><u>2</u> <i>designed with this target price in mind </i><u>3</u><i>.</i></blockquote>
Then, it occurred to me that position 1 could only be interpreted as being the temporal meaning, position 2 was ambiguous, and position 3 strongly suggests the resultative meaning.<br />
<br />
Is it just me?</div>
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com4tag:blogger.com,1999:blog-31830497.post-1248698162645985122013-03-05T21:46:00.001-05:002013-03-05T21:46:29.529-05:00New issues of Contact availableI'm now into my second year as editor of TESL Ontario's <a href="http://www.teslontario.net/publication/contact-magazine" target="_blank"><i>Contact</i> magazine</a>. You can download issues for free as far back as 2000. I've been the editor for just over a year now (volumes 38 & 39). We just put out the <a href="http://www.teslontario.net/uploads/publications/contact/ContactSpring2013.pdf" target="_blank">conference issue</a> today, as we do every spring. I hope you enjoy it. Please, leave me a comment to let me know what you thought.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-63392273719928835812012-10-29T11:00:00.001-05:002013-07-17T15:24:47.074-05:00Education First's English Proficiency IndexLast year, I blogged about the first English Proficiency Index <a href="http://english-jack.blogspot.ca/2011/07/studies-about-international-english.html" target="_blank">here</a>. Education First has now released their <a href="http://www.ef.com/__/~/media/efcom/epi/2012/full_reports/ef-epi-2012-report-master-lr-2" target="_blank">second index</a>, and they're getting some good publicity out of it (I'm linking to them). The same problems as last year occur: self selection of participants, and need for a computer with internet access to do the test. There's also no data on test reliability and no validity arguments. Still, with 1.7 million participants over three years in more than 50 countries, you wouldn't want to completely dismiss the results either.<br />
<br />
<a name='more'></a>One of the interesting things was the levels of adult English learners in English-speaking countries, which can be found on p. 34. Canada and Australia have "high proficiency" learners, while the UK's are "mid proficiency" and the US's "low proficiency". There is some attempt to explain this in terms of language teaching infrastructure and other variables. Also check out the regional <a href="http://www.ef.com/epi/downloads/" target="_blank">info-graphics</a>.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com2tag:blogger.com,1999:blog-31830497.post-29861676027427954102012-10-25T08:49:00.001-05:002012-10-25T08:49:25.479-05:00ER-Central.com<div class="tr_bq">
</div>
<br />
Another new and potentially useful site:<br />
<blockquote class="tr_bq">
<a href="http://er-central.com/">Extensive Reading Central</a> is a not-for-profit organization dedicated to developing an Extensive Reading and Extensive Listening approach to foreign and second language learning. It was started by Dr. Rob Waring of Note Dame Seishin University, Okayama, Japan and Dr. Charles Browne of Meiji Gakuin University, Tokyo, Japan as a free service to the EFL community.</blockquote>
<br />
<a name='more'></a>This site will consolidate a number of existing sites, including Rob Waring's pages <a href="http://robwaring.org/er/">robwaring.org/er/</a> and Tom Robb's site <a href="http://extensivereading.net/">extensivereading.net</a>. The <a href="http://erfoundation.org/" target="_blank">ER foundation</a>, though, will remain a separate site.<br />
<br />
Writes Rob,<br />
<br />
<blockquote>
Our hope is that it will act as the go-to place for ER/EL. Please take a look. It has information for teachers, Authors and publishers. There are how-to's, 90 videos about ER, links, downloads, and lots more - take a look. There's a student page where we'll add some student stuff in the future. We also want to allow the publishers to make announcements about their products, ask for help with say piloting materials and so on. They are part of our community too. </blockquote>
<blockquote>
We'll be adding a discussion section, advice center and so on in the coming weeks. We also hope to have pages in different languages. All in time. </blockquote>
<blockquote>
ER-Central will open an Amazon Affiliate bookstore soon. Note that the site has cost a lot of money to build and any earnings from this will pay for the development costs and go towards paying the annual bandwidth costs of ($650 per year). We expect to make a huge loss each year. However, if it does turn a profit, it will be used to buy books for disadvantaged schools. We hope we! don't have to resort to advertising. </blockquote>
<blockquote>
So how can you help?<br /><ol>
<li>If you have any handouts, presentations, worksheets, webpages etc you wish to add to the site, please let use the upload form on the top page to send stuff to me and I'll link it in.</li>
<li>Spread the word on <a href="https://www.facebook.com/pages/ER-Central/390953537620090" target="_blank">Facebook</a>, Twitter or however you like </li>
<li>Help translate some pages</li>
<li>Find dead links and send new ones</li>
<li>Make comments on pages</li>
<li>Leave announcements about ER events etc.</li>
<li>Buy books from the store (available in a week or so)</li>
<li>Leave suggestions for features you want - we'll try to add them</li>
<li>The site is in Wordpress (which both Charlie and I are still learning). If anyone can help design the site help us to add images and various widgets, let me know and I'll add you as an administrator. </li>
</ol>
</blockquote>
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-61281154522735336402012-10-25T08:31:00.000-05:002012-10-25T08:31:36.501-05:00TutelaTutela has been in beta for about a year but had its official launch during the TESL Canada conference. From their terms of use:<br />
<blockquote class="tr_bq">
<a href="http://tutela.ca/">Tutela.ca</a> (“Service”) is a Canadian not-for-profit online repository and community for ESL (English as a Second Language) and FSL (French as a Second Language) professionals registered to use the Service (“Users”). As a repository, the Service provides Users with access to ESL and FSL materials including classroom materials, lesson plans, assessment information, and reusable learning objects. As a community, the Service enables Users to share materials, discover new approaches, locate solutions, and network including through the use of the online meeting and webinar conferencing capabilities of the Service (“Conferencing”). The Service is supported by funding from Citizenship and Immigration Canada (“CIC”) and is <b>owned and operated by Citadel Rock Online Communities Inc.</b> (“Citadel”). (my bold)</blockquote>
I'm not really sure how this works. Tutela itself is not for profit, but <a href="http://venturebeatprofiles.com/company/profile/citadel-rock-online-communities" target="_blank">Citadel</a> is a privately owned corporation, for profit as far as I can tell. It has received a <a href="http://www.cic.gc.ca/english/disclosure/grants/2011-Q1/g-118.asp" target="_blank">number</a> of <a href="http://www.cic.gc.ca/english/disclosure/grants/2012-Q2/g-022.asp" target="_blank">grants</a> from the federal government. Just so as you know...Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-81648344070130063462012-10-20T09:03:00.000-05:002012-10-20T09:03:35.002-05:00More Google Ngrams 2.0 POS tagging<br />
As I <a href="http://english-jack.blogspot.ca/2012/10/google-ngrams-20-and-pos-tagging.html" target="_blank">wrote yesterday</a>, there are some strange tagging decisions concerning determinatives in this corpus. It seems though, as I added in an update, that these are largely the fault of the<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13px; line-height: 18px;"> </span><a href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz">Part-of_Speech Tagging guidelines</a> for the Penn Treebank Project.<br />
<br />
The problems, though, are not limited to determinatives. Subordinators are also affected. The words <i>that, whether, </i>and <i>if</i> (e.g., <i>They told me <u>that</u> it was OK </i>and <i>They asked me <u>whether/if</u> it was OK</i>) are tagged as _ADP_ (short for <span style="font-size: xx-small;">ADPOSITION</span>, a more inclusive term for prepositions.) The guidelines say:<br />
<blockquote>
"We make no explicit distinction between prepositions and subordinating conjunctions. (The distinction is not lost, however - a preposition is an IN that precedes a noun phrase or a prepositional phrase, and a subordinate conjunction is an IN that precedes a clause).<br />The preposition "to" has its own special tag TO."</blockquote>
This makes good sense for words like <i>because, after, </i>and <i>since, </i>which have been treated as "subordinating conjunctions" but <a href="http://english-jack.blogspot.ca/2007/05/bain-on-prepositions.html" target="_blank">really are prepositions</a> and function as the heads of preposition phrases. It doesn't work, though, with <i>that, whether, </i>and<i> if, </i>which function as markers of subordination and not heads. Consider the difference between these two clauses:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-YX3BQNZDz7A/UIKu7f6H_qI/AAAAAAAAAVs/uIpBuL5_plA/s1600/stgraph.png.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="82" src="http://4.bp.blogspot.com/-YX3BQNZDz7A/UIKu7f6H_qI/AAAAAAAAAVs/uIpBuL5_plA/s400/stgraph.png.png" width="400" /></a></div>
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-37239931478640935132012-10-20T08:05:00.000-05:002012-10-20T08:05:27.547-05:00Language Learner Literature Award WinnersSomehow I missed the announcement, but the 2012 LLL Award Winners for books published in 2011 <a href="http://erfoundation.org/wordpress/?page_id=214" target="_blank">have been announced</a>.Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0tag:blogger.com,1999:blog-31830497.post-20236147253446149882012-10-19T09:56:00.001-05:002012-10-20T18:19:50.773-05:00Google Ngrams 2.0 and POS taggingAs Ben Zimmer <a href="http://languagelog.ldc.upenn.edu/nll/?p=4258" target="_blank">blogged yesterday</a>, there's a new and improved version of the <a href="http://books.google.com/ngrams" target="_blank">Google Ngram viewer</a>. The "improved" bit has a number of elements, but one is POS tagging. This is a wonderful thing, and I'm inordinately happy about it. Unfortunately, there are some very odd quirks to deal with.<br />
<br />
<a name='more'></a>The subordinators <i>that, whether, </i>and <i>if</i> (e.g., <i>she asked me <u>whether</u></i>/<i><u>if</u> I'd be able to go</i>; <i>he told me <u>that</u> he'd be able to go</i>) are tagged as _ADP_ (adposition, a more general term for prepositions). I've never seen such a classification, and it strikes me as deeply strange.<br />
<br />
The second is their list of _DET_ (determinatives, or what they call "determiners"). I'm happy to report that they do not include the dependent possessive pronouns (<i>my, your, his, her, our, </i>etc.). These are tagged as _PRON_.<br />
<br />
<div class="p1">
It seems that the following words are tagged as determiners at least part of the time:</div>
<div class="p2">
<br /></div>
<table cellpadding="0" cellspacing="0" class="t1" style="width: 137.0px;">
<tbody>
<tr>
<td class="td1" valign="middle"><div class="p1">
a</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
all</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
an</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
another</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
any</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
both</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
each</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
every</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
some</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
the</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
these</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
this</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
those</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
whatever</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
which</div>
</td>
<td class="td2" valign="middle"><div class="p3">
100%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
whichever</div>
</td>
<td class="td2" valign="middle"><div class="p3">
94%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
no</div>
</td>
<td class="td2" valign="middle"><div class="p3">
88%</div>
</td></tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
neither</div>
</td>
<td class="td2" valign="middle"><div class="p3">
56%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
that</div>
</td>
<td class="td2" valign="middle"><div class="p3">
43%</div>
</td>
</tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
either</div>
</td>
<td class="td2" valign="middle"><div class="p3">
42%</div>
</td></tr>
<tr>
<td class="td1" valign="middle"><div class="p1">
what<span class="Apple-tab-span"> </span></div>
</td>
<td class="td2" valign="middle"><div class="p3">
3%</div>
</td>
</tr>
</tbody>
</table>
<div class="p2">
<br /></div>
<div class="p1">
Apart from these, <i>many</i> and <i>much </i>are usually tagged _ADJ_ (and _ADV_ for <i>much</i>), but less than 1% of the time they get tagged _DET_. I believe all cases tagged _ADJ_ should have been _DET_, so I have no idea what distinction is being made here.</div>
<div class="p1">
<br /></div>
<div class="p1">
I believe that <i>which </i>is generally a determinative, but in relative uses, it's usually a pronoun:</div>
<div class="p1">
<i>They could be late, <u>which</u> would be a problem.</i> [pron]</div>
<div class="p1">
<i>They could be late, in <u>which</u> case we would have a problem.</i> [det]</div>
<div class="p1">
<br /></div>
<div class="p1">
The following words are generally considered determiners (at least in some cases) and yet they are never tagged as such in the corpus:</div>
<div class="p2">
<br /></div>
<div class="p1">
<i>few, fewer, fewest, last, least, less, little, more, most</i></div>
<div class="p1">
<br /></div>
<div class="p1">
Some other words which are determinatives in some accounts but not here are:</div>
<div class="p1">
</div>
<div class="p1">
<i>a few, </i><i>a little, </i><i>anyone, </i><i>anything, </i><i>certain, </i><i>enough, </i><i>everything, </i><i>none, </i><i>said, </i><i>us, </i><i>we, </i><i>you</i></div>
<br />
<div class="p1">
Finally, the list above doesn't appear to be exhaustive as I cannot get to 100% when dividing by _DET_, even when I include upper case first letters and all upper case. I wonder what I'm missing.<br />
<br />
[<span style="color: red;">Update:</span> It seems that these oddities are based mostly in the <a href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz" target="_blank">Part-of_Speech Tagging guidelines</a> for the Penn Treebank Project (3rd Revision, 2nd Printing) by Beatrice Santorini.<br />
<blockquote class="tr_bq">
This category includes the articles <i>a</i>(<i>n</i>),<i> every, no </i>and<i> the,</i> the indefinite determiners <i>another, any </i>and <i>some, each, either</i> (as in <i>either way</i>), <i>neither</i> (as in <i>neither decision</i>), <i>that, these, this</i> and <i>those, </i>and instances of <i>all</i> and <i>both</i> when they do not precede a determiner or possessive pronoun (as in <i>all roads</i> or <i>both times</i>). (Instances of <i>all</i> or <i>both</i> that do precede a determiner or possessive pronoun are tagged as predeterminers (PDT).) Since any noun phrase can contain at most one determiner, the fact that such can occur together with a determiner (as in <i>the only such case</i>) means that it should be tagged as an adjective (JJ), unless it precedes a determiner, as in <i>such a good time, </i>in which case it is a predeterminer (PDT).</blockquote>
This explains the missing <i>m </i>determiners, but it doesn't explain why a small subset of <i>many</i> and <i>much</i> are tagged as _DET_. ]<br />
<br />
[<span style="color: red;">Update 2</span>: The instances of <i>many_</i>DET are mostly of the form <i>many a, </i>for example <i>many a day went by. </i>Under the Penn system, this is a predeterminer. Thanks to Slav Petrov for solving this puzzle!]</div>
Bretthttp://www.blogger.com/profile/02870575277556244419noreply@blogger.com0