English, Jack: On meeting 'otiose' twice again

Monday, October 06, 2014

On meeting 'otiose' twice again

I asked Mark Liberman to have a look at what I wrote yesterday since I was struggling to get my head around the probabilities. He was kind enough to write the following guest post:

Maybe a better way of thinking about it is this:

Say the probability that word w_i will be selected at random from a collection of text is P(w_i). Then assuming independence, the probability that the next word will NOT be w_i is (1-P(w_i)), and the probability of failing to find w_i in N successive draws is

(1-P(w_i))^N

If P(w_i) is 1/10^7 (one in ten million), and N is 1000, then we get

(1-(1/10^7))^1000

which is 0.9999. So if we take notice of a rare-ish (P = 1/10000000) word, and draw 1,000 other words at random looking to see it again, then 9,999 times out of 10,0000, we'll fail to find the moderately rare word we were waiting for. And if we draw 10,000 additional words instead of 1,000, the probability of failure is still

(1-(1/10^7))^10000 = 0.999

so we're still gonna fail 999 times out of a thousand.

But the thing is, Rare Words Are Common. That is, a large proportion of word tokens belong to relatively rare types. So suppose that there are 10,000 other words of approximately equal rareness, and every time we see one of them, we set a subconscious process to watch for recurrences of that word within the next thousand instances

If we do this a thousand times, then the chances of failure (for a thousand instances of noting a rare word and looking for it to occur again) become

((1-(1/10^7))^1000)^1000 = about 0.9

((1-(1/10^7))^10000)^1000 = about 0.368

So if you do enough reading for these conditions to be satisfied once a day, you should expect to have this experience several times a week.

Now, none of this reasoning really applies, because you aren't picking words at random from a well-mixed urn, you're reading them in order in coherent text. And words in coherent text are far from independent Bernoulli trials -- when a rare word appears, the probability that it will appear again before long in the same text is massively increased by topic effects (and to a lesser extent style and priming effects). But this just means that the experience should be more common rather than less common -- unless you insist that the texts be separate and on different topics, and so forth, in which case it gets complicated.

But still, I think that the real puzzle is not why you had this apparently odd experience, but why such we occasionally notice the kinds of coincidences that are in fact rather common.

This is not an unimportant question, since it has a lot to do with the genesis of superstition (and probably science, for that matter...)

The above is a guest post by Mark Liberman.