Saturday, February 26, 2011

Enumerating the semicolon

The semicolon has an interesting history, which is traced rather well by Paul Collins in his 2008 Slate article. I thought I'd go back and have another look at some of the numbers, which can now be calculated with somewhat better reliability. 

This graph demonstrates the rise and fall of the semicolon in English. Its popularity peaked right around 1800 at just shy of 1% of the words. That is to say there were almost 10,000 semicolons per million words. That's a little more common than is is today.

Frequency of semicolon use (Source: Google Books Ngram viewer)

The Ngram viewer is fantastic for the sheer number of texts it includes, but there are some problems, mainly related to not being able to see what's going on behind the scenes. And these are exacerbated by the fact that we're working with a punctuation mark instead of a word because Google proper doesn't allow punctuation searches. So if you want to get examples of semicolon use from the early 1500s, for example, you're out of luck.

But there are also good reasons to worry about the accuracy. See that little blip at the far right where, after 200 years of solid decline, there appears to be a resurgence? Well, restrict your search to just American English or just British English, and there's no uptick. So, is it the Canadians and the Australians who are fuelling this trend, or is the graph wrong? There's no way to know.

Then there's the question of how the semicolon is being used. My guess would be that over the last 200 years, the semicolon has been used more and more as a kind of super comma to delimit list items that themselves contain commas, especially in academic writing where multiple studies are listed this way (e.g., Moon, 2009; Yazdi, 1992). Unfortunately, none of my usual tools allow me to test this diachronically. In fact, both the COCA and COHA interfaces seem to have some kind of glitch when it comes to semicolons. I've notified Mark Davies, so I'll update this if he fixes it soon. [now fixed; update here]

All I'm left with, then, is the BNC. I search there for a random set of 100 lines with a semicolon, finding 144 semicolons of which 95 were list separators. The remaining 49 were clausal separators (e.g., They were triumphant; they had won the case.). That's a frequency of about 1,000 instances per million words, which is a little more frequent than the most common of the words from the Academic Word List (e.g., area occurs about 800 times per million words), but it's much less frequent than the peak of the graph above.

1 comment:

Rae said...

I believe that I am responsible for the slight uptick on the right of the graph; I love semicolons!