John Markoff has an interesting article in yesterday's New York Times about the legal uses of natural language processing. Apparently, companies have begun using software to scan documents and e-mails for information relevant to legal cases rather than pay for hundreds of thousands of lawyer hours.
It all seemed very plausible until I ran into this sentence: “'You tend to split a lot fewer infinitives when you think the F.B.I. might be reading your mail,' said Steve Roberts, Cataphora’s chief technology officer."
I don't know if he's speaking metaphorically or literally, but I have a tough time believing even that people are more careful to apply any prescriptive rules they have been taught if they think they might be under investigation. I hope I'm wrong, not because I hope people split fewer infinitives, but because I hope the technology officer of a NLP company would be opposed to language bullshit. [Update, March 8: See clarification from Keith Schon below.]
Yes, that is a little strange - 'Hey, let's not use split infinitives because the FBI folks might be watching'? - come on.
The article does say '[a] shift in an author’s e-mail style, from breezy to unusually formal, can raise a red flag about illegal activity.' - which the infinitive thing is supposed to illustrate. The obvious implication is, split infinitives = breezy; fewer split infinitives = unusually formal.
I would refrain from making any judgment yet, though, because these kinds of observations are generally based on solid facts. Maybe they do see fewer infinitives in dubious communications - in that particular kind of criminal investigation. That is, 'You tend to split a lot fewer ...' means 'These (semi-educated?) criminals tend to ...'.
But then, I don't know. But I am intrigued; I wait for more on this to come.
I work for Cataphora, and I was in the room when the "split inifinitives" quote was said. It wasn't meant as a literal statement. The discussion was about how the level of formality in written communications often changes when the writer thinks the communication may be read by people other than the intended parties. We have seen this in real world data, both because people watch what they write to a greater degree, and because they go back and "clean up" their existing data by deleting embarassing or incriminating outbursts. This matters to us because it is is one of the things we look for in criminal and civil investigations, as a way to locate important events.
With that said, we don't have data specifically about split-infinitives. That was just an off-the-cuff quip, which unfortunately may not come through in the article.
Also, to be clear, Cataphora is not primarily an NLP company. We're primarily interested in human behavior modeling. We do use NLP techniques, but we do a lot of other things as well, including social network analysis and various forms of clustering, as well as a lot of proprietary techniques that don't fit neatly into any these.
Manager, Core Technology Group, Cataphora
Thanks for clearing that up, Keith! Glad to hear that's how it went.
Post a Comment