This article is from the series Meditations from the Track Changes Column
In the course of copyediting, I often find it useful to nose around in (aka research) what great authors of the past did. The sorts of points I seek insights into include examples of word usage, what preposition a verb most often takes, whether to use a comma in “Yes, sir”, and other subtleties of punctuation.
To aid myself, I’ve accumulated a small library of fine literature in plain text format, currently numbering twenty-seven books, and including works by Charles Dickens, F. Scott Fitzgerald, Henry James, H.L. Mencken, Edith Wharton, P.G. Wodehouse, and a translation of the Bible in modern English. These are in the public domain, acquired from Project Gutenberg.
I’ve hoped to add more volumes and more authors to this collection, except that, masterpieces though they may be, these books are venerably old from the standpoint of contemporary publishing practice, and many styles that were current in Dickens’s day, or even Wodehouse’s more recent era, are not those of today. Newer books are more difficult to come by, at least legally. The truth is, I don’t know where to get them illegally, either. I’d love to have plain text versions of Updike, Wallace, Delillo, and even the likes of Hemingway and Steinbeck, plus a number of non-fiction texts, but I’m unlikely to ever get them, short of scanning them with an optical character reader myself (which I ain’t gonna do), because they are carefully guarded. (And I don’t mean to suggest that I would want them illegally, for I am a respecter of copyrights.)
Much of the same information, and of books published up to the year 2000, is available from Google Books, particularly using Ngrams, but specific examples require more digging and clicking. Sometimes the effort yields useful examples, but it can also be a pain and more trouble than it’s worth.
Plain text files are searchable using standard Unix type commands or programs written in a programming language such as Perl (my personal favorite), which allow me to filter and format the results any way I wish. Therefore, using skills as a former software engineer, I’ve devised a number of tools to get at information.
To use one of the examples above, I find that in this collection “Yes, sir” (with a comma) occurs 224 times (spoken most frequently by Jeeves to Bertie Wooster in P.G. Wodehouse’s books), and only once without the comma, likewise in a conversation between Jeeves and Bertie Wooster in the midst of many others that do have the comma — so doubtless a copyediting oversight! I conclude from the data thus obtained that it’s best to use the comma in dialogue that contains words that follow the model “Yes, sir”. (Many patterns fit the model.)
Recently I recently wondered about the average word length (in letters per word) within a book. This information is easily obtained by counting the characters and dividing by the number of words. There’s a Unix tool, wc(1), to get the numbers, and a script can gather them and do the dividing. The result is not precisely accurate, because the Unix tool counts as words every group of characters separated by spaces, so punctuation and numbers and various oddities skew the number correspondingly. But as averages across a library of books with the same constraints, they’re good enough for comparison, which I imagine is why the tool was written near the dawn of the Unix era.
The range from author to author and book to book is not as broad as you might think. A calculation to several decimal places is in order. My script calculates to fifteen decimal places, but about three places seems to be adequate for discussion purposes.
So take a guess — what you think the range would be among these highly literate authors? The shortest average among all of them is the Bible at 5.377 letters per word (lpw). The modern book with the shortest words is (believe it or not) Charles Dickens’s Great Expectations, with an average of 5.514 lpw, and the longest is 6.121 lpw by H.L. Mencken, who may have had the largest vocabulary of any English-speaking person who ever lived. Amusingly, the book with that count is titled: Damn! A Book of Calumny. Apparently the man even knew how to cuss in words of more than four letters.
From that analysis we see that the range from shortest to longest average word length is well less than a letter per word. Sounds about right to me.
Recently I edited one of the most horrendously bloated books I’ve ever laid eyes on. The author was a thesaurus diver, determined to seek out the longest and least common word in every possible case. It’s no exaggeration to say that in one out of three instances he used the more obscure words incorrectly. My task became an arduous one of consulting the dictionary, mind-reading, and replacing incorrect and rare words with ones his readers (as few as they will be, mostly his relatives) would be likely to recognize. In time it dawned on me that this guy may have used longer words on average than any author I’ve ever encountered — which made me wonder: How much longer? So I saved the document’s body text to a plain text file and made calculations as described above. (It was a very long book, too, over 500 manuscript pages.) The number I came out with was 6.821 lpw, vastly longer than H.L. Mencken’s erudite habit. (Most importantly, Mencken used and spelled all the words right, as his monumental three-volume work The American Language demonstrates conclusively.)
My favorite sentence from this editing job, said in regard to one of the author’s primary subjects of discussion, says:
He was not wont to bloviate.
Wont means inclined, and to bloviate means to speak verbosely and windily. How ironic that such was not the author’s own inclination, and that at six words in a book where sentences of thirty to sixty words are legion, it was also likely the shortest sentence in the book.
If the author was trying to impress readers that he’s smart, then bzzzt! Big mistake! No person, no matter now intellectual, actually talks like that. What he left instead was quite the opposite impression.
In contrast, the very next project I worked on was written by an author who describes himself as dyslexic and unable to read until after he left school. He has the vocabulary of a fourth-grader (though no fault of his own), and the average word length on his project came out to 4.401 lpw. It was the longest work of fiction I’ve ever edited — by about 20 percent. But it took me far less time to edit it than the previous book.
 Speaking of subtle things, did you notice that the previous sentence contains a subtlety of punctuation? And have you ever noticed that the spelling of subtle is subtle?
 The form wc(1) is standard Unix man (manual) page syntax.
 Aka Coordinated Universal Time (UTC), which began at midnight on Thursday, January 1, 1970, and is calculated within many computer programs in seconds. I don’t know when wc(1) was created, but I’m rather certain that given the nature of it and its typical use, it had to be among the first that Ken Thompson and Dennis Ritchie provided when they first created Unix.