Thursday, July 31, 2008

Numbers Used to Quantify Writing Styles

At the end of May, Microsoft gave up in the race with Google to digitize all the world’s printed books and magazines. The motivation for these ambitious projects is to allow search engine technology to reach inside the pages of every book in print. These corporations envision a future with no need to actually thumb through a book at a store or library in search of information. Search engines will perform the task much faster and return the exact location inside of any book for any text or keyword a user is seeking.

Whether such a future is a good or bad one for writer and publishers remains to be seen. Publishers have been less than enthusiastic about these digitization projects because of fears of copyright infringement and potential loss of sales. My own view as a writer is that exposure is a good thing. Obscurity is a greater threat to a writer’s livelihood than copyright infringement.

The Microsoft explanation provided for the decision to abandon the project came in classic corporate-speak. On a Microsoft blog, Satya Nadella, Senior vice president for search portal and advertising, wrote:

“Given the evolution of the Web and our strategy, we believe the next generation of search is about the development of an underlying, sustainable business model for the search engine, consumer, and content partner. For example, this past Wednesday we announced our strategy to focus on verticals with high commercial intent, such as travel, and offer users cash back on their purchases from our advertisers.”

If I ever need to write a parody of a corporate memo these sentences are a good starting point. Just how many buzzwords—evolution, development, strategy, sustainable, verticals, model, etc—are packed into just these two sentences with 65 words of prose? Actually with digitized books, questions such as that can be answered. In fact Google and Microsoft have been playing catch up with Amazon’s “Search Inside the Book” program launched in 2004. Publishers are now encouraged to submit electronic copies of printed books for Amazon to digitize and make searchable.

But, Amazon’s “Search Inside The Book” program allows for more than just keyword searches. You can now look at statistics about the writing style of a book before you purchase it. I checked Amazon’s listing for my book—The Two Headed Quarter—and discovered all sorts of numerical facts that I did not know. My book has 672, 289 characters arranged into 99,104 words. As expected in a book about deceptive numbers, the word “number” appears frequently—567 times, but it is the second most frequently occurring word. I had no idea that the most frequently used word in my book, appearing 709 times, is “years.”

But most fascinating is the statistical summary of my writing style that Amazon provides. Potential buyers can view the “readability” and “complexity” analysis of the text. My book rates as highly readable with a Flesch-Kincaid index of 7.2, meaning that you only need a 7th grade education to read it. According to Amazon 77% of all the other books are harder to read and in the category of personal finance books, 93% are more difficult to read. The easy reading arises in large part because the complexity analysis reveals that I have a simple writing style averaging only 8.6 words per sentence.

Now that last number threw me. I opened my book to some random pages and I found myself hard pressed to find sentences as short as 8 to 9 words, let alone enough sentences less than 8 words that would allow 8.6 to be the average. Amazon’s number for my average words per sentence just didn’t seem reasonable to me. I decided to run my own check using Microsoft Word’s analysis tools on some chapters from the original manuscript. I had never done that kind of analysis on my writing before, which shows you how much I think about readability statistics when I’m writing. The results:

For Chapter 1:18.0 words per sentence, Flesch-Kincaid index —10.1
For Chapter 2: 17.2 words per sentence, Flesch-Kincaid index —10.7
For Chapter 5: 18.9 words per sentence, Flesch-Kincaid index —11.4
For Chapter 8: 19.3 words per sentence, Flesch-Kincaid index —10.8

This means the readability of my writing is consistently at a 10th to 11th grade level (not 7th grade) and about average in complexity (according to Rudolf Flesch, co-inventor of the Flesch-Kincaid scale, the average words per sentence found in reading material is 17).

So how can Amazon’s statistics be so skewed? I don’t know but I have a good idea. Again, my book is about numbers and there are many numbers in the book. There are figures and tables with numbers along with worked examples containing numbers that allow readers to figure out numerical answers to many common financial questions. Many of the numbers in the book contain decimal points. My hypothesis is that the software Amazon uses to perform readability analysis cannot tell the difference between a period and a decimal point. After all, there is no difference between the two characters. Only an actual reader who can interpret the meaning of a sentence knows that a decimal point inside a number does not terminate a sentence.

I find it ironic that a book about deceptive numbers has an Amazon listing with a bunch of deceptive numbers to characterize its readability. Another example of companies spending time and resources to compile and publish useless numerical information. By the way, according to Amazon’s site my book has 3348 words per ounce and if you buy it you will receive 4715 words for each dollar you spend.

Joseph Ganem is a physicist and author of the award-winning The Two Headed Quarter: How to See Through Deceptive Numbers and Save Money on Everything You Buy

No comments: