Trends in classics: Can a machine review the reviews of books about books?

This is it. The big day. It’s what we’ve all been waiting for. In this, the third instalment of our series on the BMCR, we will finally get to the fireworks factory.

That is, we will finally start talking about the reviews themselves. More specifically, we’re going to use the reviews as a proxy for how good we think the last few decades of classical scholarship have been. Do the big-name publishing houses deserve a reputation for quality? Are books getting better or worse? Partial answers to these questions and more can be found after the jump!

Before we get to the bright colours and insightful analysis, you might want an explanation for what you will be seeing. I wrote a simple sentiment analysis algorithm which I then applied to the text of the ca. 8000 BMCR reviews that were written in English. It operates at the level of individual sentences, and determines whether the sentence is offering praise of the book, criticism of it, or is simply neutral (which in this context usually means describing the contents of the book rather than commenting on quality). It also modifies the scores based on how strong the criticism or blame seems to be. We can then add up the scores for all of the sentences in the review (-1 for criticism, +1 for praise, 0 for neutral and so on) to arrive at a single number for how positive or negative the review is. It does all this with ______ accuracy, where the blank could be filled by either ‘sufficient’, or ‘questionable’ depending on how generous you feel.(1) If you want a single number, by F1 score, it gets things right ca. 70% of the time. It isn’t a great success rate, and there are several obvious ways to improve it, but there is no compelling reason to do so at the moment. It is accurate enough that we can use it to find the trends we are looking for over a sample this large.

So, how do all of these books rate?

histogram plot of BMCR sentiment analysis. Mean and median are positive.

Scores improve along the horizontal axis. Height of the columns indicates how many reviews got that score.

Pretty well. And they cover a fairly narrow distribution. A full 50% of the book reviews have scores between +2 and +7, with the lowest score at -22,(2) and the highest at +52—which is probably the result of a couple of reviewers who really like their intensifying adverbs, like ‘extremely’, ‘totally or, um… ‘really.’ In terms of averages, the median score for the whole population is 4 (silver) and the mean is 4.96 (beige). On the whole, reviews tend to be positive: only 520 out of 8000 reviews had negative final scores (meaning they found more to criticize than to praise).

So that is what is happening across the whole sample, but we can break it down in a few ways. The next graph is interesting mostly for what it doesn’t show. I’ve broken down the reviews into books published by larger publishers (more than 100 of their books reviewed by the BMCR), medium-sized publishers (between 20 and 100 books reviewed) and small publishers (the rest). I was surprised to find that there was virtually no difference between the three groups. I would have expected the OUPs and Harvards of the world to be able to attract a higher caliber of manuscript.

Box plot distribution of sentiment scores based on size of publisher. Mostly the same.

Outside of archaeological publications, my sense is that box plots aren’t that common in classics. If you’ve never seen one before, this is a box plot. The dark line shows the average (mean) for that group. The box covers the middle 50% of observations. Outliers are drawn as circles, and the width of the box scales to the number of observations in each group. As you can see, our three groups are pretty similar.

In the next plot, each of the eleven largest publishing houses gets its own box. We can see a little more variation here.

box plot of 13 largest publishers

We’ll that’s just Ducky!

So it seems like having some fake Texan butchering Cicero on CSpan isn’t the only thing diluting the Princeton brand.

Before I did this, if you had asked me which publisher consistently puts out the best books, I would not have guessed Duckworth. I don’t think Routledge would have been my second choice either, but there you have it. It is worth noting that Duckworth is among the smallest members of this fraternity, with only 116 of their books getting reviewed over the past 20 years.

[Editor’s note: the researcher may be surprised, but I’m less so. Since Duckworth and Routledge put out a lot of teaching-level books, they require less critical engagement than a theory-heavy Cal book.]

Since I’m coming to the end of my allotted thousand words, let’s take another look at how things have changed over time.

Box plot of change in review sentiment over time

I wonder what might have soured everyone’s mood over there at the right….?

Once again, we see the strangeness of the current half-decade. Things hold almost eerily steady between 1996 and 2009, only to turn much more negative from 2010 to the present. Or, more accurately, we became more guarded and sparing in our praise. Or perhaps, as the number of books increased, they started to get worse. You can choose your own explanation—for now at least. I reserve the right to come back and pedantically insist on my explanation, once I decide what it is.

I was going to finish this off with a joke about how the tenor of reviews changes based on the time of year, but actually they don’t vary at all over the whole of the sample. It’s just a flat expanse where the average and range never varies. But rather than end on a down note, let’s marvel at how much more fun the 90s were. Back then, the tone (and number) of reviews fluctuated wildly from month to month:

Analysis of reviews month-by-month in 1990s

May: “School’s out, I love everyone!” June: “It is too hot, and I hate this book.” December: “Heck no, I’m not reviewing a book over winter break. Take a hike!”

Just to make it clear, I don’t think there’s anything significant going on here. This graph was just for fun.

[Ed’s note: as always, please leave questions/comments below!]


  1. If you want to more specific, it detects praise with 65% precision and 70% recall. Which means that 65% of sentences that it says are praising the book actually are praising it, and it finds 70% of the actual sentences of praise. It finds criticism with 76% precision, but only 60% recall—probably because reviewers are usually circumspect and oblique with their criticism. Scores calculated on a random sample of 500 sentences which I assessed by hand. It is worth noting that least four of those sentences proved very difficult even for me to classify, as a human with a PhD in the discipline.
  2.  No, I’m not going to tell you which book got the lowest score. It almost certainly isn’t a bad book. It just highlights one of the difficulties of doing sentiment analysis of book reviews in classics. A lot of these are reviews of books about books (about books in some cases). It’s very difficult for a computer to know the difference between a bad book, and a good book written about a bad book. (I’ll reveal that the book with the lowest score was about an ancient author who is not highly regarded.)

Posted by J. for our anonymous researcher.

Creative Commons License
Trends in classics: Can a machine review the reviews of books about books? by is licensed under a Creative Commons Attribution 4.0 International License.


2 thoughts on “Trends in classics: Can a machine review the reviews of books about books?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s