In the last TLG post, I took a look at intertextuality. This week, it’s corpus statistics. I hope everyone is feeling nicely mathematical today.
As usual, I’m assuming that you’ve logged in and are starting from the main TLG search page.
The New TLG‘s statistics section is both very complex and fairly simple at the same time. When you click on the ‘Statistics‘ link on the main page, you begin at the full corpus page. This page offers the option to refine the full corpus by date and by excluding certain texts (such as texts of uncertain authorship), but is not the page you want to use for a single author. I’ll get to that page later in the post.
As you can see, there are a lot of graphs, including a very pretty animated word cloud (it changes you change your date range). The ‘summary‘ sidebar tells you the size of your chosen corpus in texts, words, and authors (this also changes as you change the date range); the best-represented authors in your chosen date range; the largest works in the range; and the most frequent words in your range (both by dictionary entry and by form). The same information is provided in graphical form in the graphs, as well as in tabular form if you click ‘(more)‘ in the summary bar. Both the graphs and the tables can be downloaded.
So all in all, this is an impressive collection of data.
That being said, I didn’t feel like I learned much from the full-corpus statistics. For example, λέγω, ἔχω, πᾶς, γἰγνομαι are the most common words in Greek texts of 2nd century CE. Of course they are – I’d be more surprised not to see them (and indeed the same four are the most frequent across all Greek texts). Similarly, if you’ve every stared at a bookcase full of Loebs, Teubners, or OCTs, the graphs of authors and how much they wrote probably won’t surprise you (there’s a lot of Plato and Plutarch). So the full-corpus information isn’t that enlightening unless you’re truly new to Greek texts. I certainly remember being shocked by how much Plato there was when I was first reading Greek.
The author statistics are much more useful, and I suspect that this is really what the TLG was expecting when they made these tools. I’ll explain a few potential uses of these tools as I go.
You can get to individual author statistics by entering the author’s name into the search field (as on the main page, the TLG will give you suggestions after you input three letters). Today’s example author is Thucydides. For authors who have multiple works, you can choose to search ‘all‘ works, or to refine your search to a single work (going back to last week’s example, Plato’s Republic). While this is useful, I really hope that the TLG will allow multiple search inputs, since I can easily imagine wanting to see statistics on multiple related texts by the same author. Right now, you have to do those individually and combine the tables.
The author statistics page is identical to the full-corpus page in format, but some of the graphs are different. The graphs about most common words and word forms are the same, as are the total numbers of words in your chosen corpus. All of the fields that relate to authors are gone, because you only have one author. Instead, you get information about ‘over-‘ and ‘under-represented‘ words. Taken together, these tools let you learn an author’s favorite words, which are visualized in your word cloud.
You’ll notice right away that while some of the overall most common words are still pretty common in Thucydides (ἔχω, and to a lesser extent λέγω), others are totally new: ναῦς, for example, and πόλις (who knew that ships were as important as cities?). Ἀθηναῖος is probably not surprising, but the absence of a Sparta word (like Λακεδαιμόνιος or Σπάρτη) surprised me. In fact, even Corinth makes it!
To the right of the word cloud, the same information is represented in a graph with a few extras. In the graph, the red line represents the same information that you see in the word cloud. The yellow line tells you how many times that word would be expected to appear in the author, based on the rest of the corpus. And the blue line takes that ratio to give you an under/over-represented line.
Confused? Let’s take an example. If your full Greek corpus is 1 million words, and Thucydides’ History includes 10,000 individual words, then Thucydides’ percentage of the corpus should be, on average, 1% of the full corpus. So if you are looking at any given word, you’d expect Thucydides to use it 1% of total occurrences in the corpus. This gives you a base value.
If the word Ἀθηναῖος appears 1000 times in all Greek texts across the whole TLG corpus (obviously I am undercounting, but it’s just an example), then you’d expect it to appear (1% x 1000 = 10) times in Thucydides. Now you can check against the real counts. Let’s say Thucydides actually uses Ἀθηναῖος 900 times. Then that word is overrepresented because it appears 90 (!) times as often as you’d expect. This ratio (1:90) is significant, and now that you’ve found it you can start to investigate why Thucydides might use Ἀθηναῖος so often.
To help you visualize these ratios, you can download the TLG‘s charts in variout formats (pdf, jpg, png, svg) or print. To access these options, click on the hamburger menu:
Here’s an example of the Thucydides graph blown up:
TIP: It’s harder than it seems to line up the axes with the bars. The words go with the bar to their right – i.e., if you look at the interstitial white space as a “bar” on a bar graph, they go with the bar AFTER the line they touch. If you download the chart rather than view it online, you can see this more easily by extending lines (in Adobe, Photoshop, or your preferred image viewer) from the gray guidelines at the bottom to the top of the graph.
Our Thucydides graph shows that he uses πᾶς and λέγω less than expected, even though λέγω is one of the most common words in the History. He does not use πόλεμος more than expected, even though he’s writing a book about war.
I’ll continue with the author statistics page in my next post, but I hope that you can already see how useful it is, both for discovering new information and for checking philological information. In that next post, I’ll also discuss some further uses of the statistics tools and some of the potential methodological problems you should be aware of.