In my last post, I covered the basics of the New TLG’s Statistics tool. I focused mainly on the author vs. full-corpus statistics. In this post, I finish up with an overview of the remaining information in the author search and delve into the final search option: the lemma statistics.

In the last post, we were looking at words in Thucydides. These words could be grouped into words that Thucydides uses more or less than the *TLG* corpus as a whole, as well as visualized as a word cloud of larger and smaller words. But those are only the top row of potential data visualizations. The other graphs show most frequent words (by form), most frequent lemmata (by dictionary entry), most over-represented lemmata (in total, in relation to the author’s century, and in relation to the *TLG* corpus) and unique lemmata (only in that author). Here’s what the graphs look like on the screen; remember that each one can be downloaded by clicking the hamburger menu (the three parallel bars in the top right corner).

The same information is available in list/tabular form (there are only two columns, so either term works for me). You can see the tables *either *by clicking ‘*(more)’ *in the Summary box at the top of the page, *or* by clicking on the relevant menus directly beneath it. In the screenshot above, you can see the menus in smaller format on the left, or enlarged on the right.

If you do click ‘*(more)’ *in the Summary box, it will change to ‘*(less)*‘; you can click on ‘*(less)’* to contract the menu(s).

So you can think of the *more/less* button almost as a toggle. You can have multiple menus open in *more* view, but you’ll sacrifice vertical space and see less information. The menus expand to a default size of approximately 12 rows; as you open more, your screen will lengthen and you’ll need to scroll down to see all of the entries. I found it easier to read one menu at a time.

**TIP: **The menus and graphs offer the same information presented in different ways, so you don’t need to look at both. You should choose the option that helps you understand the data best: are you a tables person, or a graphs person?

**TIP 2:** If you start out looking at the graphs, you can use the page icon to see the tabular data in a popup window. Similarly, clicking the magnifying glass opens the graph in a larger format via popup window. The lightbulb takes you to the help file.

The word lists are very helpful if you need to be precise with your numbers. For example, if we look at the graph of overrepresented lemmata, it’s hard to figure out exactly what the numbers are.

There are two ways to get a hard number if you need it. First, if you have the graph open, you can hover over the column in question and it will give you a count.

But those figures aren’t static. If you need to compare multiple columns, the tables are a better bet. Only the top 100 words are included.

The corpus vs. century statistics require a bit of math. Each lemma has three numbers attached. The first is the **actual count** in the corpus you’re searching. So if you’re searching a single work, that will be a different number than if you’re searching *all* the author’s works. The second, or *Corpus*, count is how many times that word is expected to appear. The *TLG* gets that number by searching for every instance of that word in every text and getting a number of x/1000 words. They then take the number of words in *your selected author* and cross-multiply. If you’re researching Solon, maybe that number stays at x. If you’re working on Plato, that number is obviously x*y. The y is the size of your selected corpus. So let’s say that Plato has 10,000 words. In that case, y is 10 (10,000/1000), and the **expected number** that Plato will use your word is 10x. You can compare that number to the actual number to see how much the word is over/under-represented. Finally, the *Century* count is similar to the *Corpus* count, but instead of comparing your author to the entire *TLG*, it looks only at the author’s century. So now let’s say in the 5th century BCE, this word is used z/1000 times. We already know that Plato’s corpus count is 10z (because z and y are the same ratio: {author words / 1000}), so we can use that number to compare Plato to his contemporaries.

As a final note on the graphs, I should add that although each graph has three columns, you can click on the legend to make some of the columns disappear. So if you’re *only* interested in Thucydides’ actual vs. expected word use, you can eliminate the *‘expected (century)’ *column. It’s still there in gray if you need it; it’s just not showing up right now.

The last type of search is based on lemmata. This is probably the most familiar type of search: looking for a single word across a large corpus. In the Old *TLG*, these results were presented in a list; in the *New TLG*, you can visualize this data in various ways.

The *TLG* divides its statistics according to *highest use by author, highest use by by work, distribution by century, geographic distribution, *and *relative distribution by author *and* century. *The layout of the *lemma* statistics page is exactly the same as the *author* and the *full corpus* pages.* *

We’re looking for *Romulus*, why is why the word spikes in the second Sophistic, and again in the era of the Byzantine encyclopedists. Because I think that the categories in the lemma statistics are fairly straightforward, I’m going to use these charts to discuss some methodological questions.

**TIP: **for ‘relative distribution by author’, the graphs use the authors’ TLG numbers: make your life easier and use the tables, which give you the same information but collect it under the authors’ name.

So let’s take a look at these numbers. The structure is the same as the structure for the *author statistics *option. First, next to the author’s name, is the total number of appearances of our lemma (Romulus) in that author. Plutarch has 146. The next line tells you how many times that word is expected to appear. Because Romulus only appears 880 times (according to the *Summary* menu), and the *TLG* has millions of words, those numbers are pretty low: all less than 10. The reasons why the ‘expected’ numbers aren’t the same is because each author takes up a different percentage of the total *TLG* corpus, and the *expected* value is based on the author’s percentage share of that corpus. Finally, the last line is the expected number based on the century. This line is a weighted average: based on how popular the term is in a given era, the *expected* number will go up or down. Here we see some interesting (if not surprising) results: Plutarch’s and Dionysius’ *century* numbers are much higher (49, 32) than their *expected* numbers, which tells us that Romulus was a popular topic in the first and second centuries BCE. In contrast, the 3rd-century Dio has *century* value of only 3, roughly the same as his *expected* value.

What would you use these numbers for? You might want to use these tools if you’re trying to trace the history of a single word or group of words over time. For example, you may want to trace the word ἀνδρεῖα over time to investigate whether Greeks talked about manliness more or less in a given era. But there are a few problems with this procedure from a statistical perspective:

- Your distributions aren’t random. A work on ethics and moral philosophy will include ἀνδρεῖα more than a history, which will include ἀνδρεῖα more than Sappho, etc. It would be great if you could sort the data by genre (since the
*TLG*categorizes its authors/works by genre), but right now that isn’t a feature. - Your sample isn’t representative
*or*It’s a convenience sample: your sample is drawn from what’s present, rather than a scientific effort to randomly sample ancient Greek literature/language.

Let’s look at our Romulus example again. Based on the numbers we have, you *can’t* draw the conclusion that people in Plutarch’s era were more interested in the Roman foundation legend than people in Dio’s era. It’s a commonplace (and yes, I’m guilty of it, too), but we can only say that of *surviving* works. We can’t make assumptions about lost works. We also can’t say that Plutarch is three or eighteen times as interested in Romulus as other writers of his day or Greek literature in general (working off of the *‘expected’* and *‘century’* values). Obviously, in the *Life of Romulus*, you’re going to talk about Romulus; to learn more about Plutarch’s specific usage, you’ll need to look more specifically at where those references appear in context. The same is true of Dionysius’ *Roman Antiquities*, Dio’s books on early Rome, and Zonaras’ summaries of Dio.

You *can *draw the conclusion that Plutarch, Dionysius, Dio, Zonaras, etc. wrote books on Romulus – which is true – and you could speculate that they were interested in Roman foundation legends and/or Roman history. This wouldn’t be a bad guess, but it isn’t that enlightening.

You can also notice that the five authors who appear first in the ‘*relative distribution*’ section make up about a quarter of all appearances of Romulus in Greek. That’s an interesting point that could be pursued further: why aren’t other authors interested in Rome’s founder? When does Romulus appear in these other authors? Is it Romulus the founder of Rome or Romulus Augustulus?

You might be tempted to think that a virtue, like ἀνδρεῖα, would be different than a name. But the principle is the same. The TLG statistics collect a lot of information for you; some of it has to be used with caution.

~J