In my last few posts on the New TLG, I’ve mostly covered old tools that have been updated. In this one and the next, I’m going to tackle the TLG’s new statistical analysis tools. If you’re looking for pretty graphs, those will be in the next post. If, on the other hand, you like highlighting, this post is for you!
First, the obvious question: what is an N-gram and why do you care?
N-grams are groups of words that appear together (or close enough) in a text. The N is the mathematical notation for <insert # here>. So there are bigrams, trigrams, etc., and they all fall under the general category of N-gram. Computers can compare hundreds of texts to see where two texts share the same words at a certain level (often set by the user; in this case, set by the TLG). All of this is another way of saying that N-grams use computers to search for intertextuality.
If you’re a good programmer, you can make your computer do this for you. If you’re not (and you’re interested in Greek), you can use the TLG.
The TLG settings allow for two types of intertextuality searches. If you go straight to the ‘N-grams’ menu on the main page, you’re limited to comparing two authors. But if you activate ‘N-grams’ while you’re browsing within a given text (see Part 1), you can look for parallels between that text and the rest of the TLG corpus.
For this post, I’m focusing on the one-to-one comparison. If you’d like examples of the author-to-corpus comparison, please let us know!
Once you click on the ‘N-Grams’ option at the top of the main page, you can choose your authors (‘source texts‘). Choosing these authors is very similar to using the ‘basic’ or ‘advanced’ search functions: after three letters, the TLG will start offering options for autofill. Once you’ve chosen your author, a dropdown menu appears and you can choose which of that author’s work(s) you want to use. You aren’t forced to limit yourself to one! For my first search, I chose Herodotus and all of Dionysius.
I had two reasons for this choice: first, I knew that Dionysius praised Herodotus in the rhetorical works, so I was sure of getting results. But also, since he does praise Herodotus, I wanted to see whether his word choice was similar in the Roman Antiquities (despite the different subject matter — and I should thank Daniele Miano for the idea!). As it turns out, the answer was a bit more complicated.
The results are displayed in a relatively inflexible format (for those of you who remember the old TLG search, it’s similar — but even more rigid). You have the option to display either 20 or 40 hits per page, but there are no other options (so the 100+ option of the old TLG has disappeared). But there’s no hit count, either – you just have to keep on hitting ‘next‘ to get to the end. I found this frustrating, especially thinking of old TLG searches where I could choose to see fewer lines of context but more hits on the page.
Your options for context are both more and less limited. The dropdown menus allow you to choose between 1-3 lines of context – but if a passage catches your eye, you can use the box with the arrow to see whole passage in the author page, or the magnifying glass to see it in a pop-up window. Even better, you can click on the two-square icon:
This icon leads to you to a side-by-side comparison (‘parallel browsing‘). The TLG then highlights the similarities for you, in what is probably the most useful aspect of the tool. The image above shows an extreme example: it’s Dionysius quoting Herodotus. But the tool also works on smaller parallels:
If you look carefully, you’ll see that these matches aren’t exact. They’re examples of Dionysius and Herodotus using a similar lexicon, rather than intertexts in a strict sense. You can also see that the TLG has been pretty clever in choosing matches: compound verbs are matched with the base verb, all words are matched by lemma rather than conjugated form, and only content words ‘count’ — particles are largely ignored.
Finally, the Parallel Browsing page also lets you compare two different editions of the same author (I didn’t test this) and browse your chosen authors side-by-side without the highlights (‘browse two texts‘) – in this case, the highlighting is just the normal TLG highlighting.
I think it’s worth making the larger point that these tools are just that: tools. Finding the matches is pretty easy for a computer. But it takes a trained classicist to sort and interpret those results. While I don’t think that anyone reading this blog would question that statement, there are certainly loud voices out there who’d claim otherwise. So I think it’s worth taking up two example of why you need to actually read your results.
Example one: context is king
Here’s an examples of two shorter matches. #40 is, I think, a good match: it shows how well the TLG parsing tool works (ἔρρηξε is matched to ῥεξαι), as well as the importance of content words (Dionysius’ preposition and article are highlighted as part of his phrase, but not significant for the results). To see whether this match is a true ‘intertext’, you’d need to go in and read more closely. The Herodotean passage is about Croesus’ son, who’d never spoken a word – but who miraculously speaks to save his father’s life. Dionysius is telling the story of how Rome conquered Gabii. So both are stories about the takeover of a powerful city. But Dionysius’ story inverts Herodotus: it’s about a prominent citizen who’s put to death and is unable to speak from shock. Is it an intertext? Maybe! (I admit that I’m not sure how common the phrase is, and researching that would be the next step. Anyone who likes the idea: acknowledge me in the footnote, and it’s yours.)
#39 doesn’t work. Aside from cavalry, what are the similarities? In Herodotus, Cyrus offers rewards to the first of his cavalry to storm Sardis. In Dionysius, Tarquin wants to double the pre-existing cavalry. So πρῶτος and πρότερος need more careful consideration.
All this is to say that you need to be trained in textual criticism and to think about why/whether potential similarities make sense. Otherwise, they’re just words — and sometimes, two descriptions of the same event sound similar.
Example two: this is why we punctuate
Yes, I know. Ancient people didn’t, and the TLG‘s N-grams tool (perhaps recognizing that fact), completely ignores punctuation. A lot of the time, that’s probably helpful. Sometimes, it leads to false results.
For this example, I compared Plato’s Republic to Thucydides, thinking that there would be few results. I checked all of the matches (there were 100), and most of them were fairly weak. I’m only showing the first three hits as an example.
The lemmatizer seems to have choked a bit on λέγω (I empathize), since it counted two appearances of the same word in different forms as two separate ‘hits’. By doing that, it made the number of potential similarities really, really broad – because now any three-word stretch that includes ‘first’ and ‘say/choose/etc’ is going to show up as a potential match. If you look at result #3, this double-hit isn’t confined to Plato (#3 is one of the closer results, actually). And because one of the words included is ‘said’, the shift across punctuation means that monologue is equated with dialogue.
I don’t want to finish with you thinking that the N-grams tool is useless. On the contrary: it makes an argument about intertexts both easier (by helping you locate intertexts faster) and better (by helping you determine whether your potential intertext/collocation is rare). But don’t mistake the tool for the end result. And when people ask you, ‘Can’t computers do that now?’, explain why.