In this post, we pick up on our discussion of using the Tesserae project for intertextuality. At the start, we want to acknowledge the generosity of Neil Coffee, the project lead. He was very quick to respond to our questions about best practices for the site and shared a forthcoming article on some of his results with silver Latin poetry. Thanks, Neil!
After reading through some of Tesserae’s research, we decided to expand on the topic of using digital tools for linguistic analysis. Tesserae was one of the first projects to attempt this in any language, and it’s an ongoing project — there are plans for further refinements to at least some of the tools, but probably not the interface as a whole.
Because intertextuality is a difficult topic, I’m going to start with a slightly more detailed reminder about the previous post than I usually do. Intertextuality, in its most basic form, is textual reuse and allusion. In other words, when we search for an intertext, we are trying to see how Author A responds to, challenges, reworks, or appropriates Author B. In the last post, we discussed an obvious case: when an author clearly takes a whole sentence (or even phrase) from another author. For the most part, these textual borrowings are well known.
The goal of the Tesserae project is to find more subtle textual borrowings, such as the use of the same subset of words by two different authors. In this way, Tesserae offers a tool of discourse analysis more than what would typically be thought of as intertextuality. The tool tries to group words of similar meanings across multiple authors.
An example might help clarify these terms. Consider the following group of words: gurney, examination, sterilization, biome, suture, instrument, cardiac. What do all of them have in common? If you said “hospital” (or something similar), that’s because you recognize them as belonging to a specific type of discourse: a way of speaking about things that is specific to a professional field, a time or place, or an in-group.
Not all of the words are specific to the discourse. For example, “examination” could be found in a classroom context as well, and “instrument” could refer to music. But when you put these more common words in with the group of medically-specific words, like “suture“, you mentally categorize them as medical terminology.
Tesserae does something similar with Latin texts. At the current stage of the project, a substantial amount of user-side manual checking of all results is still required. That is, you can’t take the raw data that the computer gives you as a given; you need to look it over yourself and draw conclusions about which results are valid. As a result, this tool is aimed at advanced researchers. You will need to be confident in your ability to judge the significance of a potential intertext to use the tool most effectively. Don’t worry — there’s an example of this below!
We ended our last post by discussing how to sort the results of a search. The default format is to sort by “score“, which means the computer-generated guess at the significance of the match.
The potential scores, as far as we can tell, range from 1-12, and are based on matches of 1-3 words (you can set some sensitivity options by using the advanced search features, which we’ll discuss in part 3). Words that are less common in Latin are given more weight than very common words, like et or non. Similarly, words that appear to be collocated (joined more closely in space, and therefore a potential thought unit) are given more weight than words that are further apart. The ranking does not proceed linearly: that is to say, scores at rank 9 are not equally worse than rank 10 than they are better than rank 8. You will have to decide what these scores mean for your author(s).
As we mentioned last time, you will want to look through the entire set of matches to determine the relevance of a given result, and some of the more obvious intertexts are ranked quite low (our quotation from the last time, remember, is in the 100s and 300s!).
The score that the computer gives any given match doesn’t necessarily correlate with the strength of the intertext. Rather, it is the computer’s best guess at the likelihood that these two words would occur in both authors, based on how common the Latin stems are. Because his, haec, hoc and opus are both common words, and sum is a stop word (so common that it is excluded from searching), our example appears lower in the list than others. That doesn’t mean that it’s not an intertext, or that the program is untrustworthy; it means that the computer can’t replace a trained classicist. This is why you have to manually check your results to ensure that they are relevant. The program makes your job easier by sifting through several hundred lines of Latin and picking out the 600 matches that are most likely to be relevant, a job that would ordinarily take several years of intensive study. Now that mechanical task is shifted, and you can focus on deeper analysis.
In the last post, we promised an explanation of why you might want to focus on source locus vs. target locus. Mostly, this is a question of your research interests: the information is the same regardless of which you choose. If you are mostly interested in the influence of Vergil on later authors, it will make more sense to focus in Vergil by moving through his work from start to finish. In that case, you will want to sort by source locus. If, on the other hand, you’re more interested in Vergil’s (and others’) influence on Ovid, it will make more sense to sort by target. This arranges your material in such a way that you can see which lines of Ovid are most likely to be indebted to Vergilian influence.
For poetry, the choice between the two is probably not going to be a huge deal. Poetic books are relatively short, and you can limit your search to individual books. But prose works on Tesserae are grouped in much larger chunks. Livy, for example, is only searchable by decade. Although you can choose to sort either by increasing chapters (that is, 1, 2, 3, 4, etc.) or decreasing chapters (that is, 4, 3, 2, 1, etc.), a decade is still an immense amount of Livy to sort through. Unless you’re interested primarily in books 1 or 10, sorting isn’t going to help much.
On the other hand, if you are interested in Livy’s potential intertexts with Vergil, you can choose to search through the Aeneid book by book. In this case, it may be more manageable to sort by Vergilian lines or by relevance (with the caveats from above). Remember that you will still have to manually check your results!
If you need to work with prose texts on Tesserae, we strongly recommend using the advanced features to get more specific and more likely intertexts, and to cut down on the number of false positives that you will have to examine. But there are other interesting things that you can do with a working knowledge of the advanced features. We’ll go through the advanced features in the next post; we’re ending this one with a brief taster of how using the advanced features can yield more interesting results
The screenshot below shows a comparison of Vergil with himself. The results you see were gained from raising the requirements for significance substantially from the default. We chose the maximum number of stop words (a term we’ll explain in the next post), lowered the textual distance, and changed the way the computer compared texts from line to phrase. Here are the top results (by score) from comparing Aeneid 10 and 11:
What we start to see here is the specific way that Vergil puts together language. We’ll discuss more of that in our next post!