Help with Research: Using Tesserae for intertextuality, Part 3

In our last Tesserae post, I promised an explanation of Tesserae’s advanced features. These are mostly aimed at limiting the number of hits an individual search will come up with, which is useful because all potential matches need to be checked. It’s much easier to check 100 matches than to check 600 or even 1600! But I also want to highlight two advanced features that really advance the way that we can computationally analyze Latin: by using similarity metrics for sound effects and by connecting words that are semantically similar. In this post, I will discuss the second of these; my last (but not the last!) Tesserae post will discuss sound effects.

The most important concept for understanding how to best use Tesserae’s advanced features, as well as how the site works as a whole, is the concept of stop words. A stop word is a word (occasionally a phrase) that is so common that does not convey meaning and therefore can be ignored without damaging the results of your search. In English, common examples of stop words are articles (the, an), conjunctions (and, but), and many adverbs (really, very). These words aren’t meaningless, but they are unlikely to bear significant impact. You expect to see them all the time, so it isn’t impressive or interesting when two texts share them.

Although Latin has many fewer words than English, it too has stop words: non, et, and sed are examples. Because Tesserae is designed for use on texts, sometimes you can even specify stop words for individual authors: for example, one could argue that ego is a stop word/phrase in Cicero, but is not in Ovid. (Or maybe that’s just being mean to Cicero…)

Because stop words are words that you are specifically excluding from your search, there is an inverse relationship between stop words and hits: by increasing the number of stop words, you will decrease the number of hits. For most users, I recommend maximizing the number of stop words that you use. Going back to the previous example of comparing Aeneid 10 and 11, the number of hits with the default number of stop words (10) is 1263; with 50 stop words, there are 667 (half the number!); with 100 stop words, there are 526; and with 200 (the maximum), there are only 339, just over a quarter of the original. You thus save yourself a lot of labor.

Of course, one could argue that your results would be worse. So let’s compare the top 5 hits at 10 stop words and at 200:

vergil to vergil comparison with 200 stop words on tesseraevergil to vergil comparison with 10 stop words on tesserae

The top image shows 200 stop words, while the bottom image shows 10. As you can see, the top 3 are exactly the same, increasing the chances that these phrases are significant. Meanwhile, the 4th result of the bottom (10 stop word) search has dropped out at 200 stop words, suggesting that it is less significant.

The Advanced Search features offer a variety of other useful ways to limit your search. For example, in addition to increasing the number of stop words, you can choose how the computer decides what a stop word is. The default setting is corpus, which means “all Latin literature on Tesserae.” But this can be changed using the Stoplist basis menu:

stop list basis tesserae latin texts textmining data analysis

Target and Source refer back to the Target and Source texts that are used as bases for comparison (see our initial Tesserae post for further details). That is, if you were comparing Vergil (target) to Cicero (source), you would be able to limit specifically Ciceronian language (like crudelissimus or even pulcher) by selecting source in the Stoplist basis menu. You can also use only the two texts you’re interested in (target + source) as the basis for your stop words.

Rerunning our 200-word stop list with target/source comparison slimmed the results down even further, to 112. And some of what happened might surprise you:

vergil to vergil comparison with 200 stop words target+source basis on tesserae

The top three hits are still the same, but result #4 has come back in from the 10 stop-word list. This means that while the words mors and minor are relatively common in Latin as a whole, they are rare enough to be significant in these two books of Vergil. That is valuable information!

The Advanced Search features let you do other things besides fiddle with stop words. advanced features tesserae data mining search

Let’s go through these features one by one. The Unit feature lets you choose whether you want to look at similarities by line (more useful for poetry) or by phrase (useful in both poetry and prose). You can also decide whether to Score words based on the word itself, or on its stem. If you choose word, the program will search for the exact match, i.e. the same inflected form of the word. Although this is the default, I would recommend performing searches based on stem, which tries to match dictionary entries of the word regardless of the number/case/tense/voice/etc. As an example, using the word score, sapientiae would not match sapientiam, even though they are clearly the same word to a human eye. The stem score would match them as inflected forms of the same dictionary entry.

I would recommend scoring based on stem. You’ll get more results, but you’ll also be more accurate: Latin is an inflected language, after all! But, as a word of caution, Latin words that can come from two different dictionary entries (such as bella from bellum or bellus) can yield false results. As anyone who has taught a first-year Latin class understands, this is confusing to early-stage Latinists too! And again, this is why you need to review your results, rather than counting on numbers alone.

If you want to minimize the number of hits that appear in your results, you can drop scores below a certain number (6 is the default). These numbers, as we’ve said before, give you an idea of how likely it is that the matches are significant, and are calculated on a non-linear basis. That is, 7 is not twice as relevant as 6, or 120% as relevant, or the same amount more relevant as 8 is to 7. As a reminder, you can see the score of any search in the rightmost column:

Screen Shot 2018-03-08 at 7.30.02 PM

Although Tesserae’s relevance scores currently run up to level 12, you are more limited in the parameters of the search that you can set. You cannot, for example, exclude all matches except the level-11 matches.

tesserae drop scores relevance metrics

If you want to see only the results that the computer thinks are most likely to be relevant, you should choose 9 (which will show you only matches at levels 9, 10, and 11). If, on the other hand, you want to see all of the matches, regardless of their likely relevance, you should choose the top option of no cutoff.

With this feature, as is true of most of Tesserae’s features, the choice you make is largely determined by your goals for using Tesserae. In this post, I’ve focused mainly on limiting your search output. To close off this post, I think it’s worth explaining why. The nature of a tool like Tesserae is that it offers a sort-of shortcut. It computationally tries to reproduce the experience of being so familiar with a text that you can hear or see echoes of it in new texts that you read. For most of us, this takes years of study; using Tesserae can help you become equally familiar with a new work or a new author in a shorter timeframe. The caveat, of course, is that computers cast a wide net, and sometimes the results that they come up with are uninteresting or wrong. As I’ve been emphasizing, you need to go through them and decide which results are real matches and which are false matches (either because they’re too common to be important or because they’ve matched incorrect forms, like bella). Although it’s certainly easier to go through 6000 results than it is to go through an entire work, you shouldn’t underestimate the amount of intellectual labor that goes into this manual check. If there are ways that can help you weed out the results that are least likely to be significant, that saves your brainpower for making judgment calls on the cases that aren’t so clear (it’s basically a way to avoid decision fatigue). Simplifying and focusing will help your scholarship, and you can always return to the broader search at a later date.


~j.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s