This notebook will help you explore a corpus with Spyral Notebooks (it's part of the Art of Literary Text Analysis with Spyral Notebooks). In particular, we'll look at:
- creating a reusable corpus
- exploring term frequencies
- exploring term distributions
- exploring term collocates
- exploring text
- next steps
Creating a Reusable Corpus
Most of the code examples until now have created a transient corpus, in other words one that is created and used immediately, but not reused. In what follows we will show how we can create a corpus and reuse it elsewhere in the Spyral notebook.
It's useful to understand that when we create a corpus we actually get a promise for a corpus that will be delivered later. This is the nature of some Javascript code that runs asynchronously, in other words where things happen at different times. One way to deal with this situation would be to use something like this syntax (which we won't explain in detail since it's not the syntax we'll use):
var helloworld; loadCorpus("Hello world!").then(function(corpus) { helloworld = corpus; });
Instead we'll use a convenience function of Corpus called assign, which takes as an argument a string name that is the variable name that will be used in subsequent code blocks.
loadCorpus("Hello world!").assign("helloworld");
Much easier, yes? There are two crucial things to remember:
assign
takes a string argument (with quotes) even though later we use the same word as a variable name (without quotes)- any references to the named corpus should occur in subsequent code blocks (in order to ensure that the corpus has been created)
We can combine the code from the previous notebook Creating a Corpus with Plain Text with this new concept of creating a reusable corpus.
loadCorpus("https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", { inputRemoveUntil: 'THE GOLD-BUG', inputRemoveFrom: 'FOUR BEASTS IN ONE' }).assign("goldbug"); // this line is the new part
By running this code in its own code block we ensure that subsequent code blocks will have access to our new goldbug
corpus variable. (But don't forget that this first code block needs to be executed before subsequent blocks have access to the variable – it may help to start by running all code blocks.) In the code block below we reuse the goldbug corpus to show a summary of the corpus.
goldbug.tool('summary'); // we are reusing the goldbug corpus
Exploring Term Frequencies
So, we have a reusable corpus in Spyral, now what? Sometimes we come to a corpus with pre-existing questions, sometimes we just want to start exploring. We can begin with simple term frequencies, what are the most frequent terms in the document? To start exploring, we can use the Cirrus word cloud visualization – we can use the embed function of our corpus and specify the Cirrus tool to use.
goldbug.tool('cirrus');
The layout of the terms will change, but the relative size of terms will express frequency in the document. We can notice a few things immediately that are fairly distinctive of prose literature, for instance the high frequency of the term "said" (often used to report speeches by characters). Another typical characteristic of literature is the high frequency of character names ("legrand", "jupiter", "massa", etc.). More noticeable here are other terms such as "bug", "parchment" and "skull".
The word cloud is a very useful way to get a quick overview of high frequency terms, but it's also fairly limited. For instance, we are limited to the high frequency terms, we can't search for other terms, and we get no other information than term frequency. To dig a bit deeper we can invoke the Corpus Terms tool.
goldbug.tool('corpusterms');
The top frequency terms are indicated here (albeit in a less visually interesting way that with Cirrus), but we also have access to additional functionality available. Because this is a corpus with just one document, some of the columns are less relevant (like Trend, which shows the distribution of terms across several documents). However, we do have the possibility of scrolling down the list to see additional terms.
One thing we might notice is that there's an entry for "jupiter" (one of the character names that appears capitalized in the text), but also entries for "jupiter's" and "jup" (and possibly others). In fact, we can ask what terms appear that have the prefix "jup*".
In the code below we demonstrate the syntax for providing a query to a requested tool – the general form is something like this (where the options are defined by Corpus Terms):
corpus.tool('toolname', { options } );
The syntax of the actual query is defined by the search box (you can hover over the question mark in the search box for more information).
We also demonstrate here the use of the height and width parameters when embedding a tool (we don't need to take up the full size given to a tool by default.
goldbug.tool('corpusterms', {query: "^jup*,jup*", width: "400px", height: "250px"});
The results can be broken down into the two query parts (separated by a comma):
jup*
: the total frequency of all terms that start with jup^jup*
: the frequency of each term that starts with jup ("jupiter", "jupiter's" and "jup")
Exploring Term Distributions
Beyond term frequencies it can be interesting to explore how evenly terms are distributed in a corpus or in a document – is a given term more present at the beginning, middle or end? As with term frequencies, we have access to a convenient visualization for term distributions called Trends. Since our corpus is comprised of a single document, Trends will show term distributions within that document (as opposed to term distributions across several documents in a corpus).
goldbug.tool('trends');
By default Trends will show the distribution of the top 5 terms in our document. Because we are in single document mode, we see the raw frequencies for 10 equal segments of text: Spyral divides the text into 10 equal parts and then counts frequencies in each part. We can use the "Segments" slider to get a finer-tune look at distribution, dividing the document into 20 segments, for instance.
We can click on the terms in the top legend to toggle them on and off. If click (hide) "bug", "massa", and "said" we can see more clearly that the character names "legrand" and "jupiter" are used during the first half of the document much more than in the second half – something worth exploring perhaps. We can produce a more specific graph in a way similar to what we did for Corpus Terms, though this time we're using configuration options for Trends.
goldbug.tool('trends', {query: "jupiter,legrand"});
Another useful visualization for distribution is TermsRadio since it shows more terms than the Trends tool. The TermsRadio is intended to scroll across the distributions, but we can request to see all of the bins at once using the bins and visibleBins parameters.
goldbug.tool('termsradio', {bins: 10, visibleBins: 10, height: '600px'});
This tool reveals some potentially interesting things that we hadn't seen before. For instance, the word "parchment" is very high frequency, but limited to one portion of the text. Segment 9 also shows an intriguing set of words about words: "word", "characters", "word".
This tool is useful, but we're still somewhat limited to the highest frequency terms. However, we can request a grid of all the Document Terms. In contrast to the Corpus Terms seen earlier, Document Terms also shows distribution graphs on the right for each word.
goldbug.tool("DocumentTerms", {width: "400px", height: "300px"});
Exploring Term Collocates
We have looked at term frequencies and term distributions, but we have yet to explore how words are used in context. To do that we explore what are called collocates or terms that are co-located in proximity to terms of interest. As we have previously done, we can start with an interesting visualization and then drill down more from there. The visualization is called TermsBerry and it serves two primary purposes:
- just as with Cirrus, it shows the top frequency terms in our corpus
- when you hover over a term it shows information about other terms that occur nearby (collocate)
goldbug.tool('TermsBerry', {width: "600px", height: "600px"});
Hovering over different words to see collocates: hours of fun :). A somewhat similar tool is Collocates Graph, though it shows fewer terms and emphasizes the network of terms: the larger the labels ("nodes" in network terminology) the more frequent is the term and the thicker the line ("edge" in network graph terminology) the more times the terms co-occur.
goldbug.tool('CollocatesGraph', {width: "500px", height: "500px"});
The green terms are keywords and the red terms are collocates. You can fetch more collocates by increasing the value of the Context slider.
One term we might want to consider in more detail is "massa". In fact, we can right-click (or ctrl-click) on the term's label in the graph above and choose "Centralize". Or, of course, we can generate a new tool to do the same.
// FIXME: this doesn't yet work goldbug.tool('CollocatesGraph', {centralize: "massa"})
A graph like this certainly helps to identify words that occur close to a term of interest, and in this case we notice several non-English words (or words that aren't standard English: dis, data, fer, ting, etc. If we've read the text there's no mystery here, but sometimes exploratory text analysis is used when there's too much text to read or when we want to determine what parts of the text are worth reading in more detail. If we want a sense of how "massa" is used, we can request the Contexts tool, which provides words to the left and right of each occurrence of our keyword.
goldbug.tool('Contexts', {query: "massa"})
The Contexts tool has a slider to adjust the amount of context that is shown for each occurrence, though we are still limited by the space available in the rows. If we want a true sense of the word's context, we can click on the plus sign in the left-most column, which will expand that particular occurrence. For instance, here's what we see in the first occurrence:
In these excursions he was usually accompanied by an old negro, called Jupiter, who had been manumitted before the reverses of the family, but who could be induced, neither by threats nor by promises, to abandon what he considered his right of attendance upon the footsteps of his young “Massa Will.” It is not improbable that the relatives of Legrand, conceiving him to be somewhat unsettled in intellect, had contrived to instil this obstinacy into Jupiter, with a view to the supervision and guardianship of the wanderer.
What we learn from this is that the character who says "Massa Will" is Jupiter and we suppose that the other words are attempts to convey Jupiter's dialect (as we see clearly in subsequent occurrences). It's worth mentioning that "Will" here is a character name, but we would likely have missed that since "will" is also a common verb and noun.
If we want to see groupings of words with "massa" we can use the Phrases tool or the more visual WordTree tool.
goldbug.tool('WordTree', {query: "massa"})
As with many tools, the size of the words matter, and we can see that to the left there are multiple occurrences of "No, Massa", "Why, Massa" and "Yes, Massa" (hover over a term for more information and click on it to expand or collapse it). To the right, as we've already seen, we have "Massa Will" and several other forms following a comma.
Exploring Text
In this simple example of exploring Poe's "Gold Bug" we've looked at various ways of studying term frequencies, distributions and collocates, but we haven't actually read the text (unless you read it at the end of the previous notebook on Creating a Corpus in Text). Whether one reads a text first, and then analyzes, or analyzes and then reads, or analyzes and reads at the same time, will depend on many factors, including the nature of the corpus and the nature of one's motivations for using Spyral. In any case, it's simple enough to start reading!
goldbug.tool('Reader');
Next Steps
Here are some exercises to try, based on the contents of this notebook:
- the "Gold Bug" is partly about cryptography, what keywords in Corpus Terms might help us identify and study that?
- the term "bug" is distributed unevenly, when (distriibution) and how (collocates) is it used?
- create a new corpus from a different short story and start exploring it in the same way we did here
If you're working sequentially through the Art of Literary Text Analysis with Spyral Notebooks, the next notebook is Introducing Tables.