Creating a Corpus in a Spyral Notebook

This notebook will help you create a corpus with Spyral Notebooks (it's part of the Art of Literary Text Analysis with Spyral Notebooks). In particular, we'll look at:

creating a corpus in Voyant
creating a corpus from strings
creating a corpus from one or more URLs
- creating an XML Corpus
- creating a Plain Text Corpus
next steps

In text analysis (both with and without computers) a corpus refers to a collection (or body) of texts. A corpus can be a single poem, a set of plays, all English novels from the nineteenth century, or even a large collection of tweets. How you define documents within a corpus might change depending on your needs. For instance, it might be worth grouping all documents from a decade into one single document and have ten documents for a corpus that spans across a century.

We will consider three ways of creating a corpus in Spyral, using Voyant, strings, and URLs. Of these, using URLs is often preferable since it's the most reliable way for others to access the same data (assuming the contents of the URL don't move). But that's not always possible, and sometimes uploading files is necessary. As we will see, it may also be possible to use a data service like Gist or Pastebin to host certain formats of text at a URL.

Creating a Corpus in Voyant

Most of the techniques below describe ways of using Spyral to create a corpus programmatically, in other words with a bit of code. But of course it’s also possible to create a corpus using the usual Voyant interface and then to refer to that corpus in a Spyral Notebook (more about that below). Creating a corpus programmatically provides more flexibility (including options that may not be available in the user interface) and it also ensures that your Spyral Notebook is fully self-contained (which makes it more portable, it can run on different instances more easily). Then again, using the Voyant interface can be simpler and more familiar.

You might experiment first with the Voyant interface technique for creating a corpus and then rewrite parts of your Spyral notebook to use URLs instead if you're looking to share your notebook with others. If you use Voyant to create your corpus, you can still describe the process in detail in your Spyral Notebook so that others may try to reproduce or adapt your work using the same options.

When you first open the home page of Voyant (be sure to be on the same server as your instance of Spyral) you can begin by clicking on the options icon.

There’s a fairly dizzying number of options available (all of which are also available when you create your corpus programmatically) and you’re strongly encouraged to refer to the Voyant documentation on creating a corpus.

Once you’ve defined the input options you can create the corpus by either filling content in the text box (text or one URLs, one per line) or by clicking the upload button.

Inputting Text or File Upload

Once your corpus is loaded (assuming everything worked properly) you’ll see the regular Voyant interface, but we’re actually interested in the URL bar where part of the address should now contain a valid corpus ID that we can use in Spyral on the same server.

Selecting Corpus ID

Copy the corpus ID (only) into the clipboard (it will be different from the one in the screenshot above but it should have a similar format). Now we can use that ID in Spyral, try pasting the ID in place below, replacing “austen” (but be sure to leave the quotes in place). Then click the run button from the left toolbar.

loadCorpus("austen")

loadCorpus("austen")

This corpus (austen) has 8 documents with 781,763 total words and 15,368 unique word forms.

Creating a Corpus from Strings

Above we looked at using the regular Voyant interface to create a corpus, but we can also create a corpus directly and programmatically in Spyral. One very simple way to create a corpus in Spyral is with a set of one or more strings. This isn't especially practical for longer texts, and there are certain limitations since the strings need to be defined using either single or double quotes, usually on a single line.

Here's an example where the string "Hello Spyral!" is used directly to create a corpus:

loadCorpus("Hello Spyral!");

loadCorpus("Hello Spyral!");

This corpus (20281564ba51492a0ab0565ce2b0a3c9) has 1 documents with 2 total words and 2 unique word forms.

In Javascript (used by Spyral), a string can be define in several ways, but the most common ways are with single or double quotes. In the example below, we assign the string to a variable name – this is a name that we choose (with certain rules) and that we want to make both short and meaningful. We also use the keyword var which isn't strictly needed, but it's good programming practice and it declares the scope (where our variable is recognized).

This is functionally equivalent to the previous code (note the use of the backslashes (//) that indicates a comment in the code that's ignored by the computer).

var greeting = "Hello Spyral!"; // declare a string into a variable

loadCorpus(greeting); // create a corpus with our variable

var greeting = "Hello Spyral!"; // declare a string into a variable
loadCorpus(greeting); // create a corpus with our variable

This corpus (20281564ba51492a0ab0565ce2b0a3c9) has 1 documents with 2 total words and 2 unique word forms.

We can also create a corpus with two or more strings by using an array argument. An array is an ordered list of variables, in this case we have strings. Just as with strings, there are multiple ways of creating an array, but one of the most common ways is by using square brackets.

var greeting = "Hello Spyral!"; // declare a first string

var reply = "Well hello there!"; // declare a second string

var texts = [greeting, reply]; // combine the two strings into an array

loadCorpus(texts); // same as new Corpus([greeting, reply]).show();

var greeting = "Hello Spyral!"; // declare a first string
var reply = "Well hello there!"; // declare a second string
var texts = [greeting, reply]; // combine the two strings into an array
loadCorpus(texts); // same as new Corpus([greeting, reply]).show();

Again, it's fairly impractical to work with strings of any length in Javascript. There are some techniques for multiline strings, but it's safest to combine strings together with the plus sign (concatenation). In the example below we also demonstrate escaping the double quote that's part of the string (but that's also used to define the start and end of the string.

var greeting = "And then she asked \"What do "+ // add two string with plus sign

" you mean?\" in that voice."; // note the use of \" to escape the double quote

show(greeting)

var greeting = "And then she asked \"What do "+ // add two string with plus sign
    " you mean?\" in that voice."; // note the use of \" to escape the double quote
show(greeting)

And then she asked "What do you mean?" in that voice.

One situation where creating a corpus from a string can be useful is when we want to process a text and then feed it back in as a new corpus. The code snippet below demonstrates a few techniques that we won't explain in detail yet, but it's just a demonstration of what's possible:

// create a corpus with a URL (hermeneuti.ca) and get the plain text

loadCorpus("http://hermeneuti.ca").text().then(text => {

// modify the plain text to replace "the" by "*THE*"

text = text.replace(/\bthe\b/g, "*THE*");

// create a new corpus (with a string) and embed the reader tool

loadCorpus(text).tool('reader');

})

// create a corpus with a URL (hermeneuti.ca) and get the plain text
loadCorpus("http://hermeneuti.ca").text().then(text => {
    // modify the plain text to replace "the" by "*THE*"
    text = text.replace(/\bthe\b/g, "*THE*");
    // create a new corpus (with a string) and embed the reader tool
    loadCorpus(text).tool('reader');
})

Creating a Corpus from URLs

The preferred way of creating a corpus in Spyral is to use URLs. The benefits are that the source texts can be accessed by other people and the corpus can be more easily re-created if ever it's lost or damaged. The disadvantages are that the content of URLs can change or disappear. We've already seen examples several times, and we can use both the single URL or array of URLs format (just as we did with strings).

// one URL as a single string argument

loadCorpus("http://hermeneuti.ca").summary();

// an array of URLs as an argument

loadCorpus(["http://hermeneuti.ca", "https://en.wikipedia.org/wiki/Voyant_Tools"]).summary()

// a variable to contain URLs that are then used as a single argument

var urls = ["http://hermeneuti.ca", "https://en.wikipedia.org/wiki/Voyant_Tools"];

loadCorpus(urls).summary();

// one URL as a single string argument
loadCorpus("http://hermeneuti.ca").summary(); 

// an array of URLs as an argument
loadCorpus(["http://hermeneuti.ca", "https://en.wikipedia.org/wiki/Voyant_Tools"]).summary()

// a variable to contain URLs that are then used as a single argument
var urls = ["http://hermeneuti.ca", "https://en.wikipedia.org/wiki/Voyant_Tools"];
loadCorpus(urls).summary();

In addition to specifying one or more URLs it's also possible to specify parameters to be used for preparing the corpus. These parameters are the same as the ones used in the Voyant options interface (though there are even more options available programmatically in Spyral). Many of the options will depend on the format of the documents in your corpus. For a fuller list of options see the Corpus documentation.

Creating an XML Corpus

URLs can point to documents in a variety of formats supported by Spyral, including HTML, XML, RTF, MS Word, MS Excel, OpenOffice and compressed zip files containing documents. The XML handling of Spyral is particularly powerful as you can define very precisely how to treat each part of the document. For example, if we visit http://www.cbc.ca/cmlink/rss-topstories we find an XML document that is also an RSS feed, that is, a collection of items. Because the root element of this document is <rss>, Spyral assumes by default that you want each item to be its own document.

loadCorpus("http://www.cbc.ca/cmlink/rss-topstories"); // treated as an RSS feed by default

loadCorpus("http://www.cbc.ca/cmlink/rss-topstories"); // treated as an RSS feed by default

This corpus has 20 documents with 2,467 total words and 866 unique word forms. Created now.

However, we can tell Spyral to treat the document differently. For instance, we might want all of the items to be combined into one document. We can do that by providing a second argument to the Corpus object. The first argument specifies the source of the documents, the second argument (optionally) specifies processing parameters.

loadCorpus("http://www.cbc.ca/cmlink/rss-topstories", {

inputFormat: 'xml', // force XML (not RSS)

xmlContentXpath: '//item/description' // grab item description for content

});

loadCorpus("http://www.cbc.ca/cmlink/rss-topstories", {
    inputFormat: 'xml', // force XML (not RSS)
    xmlContentXpath: '//item/description' // grab item description for content
});

This corpus has 1 document with 1,716 total words and 691 unique word forms. Created now.

Now we have all the description text in just one document.

Let's look more closely at the syntax.

new Corpus( "string", {
  name: "value",
  name: "value"
} );

The first argument is a string, the URL of our source (http://www.cbc.ca/cmlink/rss-topstories). The second argument is a data type called an object literal. It can be declared with curly brackets that enclose a set of name: value pairs (note that the name doesn't need to be quoted but the value does need to be quoted). Each pair of arguments is separated by a comma (but no comma after the last one). It's easy to make a mistake by forgetting a comma or a curly bracket or something, but you can experiment with the examples below and try your own.

var oneValue = {inputFormat: 'xml'};

var twoValues = {inputFormat: 'xml', xmlContentXPath: '//item/description'};

var twoValuesBis = { // exact same as above, just formatted differently

inputFormat: 'xml',

xmlContentXPath: '//item/description'

};

show(twoValuesBis.inputFormat); // example of using an object's value

var oneValue = {inputFormat: 'xml'};
var twoValues = {inputFormat: 'xml', xmlContentXPath: '//item/description'};
var twoValuesBis = { // exact same as above, just formatted differently
    inputFormat: 'xml',
    xmlContentXPath: '//item/description'
};
show(twoValuesBis.inputFormat); // example of using an object's value

xml

Creating a Plain Text Corpus

Above we considered how to create a corpus from a URL with XML content. Now let's consider an example with plain text, namely Edgar Allen Poe's short story "The Gold Bug." Several editions of the text exist online (and we could use them with a URL), but we will demonstrate using the Project Gutenberg edition. One of the first things to know is that Project Gutenberg sometimes blocks Spyral from fetching URLs from its servers, so we will start by making a copy of the document at another URL. We can outline our procedure as follows:

find the document we want in plain text from Project Gutenberg
make a copy (legally) of the document in a Gist
determine which parts of the document are relevant
create a corpus with parameters that define the relevant document

The "Gold Bug" is part of Volume 1 of the Works of Edgar Allan Poe on Project Gutenberg. We have a choice of formats at this location, but for our example we will choose the Plain Text UTF-8. We have found our document.

Next we will copy the contents of the page (not the URL, the full text) into the clipboard and then paste it into the main (bigger) text box in a new Gist. Let's also specify a filename, we can use goldbug.txt. Finally, we'll click the "Create public gist" button. (Again, the Project Gutenberg license allows this kind of copying of content.)

After clicking "Create public gist" you will see a page that contains a button for the raw text – that's the URL we want (click on the "raw" button and then copy and paste the URL from the address bar of your browser).

You may be doing this with your own text, or you can use the URL from the "Gold Bug" (yes, it's a bit long):

https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt

We could use the URL right away to create a Spyral corpus, but let's remember that this document is a volume contains several works and we only want "The Gold Bug." In order to isolate the part we want, we can scroll to the start of the short story where we find this:

the moon, or in the eye of the spectator, but must be looked for in
something (an atmosphere?) existing about the moon.

THE GOLD-BUG

          What ho! what ho! this fellow is dancing mad!
               He hath been bitten by the Tarantula.
                    _--All in the Wrong._

MANY years ago, I contracted an intimacy with a Mr. William Legrand.

The text "about the moon" is the end of the previous text, so we can say that we want to start with "THE GOLD-BUG" (capitalized, which is not the case elsewhere like in the table of contents at the top). We'll do a very similar operation to find the end of the text that we want:

required a dozen--who shall tell?”

FOUR BEASTS IN ONE--THE HOMO-CAMELEOPARD

                     Chacun a ses vertus.
                        --_Crebillon’s Xerxes._

ANTIOCHUS EPIPHANES is very generally looked upon as the Gog of the

So we know we want to remove all the text until "THE GOLD-BUG" and then all the text starting with "FOUR BEASTS IN ONE". Note that we don't need to specify many options (like inputFormat), because Spyral is fairly good at determining default values.

loadCorpus("https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", {

inputRemoveUntil: 'THE GOLD-BUG',

inputRemoveFrom: 'FOUR BEASTS IN ONE'

}).tool('reader'); // let's look at the reader to ensure we have the right text

loadCorpus("https://gist.githubusercontent.com/sgsinclair/84c9da05e9e142af30779cc91440e8c1/raw/goldbug.txt", {
    inputRemoveUntil: 'THE GOLD-BUG',
    inputRemoveFrom: 'FOUR BEASTS IN ONE'
}).tool('reader'); // let's look at the reader to ensure we have the right text

Next Steps

Here are some exercises to try, based on the contents of this notebook:

use the Voyant interface to create a new corpus with different file formats such as PDF and MS Excel
create a new corpus with 3 URLs
create a new corpus from a different text in Project Gutenberg (be sure to create a copy and use the URL in Gist)

If you're working sequentially through the Art of Literary Text Analysis with Spyral Notebooks, the next notebook is Exploring a Smaller Corpus.