Textual Analysis Lab Series

Lab: Counting Words

Employing Loops, ArrayLists, and Strings


In this lab you will enhance your previous WordCounter program to do some very simple digital humanities computational textual analysis of the book. (Or, to put it more simply: word counting.)

Parsing Words in a Line

As a first exercise in analyzing words in a text, let's calculate approximately how many words there are in a line in this book. We could start by counting how many words there are in the first line.

Average Words per Line

We might wonder whether the number of words on a single line is representative of the book as a whole. Let's calculate the average number of words per line over the extended quote (20 lines or so) you printed in the previous lab.

Average Word Length

The number of words per line is really about how the book is formatted, and doesn't say anything about the author's word choices or the target audience for the book. It might be more meaningful to look at the average word length.

At this point, your program prints several points of information about the book you chose. Your output might look something like:

    Welcome to the Word Counting Mini-Lab.
    The first line in SherlockHolmes.txt is: 
       Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
    The length of the first line is 76.  There are 12 words in the first line.
    Skipping 60 lines: ............................................................
    Line 62: 'summons to Odessa in the case of the Trepoff murder, of his clearing up'
    Length of this line is 71.  There are 14 words in the line.
    On average, SherlockHolmes.txt has about 12.275 words per line.
    Words are about 3.9898167 characters long.

Analyzing the Whole Work

Is this meaningful analysis?

Counting the number of words and the average word length may not seem like it actually provides any useful information, but you will see some interesting results if you compare works written in different centuries, in different genres, in different languages, or for different audiences. For example, if you compare Hamlet written by Shakespeare (1603), Pride and Prejudice written by Jane Austen (1813), the What to the Slave is the Fourth of July? speech by Frederick Douglass (1852), and Alice's Adventures in Wonderland written by Lewis Carroll for children (1865), you will see interesting differences in vocabulary. You can download any of these, or others, from Project Gutenberg. Choose books that are very different from each other, so that you can compare them along the way or at the end. (Don't forget to download Plain Text versions.)

Zip and Submit Your Program.

You will be adding to this program, but submitting it at this point will allow you to get feedback before you submit a final version.

*Alternative Loop for Reading Entire File

Rather than repeat the code to get the next line in both the initialization and step parts of a traditional for loop, experienced Java programmers will often write a while loop that combines getting the next line and testing that it is not null in one, more complex expression.

reader = new WordReader(filename);
while ( (nextLine = reader.getNextLine()) != null )
{
    // Do something in the loop.
}