Textual Analysis Lab Series

Project: ArrayList Traversal Patterns

This set of exercises is a follow-up to the Word Counting Lab. In that project you read in lines from a work of literature and then calculated some basic statistics based on the number of words and number of lines in the text. In these exercises, you will be doing additional analysis of the text.

Shifting Gears

Add an extra newline and a string of dashes (e.g.,--------------) or other visual delimiter (e.g.,=== New Exercises ===) to your output to show the "shifting of gears" from the previous exercises to the new exercises. (You may want to add something similar in a comment in your code, too, to set the two sets of exercises off from each other.)

Vocabulary Analysis

In the previous lab, you calculated the average word length in a text. In the next set of exercises, you will do more interesting textual analysis.

How many different words does the author use?

This question is easy to answer, and doesn't even require that you step through and separately read all the lines of the file. Read the WordReader class documentation to find a method that returns a list of all the words in the document. Make sure to read the Method Details, not just the Summary, to fully understand what this method will return. How can you determine how many distinct words the author uses? Remember that the ArrayList method has a size method. The output might look something like this:
```
SomeFile.txt has 8083 distinct words.
```
NOTE: The words in the list of distinct words returned by the reader object are always converted to lower case so that differences in case don't mark words as being different. For example, "The" and "the" are the same word and will only appear as "the" in the list of words. This means, though, that words that are usually capitalized, such as proper names, will look odd; "Gutenberg" will become "gutenberg," for example.
How many words are used just once?

Another way to measure the range of vocabulary in a text is to count how many words appear only once. Create a variable to sum up the "singletons" (words that appear just once). Read the WordReader class documentation to find a method that will tell you how many times a given word occurs in the text. Loop through the word list you retrieved in the previous step; whenever you come across a word that appears just once in the text, increment your summing variable. (How will you increment your summing variable only if the word count is 1?) Outside the loop, report on how many singletons there are in this document. The output might look something like this:
```
There are 3544 words that occur only once in SomeFile.txt.
```
How many words are longer than 14 characters?

Yet another way to measure the complexity of the vocabulary in a text is to measure the use of long, generally sophisticated, words. You can do this by looping through the list of distinct words, checking whether they are longer than 14 characters, and incrementing a counter if they are. To make it more interesting, print the long words as you find them. You may want to do this using System.out.print rather than println, putting a space rather than a newline between words. Then, after the loop, use a println to get to the next line and another printlnto print out the total count. (You may also want to print an "intro" phrase before you enter the for loop, as in the example below.)
Sample output:
```
Words that are more than 14 characters long in SomeFile.txt:
    conventionalities inconsequential characteristics characteristics improbabilities characteristics indistinguishable accomplishments accomplishments disproportionately representations MERCHANTABILITY unenforceability 
SomeFile.txt has 13 words that are more than 14 characters long.
```
NOTE 1: Some of the long words you see might not be from the original text. For example, doing this exercise in a Project Gutenberg version of Alice in Wonderland results in three words longer than 14 characters, none of which actually come from the book. All three are in the Project Gutenberg license information ("representations," "MERCHANTABILITY," and "unenforceability").

NOTE 2: The words in the list of distinct words returned by the reader object are not in the original order from the file, which makes it harder to tell which words come from the actual text and which come from the extra licensing information provided by Project Gutenberg.

Optional Challenge Exercise: (Do this only after completing the other exercises in this lab.) If you want to see the list of words in their proper order (optional!), you can find them by reading in every line in the file, breaking the line into words (as you did when you calculated the number of words), and then checking all the words and increment the counter. If you do this, don't forget to start by getting a new WordReader object so that you start reading lines from the beginning of the file.
How often does the author use a particular word?

Another interesting analysis is how often an author uses a particular word, including variations. Loop through the full list of words checking to find words that contain a particular base word, like "house" or "time", and printing the ones that do. (Or you could try "cat" if you don't have any luck with "house" or "time", although many words will contain that three-letter sequence even if they have nothing to do with cats!)

Tip: Read the String Methods quick reference to find a method that will be useful.

It is even more interesting to know how many times each of those words occurs. As you print a word that contains the base, ask the reader object how often this word occurs.
Sample output:
```
    Words containing 'house': housemaid (1); house (120); household (8); farmhouse (1); houses (13); housekeeper (3); westhouse (2); warehouse (1); outhouse (1); lighthouse (1);
```
When you have your code working, you could experiment with words that might be more specific to your book, like "detect" or "chess" or "freedom" or "dance."

Reminder: Many of these exercises are more interesting if you compare results from a couple of different books from Project Gutenberg, choosing books that are very different from each other, so that you can compare them along the way or at the end. (Don't forget to download Plain Text versions.)

Literary Character Analysis

Another variation on the word frequency analysis would be to measure how often various characters (as in personages, not letters) are mentioned in the book. This is a way of quantitatively measuring the relative importance or prominence of various characters.

How often are various characters mentioned by name?
Identify 4 - 6 characters in your book. Print the number of times each character appears. Sample output:
```
There are 467 references to Holmes in SherlockHolmes.txt.
There are 81 references to Watson in SherlockHolmes.txt.
```

Zip and Submit Your Program.

You will be adding to this program in the Word Frequency project, but submitting it at this point will allow you to get feedback before you submit a final version.

In preparation for submitting your program, rename the folder containing it to YourName_Lab1. (Do this from the Mac Finder or Windows Explorer, not from within BlueJ.) This will help whoever grades it when they receive a dozen or more projects with similar names.
Zip up the folder. This will create a single, compressed version of your project that can be submitted to Kit.
Submit your zipped up folder to Kit.

Textual Analysis Lab Series

Project: ArrayList Traversal Patterns

Shifting Gears

Vocabulary Analysis

How many different words does the author use?

How many words are used just once?

How many words are longer than 14 characters?

How often does the author use a particular word?

Literary Character Analysis

How often are various characters mentioned by name?

Zip and Submit Your Program.