Textual Analysis Lab Series

Word Frequency Project

Finding Maximum Values

This set of exercises is a follow-up to the Word Counting and Word Counting labs. In these exercises, you will be doing additional textual analysis, finding the most frequently used words in the text. Add these exercises to your existing project, but print an extra newline or a string of dashes (e.g.,--------------) or other visual delimiter (e.g.,=== New Exercises ===) to your output to set the new exercises off from the old. (You may want to add something similar in a comment in your code, too, to set the two sets of exercises off from each other.)

Frequency Analysis

You have already identified how many words occur only once in the text, but what word occurs most frequently. This is an example of a classic Extreme Value problem (find the minimum or maximum value in a list).

Find the most frequent word.

To find a minimum or maximum value, you go through a list comparing new items to whatever minimum or maximum you have identified so far, changing to a new extreme value when you find a value that is smaller or larger (depending on which extreme you are looking for). The tricky part to this algorithm is selecting an initial candidate. One approach is to assume for the moment that the first one is the extreme value and then look for any that are more extreme. The other approach is to pick a value that is extreme in the other direction as the starting point. For example, if we're looking for the largest value in a list of positive integers, we could start by pretending that a negative number is the maximum; any value in the list will be larger than that.

Implementation:
- You will want two variables: one to store the most frequent word found (so far), and one to store how many times it occurs (its frequency count). A frequency count will always be 0 or a positive number, so initializing it to a "dummy" maximum value of -1 will work. You can initialize the most frequent word to null.
- Then go through the list of words in the document. If you come across a word that has a higher frequency count than the maximum so far, set the most frequent word variable to the current word and the maximum frequency count to the count associated with it.
- When you get to the end of the list, the word that is in the most frequent word variable will, indeed, be the most frequently used word, and its count will be in the frequency count variable.
Your output might look something like this:
```
The most frequently-used word is the.  It occurs 5816 times.
```
What is the most frequently used word in your book? Is there any difference when you look at different books?

Question: What if there are two words that are the most frequenctly used? This is unlikely to happen in this case, but often happens in other "Find the Minimum/Maximum Value" cases. Your code would find one of the two most frequently used, depending on whether you used < or <= for your check. Can you tell whether your code would find the first maximum value or the last one, if there were multiple values with the same maximum frequency?
Find the most frequent word(s), including multiples.

Although it is highly unlikely that the text has two equally most frequently used words, let's change the algorithm to find all of them (if there were multiples). We'll do this for two reasons: it is good practice for other situations where that could easily happen, and it will set you up for the next exercise.

Implementation:
- Construct an ArrayList in which you can store the frequently-used word(s).
- Next, create two loops to step through all the distinct words, one after the other. (Not nested.) In the first loop, find the frequency count for the most frequently used word or words, as you did before, although this time you don't have to keep track of what the most frequently used word is. In the second loop, find all words with that frequency and add them to your list.
- After the second loop, print "all" the words, along with their frequency counts, just as you did for the words containing "house" (or "time") in the previous Textual Analysis lab, although in this case you will probably get only one word.
Sample output:
```
Most frequently-used word(s): the (5816)
```
Find the top 40.

The most frequently used word is seldom interesting; in fact, it is usually "the." It would be more interesting to identify the 40 most frequently used words.

To do this will require wrapping the two loops you just wrote in an outer loop, so that you find not just the most frequently used words, but the 40 most frequently used words.
Implementation:
- After the construction of your ArrayList in which you stored frequently used words, create a new variable called upperBound and initialize it to a number greater than the frequency of the most frequent word. For example, if the most frequent word is "the" and if it occurs 5800 times, then 6000 would be an appriate initial value for upperBound.
- Next, create a loop that iterates as many times as the number of top words you want to find. In other words, if you want to identify the 40 most frequently used words, you would create a loop that repeats 40 times (similar to the loop you created in the Word Counting mini-lab.)
- Put your two loops from before within the braces for your new loop, adjusting the indentation. (The initialization of your frequency count variable to -1 should also be inside the outer loop, with the two inner loops.)
  
  Tip: BlueJ has an "Auto-layout" option under the Edit menu that will clean up indentation for you. It can also sometimes show you when your program logic doesn't match what you think it does, because the indentation will not be what you expect.
- Modify the first loop to check for two things: whether the current word's frequency count is higher than the "most frequent" count and whether the count is less than the upper bound.
- After the second loop (but still nested within your count-to-40 loop), adjust your upperBound to be the frequency count you just used. For example, if your initial upper bound was 6000 and your found the word "the" with a frequency of 5800, you would now set the upper bound to 5800 so that in the next pass through the outer loop you can find the most frequently used word(s) below 5800 (the second most frequently used). If the second most frequently used word is "and", and it occurs 3100 times, you would then set the upper bound to 3100 to find the most frequently used word(s) below that. And so on.
- After the outer loop ends, print the 40 most frequently used words, along with their frequency counts, just as you did before. You may find it neater to print each word and its frequency count on a new line, rather than putting them all on the same line.
  Below is some sample output. Note that the word "I" appears as "i"; this is because the WordReader object converts all the words in the text to lower-case in order to find multiple occurrences of words regardless of capitalization.
```
The 40 most frequently-used word(s) are:
the (5816)
and (3089)
i (3038)
...
holmes (467)
upon (467)
...  
```
Download two or more additional books from Project Gutenberg, choosing books that are very different from each other. (Don't forget to download Plain Text versions.) Run your program several times, changing the filename each time to a different book. Copy and paste your results into a document, so that you can compare results. Are there any interesting differences you see that point to fundamental differences in the tone, themes, or target audience for the different books. For example, there might be significant differences in the range of vocabulary or use of long words. The prevalence of masculine vs. feminine pronouns might be distincely different. What other differences do you notice? (If you want to share your analysis, put the document in the same folder as your program with a name that makes it easy to spot, like "Textual Analysis Results.")
Optional: Account for "ties"
Adjust your loop so that you adjust your loop counter by the number of words found rather than by 1, each iteration through the loop. To do this, you will probably want to introduce a new counter variable that is always re-initialized to 0 inside the outer loop. You can then increment it every time you add a word to the overall list. After the second inner loop, adjust the outer loop counter by this amount. So, if you found two words with the greatest frequency, the loop counter would go from 0 to 2, rather than 0 to 1, after you printed the two most common words. You might still generate more than 40 words if there are multiple words in 40th place, but you won't generate more than 40 because of ties in other places.

Style and Documentation

Be sure that you have updated the class documentation at the top of the file. Focus on what the program does, rather than how it does it. Include your name and the date as well as the names of anyone from whom you received help.
Double-check that you have used meaningful variable names, provided internal comments where appropriate, and have clean indentation. (BlueJ's "Edit > Auto-layout" function is useful for this.)

Zip and Submit Your Program.

Submit your completed program to Kit.

In preparation for submitting your program, rename the folder containing it to YourName_Lab1. (Do this from the Mac Finder or Windows Explorer, not from within BlueJ.) This will help whoever grades it when they receive a dozen or more projects with similar names.
Zip up the folder. This will create a single, compressed version of your project that can be submitted to Kit.
Submit your zipped up folder to Kit.

Textual Analysis Lab Series

Word Frequency Project

Finding Maximum Values

Frequency Analysis

Find the most frequent word.

Find the most frequent word(s), including multiples.

Find the top 40.

Optional: Account for "ties"

Style and Documentation

Zip and Submit Your Program.