Textual Analysis Lab Series
Word Frequency Project
Finding Maximum Values
This set of exercises is a follow-up to the
Word Counting
and Word Counting labs.
In these exercises, you will be doing additional textual analysis,
finding the most frequently used words in the text.
Add these exercises to your existing project, but print an extra newline or
a string of dashes (e.g.,--------------
)
or other visual delimiter (e.g.,=== New Exercises ===
)
to your output to set the new exercises off from the old.
(You may want to add something similar in a comment in your code, too, to set
the two sets of exercises off from each other.)
Frequency Analysis
You have already identified how many words occur only once in the text, but
what word occurs most frequently. This is an example of a classic Extreme
Value problem (find the minimum or maximum value in a list).
-
Find the most frequent word.
To find a minimum or maximum value, you go through a list comparing
new items to whatever minimum or maximum you have identified so
far, changing to a new extreme value when you find a value that is
smaller or larger (depending on which extreme you are looking for).
The tricky part to this algorithm is selecting an initial
candidate. One approach is to assume for the moment that the
first one is the extreme value and then look for any that
are more extreme. The other approach is to pick a value that is
extreme in the other direction as the starting point. For example,
if we're looking for the largest value in a list of positive
integers, we could start by
pretending that a negative number is the maximum; any value in the
list will be larger than that.
Implementation:
-
You will want two variables: one to store the most frequent word
found (so far), and one to store how many times it occurs (its
frequency count).
A frequency count will always be 0 or a positive number, so
initializing it to a "dummy" maximum value of -1 will work.
You can initialize the most frequent word to
null
.
-
Then go
through the list of words in the document. If you come across a
word that has a higher frequency count than the maximum so far,
set the most frequent word variable to the current word and the
maximum frequency count to the count associated with it.
-
When you get to the end of the list, the word that is in the most
frequent word variable will, indeed, be the most frequently used
word, and its count will be in the frequency count variable.
Your output might look something like this:
The most frequently-used word is the. It occurs 5816 times.
What is the most frequently used word in your book? Is there any
difference when you look at different books?
Question: What if there are
two words that are the most frequenctly used? This is
unlikely to happen in this case, but often happens in other
"Find the Minimum/Maximum Value" cases. Your code would find
one of the two most frequently used, depending on whether you
used <
or <=
for your check.
Can you tell whether your code would find the first maximum
value or the last one, if there were multiple values with the
same maximum frequency?
-
Find the most frequent word(s), including multiples.
Although it is highly unlikely that the text has two equally most
frequently used words, let's change the algorithm to find all of
them (if there were multiples). We'll do this for two reasons: it
is good practice for other situations where that could easily
happen, and it will set you up for the next exercise.
Implementation:
- Construct an
ArrayList
in which you can store the frequently-used
word(s).
- Next, create two loops to step through all the distinct words,
one after the other. (Not nested.) In the first loop, find the
frequency count for the most frequently used word or words,
as you did before,
although this time you don't have to keep track of what the most
frequently used word is. In the second loop, find all words with
that frequency and add them to your list.
- After the second loop,
print "all" the words, along with their frequency counts, just as
you did for the words containing "house" (or "time") in the
previous Textual Analysis lab,
although in this case you will probably get only one word.
Sample output:
Most frequently-used word(s): the (5816)
-
Find the top 40.
The most frequently used word is seldom interesting; in fact, it
is usually "the." It would be more interesting to identify the 40
most frequently used words.
To do this will require wrapping the two loops you just wrote in an
outer loop, so that you find not just the most frequently used
words, but the 40 most frequently used words.
Implementation:
- After the construction of your
ArrayList
in which you stored frequently used
words, create a new variable called upperBound
and initialize it to a number greater than the frequency of the
most frequent word. For example, if the most frequent word is
"the" and if it occurs 5800 times, then 6000 would be an appriate
initial value for upperBound
.
- Next, create a loop that iterates as many times as the
number of top words you want to find. In other words, if you want
to identify the 40 most frequently used words, you would create a
loop that repeats 40 times (similar to the loop you created in the
Word Counting mini-lab.)
-
Put your two loops from before
within the braces for your new loop, adjusting the
indentation. (The initialization of your frequency count variable
to -1 should also be inside the outer loop, with the two inner
loops.)
Tip: BlueJ has an "Auto-layout"
option under the Edit menu that will clean up indentation for
you. It can also sometimes show you when your program logic
doesn't match what you think it does, because the indentation
will not be what you expect.
-
Modify the first loop to check for two things: whether
the current word's frequency count is higher than the "most
frequent" count and whether the count is less than
the upper bound.
-
After the second loop (but still nested within your count-to-40
loop), adjust your upperBound to be the frequency count you just
used. For example, if your initial upper bound was 6000 and your
found the word "the" with a frequency of 5800, you would now set
the upper bound to 5800 so that in the next pass through the outer
loop you can find the most frequently used word(s) below 5800 (the
second most frequently used). If the second most frequently used
word is "and", and it occurs 3100 times, you would then set the
upper bound to 3100 to find the most frequently used word(s) below
that. And so on.
-
After the outer loop ends,
print the 40 most frequently used words,
along with their frequency counts, just as
you did before. You may find it neater to print each word and its
frequency count on a new line, rather than putting them all on the
same line.
Below is some sample output. Note that the word "I" appears as
"i"; this is because the WordReader
object converts
all the words in the text to lower-case in order to find multiple
occurrences of words regardless of capitalization.
The 40 most frequently-used word(s) are:
the (5816)
and (3089)
i (3038)
...
holmes (467)
upon (467)
...
Note: You might actually identify
slightly more than 40 words. Why? You are looping 40 times,
finding 40 different frequency counts, but it's
possible that one or more of those frequency counts had
multiple words with that count, as in the sample output above.
-
Download two or more additional books from
Project
Gutenberg, choosing books that are very different from each
other. (Don't forget to download Plain Text versions.) Run your
program several times, changing the filename each time to a
different book. Copy and paste your results into a document, so
that you can compare results. Are there any interesting
differences you see that point to fundamental differences in the
tone, themes, or target audience for the different books. For
example, there might be significant differences in the range of
vocabulary or use of long words. The prevalence of masculine
vs. feminine pronouns might be distincely different. What
other differences do you notice? (If you want to share your
analysis, put the document in the same folder as your program with
a name that makes it easy to spot, like "Textual Analysis Results.")
Optional: Account for "ties"
Adjust your loop so that you adjust your loop counter by the
number of words found rather than by 1, each iteration through the
loop. To do this, you will probably want to introduce a new
counter variable that is always re-initialized to 0 inside
the outer loop. You can then increment it
every time you add a word to the overall list. After the second
inner loop, adjust the outer loop counter by this amount. So, if
you found two words with the greatest frequency, the loop counter
would go from 0 to 2, rather than 0 to 1, after you printed the two
most common words.
You might still generate more than 40 words if there are
multiple words in 40th place, but you won't generate more than 40
because of ties in other places.
Style and Documentation
- Be sure that you have updated
the class documentation at the top of the file.
Focus
on what the program does, rather than how it does it.
Include your name and the date as well as the names of anyone from whom
you received help.
- Double-check that you have used meaningful variable names,
provided internal comments where appropriate, and have
clean indentation. (BlueJ's "Edit > Auto-layout" function is
useful for this.)
Zip and Submit Your Program.
Submit your completed program to Kit.
- In preparation for submitting your program, rename the
folder containing it to
YourName_Lab1
. (Do
this from the Mac Finder or Windows Explorer, not from within BlueJ.)
This will help whoever grades it
when they receive a dozen or more projects with similar names.
- Zip up the folder.
This will create a single, compressed version
of your project that can be submitted to
Kit.
-
Submit your zipped up folder to Kit.