This set of exercises is a follow-up to the Word Counting Lab. In that project you read in lines from a work of literature and then calculated some basic statistics based on the number of words and number of lines in the text. In these exercises, you will be doing additional analysis of the text.
--------------
)
or other visual delimiter
(e.g.,=== New Exercises ===
)
to your output to show
the "shifting of gears" from the previous exercises to the new
exercises.
(You may want to add something similar in a comment in your code, too, to set
the two sets of exercises off from each other.)
In the previous lab, you calculated the average word length in a text. In the next set of exercises, you will do more interesting textual analysis.
This question is easy to answer,
and doesn't even require that you step through and separately read
all the lines of the file. Read the WordReader
class
documentation to find a method that returns a list of all the words in
the document. Make sure to read the Method Details, not just the
Summary, to fully understand what this method will return. How can
you determine how many distinct words the author uses? Remember
that the ArrayList
method has a size
method.
The output might look something like this:
SomeFile.txt has 8083 distinct words.
NOTE: The words in the list of distinct words returned
by the reader
object are always converted to lower case so
that differences in case don't mark words as being different. For
example, "The" and "the" are the same word and will only appear as
"the" in the list of words. This means, though, that words
that are usually capitalized, such as proper names, will look odd;
"Gutenberg" will become "gutenberg," for example.
Another way to measure the range of vocabulary in a text is to count
how many words appear only once. Create a variable to sum up the
"singletons" (words that appear just once). Read the
WordReader
class documentation to find a method that will
tell you how many times a given word occurs in the text. Loop through
the word list you retrieved in the previous step; whenever you come
across a word that appears just once in the text, increment your
summing variable. (How will you increment your summing variable
only if the word count is 1?) Outside the loop,
report on how many singletons there are in this document.
The output might look something like this:
There are 3544 words that occur only once in SomeFile.txt.
Yet another way to measure the complexity of the vocabulary in a text
is to measure the use of long, generally sophisticated, words. You can
do this by looping through the list of distinct words, checking
whether they are longer than 14 characters, and incrementing a counter
if they are. To make it more interesting, print the long words as you
find them. You may
want to do this using System.out.print
rather than
println
, putting a space rather than a newline between
words. Then, after the loop, use a println
to get to the
next line and another println
to print out the total count.
(You may also want to print an "intro" phrase before you enter the
for
loop, as in the example below.)
Sample output:
Words that are more than 14 characters long in SomeFile.txt: conventionalities inconsequential characteristics characteristics improbabilities characteristics indistinguishable accomplishments accomplishments disproportionately representations MERCHANTABILITY unenforceability SomeFile.txt has 13 words that are more than 14 characters long.
NOTE 1: Some of the long words you see might not be from the original text. For example, doing this exercise in a Project Gutenberg version of Alice in Wonderland results in three words longer than 14 characters, none of which actually come from the book. All three are in the Project Gutenberg license information ("representations," "MERCHANTABILITY," and "unenforceability").
NOTE 2: The words in the list of distinct words returned
by the reader
object are not in the original order
from the file, which makes it harder to tell which words come from the
actual text and which come from the extra licensing information
provided by Project Gutenberg.
Optional Challenge Exercise:
(Do this only after completing the other exercises in this lab.)
If you want to see the list of words
in their proper order (optional!), you can find them by
reading in every line in the file, breaking the line
into words (as you did when you
calculated the number of words), and then checking all the words and
increment the counter. If you do this, don't forget to start by
getting a new WordReader
object so that you start reading
lines from the beginning of the file.
Another interesting analysis is how often an author uses a particular word, including variations. Loop through the full list of words checking to find words that contain a particular base word, like "house" or "time", and printing the ones that do. (Or you could try "cat" if you don't have any luck with "house" or "time", although many words will contain that three-letter sequence even if they have nothing to do with cats!)
Tip: Read the String Methods quick reference to find a method that will be useful.
It is even more interesting to know how many times each of those words
occurs. As you print a word that contains the base, ask the
reader
object how often this word occurs.
Sample output:
Words containing 'house': housemaid (1); house (120); household (8); farmhouse (1); houses (13); housekeeper (3); westhouse (2); warehouse (1); outhouse (1); lighthouse (1);
When you have your code working, you could experiment with words that might be more specific to your book, like "detect" or "chess" or "freedom" or "dance."
Reminder: Many of these exercises are more interesting if you compare results from a couple of different books from Project Gutenberg, choosing books that are very different from each other, so that you can compare them along the way or at the end. (Don't forget to download Plain Text versions.)
Another variation on the word frequency analysis would be to measure how often various characters (as in personages, not letters) are mentioned in the book. This is a way of quantitatively measuring the relative importance or prominence of various characters.
There are 467 references to Holmes in SherlockHolmes.txt. There are 81 references to Watson in SherlockHolmes.txt.
You will be adding to this program in the Word Frequency project, but submitting it at this point will allow you to get feedback before you submit a final version.
YourName_Lab1
. (Do
this from the Mac Finder or Windows Explorer, not from within BlueJ.)
This will help whoever grades it
when they receive a dozen or more projects with similar names.