In this lab, you will write a program that will report each n-letter substring that appears in a list of strings, as well as where it appears in each of the strings, where n is to be determined by the user. This is a process that is useful in finding shared substrings in DNA sequences.
For example, suppose we have a list of strings: acgtacgtacgtacgt, cgtacgtacgtacgt, gtacgtacgtacgtacgtacgt, and tacgtacgtacgt, and we are looking for 4-letter substrings. The first substring we find, acgt, appears 4 times in the first string - at positions 0, 4, 8, and 12. It also appears 3 times in the second string - at positions 3, 7, and 11, 5 times in the third string - at positions 2, 6, 10, 14, and 18, and 3 times in the fourth string - at positions 1, 5, and 9. Your program will report these 15 occurrences of the substring acgt. It will then report the 12 occurrences of cgta, the next substring encountered, and will continue with the occurrences of gtac, etc.
There are a number of files provided to get you started. To begin, download the following files and create a project in Eclipse (or BlueJ) containing them.
DNAData.txt - This is a file of partial DNA sequences, taken from a set of viruses. A DNA sequence consists of a GI identifier, a description, and then a sequence of nucleotides. These nucleotide sequences are the strings that you will be indexing.
TestData.txt - This file has a format similar to DNAData.txt, but can be used to easily check the correctness of your implementation.
DNADataReader.java - This file contains code to read DNA sequence data in from a file, such as TestData.txt or DNAData.txt.
ValidatedInputReader.java - This file may be used for input of integers and strings.
Debug.java - This file may be useful for testing and debugging your program.
The implementation of this program has been divided into smaller tasks (stages), which should make testing easier.
TestDNAReader that you will
use to test the early stages of this program as you develop it.
This class should have a main method that asks the user
for a filename, using the ValidatedInputReader class.
It will then create a DNADataReader object with the
file name entered by the user. It should then call the
readData method, which for now simply prints some
information about each sequence it reads in.
Testing early and often: If you test your program at every step along the way, then when something does not go as you expect you will have a good idea where the problem lies. Testing early and often at this stage could mean:
- Create a
mainthat just prints a simple string and test that.- Use the
Debugclass for debugging output: turn on debugging (Debug.turnOn();)at the top of yourmainmethod, and then useDebug.printlnrather thanSystem.out.println.- Construct your
ValidatedInputReaderobject, ask the user for a filename, and test this by printing the filename.- Construct a
for the appropriate filename, and test that theDNADataReaderobjectreadDatamethod does what you expect. (Note that the initial version ofDNADataReaderwill only produce debugging output if you have turned debugging on.)
Create a DNASequence class for DNA sequence objects.
(What attributes should a DNA sequence object have? How will those
attributes get initialized? What should the constructor do, and what
parameters does it need?)
This class
should allow the user to access the sequence information, but users
should not be allowed to modify it. (So, what methods should the class
have?) The DNASequence class should also have
a toString method that will return all of the sequence
information (GI, description and nucleotide sequence) as a nicely
formatted string.
DNADataReader class. This class
will be used to read the data from the two text files, but it is
incomplete. Read the complete class to understand the nature of the
changes you will be making. You will probably then find it easiest to
make the changes "bottom-up": make the suggested changes in the
lowest-level method first, then edit the method that calls it to
correspond to the changes you just made, etc..
(A figure illustrating a list of DNASequence objects, such
as you are asked to create in DNADataReader, can be found
in Figure 3
at the end of this page.)
TestDNAReader class so that it saves what is
returned from the readData method into an
ArrayList. Finally, it should iterate through
the ArrayList and print out the data for each DNA
sequence. Compare your output with the data file you used. If your
output agrees with the file, you're ready to continue.
The overall goal of this project is to report on the locations where
various substrings can be found in a set of DNA sequences. To keep
track of locations,
create a Location class that will hold a pair of
integers indicating (a) the DNA sequence string, and (b) the
starting location of the substring within the sequence string.
For instance, in the example at the top of this lab, the
substring acgt occurs at locations 0, 4, 8, and 12 in the
first DNA Sequence string (string 0) and at locations 3, 7, and 11
in the second (string 1). The Location objects
representing these locations would consist of the following
(string, location) pairs:
(0, 0), (0, 4), (0, 8), (0, 12), (1, 3), (1, 7), and (1, 11).
Assume that Location objects will be created only
after both the sequence number and starting location are known.
What should the constructor look like, and what parameters does it need?
Assume also that the attributes of Location objects are
not changed after the object is created. What does that tell you
about what methods are needed or not needed? In addition, you should
create a toString method that returns the contents of a
string in the (string, location) format used above, e.g.,
(0, 12).
To keep track of the locations associated with a number of different substrings, it will be useful to have a class that bundles a particular substring along with the list of its locations.
Give this class an appropriate name, such as
SubstringLocs. What attributes should this class have?
Assume that objects of your new class will be created with the
substring value specified but before the locations have been
determined.
(Alternative: A different, valid assumption would be that
each new object is created with a substring and a single, initial
location for the list.)
What should the constructor do, and what parameters does it need?
Should this class have methods that let client code change the
substring? Should it have methods that let client code add,
remove, or change locations in the list of locations? What about
a toString method? One simple toString
format for the object in Figure 2 would be
acgt: (0, 0) (0, 4) (0, 8) (0, 12) (1, 3) (1, 7)
Testing: Add code to yourmainmethod that tests yourLocationandSubstringLocsclasses as you develop them. You can testtoString()methods by calling them explicitly or by using the objects inSystem.out.printstatements. For example,System.out.println(new Location(3, 4));tests both theLocationconstructor and itstoStringmethod.
Indexer class that will be used to find all of the
n-letter substrings and their locations within a set of
DNA sequences.Indexer class should do the
following:
DNADataReader object with the
user-specified information.ArrayList containing the DNA sequences.
(Where will you get it? From which object?)ArrayList that will hold the
SubstringAndLocs objects.run method to the Indexer class.
This method should scan through the nucleotide sequences, looking at
all the different substrings of the length specified by the user and
building up a list of substrings and their locations, as illustrated
in Figure 4 below. When a given substring is
encountered for the first time, it should be added to the list along
with its location. If the substring was already in the list, its
location should be added to the appropriate location list.
(You may find it helpful to create a private helper method that,
for a given substring, finds the associated substring-and-locations
object in the list. It would also, of course, indicate when
the given substring is not yet represented in the list.)
printResults method to the Indexer
class. The method should print a list of each substring found,
along with the locations at which it appears.
Indexer object, tells it to run and prints the
results..txt or .docx file.)