In this lab, you will write a program that will report each n-letter substring that appears in a list of strings, as well as where it appears in each of the strings, where n is to be determined by the user. This is a process that is useful in finding shared substrings in DNA sequences.
For example, suppose we have a list of strings: acgtacgtacgtacgt, cgtacgtacgtacgt, gtacgtacgtacgtacgtacgt, and tacgtacgtacgt, and we are looking for 4-letter substrings. The first substring we find, acgt, appears 4 times in the first string - at positions 0, 4, 8, and 12. It also appears 3 times in the second string - at positions 3, 7, and 11, 5 times in the third string - at positions 2, 6, 10, 14, and 18, and 3 times in the fourth string - at positions 1, 5, and 9. Your program will report these 15 occurrences of the substring acgt. It will then report the 12 occurrences of cgta, the next substring encountered, and will continue with the occurrences of gtac, etc.
There are a number of files provided to get you started. To begin, download the following files and create a project in Eclipse (or BlueJ) containing them.
DNAData.txt - This is a file of partial DNA sequences, taken from a set of viruses. A DNA sequence consists of a GI identifier, a description, and then a sequence of nucleotides. These nucleotide sequences are the strings that you will be indexing.
TestData.txt - This file has a format similar to DNAData.txt, but can be used to easily check the correctness of your implementation.
DNADataReader.java - This file contains code to read DNA sequence data in from a file, such as TestData.txt or DNAData.txt.
ValidatedInputReader.java - This file may be used for input of integers and strings.
Debug.java - This file may be useful for testing and debugging your program.
The implementation of this program has been divided into smaller tasks (stages), which should make testing easier.
TestDNAReaderthat you will use to test the early stages of this program as you develop it. This class should have a
mainmethod that asks the user for a filename, using the
ValidatedInputReaderclass. It will then create a
DNADataReaderobject with the file name entered by the user. It should then call the
readDatamethod, which for now simply prints some information about each sequence it reads in.
Testing early and often: If you test your program at every step along the way, then when something does not go as you expect you will have a good idea where the problem lies. Testing early and often at this stage could mean:
- Create a
mainthat just prints a simple string and test that.
- Use the
Debugclass for debugging output: turn on debugging (
Debug.turnOn();)at the top of your
mainmethod, and then use
- Construct your
ValidatedInputReaderobject, ask the user for a filename, and test this by printing the filename.
- Construct afor the appropriate filename, and test that the
readDatamethod does what you expect. (Note that the initial version of
DNADataReaderwill only produce debugging output if you have turned debugging on.)
DNASequence class for DNA sequence objects.
(What attributes should a DNA sequence object have? How will those
attributes get initialized? What should the constructor do, and what
parameters does it need?)
should allow the user to access the sequence information, but users
should not be allowed to modify it. (So, what methods should the class
DNASequence class should also have
toString method that will return all of the sequence
information (GI, description and nucleotide sequence) as a nicely
DNADataReaderclass. This class will be used to read the data from the two text files, but it is incomplete. Read the complete class to understand the nature of the changes you will be making. You will probably then find it easiest to make the changes "bottom-up": make the suggested changes in the lowest-level method first, then edit the method that calls it to correspond to the changes you just made, etc.. (A figure illustrating a list of
DNASequenceobjects, such as you are asked to create in
DNADataReader, can be found in Figure 3 at the end of this page.)
TestDNAReaderclass so that it saves what is returned from the
readDatamethod into an
ArrayList. Finally, it should iterate through the
ArrayListand print out the data for each DNA sequence. Compare your output with the data file you used. If your output agrees with the file, you're ready to continue.
The overall goal of this project is to report on the locations where
various substrings can be found in a set of DNA sequences. To keep
track of locations,
Location class that will hold a pair of
integers indicating (a) the DNA sequence string, and (b) the
starting location of the substring within the sequence string.
For instance, in the example at the top of this lab, the
substring acgt occurs at locations 0, 4, 8, and 12 in the
first DNA Sequence string (string 0) and at locations 3, 7, and 11
in the second (string 1). The
representing these locations would consist of the following
(string, location) pairs:
(0, 0), (0, 4), (0, 8), (0, 12), (1, 3), (1, 7), and (1, 11).
Location objects will be created only
after both the sequence number and starting location are known.
What should the constructor look like, and what parameters does it need?
Assume also that the attributes of
Location objects are
not changed after the object is created. What does that tell you
about what methods are needed or not needed? In addition, you should
toString method that returns the contents of a
string in the (string, location) format used above, e.g.,
To keep track of the locations associated with a number of different substrings, it will be useful to have a class that bundles a particular substring along with the list of its locations.
Give this class an appropriate name, such as
SubstringLocs. What attributes should this class have?
Assume that objects of your new class will be created with the
substring value specified but before the locations have been
(Alternative: A different, valid assumption would be that
each new object is created with a substring and a single, initial
location for the list.)
What should the constructor do, and what parameters does it need?
Should this class have methods that let client code change the
substring? Should it have methods that let client code add,
remove, or change locations in the list of locations? What about
toString method? One simple
format for the object in Figure 2 would be
acgt: (0, 0) (0, 4) (0, 8) (0, 12) (1, 3) (1, 7)
Testing: Add code to your
mainmethod that tests your
SubstringLocsclasses as you develop them. You can test
toString()methods by calling them explicitly or by using the objects in
System.out.printstatements. For example,System.out.println(new Location(3, 4));tests both the
Locationconstructor and its
Indexerclass that will be used to find all of the n-letter substrings and their locations within a set of DNA sequences.
Indexerclass should do the following:
DNADataReaderobject with the user-specified information.
ArrayListcontaining the DNA sequences. (Where will you get it? From which object?)
ArrayListthat will hold the
runmethod to the
Indexerclass. This method should scan through the nucleotide sequences, looking at all the different substrings of the length specified by the user and building up a list of substrings and their locations, as illustrated in Figure 4 below. When a given substring is encountered for the first time, it should be added to the list along with its location. If the substring was already in the list, its location should be added to the appropriate location list. (You may find it helpful to create a private helper method that, for a given substring, finds the associated substring-and-locations object in the list. It would also, of course, indicate when the given substring is not yet represented in the list.)
printResultsmethod to the
Indexerclass. The method should print a list of each substring found, along with the locations at which it appears.
Indexerobject, tells it to run and prints the results.