In this lab, you will write a program that will report each n-letter substring that appears in a list of strings, as well as where it appears in each of the strings, where n is to be determined by the user.
For example, suppose we have a list of strings: abcdabcdabcdabcd, bcdabcdabcdabcd, cdabcdabcdabcdabcdabcd, and dabcdabcdabcd, and we are looking for 4-letter substrings. The first substring we find, abcd, appears 4 times in the first string - at positions 0, 4, 8, and 12. It also appears 3 times in the second string - at positions 3, 7, and 11, 5 times in the third string - at positions 2, 6, 10, 14, and 18, and 3 times in the fourth string - at positions 1, 5, and 9. Your program will report these 15 occurrences of the substring abcd. It will then report the 12 occurrences of bcda, the next substring encountered, and will continue with the occurrences of cdab, etc.
There are a number of files provided to get you started. To begin, download the following files and create a project in Eclipse containing them.
DNAData.txt - This is a file of partial DNA sequences, taken from a set of viruses. A DNA sequence consists of a GI identifier, a description, and then a sequence of nucleotides. These nucleotide sequences are the strings that you will be indexing.
TestData.txt - This file has a format similar to DNAData.txt, but can be used to easily check the correctness of your implementation.
DNASeqReader.java - This file is a superclass which gets extended for reading in DNA sequences with a specific format.
DNADataReader.java - This file contains code to read DNA sequence data in from a file, such as TestData.txt or DNAData.txt.
ValidatedInputReader.java - This file may be used for input of integers and strings.
Debug.java - This file may be useful for testing and debugging your program.
The implementation of this program has been divided into smaller tasks (stages), which should make testing easier.
DNASequence class for DNA sequence objects. This class
should allow the user to access the sequence information, but they should not be allowed to modify it.
This class should also have a toString method that will return the entire sequence (GI, description and
nucleotide sequence) as a nicely formatted string.DNADataReader class. This class will be used to read the data from
the two text files, but it is incomplete. Notice that there are places where it asks you to add,
replace, or change code. Follow these instructions.
TestDNAReader to test that you can correctly
read in the data. This class should have a main method that gets the filename
and the number of lines in the nucleotide sequence from the user. (You may
use the ValidatedInputReader class for this.) It will then create
a DNADataReader object with the information specified by the
user. It should save what is returned from the readData method
into an ArrayList. Finally, it should iterate through the ArrayList
and print out the data for each DNA sequence. Compare your output with the
data file you used. If your output agrees with the file, you're ready to continue.
Location class that will hold a pair of integers.
Location objects will be used to store occurrences of substrings.
For instance, in the example above, the substring abcd would occur
at Locations (0,0), (0,4), (0, 8), (0, 12), (1, 3), (1, 7), ...,
and (3, 9). The first number in the pair represents the index of the string
(ie, which string it is), and the second number in the pair represents
the starting location of where the substring appears in that particular string.
So, in the first string (which would be at index 0 in an ArrayList), the substring
appears starting at positions 0, 4, 8, and 12.Locations. This ArrayList will hold the locations of the substring in
the sequences from the data file. Give this class an appropriate name, such as
SubstringAndLocs.
Reminder: You may use the template for a main class and template for a normal class if you want, to get started.
Indexer class that will be used to find all of the
n-letter substrings and their locations within a set of DNA sequences.
Indexer class should do the following:
DNADataReader object with the user-specified information.Locationsrun method to the Indexer class. This method should scan
through the nucleotide sequences. When a new substring of the specified length is encountered,
it should be added to the appropriate ArrayList, together with the location at which it appeared.
When a duplicate substring is found, the location should be added to the respective ArrayList of
Locations for that substring. printResults method to the Indexer class. The method
should print a list of each substring found, along with the locations at which it appears.
Indexer object and
tells it to run, and then to print the results..java files
you created for this program. You should also print the output obtained when running your
program with the DNAData.txt file, using substrings of
length 7. You may choose to print only the substrings that occur more
than once. Alternatively, you may create a zip file from your project
and email the zip file to me. Be sure that it includes the .java files and the output!