Indexing Substrings

Pamela Cutter
Kalamazoo College


In this lab, you will write a program that will report each n-letter substring that appears in a list of strings, as well as where it appears in each of the strings, where n is to be determined by the user. This is a process that is useful in finding shared substrings in DNA sequences.

For example, suppose we have a list of strings: acgtacgtacgtacgt, cgtacgtacgtacgt, gtacgtacgtacgtacgtacgt, and tacgtacgtacgt, and we are looking for 4-letter substrings. The first substring we find, acgt, appears 4 times in the first string - at positions 0, 4, 8, and 12. It also appears 3 times in the second string - at positions 3, 7, and 11, 5 times in the third string - at positions 2, 6, 10, 14, and 18, and 3 times in the fourth string - at positions 1, 5, and 9. Your program will report these 15 occurrences of the substring acgt. It will then report the 12 occurrences of cgta, the next substring encountered, and will continue with the occurrences of gtac, etc.


Getting Started

There are a number of files provided to get you started. To begin, download the following files and create a project in Eclipse (or BlueJ) containing them.

DNAData.txt - This is a file of partial DNA sequences, taken from a set of viruses. A DNA sequence consists of a GI identifier, a description, and then a sequence of nucleotides. These nucleotide sequences are the strings that you will be indexing.

TestData.txt - This file has a format similar to DNAData.txt, but can be used to easily check the correctness of your implementation.

DNADataReader.java - This file contains code to read DNA sequence data in from a file, such as TestData.txt or DNAData.txt.

ValidatedInputReader.java - This file may be used for input of integers and strings.

Debug.java - This file may be useful for testing and debugging your program.


Implementation

The implementation of this program has been divided into smaller tasks (stages), which should make testing easier.

Stage 1: Reading in the Data


Stage 2: Create additional classes

Testing: Add code to your main method that tests your Location and SubstringLocs classes as you develop them. You can test toString() methods by calling them explicitly or by using the objects in System.out.print statements. For example,
     System.out.println(new Location(3, 4));
tests both the Location constructor and its toString method.

Stage 3: Indexing


Stage 4: Submission

picture of SubstringLocs list
Figure 4
picture of Sequence list
Figure 3