COMP210 - Project #4

PROJECT #4 BLAST
By Pamela Cutter • Edited by Sandino Vargas-Pérez

BACKGROUND

DNA analysis is a subject that is in the news almost every day, whether it be a new advance in medical research, a criminal trial, or some other application. Individuals are identified by their DNA sequence. Individuals have been exonerated from crimes that they have been accused of, and even, in some cases, exonerated from crimes they have served prison time for committing, based on DNA evidence. We have a new tool for Forensic (Legal) Investigation that was not widely available to law enforcement officials and legal experts as recently as 20 years ago.

Investigation of DNA sequences is also playing a major role in medical research. During the 2003 SARS epidemic... it was through DNA identification that the virus causing the disease was identified. With this knowledge, although scientists were not able to find a cure for the disease, they were able to implement an aggressive prevention program. In other cases it is possible to alter the genetic make-up of certain organisms to develop cures or control mechanisms for a disease, e.g. most insulin for the control of diabetes is manufactured using DNA-related technology.

DNA is common to all life - it is contained within all living cells and is continuously replicated as cells divide and when new life is created. This replication process is so accurate that mistakes are rarely made. Despite the similarities of genomes among all humans, roughly 0.1% of the 3 billion nucleotides (characters from the 4-letter DNA alphabet of Alanine, Cytosene, Guanine, Thymine), or 3 million bases, are different between any two individuals. While this leaves room for 4 ^3,000,000 different genomes, the basic long DNA sequence is roughly the same for all humans. However, an analysis of the genomes between humans and fruit flies has also revealed that many genes in humans and fruit flies are also similar. Some human genes also show similarity with worms, plants, and even bacteria. This raises several perplexing questions: How do we decide whether two strings of DNA characters represent the same gene or different genes? How do we decide whether two genomes represent the same species or different species?

Millions of unique DNA sequences have been found as a result of extracting samples from cells in laboratories around the world. These are stored in a central database that individuals can access. This makes it possible to compare DNA sequences to the 'known' sequences from the database. There are a number of exact, accurate, and powerful techniques to align sequences (placing sequences opposite each other in a biologically meaningful way), but many of these are too time consuming to be practical. By using heuristics to obtain statistically significant results, the BLAST (Basic Local Alignment Search Tool) algorithm takes minutes instead of days to search for matches.

BLAST is an important tool used by biologists worldwide to compare DNA and protein sequences, and to infer functional and evolutionary relationships between them. It is available to any individual free of cost on the World Wide Web (http://www.ncbi.nlm.nih.gov/BLAST/). It is accessed every day by thousands of researchers and students who are comparing 'their' sequence of DNA characters or their protein sequence to the sequences in the database for identification or to discover similarities within the existing database entries. If a search results in a 'hit', or 'match', and there are often hundreds, the researcher is shown the likelihood that the match is not just a random occurrence, the number of terms that match exactly, the number of terms that are acceptable substitutions, the number of gaps (extra characters) within the match, and the actual alignment. Read more

Basics of the BLAST Algorithm

Suppose that we have a query DNA sequence Q and we wish to find those sequences from a particular database that are very similar to our sequence Q. We will input our sequence Q into the BLAST algorithm, along with some threshold, T, that will be used as a level of what makes a "good" score when comparing two words, or string segments. (By "good" we mean the lowest value of a segment score that is unlikely to happen by chance.)

There are three major algorithmic steps that the BLAST program uses to detect biologically significant DNA sequence similarities: compile a list of high scoring words, scan the database for hits, and extend the hits. There are two approaches to scanning the database: one involves the use of a hash table; the other involves the use of a deterministic finite automaton. In this project, we will only explore the hash table approach.

Compiling a list of high scoring words

For each contiguous sequence of w nucleotides (i.e., substring of w characters) in a DNA query sequence, a list of high scoring words is created. Typically w >= 11, so we will take w = 11, and restrict our discussion to 11-letter words. This list of high scoring words will contain 11-letter words that score above a given input threshold, T, when compared to this word. This will allow non-exact words to trigger an extension of the hit.

Looking at a shorter example, if we were to take a 4-letter word, say ACCT, and assign a score of +1 to identical characters, and a score of -1 to mismatched characters, ACCT would score +4 with itself, but as soon as there is one mismatched character, such as in ACCA, the highest score we could obtain would be +2. There would in fact be a total of twelve 4-letter words that attain the score of +2 with ACCT. So, if our input threshold T is +2, our list of high scoring words would have 13 words in it.

Scanning the database for hits

The database is indexed and a hash table is created to contain all of the "significant" (not overly common) 11-letter words found in the sequences and their locations within the database. If all possible 11-letter words were in the table, there would be 4¹¹ = 4,194,304 words. Typically there are only a few thousand words in the table. It can be noted that DNA sequences are highly non-random, so some words just won't appear or will be so common that they would not be useful for identification purposes. The words that are used in the table are thus "filtered" to include only the most significant words.

Now, given this hash table and a query sequence, each 11-letter word from the query sequence, as well as all of the high scoring words close to this word, are hashed. If any of these words are found in the table, there is a hit, and then there is an attempt made to extend this hit at the specified locations in the database sequences.

Extending the hits

Once a hit with a word in the hash table has been obtained, the first location indicated in the database is found, and an attempt is made to try to extend the 11 letters in both directions to make a longer sequence that achieves a score higher than some prescribed cutoff score, S. This works by starting with the hit, then continuing to extend the sequence in both directions until the score falls some distance X below the maximum score attained up to that point. Once the extension has been stopped, if the total score for the extended sequence is above S, the sequence is saved for reporting in the results.

Suppose we have found that some 11-letter word from a query sequence, such as GACATAGCAAC, is in the hash table of significant words from a database, and suppose we have been given a target score, S. We look in our hash table at the entry corresponding to our good word to find where we should look to start extending our sequence. Suppose we find the following, where the top row contains a portion of the database sequence, and the bottom row contains the query sequence. The 11-letter word in bold was found to have a score of +7 with the word from the database sequence (using +1 for identical characters, -1 for mismatched characters), which, we will assume for this example, is above the threshold T.

A C T G G C C T A G A C C T A G C T A C T A A G T

C C T G A C A T A G A C A T A G C A A C T A T C T

Now, we could clearly continue to extend our sequence to include both the TA before it and the CTA after it, as these will both increase the score. But, depending on what the scoring level S was, and how much we are allowed to drop below the maximum thus far, extending it further might cause our score to fall too low. Once we have extended as far as possible, and have attained a score above S, we then save this long sequence to be reported in the final analysis.

THE PROJECT

In this project, you will be using a real data set and will implement the BLAST program on a smaller scale, looking for exact matches among sequences. We are using virus data obtained from NCBI (National Center for Biotechnology Information). Each virus DNA sequence we will use contains just the first 1020 nucleotides in the sequence. Many of these sequences are much longer than that; several were shorter, and have been padded with x's so that they are all the same length.

We will be using the following modified algorithm:

Index the database: Read the virus DNA sequences from a data file; choose an appropriate data structure to hold this information. Create a hash table to contain all of the n-letter words found in the sequences, and their locations within the database (i.e., which sequence and the starting position(s) of this word in the sequence).
Scan the query sequence: For each n-letter word in the query sequence, check if it is in the hash table. If it is in the table, attempt to extend this hit.
Extend the hits: Once we have a obtained a hit with a word in the hash table, go to the first location indicated in a database sequence, and begin to try to extend the match to make a longer matching sequence. (Note, with this algorithm, we are just looking for exact matches between sequences.) When we get to a mismatch, stop extending. Once the extension has been stopped, check if the length of the match is longer than some prescribed value. If it is, and the match is not part of a longer match already being saved, save this match for reporting.
Report the results: For each query sequence, display the number of significant matches (matches longer than the prescribed value) with virus sequences, as well information about each match. This information should at least include which virus(es) contained matches, as well as the starting location(s) and length of each match within the virus, and the matched substrings.

Design Questions

What are the objects involved in this project? (May include such things as DNA sequence, query sequence, word, database, and hash table.)
What data structure will you use to hold the virus DNA sequences?
Will this data structure be part of a class? If so, what are the responsibilities of this class?
What class should include the hash table?
What information do you need to save when you find a match? Should you create a new type of object for this?
Where should the user specify filenames for the DNA sequence data and the query sequence data? Where should they specify n, the substring length to use as the words? Where should they specify the length to use for significant matches?
Which parts of your Indexing Substrings project might you be able to build from?

Remember the Principle of Data Abstraction: implementation details should be hidden from the user.

Possible Design

The following is one possible design. A description of the classes used is given below. You may choose to implement this design or your own design.

DNASequence: This class represents a DNA sequence - it contains the GI, the description, and the (partial) DNA sequence.
Location: This class represents a pairing of a sequence number and a position within that sequence.
Database: This class contains the DNA sequence data, it contains the HashMap that holds all possible n-letter words contained in this DNA data, it has a method to find matches of substrings of a query sequence, a method to extend those matches, and possibly a method to display the matches.
MatchElement: This class stores information for each match longer than a given threshold.
BLASTMain: This class contains the main method which creates and runs the application.
BLASTApp: This class gets the DNA data and the query data from the data readers. It has a run method that iterates through each query and displays any matches with substrings of the DNA data that are longer than a given threshold.
DNADataReader: This class reads the DNA sequence data from a file, and puts it into an appropriate data structure.
QueryReader: This class reads the query data from a file, and puts it into an appropriate data structure.

Available Files

You may choose to use the following files, or you can make your own:

Design ideas and implementation hints:

Java files for input and reading from files:

Data Files

DNAData.txt: This is the same data file used in the Indexing Substrings project.
viruses_100.fasta: Contains the complete genomes of 100 viruses. (This is a large file!)
Queries.txt: Contains the queries for which we want to find matches.
testData.txt: Contains a small number of short sequences that can be used for testing.
testQueries.txt: Contains a few short queries for testing purposes.
Output11_15.txt: Output obtained when using viruses_100.fasta, Queries.txt, and looking for matches at least 15 characters long.

Acknowledgments: This project was written by Pamela Cutter. The background information was modified from "Searching DNA Strands", an educational module written by Pamela Cutter, Carl Leinbach and Kah Loon Ng, submitted to the DIMACS Educational Module Series, Rutgers University.

EVALUATION

Project #4 is worth 30 points. This project will submitted in groups of 3 students. The following items describe the criteria we will use to evaluate your project:

Object Oriented Desing/Abstraction (5 points):
- Appropiate use of Database, MatchElement, BLASTMain, BLASTApp, DNASequence, Location classes, and other classes you might create 2.5 pts.
- Correct use of data abstraction: declaring objects, using accurate access modifiers, encapsulation, interfaces (if applicable), and so on 2.5 pts.
- Note: This section has to do with correct use of the concepts of Object Oriented Programming we have applied during class.
Implementation (20 points):
- Reading virus data and query data into structure 2.5 pts.
- Creating HashMap of n-letter words 3 pts.
- Scan query and find matches: check if n-letter words are in HashMap 7 pts.
- Extend matches (avoiding repeated matches) 5 pts.
- Report results (see Output11_15.txt for format guide) 2.5 pts.
Internal Documentation and Style (5 points):
- Appropriate internal documentation at the top of all files 2 pts.
- Appropriate internal comments, variable names, indentation in all files 3 pts.
Note: Make sure these classes are properly implemented, you have clean code, and the appropriate amount of comments.

Submit your completed program through Kit, under Project 4: BLAST.

Have fun! And if you have any questions remember my email, my office hours, and the collaboration center are here for you!

← back to the schedule

PROJECT #4 BLAST By Pamela Cutter • Edited by Sandino Vargas-Pérez