BLAST Programming Project

Pamela Cutter
Kalamazoo College

Background

DNA analysis is a subject that is in the news almost every day, whether it be a new advance in medical research, a criminal trial, or some other application. Individuals are identified by their DNA sequence. Individuals have been exonerated from crimes that they have been accused of, and even, in some cases, exonerated from crimes they have served prison time for committing, based on DNA evidence. We have a new tool for Forensic (Legal) Investigation that was not widely available to law enforcement officials and legal experts as recently as 20 years ago.

Investigation of DNA sequences is also playing a major role in medical research. During the 2003 SARS epidemic it was through DNA identification that the virus causing the disease was identified. With this knowledge, although scientists were not able to find a cure for the disease, they were able to implement an aggressive prevention program. In other cases it is possible to alter the genetic make-up of certain organisms to develop cures or control mechanisms for a disease, e.g. most insulin for the control of diabetes is manufactured using DNA-related technology.

DNA is common to all life - it is contained within all living cells and is continuously replicated as cells divide and when new life is created. This replication process is so accurate that mistakes are rarely made. Despite the similarities of genomes among all humans, roughly 0.1% of the 3 billion nucleotides (characters from the 4-letter DNA alphabet of Alanine, Cytosene, Guanine, Thymine), or 3 million bases, are different between any two individuals. While this leaves room for 4 3,000,000 different genomes, the basic long DNA sequence is roughly the same for all humans. However, an analysis of the genomes between humans and fruit flies has also revealed that many genes in humans and fruit flies are also similar. Some human genes also show similarity with worms, plants, and even bacteria. This raises several perplexing questions: How do we decide whether two strings of DNA characters represent the same gene or different genes? How do we decide whether two genomes represent the same species or different species?

Millions of unique DNA sequences have been found as a result of extracting samples from cells in laboratories around the world. These are stored in a central database that individuals can access. This makes it possible to compare DNA sequences to the 'known' sequences from the database. There are a number of exact, accurate, and powerful techniques to align sequences (placing sequences opposite each other in a biologically meaningful way), but many of these are too time consuming to be practical. By using heuristics to obtain statistically significant results, the BLAST (Basic Local Alignment Search Tool) algorithm takes minutes instead of days to search for matches.

BLAST is an important tool used by biologists worldwide to compare DNA and protein sequences, and to infer functional and evolutionary relationships between them. It is available to any individual free of cost on the World Wide Web (http://www.ncbi.nlm.nih.gov/BLAST/). It is accessed every day by thousands of researchers and students who are comparing 'their' sequence of DNA characters or their protein sequence to the sequences in the database for identification or to discover similarities within the existing database entries. If a search results in a 'hit', or 'match', and there are often hundreds, the researcher is shown the likelihood that the match is not just a random occurrence, the number of terms that match exactly, the number of terms that are acceptable substitutions, the number of gaps (extra characters) within the match, and the actual alignment.

The Basics of the BLAST Algorithm

Suppose that we have a query DNA sequence Q and we wish to find those sequences from a particular database that are very similar to our sequence Q. We will input our sequence Q into the BLAST algorithm, along with some threshold, T, that will be used as a level of what makes a "good" score when comparing two words, or string segments. (By "good" we mean the lowest value of a segment score that is unlikely to happen by chance.)

There are three major algorithmic steps that the BLAST program uses to detect biologically significant DNA sequence similarities: compile a list of high scoring words, scan the database for hits, and extend the hits. There are two approaches to scanning the database: one involves the use of a hash table; the other involves the use of a deterministic finite automaton. In this project, we will only explore the hash table approach.

Compiling a list of high scoring words

For each contiguous sequence of w nucleotides (i.e., substring of w characters) in a DNA query sequence, a list of high scoring words is created. Typically w >= 11, so we will take w = 11, and restrict our discussion to 11-letter words. This list of high scoring words will contain 11-letter words that score above a given input threshold, T, when compared to this word. This will allow non-exact words to trigger an extension of the hit.

Looking at a shorter example, if we were to take a 4-letter word, say ACCT, and assign a score of +1 to identical characters, and a score of -1 to mismatched characters, ACCT would score +4 with itself, but as soon as there is one mismatched character, such as in ACCA, the highest score we could obtain would be +2. There would in fact be a total of twelve 4-letter words that attain the score of +2 with ACCT. So, if our input threshold T is +2, our list of high scoring words would have 13 words in it.

Scanning the database for hits

The database is indexed and a hash table is created to contain all of the "significant" (not overly common) 11-letter words found in the sequences and their locations within the database. If all possible 11-letter words were in the table, there would be 411 = 4,194,304 words. Typically there are only a few thousand words in the table. It can be noted that DNA sequences are highly non-random, so some words just won't appear or will be so common that they would not be useful for identification purposes. The words that are used in the table are thus "filtered" to include only the most significant words.

Now, given this hash table and a query sequence, each 11-letter word from the query sequence, as well as all of the high scoring words close to this word, are hashed. If any of these words are found in the table, there is a hit, and then there is an attempt made to extend this hit at the specified locations in the database sequences.

Extending the hits

Once a hit with a word in the hash table has been obtained, the first location indicated in the database is found, and an attempt is made to try to extend the 11 letters in both directions to make a longer sequence that achieves a score higher than some prescribed cutoff score, S. This works by starting with the hit, then continuing to extend the sequence in both directions until the score falls some distance X below the maximum score attained up to that point. Once the extension has been stopped, if the total score for the extended sequence is above S, the sequence is saved for reporting in the results.

Suppose we have found that some 11-letter word from a query sequence, such as GACATAGCAAC, is in the hash table of significant words from a database, and suppose we have been given a target score, S. We look in our hash table at the entry corresponding to our good word to find where we should look to start extending our sequence. Suppose we find the following, where the top row contains a portion of the database sequence, and the bottom row contains the query sequence. The 11-letter word in bold was found to have a score of +7 with the word from the database sequence (using +1 for identical characters, -1 for mismatched characters), which, we will assume for this example, is above the threshold T.

Now, we could clearly continue to extend our sequence to include both the TA before it and the CTA after it, as these will both increase the score. But, depending on what the scoring level S was, and how much we are allowed to drop below the maximum thus far, extending it further might cause our score to fall too low. Once we have extended as far as possible, and have attained a score above S, we then save this long sequence to be reported in the final analysis.


The Project

In this project, you will be using a real data set and will implement the BLAST program on a smaller scale, looking for exact matches among sequences. We are using virus data obtained from NCBI (National Center for Biotechnology Information). Each virus DNA sequence we will use contains just the first 1020 nucleotides in the sequence. Many of these sequences are much longer than that; several were shorter, and have been padded with x's so that they are all the same length.

We will be using the following modified algorithm:

Design Questions

Remember the Principle of Data Abstraction: implementation details should be hidden from the user. If you are creating some kind of siumlation, any implementation details and data structures should be hidden from users of the simulation. Users should be able to create a simulation and just tell it to run.

One Possibility

Design

The following is one possible design. A description of the classes used is given below. You may choose to implement this design or your own design.

Available Files

You may choose to use the following files, or you may make your own.

Design ideas and implementation hints:

Java files for input and reading from files:

Data Files