PROJECT #4 BLAST
By Pamela Cutter • Edited by Sandino Vargas-Pérez
BACKGROUND
DNA analysis is a subject that is in the news almost every day, whether
it be a new advance in medical research, a criminal trial, or some other
application. Individuals are identified by
their DNA sequence. Individuals have been exonerated from crimes that
they have been accused of, and even, in some cases, exonerated from
crimes they have served prison
time for committing, based on DNA evidence. We have a new tool for
Forensic (Legal) Investigation that was not widely available to law
enforcement officials and legal experts as recently as 20 years ago.
Investigation of DNA sequences is also playing a major role in medical
research. During the 2003 SARS epidemic... it was through DNA
identification that the virus causing the disease was identified. With
this knowledge, although
scientists were not able to find a cure for the disease, they were able
to implement an aggressive
prevention program. In other cases it is possible to alter the genetic
make-up of certain organisms to develop cures or control mechanisms for
a disease, e.g. most insulin for the control of diabetes is manufactured
using DNA-related technology.
DNA is common to all life - it is contained
within all living cells and is continuously replicated as cells divide
and when new life is created. This replication process is so accurate
that mistakes are rarely made. Despite the similarities of genomes
among all humans, roughly 0.1% of the 3 billion nucleotides (characters
from the 4-letter DNA alphabet of Alanine, Cytosene,
Guanine, Thymine), or 3 million bases, are different
between any two individuals. While this leaves room for 4
3,000,000 different genomes, the basic long DNA sequence is
roughly the same for all humans. However, an analysis of the genomes
between humans and fruit flies has also revealed that many genes in
humans and fruit flies are also similar. Some human genes also show
similarity with worms, plants, and even bacteria. This raises several
perplexing questions: How do we decide whether two strings of DNA
characters represent the same gene or different genes? How do we decide
whether two genomes represent the same species or different species?
Millions of unique DNA sequences have been found as a result of
extracting samples from cells in laboratories around the world. These
are stored in a central database that individuals can access. This
makes it possible to compare DNA sequences to
the 'known' sequences from the database. There are a number of exact,
accurate, and powerful techniques to align sequences (placing sequences
opposite each other in a biologically meaningful way), but many of these
are too time consuming to be practical. By using heuristics to
obtain statistically significant results, the
BLAST (Basic Local Alignment Search
Tool) algorithm takes minutes instead of days to search for matches.
BLAST is an important tool used by biologists worldwide to compare DNA
and protein sequences, and to infer functional and
evolutionary relationships between them. It is available to any
individual free of cost on the World Wide Web
(http://www.ncbi.nlm.nih.gov/BLAST/).
It is accessed every day by thousands of
researchers and students who are comparing 'their' sequence of
DNA characters or their protein sequence to the sequences in the
database for identification or to discover similarities within the
existing database entries.
If a search results in a 'hit', or 'match', and there are often hundreds, the researcher is shown
the likelihood that the match is not just a random occurrence, the
number of terms that match exactly, the number of terms that are
acceptable substitutions, the number of gaps (extra characters) within the match, and the actual alignment.
Read more
Basics of the BLAST Algorithm
Suppose that we have a query DNA sequence Q and we wish to find those sequences from a particular database that are very similar to our sequence Q. We will input our sequence Q into the BLAST algorithm, along with some threshold, T, that will be used as a level of what makes a "good" score when comparing two words, or string segments. (By "good" we mean the lowest value of a segment score that is unlikely to happen by chance.)
There are three major algorithmic steps that the BLAST program uses to detect biologically significant DNA sequence similarities: compile a list of high scoring words, scan the database for hits, and extend the hits. There are two approaches to scanning the database: one involves the use of a hash table; the other involves the use of a deterministic finite automaton. In this project, we will only explore the hash table approach.
Compiling a list of high scoring words
For each contiguous sequence of w
nucleotides (i.e., substring of
w
characters) in a
DNA query sequence, a list of high scoring words is created. Typically
w >= 11
, so we
will take w = 11
, and restrict our discussion to 11-letter words. This
list of high scoring words will contain 11-letter words that score above a given
input threshold, T
, when compared to this word. This will allow
non-exact words to trigger an extension of the hit.
Looking at a shorter example, if we were to take a 4-letter word, say
ACCT
, and assign a score of +1 to identical characters, and a
score of -1 to mismatched characters, ACCT
would score +4 with
itself, but as soon as there is one mismatched character, such as in
ACCA
, the highest score we could obtain would be +2. There would
in fact be a total of twelve 4-letter words that attain the score of +2
with ACCT
. So, if our input threshold T
is +2, our list
of high scoring words would have 13 words in it.
Scanning the database for hits
The database is indexed and a hash table is created to contain all of the "significant" (not overly common) 11-letter words found in the sequences and their locations within the database. If all possible 11-letter words were in the table, there would be 411 = 4,194,304 words. Typically there are only a few thousand words in the table. It can be noted that DNA sequences are highly non-random, so some words just won't appear or will be so common that they would not be useful for identification purposes. The words that are used in the table are thus "filtered" to include only the most significant words.
Now, given this hash table and a query sequence, each 11-letter word from the query sequence, as well as all of the high scoring words close to this word, are hashed. If any of these words are found in the table, there is a hit, and then there is an attempt made to extend this hit at the specified locations in the database sequences.
Extending the hits
Once a hit with a word in the hash table has been obtained, the first
location indicated in the database is found, and an attempt is made to
try to extend the 11 letters in both directions to make a longer
sequence that achieves a score higher than some prescribed cutoff score,
S
. This works by starting with the hit, then continuing to
extend the sequence in both directions until the score falls some
distance X
below the maximum score attained up to that point.
Once the extension has been stopped, if the total score for the extended
sequence is above S
, the sequence is saved for reporting in the
results.
Suppose we have found that some 11-letter word from a query sequence,
such as GACATAGCAAC
, is in the hash table of significant words from a
database, and suppose we have been given a target score, S
. We
look in our hash table at the entry corresponding to our good word to
find where we should look to start extending our sequence. Suppose we
find the following, where the top row contains a portion of the
database sequence, and the bottom row contains the query sequence. The
11-letter word in bold was found to have a score of +7 with the word
from the database sequence (using +1 for identical characters, -1 for
mismatched characters), which, we will assume for this example, is above
the threshold T
.
A C T G G C C T A G A C C T A G C T A C T A A G T
C C T G A C A T A G A C A T A G C A A C T A T C T
Now, we could clearly continue to extend our sequence to include both
the TA before it and the CTA after it, as these will both increase the
score. But, depending on what the scoring level S
was, and how
much we are allowed to drop below the maximum thus far, extending it
further might cause our score to fall too low. Once we have extended as
far as possible, and have attained a score above S
, we then save
this long sequence to be reported in the final analysis.
THE PROJECT
In this project, you will be using a real data set and will implement the
BLAST program on a smaller scale, looking for
exact matches among sequences. We are using virus data obtained from
NCBI (National Center for Biotechnology Information). Each virus DNA sequence we will use contains just the first 1020
nucleotides in the sequence. Many of these sequences are much
longer than that; several were shorter, and have been padded with
x
's so
that they are all the same length.
We will be using the following modified algorithm:
- Index the database: Read the virus DNA sequences from a data file;
choose an appropriate data structure to hold this information. Create a hash table
to contain all of the
n
-letter words found in the sequences, and their locations within the database (i.e., which sequence and the starting position(s) of this word in the sequence). - Scan the query sequence: For each
n
-letter word in the query sequence, check if it is in the hash table. If it is in the table, attempt to extend this hit. - Extend the hits: Once we have a obtained a hit with a word in the hash table, go to the first location indicated in a database sequence, and begin to try to extend the match to make a longer matching sequence. (Note, with this algorithm, we are just looking for exact matches between sequences.) When we get to a mismatch, stop extending. Once the extension has been stopped, check if the length of the match is longer than some prescribed value. If it is, and the match is not part of a longer match already being saved, save this match for reporting.
- Report the results: For each query sequence, display the number of significant matches (matches longer than the prescribed value) with virus sequences, as well information about each match. This information should at least include which virus(es) contained matches, as well as the starting location(s) and length of each match within the virus, and the matched substrings.
Design Questions
- What are the objects involved in this project? (May include such things as DNA sequence, query sequence, word, database, and hash table.)
- What data structure will you use to hold the virus DNA sequences?
- Will this data structure be part of a class? If so, what are the responsibilities of this class?
- What class should include the hash table?
- What information do you need to save when you find a match? Should you create a new type of object for this?
- Where should the user specify filenames for the DNA sequence data
and the query sequence data? Where should they specify
n
, the substring length to use as the words? Where should they specify the length to use for significant matches? - Which parts of your Indexing Substrings project might you be able to build from?
Remember the Principle of Data Abstraction: implementation details should be hidden from the user.
Possible Design
The following is one possible design. A description of the classes used is given below. You may choose to implement this design or your own design.
DNASequence
: This class represents a DNA sequence - it contains the GI, the description, and the (partial) DNA sequence.Location
: This class represents a pairing of a sequence number and a position within that sequence.Database
: This class contains the DNA sequence data, it contains theHashMap
that holds all possible n-letter words contained in this DNA data, it has a method to find matches of substrings of a query sequence, a method to extend those matches, and possibly a method to display the matches.MatchElement
: This class stores information for each match longer than a given threshold.BLASTMain
: This class contains the main method which creates and runs the application.BLASTApp
: This class gets the DNA data and the query data from the data readers. It has a run method that iterates through each query and displays any matches with substrings of the DNA data that are longer than a given threshold.DNADataReader
: This class reads the DNA sequence data from a file, and puts it into an appropriate data structure.QueryReader
: This class reads the query data from a file, and puts it into an appropriate data structure.
Available Files
You may choose to use the following files, or you can make your own:
Design ideas and implementation hints:
Java files for input and reading from files:
Data Files
- DNAData.txt: This is the same data file used in the Indexing Substrings project.
- viruses_100.fasta: Contains the complete genomes of 100 viruses. (This is a large file!)
- Queries.txt: Contains the queries for which we want to find matches.
- testData.txt: Contains a small number of short sequences that can be used for testing.
- testQueries.txt: Contains a few short queries for testing purposes.
- Output11_15.txt:
Output obtained when using
viruses_100.fasta
,Queries.txt
, and looking for matches at least 15 characters long.
EVALUATION
Project #4 is worth 30 points. This project will submitted in groups of 3 students. The following items describe the criteria we will use to evaluate your project:
- Object Oriented Desing/Abstraction (5 points):
- Appropiate use of
Database
,MatchElement
,BLASTMain
,BLASTApp
,DNASequence
,Location
classes, and other classes you might create 2.5 pts. - Correct use of data abstraction: declaring objects, using accurate access modifiers, encapsulation, interfaces (if applicable), and so on 2.5 pts.
- Note: This section has to do with correct use of the concepts of Object Oriented Programming we have applied during class.
- Appropiate use of
- Implementation (20 points):
- Reading virus data and query data into structure 2.5 pts.
- Creating
HashMap
of n-letter words 3 pts. - Scan query and find matches: check if n-letter words are in
HashMap
7 pts. - Extend matches (avoiding repeated matches) 5 pts.
- Report results (see Output11_15.txt for format guide) 2.5 pts.
- Internal Documentation and Style (5 points):
- Appropriate internal documentation at the top of all files 2 pts.
- Appropriate internal comments, variable names, indentation in all files 3 pts.
- Note: Make sure these classes are properly implemented, you have clean code, and the appropriate amount of comments.
Submit your completed program through Kit, under Project 4: BLAST.
Have fun! And if you have any questions remember my email, my office hours, and the collaboration center are here for you!
← back to the schedule