### Survey - University of Washington

BLAST, PSI-BLAST and positionspecific scoring matrices
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
[email protected]
Outline
•
•
•
•
•
•
Responses from last class
Revision
BLAST
PSI-BLAST
Position specific scoring matrices (PSSMs)
Python
One-minute responses
• Please explain the null and alternative hypothesis again.
• Liked giving examples on the statistical concepts.
• Sometimes the class is boring because you are using only the projector.
•
•
•
•
•
Python session was good, but too fast.
The Python is difficult because it is different from what we learned before.
The problem is how to use sys in Python. I hope you give lots of examples
for the sys command.
• Please be available for consultation over the weekend on the assignment.
• Does BLAST use p-values to decide which alignments to consider?
Revision
• What is a distribution?
– A mathematical function whose
values sum to 1.
• If you roll a single die many times
and make a histogram of the
resulting values, what kind of
distribution will you observe?
– Uniform
• If you compare a protein
sequence to many, randomly
shuffled protein sequences and
make a histogram of the resulting
scores, what kind of distribution
will you observed?
– Extreme value distribution
• What is the definition of “null
hypothesis”?
– A statistical model of the situation
that we are not interested in.
• What is the opposite of the null
hypothesis?
– The alternative hypothesis.
• What is the name of the
estimated probability of
observing the data, assuming that
it was generated according to the
null hypothesis?
– p-value
• How do you decide what p-value
threshold to use?
– Consider the costs associated with
making a mistake.
Significance of scores
HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT
Sequence
alignment
algorithm
LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE
45
Low score = unrelated
High score = homologs
How high is high enough?
Database searching
Sequence
database
Query
Targets ranked by
score
Sequence
comparison
algorithm
How long does DP take?
Query sequence of length n
There are nm
entries in the
matrix.
Target sequence of length m
Each entry requires a
constant number c
of operations.
Dynamic programming matrix
The total number of required operations is approximate nmc.
We say that the algorithm is “order nm” or “O(nm).”
How long does DP take?
• Say that your query is 200 amino acids long.
• You are searching a database that contains a million
proteins.
• If their average length is 200, then you have to fill in
200  200  1,000,000 = 4  1010 DP entries.
• If it takes only 10 operations to fill in each cell, then
you still have to do 4  1011 operations.
BLAST
• DP is O(nm); BLAST is O(m).
• Fundamental innovation: employ a data
structure to index the query sequence.
• The data structure allows you to look up
entries in a table in O(1) time.
Does my length-n
sequence contain the
subsequence “GTR”?
Naive method:
scan the sequence
O(n)
Improved method: hash table
or search tree lookup
O(1)
BLAST
List of
words in
query and
similar
words
Query sequence
Query
Target sequence
BLAST
List of
words in
query and
similar
words
Query sequence
Query
Target sequence
“Does this target
word appear in the
query word list?”
BLAST
List of
words in
query and
similar
words
Query sequence
Query
x
“Yes, at
position 34 in
the query
sequence.”
Target sequence
BLAST
Query
Query sequence
List of
words in
query and
similar
words
x
x
x
x
x
x
x
x
x
Target sequence
BLAST
Query
Query sequence
List of
words in
query and
similar
words
x
x
x
x
x
These two hits are on the
x
diagonal
and close to each
other, so let’s try to
connect them.
x
x
x
Target sequence
BLAST
Query
Query sequence
List of
words in
query and
similar
words
x
x
x
x
x
x
x
x
x
Target sequence
BLAST
Assign a
score to
each hit
List of
words in
query and
similar
words
Query sequence
Query
0.005
x
x
0.27 x
x
Target sequence
BLAST
• “The central idea of the BLAST algorithm is
that a statistically significant alignment is likely
to contain a high-scoring pair of aligned
words.”
• The initial word threshold T is the most
important parameter.
• Low T = high sensitivity, long compute.
• High T = low sensitivity, quick compute.
When does BLAST fail?
E
R
F
E
K
A
Y
K
E
L
I
F
E
M
A
V
N
V
M
F
ECEIRQFLFIQRESARKEACATGTYREKKMDPELIVLVIWICPQFEQLEMRAMWIHAKJEVIUENAQCVIYTMQEPFCII
• BLAST works by joining together short regions
of high similarity.
• Therefore, BLAST will fail to detect long
regions of low similarity.
Summary of BLAST
• Dynamic programming is O(nm), where n is the length of the
query and m is the size of the database.
• BLAST is O(m).
• BLAST produces an index of the query sequence that allows
fast matching to the database.
• Relative to Smith-Waterman, BLAST can produce false
negatives; i.e., homologs that BLAST fails to detect.
BLAST
Query
Homologs
Sequence
database
BLAST
Position-specific iterated BLAST
Position-specific
scoring matrix
(PSSM)
Query
Statistical model
of protein family
Homologs
Sequence
database
BLAST
Position-specific scoring matrix
Position in query sequence
• A PSSM is an n by m
matrix, where n is the
size of the alphabet,
and m is the length of
the sequence.
• The entry at (i, j) is the
score assigned by the
PSSM to letter i at the
jth position.
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
I
-3
-3
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
-2
-1
-23
“K”
at0 position
-4
0
gets
a-3score
of -42.
8
-3
Position-specific scoring matrix
• This PSSM assigns the
sequence NMFWAFGH
a score of 0 + -2 + -3 + 2 + -1 + 6 + 6 + 8 = 12.
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
• What score
does this PSSM
assign to
KRPGHFLA?
• 2 + 0 + -2 + 6 +
0 + 6 + -4 + -2 =
6
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
How PSI-BLAST makes PSSMs
Position-specific iterated BLAST
Query
?
PSSM
Multiple
alignment
Sequence
database
BLAST
Creating a PSSM from 1 sequence
R
L
RNRGQFGH
R
BLOSUM62
matrix
20 by 20
A
-1
-2
-1
0
-1
-2
0
-2
R
5
0
5
-2
1
-3
-2
0
N
0
6
0
0
0
-3
0
1
D
-2
1
-2
-1
0
-3
-1
-1
C
-3
-3
-3
-3
-3
-2
-3
-3
Q
1
0
1
-2
5
-3
-2
0
E
0
0
0
-2
2
-3
-2
0
G
-2
0
-2
6
-2
-3
6
-2
H
0
1
0
-2
0
-1
-2
8
I
-3
-3
-3
-4
-3
0
-4
-3
L
-2
-3
-2
-4
-2
0
-4
-3
K
2
0
2
-2
1
-3
-2
-1
M
-1
-2
-1
-3
0
0
-3
-2
F
-3
-3
-3
-3
-3
6
-3
-1
P
-2
-2
-2
-2
-1
-4
-2
-2
S
-1
1
-1
0
0
-2
0
-1
T
-1
0
-1
-2
-1
-2
-2
-2
W
-3
-4
-3
-2
-2
1
-2
-2
Y
-2
-2
-2
-3
-1
3
-3
2
V
-3
-3
-3
-3
-2
-1
-3
-3
20 by L
Position-specific iterated BLAST
?
Query
PSSM
Multiple
alignment
Sequence
database
BLAST
Creating a PSSM from multiple sequences
• Discard columns that contain gaps in the
query.
• For each column C
– Compute relative sequence weights
– Compute PSSM entries, taking into account
• Observed residues in this column
• Sequence weights
• Substitution matrix
EEFG----SVDGLVNNA
QKYG----RLDVMINNA
RRLG----TLNVLVNNA
GGIG----PVD-LVNNA
KALG----GFNVIVNNA
ARFG----KID-LIPNA
FEPEGPEKGMWGLVNNA
AQLK----TVDVLINGA
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVD-LVNNA
KALGGFNVIVNNA
ARFGKID-LIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
Compute sequence weights
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVDLLVNNA
KALGGFNVIVNNA
ARFGKIDTLIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
1.2
1.2
0.8
0.8
1.1
0.9
1.1
1.3
• Low weights are
assigned to redundant
sequences.
• High weights are
assigned to unique
sequences.
Compute PSSM entries
EEFGSVDGLVNNA
QKYGRLDVMINNA
RRLGTLNVLVNNA
GGIGPVDLLVNNA
KALGGFNVIVNNA
ARFGKIDTLIPNA
FEPEGMWGLVNNA
AQLKTVDVLINGA
BLOSUM62
matrix
1.2
1.2
0.8
0.8
1.1
0.9
1.1
1.3
PSSM
Position-specific iterated BLAST
Query
PSSM
Multiple
alignment
Sequence
database
BLAST
Summary of PSI-BLAST
• PSI-BLAST builds a model of the query sequence and its close
homologs.
• Instead of comparing a target sequence to the query, each
target is compared to the model.
• The PSI-BLAST model is called a position-specific scoring
matrix (PSSM).
• The PSSM can be constructed from a collection of targets
aligned to the query sequence.
• PSI-BLAST is more accurate than BLAST.
Sample problem #1
• Given:
– a file containing a
sequence of
amino acids
• Return:
– the amino acid
counts
./compute-counts.py seq1.txt
seq1.txt.
A 5
C 2
D 3
E 1
F 6
G 0
H 0
I 2
K 2
L 8
M 1
N 5
P 7
Q 1
R 1
S 2
T 5
V 6
W 3
Y 8
Sample problem #2
• Given:
– a pseudocount weight
– a file containing amino acid frequencies
– a file containing a sequence of amino acids
• Return:
– the summed amino acid counts and pseudocounts
Sample problem #3
• Given:
– a pseudocount weight
– a file containing amino acid frequencies
– a file containing a sequence of amino acids
• Return:
– the normalized summed amino acid counts and
pseudocounts