Tutorial - Wright State University

Report
OCCBIO 2006 Tutorial
Fundamental Concepts of
Bioinformatics
Michael L. Raymer
Computer Science, Biomedical Sciences
Wright State University
Bioinformatics Research Group
Part I – Background
Some basics of molecular biology,
and some of the fundamental
problems facing bioinformaticians
The Central Dogma of molecular biology
OCCBIO 2006 – Fundamental Bioinformatics
3
DNA structure and base pairing
 Polymer of:
• Ribose sugar
• Phosphate
• Nitrogenous base
 Four bases
• A, C, G, T
 Base pairing
• A—T
• G—C
OCCBIO 2006 – Fundamental Bioinformatics
4
DNA is an information carrying molecule
 Arranged into 23
chromosome pairs in the
nucleus of each cell
 Genes: coding
information
• < 5% of all DNA
• Instructions for protein
synthesis
• Directions on when and
where to synthesize
proteins (regulatory
regions)
OCCBIO 2006 – Fundamental Bioinformatics
5
The Genetic Code
 Redundancy/robustness
•
•
•
•
Synonymous codons
Dual strands
Diploidy
Amino acid structure (?)
OCCBIO 2006 – Fundamental Bioinformatics
6
Transcription
DNA
transcription

RNA
translation

Protein
Messenger RNA
 Carries
instructions
for a protein
outside of the
nucleus to the
ribosome
 The ribosome
is a protein
complex that
synthesizes
new proteins
OCCBIO 2006 – Fundamental Bioinformatics
8
Prokaryotic gene structure
Yeast RNA Polymerase II
Darst et al. in 1991 (Cell 66, pp 121-128)
5' UTR
3' UTR
5'
3'
Coding
Operator: regulation
Stop Codon
Promoter: RNA polymerase binding
OCCBIO 2006 – Fundamental Bioinformatics
9
Regulation of transcription
 Energy budget
 Cellular differentiation & tissue function
From W. Becker, L. Kleinsmith, and J. Hardin, The World of the Cell, Fourth Edition. Copyright © Addison Wesley Longman, Inc.
OCCBIO 2006 – Fundamental Bioinformatics
10
Bioinformatics problems
 Shotgun sequencing
 Sequence alignment & multiple alignment
• Database searches
 Phylogenetic tree induction
 Protein structure determination, modeling, and
prediction
 Ligand screening and docking
 Many, many more
OCCBIO 2006 – Fundamental Bioinformatics
11
Bioinformatics data
 DNA sequence information
• Genome projects, etc.
 mRNA expression information
• Microarrays, SAGE
 Metabolite concentrations
• Mass Spec., NMR Spec., etc.
 Protein sequence information
 Protein structure information
• X-Ray Crystallography
OCCBIO 2006 – Fundamental Bioinformatics
12
Part II – Obtaining Sequences
Sanger Sequencing
Primer Walking
Shotgun Approaches
Fragment Assembly Algorithms
Outline




PCR
Sanger Sequencing
Primer Walking
Shotgun Sequencing
• Models
• Algorithms
• Analysis
OCCBIO 2006 – Fundamental Bioinformatics
14
Polymerase chain reaction (PCR)
OCCBIO 2006 – Fundamental Bioinformatics
15
Gel electrophoresis
OCCBIO 2006 – Fundamental Bioinformatics
16
Sanger sequencing
OCCBIO 2006 – Fundamental Bioinformatics
17
Limitations to sequencing
 You must have a primer of known sequence to
initiate PCR
 Only about 1000nts can be sequenced in a
single reaction
 The sequencing process is slow, so it is
beneficial to do as much in parallel as possible
• Primer hopping
• Shotgun approach
OCCBIO 2006 – Fundamental Bioinformatics
18
Shotgun Sequencing
OCCBIO 2006 – Fundamental Bioinformatics
19
The Ideal Case
 Find maximal overlaps between fragments:
ACCGT
CGTGC
TTAC
TACCGT
--ACCGT-----CGTGC
TTAC-----TACCGT—
TTACCGTGC
Consensus
sequence
determined by vote
OCCBIO 2006 – Fundamental Bioinformatics
20
Quality Metrics
 The coverage at position i of the target or
consensus sequence is the number of fragments
that overlap that position
Target:
No coverage
 Two contigs
OCCBIO 2006 – Fundamental Bioinformatics
21
Quality Metrics
 Linkage – the degree of overlap between
fragments
Target:
Perfect coverage, poor average linkage
poor minimum linkage
OCCBIO 2006 – Fundamental Bioinformatics
22
Real World Complications
 Base call errors
 Chimeric fragments, contamination (e.g. from
the vector)
--ACCGT-----CGTGC
TTAC-----TGCCGT—
TTACCGTGC
Base Call Error
--ACC-GT-----CAGTGC
TTAC------TACC-GT—
TTACC-GTGC
Insertion Error
OCCBIO 2006 – Fundamental Bioinformatics
--ACCGT-----CGTGC
TTAC-----TAC-GT—
TTACCGTGC
Deletion Error
23
Unknown Orientation
A fragment can come
from either strand
CACGT
ACGT
ACTACG
GTACT
ACTGA
CTGA
OCCBIO 2006 – Fundamental Bioinformatics






CACGT
-ACGT
--CGTAGT
-----AGTAC
--------ACTGA
---------CTGA
24
Repeats
 Direct repeats
A
X
A
X
B
C
OCCBIO 2006 – Fundamental Bioinformatics
X
X
C
B
X
D
X
D
25
Repeats
 Direct repeats
A
X
A
X
B
D
Y
Y
OCCBIO 2006 – Fundamental Bioinformatics
C
C
X
X
D
B
Y
E
Y
E
26
Repeats
 Inverted repeats
X
X
X
X
OCCBIO 2006 – Fundamental Bioinformatics
27
Sequence Alignment Models
 Shortest common superstring
• Input: A collection, F, of strings (fragments)
• Output: A shortest possible string S such that for
every f  F, S is a superstring of f.
 Example:
• F = {ACT, CTA, AGT}
• S = ACTAGT
OCCBIO 2006 – Fundamental Bioinformatics
28
Problems with the SCS model
x
x
x




x´
Directionality of fragments must be known
No consideration of coverage
Some simple consideration of linkage
No consideration of base call errors
OCCBIO 2006 – Fundamental Bioinformatics
29
Reconstruction
 Deals with errors and unknown orientation
 Definitions
• f is an approximate substring of S at error level 
when ds(f, S)    | f |
Match = 0
• ds = substring edit distance:
 Reconstruction
Mismatch = 1
Gap = 1
• Input: A collection, F, of strings, and a tolerance
level, 
• Output: Shortest possible string, S, such that for
every f  F : minds  f , S , ds  f , S    f
OCCBIO 2006 – Fundamental Bioinformatics
30
Reconstruction Example
 Input:
 Output:
F = {ATCAT, GTCG, CGAG, TACCA}
 = 0.25
ATGAT
------CGAC
-CGAG
----TACCA
ACGATACGAC
ATCAT
GTCG
ds(CGAG, ACGATACGAC) = 1
= 0.25  4
So this output is OK for  = 0.25
OCCBIO 2006 – Fundamental Bioinformatics
31
Gaps in Reconstruction
 Reconstruction allows gaps in fragments:
AT-GA----ATCGATAGAC
OCCBIO 2006 – Fundamental Bioinformatics
ds = 1
32
Limitations of Reconstruction





Models errors and unknown orientation
Doesn’t handle repeats
Doesn’t model coverage
Only handles linkage in a very simple way
Always produces a single contig
OCCBIO 2006 – Fundamental Bioinformatics
33
Contigs
 Sometimes you just can’t put all of the
fragments together into one contiguous
sequence:
No way to tell how
much sequence is
missing between
them.
OCCBIO 2006 – Fundamental Bioinformatics
?
No way to tell the
order of these two
contigs.
34
Multicontig
 Definitions
• A layout, L, is a multiple alignment of the fragments

Columns numbered from 1 to |L |
• Endpoints of a fragment: l(f) and r(f)
• An overlap is a link is no other fragment completely
covers the overlap
Link
OCCBIO 2006 – Fundamental Bioinformatics
Not a link
35
Multicontig
 More definitions
• The size of a link is the number of overlapping
positions
ACGTATAGCATGA
GTA
CATGATCA
ACGTATAG
GATCA
A link of size 5
• The weakest link is the smallest link in the layout
• A t-contig has a weakest link of size t
• A collection, F, admits a t-contig if a t-contig can be
constructed from the fragments in F
OCCBIO 2006 – Fundamental Bioinformatics
36
Perfect Multicontig
 Input: F, and t
 Output: a minimum number of collections, Ci,
such that every Ci admits a t-contig
Let F = {GTAC, TAATG, TGTAA}
t=3
t=1
--TAATG
TGTAA--
TGTAA------TAATG--------GTAC
GTAC
OCCBIO 2006 – Fundamental Bioinformatics
37
Handling errors in Multicontig
 The image of a fragment is the portion of the
consensus sequence, S, corresponding to the
fragment in the layout
 S is an -consensus for a collection of
fragments when the edit distance from each
fragment, f, and its image is at most   | f |
TATAGCATCAT
CGTC
CATGATCA
ACGGATAG
GTCCA
ACGTATAGCATGATCA
OCCBIO 2006 – Fundamental Bioinformatics
An -consensus
for  = 0.4
38
Definition of Multicontig
 Input: A collection, F , of strings, an integer t 
0, and an error tolerance  between 0 and 1
 Output: A partition of F into the minimum
number of collections Ci such that every Ci
admits a t-contig with an -consensus
OCCBIO 2006 – Fundamental Bioinformatics
39
Example of Multicontig
 Let  = 0.4, t = 3
TATAGCATCAT
ACGTC
CATGATCAG
ACGGATAG
GTCCAG
ACGTATAGCATGATCAG
OCCBIO 2006 – Fundamental Bioinformatics
40
Algorithms
 Most of the algorithms to solve the fragment
assembly problem are based on a graph model
 A graph, G, is a collection of edges, e, and
vertices, v.
• Directed or undirected
• Weighted or unweighted
 We will discuss
representations and
other issues shortly…
OCCBIO 2006 – Fundamental Bioinformatics
A directed,
unweighted
graph
41
The Maximum Overlap Graph
 The text calls it an overlap multigraph
 Each directed edge, (u,v) is weighted with the
length of the maximal overlap between a suffix
of u and a prefix of v
TACGA
a
1
2
ACCC
CTAAAG
b
1
c
1
1
d
OCCBIO 2006 – Fundamental Bioinformatics
GACA
0-weight
edges
omitted!
42
Paths and Layouts
 The path dbc leads to the alignment:
GACA----------ACCC----------CTAAAG
TACGA
a
1
2
ACCC
CTAAAG
c
1
1
d
OCCBIO 2006 – Fundamental Bioinformatics
b
1
GACA
43
Superstrings
 Every path that covers every node is a
superstring
 Zero weight edges result in alignments like:
GACA-----------GCCC------------TTAAAG
 Higher weights produce more overlap, and thus
shorter strings
 The shortest common superstring is the highest
weight path that covers every node
OCCBIO 2006 – Fundamental Bioinformatics
44
Graph formulation of SCS
 Input: A weighted, directed graph
 Output: The highest-weight path that touches
every node of the graph
Does this problem sound familiar?
OCCBIO 2006 – Fundamental Bioinformatics
45
The Greedy Algorithm
Algorithm greedy
Sort edges in decreasing weight order
For each edge in this order
If the edge does not form a cycle
and the edge does not start or end at
the same node as another edge in the set
then
add the edge to the current set
End for
End Algorithm
Figure 4.16, page 125
OCCBIO 2006 – Fundamental Bioinformatics
46
Greedy Example
7
4
5
2
2
2
6
3
1
OCCBIO 2006 – Fundamental Bioinformatics
47
Greedy does not always find the best path
GCC
2
ATGC
0
2
TGCAT
3
OCCBIO 2006 – Fundamental Bioinformatics
48
Tools for Shotgun Sequencing
OCCBIO 2006 – Fundamental Bioinformatics
49
Common Difficulty
 Each of these problems is a method for
modeling fragment assembly
 Each of these problems is provably intractable
 How?
OCCBIO 2006 – Fundamental Bioinformatics
50
Embedding problems

Suppose I told you that I had found a clever
way to model the TSP as a shortest common
superstring problem
•
•

Paths between cities are represented as fragments
The shortest path is the shortest common
superstring of the fragments
If this is true, then there are only two
possibilities:
1. This problem is just as intractable as TSP
2. TSP is actually a tractable problem!
OCCBIO 2006 – Fundamental Bioinformatics
51
NP-Complete Problems
 There is a collection of problems that computer
scientists believe to be intractable
• TSP is one of them
 Each of them has been modeled as one or more
of the other NP-complete problems
 If you solve one, you solve them all
 A problem, p, is NP-hard if you can model one
of these NP-complete problems as an instance
of p
OCCBIO 2006 – Fundamental Bioinformatics
52
NP-Completeness
NP
Subset sum
3-SAT
OCCBIO 2006 – Fundamental Bioinformatics
TSP
P
53
P = NP?
NP
Subset sum
3-SAT
NP
P
OCCBIO 2006 – Fundamental Bioinformatics
54
Part III – Sequence Alignments
Needleman-Wunsch
Smith-Waterman
Dynamic Programming
Why align sequences?
 The draft human genome is available
 Automated gene finding is possible
 Gene: AGTACGTATCGTATAGCGTAA
• What does it do?
 One approach: Is there a similar gene in
another species?
• Align sequences with known genes
• Find the gene with the “best” match
OCCBIO 2006 – Fundamental Bioinformatics
56
Comparing two sequences
 Point mutations, easy:
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
 Indels are difficult, must align sequences:
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
OCCBIO 2006 – Fundamental Bioinformatics
57
Scoring a sequence alignment
 Match score:
+1
 Mismatch score:
+0
 Gap penalty:
–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
 Matches: 18 × (+1)
 Mismatches: 2 × 0
 Gaps: 7 × (– 1)
OCCBIO 2006 – Fundamental Bioinformatics
Score = +11
58
DNA Replication
 Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set
 DNA polymerase is the enzyme
that copies DNA
• Synthesizes in the 5' to 3' direction
OCCBIO 2006 – Fundamental Bioinformatics
59
Over time, genes accumulate mutations
 Environmental factors
• Radiation
• Oxidation
 Mistakes in replication or
repair
 Deletions, Duplications
 Insertions
 Inversions
 Point mutations
OCCBIO 2006 – Fundamental Bioinformatics
60
Deletions
 Codon deletion:
ACG ATA GCG TAT GTA TAG CCG…
• Effect depends on the protein, position, etc.
• Almost always deleterious
• Sometimes lethal
 Frame shift mutation:
ACG ATA GCG TAT GTA TAG CCG…
ACG ATA GCG ATG TAT AGC CG?…
• Almost always lethal
OCCBIO 2006 – Fundamental Bioinformatics
61
Indels
 Comparing two genes it is generally impossible
to tell if an indel is an insertion in one gene, or
a deletion in another, unless ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
OCCBIO 2006 – Fundamental Bioinformatics
62
Origination and length penalties
 We want to find alignments that are
evolutionarily likely.
 Which of the following alignments seems more
likely to you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT

 We can achieve this by penalizing more for a
new gap, than for extending an existing gap
OCCBIO 2006 – Fundamental Bioinformatics
63
Scoring a sequence alignment (2)
 Match/mismatch score:
+1/+0
 Origination/length penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT




Matches: 18 × (+1)
Mismatches: 2 × 0
Origination: 2 × (–2)
Length: 7 × (–1)
OCCBIO 2006 – Fundamental Bioinformatics
Score = +7
64
How can we find an optimal alignment?
 Finding the alignment is computationally hard:
ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG—CATCGTC--T-ATCT
 C(27,7) gap positions = ~888,000 possibilities
 It’s possible, as long as we don’t repeat our
work!
 Dynamic programming: The Needleman &
Wunsch algorithm
OCCBIO 2006 – Fundamental Bioinformatics
65
What is the optimal alignment?
 ACTCG
ACAGTAG
 Match: +1
 Mismatch: 0
 Gap: –1
OCCBIO 2006 – Fundamental Bioinformatics
66
Needleman-Wunsch: Step 1
 Each sequence along one axis
 Mismatch penalty multiples in first row/column
 0 in [1,1] (or [0,0] for the CS-minded)
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
A
-1
1
OCCBIO 2006 – Fundamental Bioinformatics
C
-2
T
-3
C
-4
G
-5
67
Needleman-Wunsch: Step 2
 Vertical/Horiz. move: Score + (simple) gap penalty
 Diagonal move: Score + match/mismatch score
 Take the MAX of the three possibilities
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
A
-1
1
OCCBIO 2006 – Fundamental Bioinformatics
C
-2
T
-3
C
-4
G
-5
68
Needleman-Wunsch: Step 2 (cont’d)
 Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
OCCBIO 2006 – Fundamental Bioinformatics
t
-2
0
c
-3
-1
g
-4
-2
-5
-3
69
Needleman-Wunsch: Step 2 (cont’d)
 Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
 The optimal alignment score is calculated in the
lower-right corner
OCCBIO 2006 – Fundamental Bioinformatics
70
But what is the optimal alignment
 To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
OCCBIO 2006 – Fundamental Bioinformatics
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
71
A path corresponds to an alignment

= GAP in top sequence

= GAP in left sequence

= ALIGN both positions
 One path from the previous table:
 Corresponding alignment (start at the end):
AC--TCG
ACAGTAG
OCCBIO 2006 – Fundamental Bioinformatics
Score = +2
72
Practice Problem
 Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
 Match: +1
 Mismatch: 0
 Gap: –1 g
c
g
t
g
0
-1
-2
-3
-4
OCCBIO 2006 – Fundamental Bioinformatics
c
-1
g
-2
g
-3
t
-4
t
-5
-6
73
Practice Problem
 Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
g
c
g
g
t
t
g
c
g
t
0
-1
-2
-3
-4
-1
1
0
-1
-2
-2
0
2
1
0
-3
-1
1
3
2
GCGGTT
GCG-TOCCBIO 2006 – Fundamental Bioinformatics
-4
-2
0
2
3
-5
-3
-1
1
3
-6
-4
-2
0
2
Score = +2
74
Semi-global alignment
 Suppose we are aligning:
GCG
GGCG
 Which do you prefer?
G-CG
-GCG
GGCG
GGCG
g
g
g
c
g
0
-1
-2
-3
-4
c
-1
1
0
-1
-2
g
-2
0
1
1
0
-3
-1
1
1
2
 Semi-global alignment allows gaps at the ends
for free.
OCCBIO 2006 – Fundamental Bioinformatics
75
Semi-global alignment
 Semi-global alignment allows gaps at the ends
for free.
g
g
g
c
g
0
0
0
0
0
c
0
1
1
0
1
g
0
0
1
2
1
0
1
1
1
3
 Initialize first row and column to all 0’s
 Allow free horizontal/vertical moves in last
row and column
OCCBIO 2006 – Fundamental Bioinformatics
76
Local alignment
 Global alignments – score the entire alignment
 Semi-global alignments – allow unscored gaps
at the beginning or end of either sequence
 Local alignment – find the best matching
subsequence
 CGATG
AAATGGA
 This is achieved by allowing a 4th alternative at
each position in the table: zero.
OCCBIO 2006 – Fundamental Bioinformatics
77
Local alignment
 Mismatch = –1 this time
c
a
a
a
t
g
g
a
0
-1
-2
-3
-4
-5
-6
-7
g
-1
0
0
0
0
0
0
0
a
-2
0
0
0
0
1
1
0
t
-3
0
1
1
0
0
0
2
g
-4
0
0
0
2
1
0
1
-5
0
0
0
1
3
2
1
CGATG
AAATGGA
OCCBIO 2006 – Fundamental Bioinformatics
78
Optimal Substructure in Alignments
 Consider the alignment:
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
 Is it true that the alignment in the boxed region
must be optimal?
OCCBIO 2006 – Fundamental Bioinformatics
79
A Greedy Strategy
 Consider this pair of sequences
GAGC
GAP = 1
CAGC
 Greedy Approach:
G or G or
C
 Leads to
GAGC-----CAGC
Match = +1
G
Better:
OCCBIO 2006 – Fundamental Bioinformatics
Mismatch = 2
GACG
CACG
80
Breaking apart the problem
 Suppose we are aligning:
ACTCG
ACAGTAG
 First position choices:
A
A
+1
CTCG
CAGTAG
A
-
-1
CTCG
ACAGTAG
A
-1
ACTCG
CAGTAG
OCCBIO 2006 – Fundamental Bioinformatics
81
A Recursive Approach to Alignment
 Choose the best alignment based on these three
possibilities:
align(seq1, seq2) {
if (both sequences empty) {return 0;}
if (one string empty) {
return(gapscore * num chars in nonempty seq);
else {
score1 = score(firstchar(seq1),firstchar(seq2))
+ align(tail(seq1), tail(seq2));
score2 = align(tail(seq1), seq2) + gapscore;
score3 = align(seq1, tail(seq2) + gapscore;
return(min(score1, score2, score3));
}
}
}
OCCBIO 2006 – Fundamental Bioinformatics
82
Time Complexity of RecurseAlign
 What is the recurrence equation for the time
needed by RecurseAlign?
T (n)  3T (n  1)  3
3
n
3
3
3
…
3
3
9
3
27
n
3
OCCBIO 2006 – Fundamental Bioinformatics
83
RecurseAlign repeats its work
A
C
G
T
A
T
C
G
C
G
T
A
T
A
G
A
T
G
C
T
C
T
C
G
OCCBIO 2006 – Fundamental Bioinformatics
84
Dynamic Programming
 Remember all the subproblem answers along the way:
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
 This is possible for any problem that exhibits
optimal substructure
OCCBIO 2006 – Fundamental Bioinformatics
85
Saving Space
 Note that we can throw away the previous rows
of the table as we fill it in:
a
This row
is based
only on
this one
a
c
a
g
t
a
g
OCCBIO 2006 – Fundamental Bioinformatics
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
86
Saving Space (2)
 Each row of the table contains the scores for
aligning a prefix of the left-hand sequence with
all prefixes of the top sequence:
a
Scores for
aligning
aca with
all
prefixes of
actcg
a
c
a
g
t
a
g
OCCBIO 2006 – Fundamental Bioinformatics
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
87
Divide and Conquer
 By using a recursive approach, we can use only
two rows of the matrix at a time:
• Choose the middle character of the top sequence, i
• Find out where i aligns to the bottom sequence

Needs two vectors of scores
• Recursively align the sequences before and after the
fixed positions
i
ACGCTATGCTCATAG
CGACGCTCATCG
OCCBIO 2006 – Fundamental Bioinformatics
88
Finding where i lines up
 Find out where i aligns to the bottom sequence

Needs two vectors of scores
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
 Assuming i lines up with a character:
alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)
+ align(CTCATAG, suffix(t))
 Which character is best?
• Can quickly find out the score for aligning ACGCTAT with
every prefix of t.
OCCBIO 2006 – Fundamental Bioinformatics
89
Finding where i lines up
 But, i may also line up with a gap
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
 Assuming i lines up with a gap:
alignscore = align(ACGCTAT, prefix(t)) + gapscore
+ align(CTCATAG, suffix(t))
OCCBIO 2006 – Fundamental Bioinformatics
90
Recursive Call
 Fix the best position for I
 Call align recursively for the prefixes and
suffixes:
i
s: ACGCTATGCTCATAG
t:
CGACGCTCATCG
OCCBIO 2006 – Fundamental Bioinformatics
91
Complexity
 Let len(s) = m and len(t) = n
i
 Space: 2m
s: ACGCTATGCTCATAG
 Time:
t:
CGACGCTCATCG
• Each call to build
similarity vector = m´n´
• First call + recursive call:
mn mn
m
T m, n  

T ,
2
2
2
j

m
j T ,n 

2

j

 m n  m j  m( n  j )
 2m n
OCCBIO 2006 – Fundamental Bioinformatics
92
General Gap Penalties
 Suppose we are no longer using simple gap
penalties:
• Origination = −2
• Length = −1
 Consider the last position of the alignment for
ACGTA with ACG
 We can’t determine the score for
G
or
G
unless we know the previous positions!
OCCBIO 2006 – Fundamental Bioinformatics
93
Scoring Blocks
 Now we must score a block at a time
A A C --- A TATCCG A C T AC
A C T ACC T ------ C G C --
 A block is a pair of characters, or a maximal
group of gaps paired with characters
 To score a position, we need to either start a
new block or add it to a previous block
OCCBIO 2006 – Fundamental Bioinformatics
94
The Algorithm
 Three tables
• a – scores for alignments ending in char-char blocks
• b – scores for alignments ending in gaps in the top
sequence (s)
• c – scores for alignments ending in gaps in the left
sequence (t)
 Scores no longer depend on only three
positions, because we can put any number of
gaps into the last block
OCCBIO 2006 – Fundamental Bioinformatics
95
The Recurrences
ai  1, j  1

ai, j   p i, j   maxbi  1, j  1
ci  1, j  1

ai, j  k   wk , for1  k  j
bi, j   max
ci, j  k   wk , for1  k  j
ai  k , j   wk , for1  k  i
ci, j   max
bi  k , j   wk , for1  k  i
OCCBIO 2006 – Fundamental Bioinformatics
96
The Optimal Alignment
 The optimal alignment is found by looking at
the maximal value in the lower right of all three
arrays
 The algorithm runs in O(n3) time
• Uses O(n2) space
OCCBIO 2006 – Fundamental Bioinformatics
97
Part IV – Database Searches
BLAST
Search statistics
Database Searching
 How can we find a particular short sequence in
a database of sequences (or one HUGE
sequence)?
 Problem is identical to local sequence
alignment, but on a much larger scale.
 We must also have some idea of the
significance of a database hit.
• Databases always return some kind of hit, how
much attention should be paid to the result?
OCCBIO 2006 – Fundamental Bioinformatics
99
BLAST
 BLAST – Basic Local Alignment Search Tool
 An approximation of the Needleman & Wunsch
algorithm
 Sacrifices some search sensitivity for speed
OCCBIO 2006 – Fundamental Bioinformatics
100
Scoring Matrices
 DNA
 Proteins
• Identity
• Transition/Transversion
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
2
-2
0
0
-2
0
0
1
-1
-1
-2
-1
-1
-4
1
1
1
-6
-3
0
• PAM
• BLOSUM
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
6
0
-1
-4
1
-1
-3
2
-2
-3
3
0
-4
0
0
-1
2
-4
-2
2
2
-4
1
1
0
2
-2
-3
1
-2
-4
-1
1
0
-4
-2
-2
4
-5
2
3
1
1
-2
-4
0
-3
-6
-1
0
0
-7
-4
-2
4
-5
-5
-3
-3
-2
-6
-5
-5
-4
-3
0
-2
-8
0
-2
4
2
-1
3
-2
-2
1
-1
-5
0
-1
-1
-5
-4
-2
4
0
1
-2
-3
0
-2
-5
-1
0
0
-7
-4
-2
5
-2
-3
-4
-2
-3
-5
-1
1
0
-7
-5
-1
6
-2
-2
0
-2
-2
0
-1
-1
-3
0
-2
5
2
-2
2
1
-2
-1
0
-5
-1
4
6
-3
4
2
-3
-3
-2
-2
-1
2
5
0
-5
-1
0
0
-3
-4
-2
6
0
-2
-2
-1
-4
-2
2
9
-5
-3
-2
0
7
-1
6
1
0
-6
-5
-1
3
1
-2
-3
-1
3
-5
-3
0
17
0
-6
10
2
4
OCCBIO 2006 – Fundamental Bioinformatics
101
The BLAST algorithm
 Break the search sequence into words
• W = 3 for proteins, W = 12 for DNA
MCGPFILGTYC
CGP
MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC
MCG
 Include in the search all words that score above
a certain value (T) for any search word
MCG
MCT
MCN
…
CGP
MGP
CTP
…
OCCBIO 2006 – Fundamental Bioinformatics
…
This list can be
computed in linear
time
102
The Blast Algorithm (2)
 Search for the words in the database
• Word locations can be precomputed and indexed
• Searching for a short string in a long string

Regular expression matching: FSA
 HSP (High Scoring Pair) = A match between a
query word and the database
 Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
 Extend the hit until the score falls below a
threshold value, X
OCCBIO 2006 – Fundamental Bioinformatics
103
OCCBIO 2006 – Fundamental Bioinformatics
104
Results from a BLAST search
OCCBIO 2006 – Fundamental Bioinformatics
105
Search Significance Scores
 A search will always return some hits.
 How can we determine how “unusual” a
particular alignment score is?
• ORF’s

Assumptions
OCCBIO 2006 – Fundamental Bioinformatics
106
Assessing significance requires a distribution
Frequency
 I have an apple of diameter 5”. Is that unusual?
Diameter (cm)
OCCBIO 2006 – Fundamental Bioinformatics
107
Is a match significant?
• Scoring system
• Database
• Sequence to search for
Length
 Composition

Frequency
 Match scores for aligning my sequence with
random sequences.
 Depends on:
Match score
 How do we determine the random sequences?
OCCBIO 2006 – Fundamental Bioinformatics
108
Generating “random” sequences
 Random uniform model:
P(G) = P(A) = P(C) = P(T) = 0.25
• Doesn’t reflect nature
 Use sequences from a database
• Might have genuine homology

We want unrelated sequences
 Random shuffling of sequences
• Preserves composition
• Removes true homology
OCCBIO 2006 – Fundamental Bioinformatics
109
What distribution do we expect to see?
 The mean of n random (i.i.d.) events tends
towards a Gaussian distribution.
• Example: Throw n dice and compute the mean.
• Distribution of means:
n=2
OCCBIO 2006 – Fundamental Bioinformatics
n = 1000
110
The extreme value distribution
 This means that if we get the match scores for
our sequence with n other sequences, the mean
would follow a Gaussian distribution.
 The maximum of n (i.i.d.) random events tends
towards the extreme value distribution as n
grows large.
OCCBIO 2006 – Fundamental Bioinformatics
111
Comparing distributions
Extreme Value:
Gaussian:

f x  
1


e
x

e
e

x 

OCCBIO 2006 – Fundamental Bioinformatics

1
f x  
e
 2
  x   2
2 2
112
Determining P-values
 If we can estimate  and , then we can
determine, for a given match score x, the
probability that a random match with score x or
greater would have occurred in the database.
 For sequence matches, a scoring system and
database can be parameterized by two
parameters, K and , related to  and .
• It would be nice if we could compare hit
significance without regard to the database and
scoring system used!
OCCBIO 2006 – Fundamental Bioinformatics
113
Bit Scores
 The expected number of hits with score  S is:
E = Kmn e s
• Where m and n are the sequence lengths
 Normalize the raw score using:
S 
S  ln K
ln 2
 Obtains a “bit score” S’, with a standard set of
units.
S
 The new E-value is: E  mn 2
OCCBIO 2006 – Fundamental Bioinformatics
114
P values and E values
 Blast reports E-values
 E = 5, E = 10 versus P = 0.993 and P = 0.99995
 When E < 0.01 P-values and E-values are
nearly identical
OCCBIO 2006 – Fundamental Bioinformatics
115
BLAST parameters
 Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be
found, at the expense of increased noise in the
results set.
 Raising the segment extension cutoff (X)
returns longer extensions for each hit.
 Changing the minimum E-value changes the
threshold for reporting a hit.
OCCBIO 2006 – Fundamental Bioinformatics
116
Part V – Phylogenies
Preliminaries
Distance-based methods
Parsimony Methods
Phylogenetic Trees
 Hypothesis about the relationship between
organisms
 Can be rooted or unrooted
A
B
B
C
D
E
Time
A
E
C
D
Root
OCCBIO 2006 – Fundamental Bioinformatics
118
Tree proliferation

2n  3!
N R  n2
2 n  2!
NU

2n  5!
 n 3
2 n  3!
Species Number of Rooted Trees
Number of Unrooted Trees
2
1
1
3
3
1
4
15
3
5
105
15
6
34,459,425
2,027,025
7
213,458,046,767,875
7,905,853,580,625
8
8,200,794,532,637,891,559,375
221,643,095,476,699,771,875
OCCBIO 2006 – Fundamental Bioinformatics
119
Molecular phylogenetics
 Specific genomic
sequence variations
(alleles) are much
more reliable than
phenotypic
characteristics
 More than one gene
should be considered
OCCBIO 2006 – Fundamental Bioinformatics
120
An ongoing didactic
 Pheneticists tend to prefer distance based
metrics, as they emphasize relationships among
data sets, rather than the paths they have taken
to arrive at their current states.
 Cladists are generally more interested in
evolutionary pathways, and tend to prefer more
evolutionarily based approaches such as
maximum parsimony.
OCCBIO 2006 – Fundamental Bioinformatics
121
Distance matrix methods
Species
B
A
9
B
–
C
–
D
–
C
8
11
–
–
D
12
15
10
–
E
15
18
13
5
OCCBIO 2006 – Fundamental Bioinformatics
122
UPGMA
 Similar to average-link clustering
 Merge the closest two groups
• Replace the distances for the new, merged group
with the average of the distance for the previous
two groups
 Repeat until all species are joined
OCCBIO 2006 – Fundamental Bioinformatics
123
UPGMA Step 1
Species
B
A
9
B
–
C
–
D
–
C
8
11
–
–
D
12
15
10
–
E
15
18
13
5
Merge D & E
D
Species
B
A
9
B
–
C
–
C
8
11
–
DE
OCCBIO 2006 – Fundamental Bioinformatics
E
13.5 16.5 11.5
124
UPGMA Step 2
Species
B
A
9
B
–
C
–
C
8
11
–
DE
Merge A & C
A
C
D
E
13.5 16.5 11.5
Species
AC
DE
OCCBIO 2006 – Fundamental Bioinformatics
B
10
AC
–
16.5 12.5
125
UPGMA Steps 3 & 4
Species
AC
DE
B
10
AC
–
Merge B & AC
A
C
B
D
E
16.5 12.5
Merge ABC & DE
A
C
B
D
E
(((A,C)B)(D,E))
OCCBIO 2006 – Fundamental Bioinformatics
126
Parsimony approaches
 Belong to the broader class of character based
methods of phylogenetics
 Emphasize simpler, and thus more likely
evolutionary pathways
I: GCGGACG
II: GTGGACG
A
(C or T)
C
I
(C or T)
T
II
OCCBIO 2006 – Fundamental Bioinformatics
C
I
T
II
127
Informative and uninformative sites
Position
Seq 1
2
3
4
5
6
1
G
G
G
G
G
G
2
G
G
G
A
G
T
3
G
G
A
T
A
G
4
G
A
T
C
A
T
For positions 5 & 6, it is
possible to select more
parsimonious trees –
those that invoke less
substitutions.
OCCBIO 2006 – Fundamental Bioinformatics
128
Parsimony methods
 Enumerate all possible trees
 Note the number of substitutions events
invoked by each possible tree
• Can be weighted by transition/transversion
probabilities, etc.
 Select the most parsimonious
OCCBIO 2006 – Fundamental Bioinformatics
129
Branch and Bound methods
 Key problem – number of possible trees grows
enormous as the number of species gets large
 Branch and bound – a technique that allows
large numbers of candidate trees to be rapidly
disregarded
 Requires a “good guess” at the cost of the best
tree
OCCBIO 2006 – Fundamental Bioinformatics
130
Branch and Bound for TSP
 Find a minimum cost
round-trip path that
visits each intermediate
city exactly once
 NP-complete
 Greedy approach:
A,G,E,F,B,D,C,A
= 251
OCCBIO 2006 – Fundamental Bioinformatics
A
93
20
D
82
B
59
31
57
12
G
17
C
46
82
35
E
68
F
15
131
Search all possible paths
A
93
20
All paths
D
82
B
59
31
57
12
G
17
C
46
82
35
AG (20)
E
68
F
AB (46)
AC (93)
15
AGF (88)
AGFB
AGFE
Best estimate: 251
OCCBIO 2006 – Fundamental Bioinformatics
AGE (55)
AGFC
ACB (175)
ACD
ACF
ACBE (257)

132
Parsimony – Branch and Bound
 Use the UPGMA tree for an initial best estimate
of the minimum cost (most parsimonious) tree
 Use branch and bound to explore all feasible
trees
 Replace the best estimate as better trees are
found
 Choose the most parsimonious
OCCBIO 2006 – Fundamental Bioinformatics
133
Parsimony example
Position
Seq 1
2
3
4
5
6
1
G
G
G
G
G
G
2
G
G
G
A
G
T
3
G
G
A
T
A
G
4
G
A
T
C
A
T
Position 5:
All trees
(1,2) [0]
(1,3) [1]
(1,4) [1]
Etc.
OCCBIO 2006 – Fundamental Bioinformatics
134
Part VI – Aligning protein sequences
PAM matrices
BLOSUM matrices
Sequence Alignments Revisited
 Scoring nucleotide sequence alignments was
easier
• Match score
• Possibly different scores for transitions and
transversions
 For amino acids, there are many more possible
substitutions
 How do we score which substitutions are highly
penalized and which are moderately penalized?
• Physical and chemical characteristics
• Empirical methods
OCCBIO 2006 – Fundamental Bioinformatics
136
Scoring Mismatches
 Physical and chemical characteristics
• V  I – Both small, both hydrophobic,
conservative substitution, small penalty
• V  K – Small  large, hydrophobic  charged,
large penalty
• Requires some expert knowledge and judgement
 Empirical methods
• How often does the substitution V  I occur in
proteins that are known to be related?

Scoring matrices: PAM and BLOSUM
OCCBIO 2006 – Fundamental Bioinformatics
137
PAM matrices
 PAM = “Point Accepted Mutation” interested
only in mutations that have been “accepted” by
natural selection
 Starts with a multiple sequence alignment of
very similar (>85% identity) proteins. Assumed
to be homologous
 Compute the relative mutability, mi, of each
amino acid
• e.g. mA = how many times was alanine substituted
with anything else?
OCCBIO 2006 – Fundamental Bioinformatics
138
Relative mutability
 ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
 Across all pairs of sequences, there are 28
A  X substitutions
 There are 10 ALA residues, so mA = 2.8
OCCBIO 2006 – Fundamental Bioinformatics
139
Pam Matrices, cont’d
 Construct a phylogenetic tree for the sequences
in the alignment
ACGCTAFKI
AG
GCGCTAFKI
AG
GCGCTGFKI
FG,A = 3
IL
ACGCTAFKL
AL
GCGCTLFKI
CS
ASGCTAFKL
GA
ACACTAFKL
 Calculate substitution frequences FX,X
 Substitutions may have occurred either way, so
A  G also counts as G  A.
OCCBIO 2006 – Fundamental Bioinformatics
140
Mutation Probabilities
 Mi,j represents the probability of J  I
substitution.
M ij 
m j Fij
ACGCTAFKI
 Fij
AG
GCGCTAFKI
i
AG
GCGCTGFKI
 M G, A
IL
ACGCTAFKL
AL
GCGCTLFKI
CS
ASGCTAFKL
GA
ACACTAFKL
2.7  3

= 2.025
4
OCCBIO 2006 – Fundamental Bioinformatics
141
The PAM matrix
 The entries, Ri,j are the Mi,j values divided by
the frequency of occurrence, fi, of residue i.
 fG = 10 GLY / 63 residues = 0.1587
 RG,A = log(2.025/0.1587) = log(12.760) = 1.106
 The log is taken so that we can add, rather than
multiply entries to get compound probabilities.
 Log-odds matrix
 Diagonal entries are 1– mj
OCCBIO 2006 – Fundamental Bioinformatics
142
Interpretation of PAM matrices
 PAM-1 – one substitution per 100 residues (a
PAM unit of time)
 Multiply them together to get PAM-100, etc.
 “Suppose I start with a given polypeptide
sequence M at time t, and observe the
evolutionary changes in the sequence until 1% of
all amino acid residues have undergone
substitutions at time t+n. Let the new sequence at
time t+n be called M’. What is the probability that
a residue of type j in M will be replaced by i in
M’?”
OCCBIO 2006 – Fundamental Bioinformatics
143
PAM matrix considerations
 If Mi,j is very small, we may not have a large
enough sample to estimate the real probability.
When we multiply the PAM matrices many
times, the error is magnified.
 PAM-1 – similar sequences, PAM-1000 very
dissimilar sequences
OCCBIO 2006 – Fundamental Bioinformatics
144
BLOSUM matrix
 Starts by clustering proteins by similarity
 Avoids problems with small probabilities by
using averages over clusters
 Numbering works opposite
• BLOSUM-62 is appropriate for sequences of about
62% identity, while BLOSUM-80 is appropriate for
more similar sequences.
OCCBIO 2006 – Fundamental Bioinformatics
145
Part VII – Protein Structure
Preliminaries
Lattice Models
Protein Folding Algorithms
Illustrations from: C Branden and J Tooze, Introduction to Protein Structure, 2 nd ed. Garland Pub. ISBN 0815302703
The many functions of proteins






Mechanoenzymes: myosin, actin
Rhodopsin: allows vision
Globins: transport oxygen
Antibodies: immune system
Enzymes: pepsin, renin, carboxypeptidase A
Receptors: transmit messages through
membranes
 Vitelogenin: molecular velcro
• And hundreds of thousands more…
OCCBIO 2006 – Fundamental Bioinformatics
147
Proteins are chains of amino acids
 Polymer – a molecule composed of repeating units
OCCBIO 2006 – Fundamental Bioinformatics
148
Amino acid composition
 Basic Amino Acid
Structure:
• The side chain, R,
varies for each of
the 20 amino acids
Side chain
R
H
N C C
H
Amino
group
OCCBIO 2006 – Fundamental Bioinformatics
O
H
OH
Carboxyl
group
149
The Peptide Bond
 Dehydration synthesis
 Repeating backbone: N–C –C –N–C –C
O
O
• Convention – start at amino terminus and proceed
to carboxy terminus
OCCBIO 2006 – Fundamental Bioinformatics
150
Peptidyl polymers
 A few amino acids in a chain are called a
polypeptide. A protein is usually composed of
50 to 400+ amino acids.
 Since part of the amino acid is lost during
dehydration synthesis, we call the units of a
protein amino acid residues.
carbonyl
carbon
OCCBIO 2006 – Fundamental Bioinformatics
amide
nitrogen
151
Side chain properties
 Recall that the electronegativity of carbon is at
about the middle of the scale for light elements
• Carbon does not make hydrogen bonds with water
easily – hydrophobic
• O and N are generally more likely than C to h-bond
to water – hydrophilic
 We group the amino acids into three general
groups:
• Hydrophobic
• Charged (positive/basic & negative/acidic)
• Polar
OCCBIO 2006 – Fundamental Bioinformatics
152
The Hydrophobic Amino Acids
Proline severely
limits allowable
conformations!
OCCBIO 2006 – Fundamental Bioinformatics
153
The Charged Amino Acids
OCCBIO 2006 – Fundamental Bioinformatics
154
The Polar Amino Acids
OCCBIO 2006 – Fundamental Bioinformatics
155
More Polar Amino Acids
And then there’s…
OCCBIO 2006 – Fundamental Bioinformatics
156
Planarity of the peptide bond
Psi () – the
angle of
rotation about
the C-C bond.
Phi () – the
angle of
rotation about
the N-C bond.
The planar bond angles and bond
lengths are fixed.
OCCBIO 2006 – Fundamental Bioinformatics
157
Phi and psi
  =  = 180° is
extended
conformation
  : C to N–H
  : C=O to C
OCCBIO 2006 – Fundamental Bioinformatics
C=O
C
N–H
158
The Ramachandran Plot
Observed
(non-glycine)
Calculated
Observed
(glycine)
 G. N. Ramachandran – first calculations of
sterically allowed regions of phi and psi
 Note the structural importance of glycine
OCCBIO 2006 – Fundamental Bioinformatics
159
Primary & Secondary Structure
 Primary structure = the linear sequence of
amino acids comprising a protein:
AGVGTVPMTAYGNDIQYYGQVT…
 Secondary structure
• Regular patterns of hydrogen bonding in proteins
result in two patterns that emerge in nearly every
protein structure known: the -helix and the
-sheet
• The location of direction of these periodic,
repeating structures is known as the secondary
structure of the protein
OCCBIO 2006 – Fundamental Bioinformatics
160
The alpha helix

 60°
OCCBIO 2006 – Fundamental Bioinformatics
161
Properties of the alpha helix
     60°
 Hydrogen bonds
between C=O of
residue n, and
NH of residue
n+4
 3.6 residues/turn
 1.5 Å/residue rise
 100°/residue turn
OCCBIO 2006 – Fundamental Bioinformatics
162
Properties of -helices
 4 – 40+ residues in length
 Often amphipathic or “dual-natured”
• Half hydrophobic and half hydrophilic
• Mostly when surface-exposed
 If we examine many -helices,
we find trends…
• Helix formers: Ala, Glu, Leu,
Met
• Helix breakers: Pro, Gly, Tyr,
Ser
OCCBIO 2006 – Fundamental Bioinformatics
163
The beta strand (& sheet)
   135°
  +135°
OCCBIO 2006 – Fundamental Bioinformatics
164
Properties of beta sheets
 Formed of stretches of 5-10 residues in
extended conformation
 Pleated – each C a bit
above or below the previous
 Parallel/aniparallel,
contiguous/non-contiguous
OCCBIO 2006 – Fundamental Bioinformatics
165
Parallel and anti-parallel -sheets
 Anti-parallel is slightly energetically favored
Anti-parallel
OCCBIO 2006 – Fundamental Bioinformatics
Parallel
166
Turns and Loops
 Secondary structure elements are connected by
regions of turns and loops
 Turns – short regions
of non-, non-
conformation
 Loops – larger stretches with no secondary
structure. Often disordered.
• “Random coil”
• Sequences vary much more than secondary
structure regions
OCCBIO 2006 – Fundamental Bioinformatics
167
Levels of Protein
Structure
 Secondary structure
elements combine to
form tertiary structure
 Quaternary structure
occurs in multienzyme
complexes
• Many proteins are
active only as
homodimers,
homotetramers, etc.
Disulfide Bonds
 Two cyteines in
close proximity
will form a
covalent bond
 Disulfide bond,
disulfide bridge,
or dicysteine
bond.
 Significantly
stabilizes tertiary
structure.
OCCBIO 2006 – Fundamental Bioinformatics
169
Protein Structure Examples
OCCBIO 2006 – Fundamental Bioinformatics
170
Determining Protein Structure
 There are O(100,000) distinct proteins in the
human proteome.
 3D structures have been determined for 14,000
proteins, from all organisms
• Includes duplicates with different ligands bound,
etc.
 Coordinates are determined by X-ray
crystallography
OCCBIO 2006 – Fundamental Bioinformatics
171
X-Ray Crystallography
~0.5mm
• The crystal is a mosaic of millions of copies
of the protein.
• As much as 70% is solvent (water)!
• May take months (and a “green” thumb) to
grow.
OCCBIO 2006 – Fundamental Bioinformatics
172
X-Ray diffraction
 Image is averaged
over:
• Space (many copies)
• Time (of the diffraction
experiment)
OCCBIO 2006 – Fundamental Bioinformatics
173
Electron Density Maps
 Resolution is
dependent on the
quality/regularity
of the crystal
 R-factor is a
measure of
“leftover” electron
density
 Solvent fitting
 Refinement
OCCBIO 2006 – Fundamental Bioinformatics
174
The Protein Data Bank
 http://www.rcsb.org/pdb/
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
N
CA
C
O
CB
N
CA
C
O
N
CA
C
O
CB
CG1
CG2
ALA
ALA
ALA
ALA
ALA
GLY
GLY
GLY
GLY
VAL
VAL
VAL
VAL
VAL
VAL
VAL
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
1
1
1
1
1
2
2
2
2
3
3
3
3
3
3
3
OCCBIO 2006 – Fundamental Bioinformatics
22.382
22.957
23.572
23.948
23.932
23.656
24.216
25.653
26.258
26.213
27.594
28.569
28.429
27.834
29.259
26.811
47.782
47.648
46.251
45.688
48.787
45.723
44.393
44.308
45.296
43.110
42.879
43.613
43.444
41.363
41.013
40.649
112.975
111.613
111.545
112.603
111.380
110.336
110.087
110.579
110.994
110.521
110.975
110.055
108.822
110.979
111.404
111.850
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
24.09
22.40
21.32
21.54
22.79
19.17
17.35
16.49
15.35
16.21
16.02
15.69
16.43
16.66
17.35
17.03
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
3APR
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
175
Views of a protein
Wireframe
OCCBIO 2006 – Fundamental Bioinformatics
Ball and stick
176
Views of a protein
Spacefill
Cartoon
CPK colors
Carbon =
green, black,
or grey
Nitrogen =
blue
Oxygen = red
Sulfur =
yellow
Hydrogen =
white
OCCBIO 2006 – Fundamental Bioinformatics
177
The Protein Folding Problem
 Central question of molecular biology:
“Given a particular sequence of amino acid
residues (primary structure), what will the
tertiary/quaternary structure of the resulting
protein be?”
 Input: AAVIKYGCAL…
Output: 11, 22…
= backbone conformation:
(no side chains yet)
OCCBIO 2006 – Fundamental Bioinformatics
178
Forces driving protein folding
 It is believed that hydrophobic collapse is a key
driving force for protein folding
• Hydrophobic core
• Polar surface interacting with solvent




Minimum volume (no cavities)
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
OCCBIO 2006 – Fundamental Bioinformatics
179
Folding help
 Proteins are, in fact, only marginally stable
• Native state is typically only 5 to 10 kcal/mole more
stable than the unfolded form
 Many proteins help in folding
• Protein disulfide isomerase – catalyzes shuffling of
disulfide bonds
• Chaperones – break up aggregates and (in theory)
unfold misfolded proteins
OCCBIO 2006 – Fundamental Bioinformatics
180
The Hydrophobic Core
 Hemoglobin A is the protein in red blood cells
(erythrocytes) responsible for binding oxygen.
 The mutation E6V in the  chain places a
hydrophobic Val on the surface of hemoglobin
 The resulting “sticky patch” causes hemoglobin
S to agglutinate (stick together) and form fibers
which deform the red blood cell and do not
carry oxygen efficiently
 Sickle cell anemia was the first identified
molecular disease
OCCBIO 2006 – Fundamental Bioinformatics
181
Sickle Cell Anemia
Sequestering hydrophobic residues in
the protein core protects proteins from
hydrophobic agglutination.
OCCBIO 2006 – Fundamental Bioinformatics
182
Computational Problems in Protein Folding
 Two key questions:
• Evaluation – how can we tell a correctly-folded
protein from an incorrectly folded protein?
H-bonds, electrostatics, hydrophobic effect, etc.
 Derive a function, see how well it does on “real” proteins

• Optimization – once we get an evaluation function,
can we optimize it?
Simulated annealing/monte carlo
 EC
 Heuristics
 We’ll talk more about these methods later…

OCCBIO 2006 – Fundamental Bioinformatics
183
Fold Optimization
 Simple lattice models (HPmodels)
• Two types of residues:
hydrophobic and polar
• 2-D or 3-D lattice
• The only force is hydrophobic
collapse
• Score = number of HH
contacts
OCCBIO 2006 – Fundamental Bioinformatics
184
Scoring Lattice Models
 H/P model scoring: count noncovalent
hydrophobic interactions.
 Sometimes:
• Penalize for buried polar or surface hydrophobic
residues
OCCBIO 2006 – Fundamental Bioinformatics
185
What can we do with lattice models?
 For smaller polypeptides, exhaustive search can
be used
• Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the
protein folding process
 For larger chains, other optimization and search
methods must be used
• Greedy, branch and bound
• Evolutionary computing, simulated annealing
• Graph theoretical methods
OCCBIO 2006 – Fundamental Bioinformatics
186
Learning from Lattice Models
 The “hydrophobic zipper” effect:
Ken Dill ~ 1997
OCCBIO 2006 – Fundamental Bioinformatics
187
Representing a lattice model
 Absolute directions
• UURRDLDRRU
 Relative directions
• LFRFRRLLFFL
• Advantage, we can’t have UD or RL in absolute
• Only three directions: LRF
 What about bumps? LFRRR
• Bad score
• Use a better representation
OCCBIO 2006 – Fundamental Bioinformatics
188
Preference-order representation
 Each position has two “preferences”
• If it can’t have either of the two, it will take the
“least favorite” path if possible
 Example: {LR},{FL},{RL},
{FR},{RL},{RL},{FR},{RF}
 Can still cause bumps:
{LF},{FR},{RL},{FL},
{RL},{FL},{RF},{RL},
{FL}
OCCBIO 2006 – Fundamental Bioinformatics
189
More realistic models
 Higher resolution lattices (45° lattice, etc.)
 Off-lattice models
• Local moves
• Optimization/search methods and /
representations
Greedy search
 Branch and bound
 EC, Monte Carlo, simulated annealing, etc.

OCCBIO 2006 – Fundamental Bioinformatics
190
The Other Half of the Picture
 Now that we have a more realistic off-lattice
model, we need a better energy function to
evaluate a conformation (fold).
 Theoretical force field:
• G = Gvan der Waals + Gh-bonds + Gsolvent +
Gcoulomb
 Empirical force fields
• Start with a database
• Look at neighboring residues – similar to known
protein folds?
OCCBIO 2006 – Fundamental Bioinformatics
191
Threading: Fold recognition
 Given:
• Sequence:
IVACIVSTEYDVMKAAR…
• A database of molecular
coordinates
 Map the sequence onto
each fold
 Evaluate
• Objective 1: improve
scoring function
• Objective 2: folding
OCCBIO 2006 – Fundamental Bioinformatics
192
Secondary Structure Prediction
AGVGTVPMTAYGNDIQYYGQVT…
A-VGIVPM-AYGQDIQY-GQVT…
AG-GIIP--AYGNELQ--GQVT…
AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…
----hhhHHHHHHhhh--eeEE…
OCCBIO 2006 – Fundamental Bioinformatics
193
Secondary Structure Prediction

Easier than folding
•

Current algorithms can prediction secondary
structure with 70-80% accuracy
Chou, P.Y. & Fasman, G.D. (1974).
Biochemistry, 13, 211-222.
•

Based on frequencies of occurrence of residues in
helices and sheets
PhD – Neural network based
•
•
Uses a multiple sequence alignment
Rost & Sander, Proteins, 1994 , 19, 55-72
OCCBIO 2006 – Fundamental Bioinformatics
194
Chou-Fasman Parameters
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Abbrv
A
R
D
N
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
OCCBIO 2006 – Fundamental Bioinformatics
P(b) P(turn)
83
66
93
95
54
146
89
156
119
119
37
74
110
98
75
156
87
95
160
47
130
59
74
101
105
60
138
60
55
152
75
143
119
96
137
96
147
114
170
50
f(i)
0.06
0.07
0.147
0.161
0.149
0.056
0.074
0.102
0.14
0.043
0.061
0.055
0.068
0.059
0.102
0.12
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.11
0.083
0.05
0.06
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.19
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.07
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
195
Chou-Fasman Algorithm
 Identify -helices
• 4 out of 6 contiguous amino acids that have P(a) >
100
• Extend the region until 4 amino acids with P(a) <
100 found
• Compute P(a) and P(b); If the region is >5
residues and P(a) > P(b) identify as a helix
 Repeat for -sheets [use P(b)]
 If an  and a  region overlap, the overlapping
region is predicted according to P(a) and
P(b)
OCCBIO 2006 – Fundamental Bioinformatics
196
Chou-Fasman, cont’d
 Identify hairpin turns:
• P(t) = f(i) of the residue  f(i+1) of the next residue
 f(i+2) of the following residue  f(i+3) of the
residue at position (i+3)
• Predict a hairpin turn starting at positions where:
P(t) > 0.000075
 The average P(turn) for the four residues > 100
 P(a) < P(turn) > P(b) for the four residues

 Accuracy  60-65%
OCCBIO 2006 – Fundamental Bioinformatics
197
Chou-Fasman Example
 CAENKLDHVRGPTCILFMTWYNDGP
 CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY
• Extend: When we reach RGPT, we must stop
• CAENKLDHV: P(a) = 972, P(b) = 843
• Declare alpha helix
 Identifying a hairpin turn
• VRGP: P(t) = 0.000085
• Average P(turn) = 113.25

Avg P(a) = 79.5, Avg P(b) = 98.25
OCCBIO 2006 – Fundamental Bioinformatics
198
Other topics?
 Tools and languages
 Forensic DNA
 Microarray analysis
OCCBIO 2006 – Fundamental Bioinformatics
199

similar documents