Module 5

```Structure-based Evidence for Function
(TIGRfam, Pfam and PDB)
TIGRfams are protein families
categorized by functional role
Concept: HMMs
•
HMM: A Hidden Markov Model is a probabilistic model developed from
observed sequences of proteins of a known function. The profile HMM is used
to score the alignment of the amino acid sequence entered to other proteins
base on amino acid identity and position
A concrete example of an HMM:
Consider two friends, Alice and Bob, who live far apart from each other and who talk together daily over
the telephone about what they did that day. Bob is only interested in three activities: walking in the park,
shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather
on a given day. Alice has no definite information about the weather where Bob lives, but she knows
general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must
have been like.
Alice believes that the weather operates as a discrete Markov chain (system in various states that can
change randomly). There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that
is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the
following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his
activities, those are the observations. The entire system is that of a hidden Markov model (HMM).
Alice knows the general weather trends in the area, and what Bob likes to do on average. In other words,
the parameters of the HMM are known.
http://en.wikipedia.org/wiki/Hidden_Markov_model
TIGRfams:
Haft et al. (2001)
Nucleic Acids Research 29: 41-43.
Search TIGRFAM database
Change Database to “TIGRFAMS”
Change Scope to GLOBAL
Change E-value cutoff to “0.01”
Enter protein sequence in FASTA
format in the box
 Click on “Start HMM search”
 Then wait…




“Click”
RESULTS:
Only hits with
positive Score &
E-value  10-3
should be recorded
Score and E-value
 Enter the TIGRfam number (format -- TIGRXXXXX) from 'Model' column into
imgACT lab notebook in box for significant TIGRfam hit
 Enter TIGRfam name from ‘Description’ column into notebook
 NOTE: If full name is cut off in ‘Description’ column, go to
http://cmr.jcvi.org/cgi-bin/CMR/shared/MakeFrontPages.cgi?page=text_search&crumbs=searches
 Enter Score and E-value into Notebook as well
“Click”
To obtain full TIGRfam name:
Then what?
Full name
Complete description
TIGRfam Results in imgACT Notebook
Terms to Know for Pfam
•
Domain: A structural unit which can be found in
multiple protein contexts.
e.g., zinc finger, leucine zipper
•
Family: A collection of related proteins containing
the same domain.
e.g., immunoglobulins, CD4, MHC, TCR, etc.
•
Clan: A collection of multiple protein families. The
relationship may be defined by similarity of sequence,
structure, or profile-HMM.
e.g., ATPase functioning in ETC
vs.
ATPase functioning in DNA replication.
provided in your
notebook.
You know the Drill!
amino acid sequence
Change E-value to 0.001
“Click”
WAIT…this can sometimes take awhile
RESULTS!
Graphic view of
domain organization
Notice there may be two types of results
Significant and insignificant matches.
Only investigate significant matches.
NOTE: Insignificant matches may have
valid E-value. . . but this Pfam result is
considered insignificant because the
length of the alignment is very short &
Pfam has detected and flagged this.
If you do not have any significant matches, make a note of this in your notebook by creating
a COMMENTS section, entering “No significant hits”.
Be sure your search criteria was accurate (e.g., E-value of 0.001)
Investigate SIGNIFICANT matches
Click on [Show] to view the “pairwise
alignment” for the Pfam match
Copy/paste this pair-wise alignment into
How do I interpret the alignment?
Top row (#HMM): all capital letters indicate
conserved residues in the HMM consensus sequence.
Middle row (#MATCH): identical or functionally
conserved (similar) amino acids
Bottom row (#SEQ): query sequence aligned to
HMM representing the domain/family
Legend for #MATCH
• Upper case = identical match
(conserved and high frequency)
• Lower case = identical match
(conserved but low frequency)
• + symbol = functionally similar
(i.e. aspartic vs. glutamic acid)
• Space = no match
What is an HMM consensus sequence?
The HMM consensus sequence
& open in new tab
On Pfam family
summary page, click on
“Alignments”’
The HMM consensus sequence
Full: Total number of sequences in
database that have been categorized
into this Pfam family
Seed: Number of sequences within
multiple sequence alignment
representing architectural variations
within a single Pfam family
What does this mean?
Architecture Diversity
• Domain organization within context of full protein
The HMM consensus sequence
Leave default settings and
press the [View] button
The HMM consensus sequence
A new window will pop up as shown:
Click on [Start Jalview] button to view
the multiple sequence alignment
The HMM consensus sequence
Another new window will pop up as shown:
TOO MANY COLORS!
The HMM consensus sequence
Let’s make the view more manageable by simplifying the colors. . .
Select “Percentage
NOTE: Take the
time to browse
other color
schemes to learn
protein.
The HMM consensus sequence
This view reveals the amount of conservation in your amino acid sequence.
Dark = highest frequency
Light = lower frequency
Pay special
attention to
BOTTOM graph:
Consensus
sequence for
protein family
This consensus sequence is used to construct the HMM
Letters show
which amino acids
occur most
frequently at that
position.
What else do I need for my notebook?
Pfam name and Pfam number
Pfam number
Abbreviated
Pfam name
Full
Pfam name
Copy/paste full &
abbreviated Pfam name
as well as Pfam number
Note: Pay Attention to possible 3D Image
• You may see a 3D image when you
• If you see this image, then this is your
first clue that you should expect to have
significant hits in the PDB search (next
section of this module).
• If you don’t see an image, then this
suggests no structure has yet been
solved for proteins containing the domain
identified by Pfam.
What else do I need for my notebook?
HMM Logo
On Summary page, click on
“HMM logo”
What else do I need for my notebook?
HMM Logo
SAVE this image in .png
format and insert into your
notebook.
How do we interpret the HMM Logo?
HMM Logo:
-- Highly conserved amino acids are represented by wide letters
-- Amino acids with a high frequency of occurrence in the
alignment used to generate the HMM consensus sequence are
represented by tall letters
What else do I need for my notebook?
Clan name and number
Use key words from Pfam
family name for clan search
Click BROWSE to search
for clan information
What else do I need for my notebook?
Clan name and number
Investigate possible clans
based on key word search
from Pfam family description.
clan information.
What else do I need for my notebook?
Clan name and number
Abbreviated
Clan name
Clan number
Full
Clan name
NOTE: Not all Pfam families
belong to a clan. If no clan is
found, enter “None found” in
Tells you which Pfam families belong to this clan. If the
Pfam family to which your protein belongs is not in this
list, then your protein is NOT a member of this clan.
What else do I need for my notebook?
Key functional residues
You have THREE key tools to assist you in identifying the
KEY FUNCTIONAL RESIDUES of your protein.
Tool #2: HMM Logo
Tool #1: Pairwise Alignment
Tool #3: Jalview consensus
How do we identify key functional residues?
 Capital letter in #MATCH line
 Tall, wide letter in HMM logo
 Tall bar in graphical depiction of consensus sequence
How do we report key functional residues
in the notebook?
Formula:
AA(start+HMM#-1)
Example:
C(47+8-1)=
C54
HMM#
SUMMARY: Identifying key functional residues
1. Use the HMM pair-wise alignment to
identify possible key functional residues.
2. Use the HMM Logo and Jalview alignment
tools to verify key functional residues.
3. Scan the entire amino acid sequence and
record all key functional residues using
proper notation.
Recording results in your Lab Notebook
Scroll
down
Recording results in your Lab Notebook
REPEAT procedure for all significant Pfam hits
3 hits = 3 notebook entries
PDB
Protein Data Bank
o Worldwide depository for three-dimensional
structures of large biological molecules, including
proteins and nucleic acids
o Contains information about structure such as. . .
• sequence details
• atomic coordinates
• crystallization conditions
Berman et al. (2003)
Nature Structural Biology 10: 980.
• 3-D structure neighbors
• derived geometric data
• structure factors
• 3-D images
provided in your
notebook.
Select “Sequence (Blast/Fasta)” option
Change E-value
cut off to 0.001
protein sequence into query box
to initiate search
Results of PDB Search
Search hits listed by ascending E-value
Scroll
down
Evaluating PDB Results
Assess quality of the alignment: Is the E-value less than 10-3?
Is a significant proportion of the protein aligned?
If so, good hit. 
(Hint: compare alignment length to total length)
PDB CODE
Thumbnail of 3D structure.
Click on it to get a high-resolution image for notebook.
PDB NAME
Citation
Alignment
and statistics
Recording results in your Lab Notebook