Automate Function Prediction
How function is defined
Why Gene Ontology
Methods for protein function prediction
End points
• A) You find a new protein
• B) You sequence the whole genome of your
favorite organism
• Obtained gene(s) should be annotated
• A can be solved manually. B needs automatic
How function is defined
Functional description as text
Linking gene to Key Words (Uniprot)
Linking gene Gene Ontology
Linking gene to Signalling Pathways or
Biochemical Pathways (KEGG)
Why Gene Ontology (GO)
• GO represents a popular standard currently in
the gene annotation
• GO represents categories that represent gene
• Creates an union for genes in same process
• Easy summary for genes with similar function
Why Gene Ontology (GO)
• 3 sub-parts: Biological Process, Molecular
Function, Cellular Localization
– Molecular Function => chemical activity
– Biological Process => Biology, cellular process
– Cellular localization => Location of gene
• Hierarchical structure
– Categories with very precise function
– Categories with less precise function
– Categories with very broad function
How GO helps
• End user: Summary categories for genes with
various functions
• Computer programs: Classifier algorithms can
be taught to predict the categories for genes
Understanding GO
• Amigo server (http://amigo.geneontology.org/cgibin/amigo/go.cgi)
Function Prediction: What can we use
to predict function
Sequence homology (BLAST result list)
Phylogenetic tree of sequences
Protein Domains (PFAM domains)
Short sequence patterns – motifs
Sequence features (sec. struct., low compl.
Sequence Homology Methods
• Do a BLAST search with a query sequence
• Collect GO classes for genes in the BLAST
result hit
• Give a weight to each BLAST hit
– often log(E-value)
• Combine the scores from the genes that
belong to same GO class
• Report the top best / significant GO classes
Sequence Homology Methods
• Simple methods
• Programs
– BLAST2GO (http://www.blast2go.com/b2ghome)
– GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php)
– ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php)
– PFP (http://kiharalab.org/web/pfp.php)
Phylogenetic tree methods
• Create the pair-wise distances for the set of
• Do a hierarchical clustering of genes
• Map the know GO functions to cluster tree
• Look for unknown genes in a cluster with many
genes from the same GO class
• Report the top best / significant GO classes
• More => http://genome.cshlp.org/content/8/3/163.full
Phylogenetic tree methods
• These should outperform sequence homology
methods (CAFA 2011?)
• Require a set of related genes
• Often much heavier calculations
• Programs:
– Sifter
Prediction with Protein domains
• Look what protein domains there are in query
protein (PFAM)
• Map the functions that are linked to domains to
your query sequence
• Programs: InterProScan + PFAM2GO
• Drawbacks:
– This mapping is same in plant, mammal, bacteria
– Many domains to specific function
Prediction with Protein domains
• Benefits:
– Can create annotation from separate domains
– Similar seq:s do not have to be in database
• Programs (?): InterProScan
• Drawbacks:
– The mapping is same in plant, mammal, bacteria
– Many domains to specific function
Prediction with patterns and motifs
• Same principle as before, but we look
sequence patterns and motifs
• Map the functions that are linked to patterns
to your query sequence
• Programs:
– InterProScan
– IBM BioDictionary (http://cbcsrv.watson.ibm.com/Tpa.html)
• Drawbacks and benefits appr. same as before
Prediction with sequence features
• Again same principle as before
• We look seq. features (see pict.)
• These are given as an input to classifier
algorithm (Support Vector Machine)
Prediction with sequence features
Prediction with sequence features
• Benefits:
– No actual seq. similarity needed
– Info collected from vague similarities
– Use of classifier => feature weighting
• Program: FFPred (http://bioinf.cs.ucl.ac.uk/ffpred/)
• Drawbacks:
• Calculations probably quite heavy
• No use of nearby sequence similarities (domains etc.)
Our contribution: PANNZER
• Use BLAST result list
• Add Taxonomic information
• Score GO classes using a score that takes the
frequency of GO class in seq. DB into account
• Method is used to predict:
– GO Classes
– Description line
Our contribution: PANNZER
• Benefits:
– Taking the species taxonomy into account
– Improved use of statistics
• Not public yet
Our contribution: No Name Yet
• Take PFAM domain predictions, BLAST
similarities and Taxonomic information
• Feed this to feature selection and to classifier
• …Wait…
• Method is used to predict GO-classes
• Not public + testing is ongoing
• These methods increasingly needed
• Some methods exist
• Unfortunately no clear evaluation (my
• Remember: These are predictions. No certain
info until they are tested in wet lab…

similar documents