AN OVERVIEW OF GENE STRUCTURE & FUNCTION PREDICTION Marcus Chibucos, Ph.D. University of Maryland School of Medicine June 2014 Overview & goals • Understand • 1. How we predict presence & structure of coding & non-coding genes in the genome • 2. How we know what a gene product does & how evidence is used to support this • When searching databases like FungiDB or InterPro, understand the meaning of terms like: protein motif, domain, ortholog, HMM, EC, GO annotation, and so forth • Learn fundamentals with prokaryotes • Overview of eukaryotes GENE STRUCTURAL ANNOTATION 3 What is a gene model? Yandell and Ence (2012) Nature Reviews Genetics. 13:329-342. Fundamental methods of pattern detection • Intrinsic (ab initio/de novo, “from the beginning”) • Uses only DNA sequence & the inherent patterns within it • Canonical features like start & stop codons • Extrinsic • Uses additional sources of evidence information • Homologous proteins • mRNA (ESTs, RNA-Seq) • Synteny PROKARYOTIC STRUCTURAL ANNOTATION 6 Prokaryotic gene structure promotor RBS ATG TAG start stop AUG UAG DNA mRNA Open reading frame (ORF) start RBS Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Start with DNA sequence DNA sequence has 6 translation frames • 3 on forward strand, 3 on reverse strand Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Graphical display of 6-frame translation Each horizontal bar represents one of the translation frames. Tall vertical lines represent translation stops (TAG, TAA, TGA). Short vertical lines represent translation starts (ATG, GTG, TTG). Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Graphical display of 6-frame translation start These are examples of the many ORFs in this graphic. stop • What is an “ORF”? Prokaryotic gene finders • Glimmer • http://www.cbcb.umd.edu/software/glimmer • prok and euk versions • Prodigal • http://prodigal.ornl.gov • GeneMark • http://exon.gatech.edu • prok and euk versions • EasyGene • http://www.cbs.dtu.dk/services/EasyGene • Many others exist (or have existed...) Glimmer • Tool uses interpolated Markov models (IMMs) to predict which ORFs in a genome contain real genes. • Glimmer compares nucleotide patterns it finds in a training set of genes known (or believed) to be real to nucleotide patterns of ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves. • Using Glimmer is a two-part process • Train Glimmer with genes from organism that was sequenced, which are known, or strongly believed, to be real genes. • Run trained Glimmer against the entire genome sequence. • This is actually how most ab initio gene predictors—including eukaryotic predictors like Augustus, GeneID, SNAP, and others—work. Gathering the training set • Using verified, published sequences ideal… not always possible • Minimum needed is 250 kb of total sequence • BLAST translated ORFs against a protein database (slow) • Keep only very strong matches • Gather long non-overlapping ORFs (fast) • Many more complex strategies exist, especially for eukaryotes these not these Training Glimmer • All k-mers from size 5-8 in sequence are tracked • Frequency of each nucleotide following any given k-mer is recorded • This data set is used to build a statistical model that provides the probability that any given nucleotide will follow any given kmer • This model is used to score the ORFs in the genome • Those where the patterns of nucleotides/k-mers match the model are predicted to be real genes Candidate ORFs +3 +2 +1 -1 -2 -3 • Choose a minimum length cut-off • Blue ORFs meet this minimum • Each blue ORF will be scored against the model built from the training genes Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 Categorizing ORFs as genes or not • Some ORFs will score well to the model (green) • Some will not (red) • Green ORFs will be retained as predicted genes (blue arrows depicted along the DNA molecule in black at the bottom of the figure) +3 +2 +1 -1 -2 -3 Potential problems to watch for • False Positives • An ORF is predicted to be a gene, but really isn’t • May result in overlaps • False Negatives • An ORF is not predicted to be a gene, but really is • May result in “gaps” in feature predictions • Wrong start site chosen • Most genes have multiple start codons near the beginning – it can be hard to determine which is the true one +3 +2 +1 -1 -2 -3 Is one of these a False Positive? Probably. Genes don’t generally overlap to this extent in prokaryotes. • What about eukaryotes? +3 +2 +1 -1 -2 -3 Is this a false negative? Probably. There are not large regions without gene content in prokaryotes. • What about eukaryotes? Why might this happen? If a region of DNA is different in composition than the rest of the genome then the gene finders will score the ORFs poorly when in fact they are real genes. Different composition may come about in many ways – one common way is through lateral (or horizontal) transfer, e.g. things such as phage integration, transposition, et cetera. 21 Translation start sites - Start site frequency: ATG >> GTG >> TTG - Ribosome binding site (RBS): AG rich sequence 5-11 bp upstream of the start codon - Similarity to match proteins, in BER & multiple alignments - Example below shows beginning of a BER alignment. (DNA sequence reads down in columns for each codon.) Homology starts exactly at first atg (current chosen start, aa #1). There is favorable RBS (gagggaga) beginning 9 bp upstream of this atg. No reason to consider the ttg, and no justification for moving to the second atg (this would cut off some similarity and it does not have an RBS). 3 possible start sites RBS upstream of chosen start BER match This ORF’s upstream boundary 22 Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. Small regions of overlap are allowed (circle). 23 Interevidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against a non-redundant protein database. It’s not just about proteins • Can predict many genes beyond protein coding ones Manatee genome viewer http://manatee.igs.umaryland.edu/ http://manatee.sourceforge.net/igs/index.shtml Artemis gene model curation tool http://www.sanger.ac.uk/resources/software/artemis/ EUKARYOTIC STRUCTURAL ANNOTATION 29 Eukaryotic gene structure prediction …now things get more complicated Gene finder evaluation • Sensitivity (Sn) measures false negatives • The fraction of a known reference feature that is predicted by a gene predictor = TP / (TP + FN) • Specificity (Sp) measures false positives • The fraction of the prediction that overlaps a known reference feature = TP / (TP + FP) • Sensitivity (Sn) false negatives = TP / (TP + FN) • Specificity (Sp) false positives = TP / (TP + FP) Real gene model True positives Sn = 3/(3+0) = 1.0 Sp = 3/(3+0) = 1.0 Sn = 1.0 Sp = 0.75 Sn = 0.67 Sp = 1.0 False positive False negative • Assessed at different levels – – – – Base Exon (pictured above) Transcript Gene True positives True negatives Intrinsic (ab initio) success rates • Prokaryotic – very good • Eukaryotic – not so good >95% correct ~50% correct (shown below) http://bioinf.uni-greifswald.de/augustus/accuracy (accessed May 2013) Complexities of eukaryotic gene finding • Large eukaryote genomes have low coding density compared to • • • • • • • • • • • prokaryotes where all long ORFs encode genes Genomic repeats Non-canonical (ATG) start codon Splicing (exons & introns) - alternative splicing (40-50% genes) Pseudogenes Long genes or short genes Long introns Non-canonical introns UTR introns Overlapping genes on opposite strands Nested genes overlapping on strand or in intron Polycistronic peptide coding genes • One mRNA codes for several very short (~11 aa) peptides… regulatory function • Even if you have some RNA (helpful) transcription not always active • Require multiple biological conditions Masking repeats is essential • RepeatMasker (http://www.repeatmasker.org) finds interspersed repeats & low complexity DNA sequences by comparing DNA sequence to curated genomic-specific libraries • Simple Repeats – 1-5 bp duplications such as A, CA, CGG • Tandem Repeats - 100-200 bases found at centromeres & telomeres • Segmental Duplications - 10-300 kilobases blocks copied to another genomic region • Interspersed Repeats • Processed pseudogenes, retrotranscripts (short-interspersed elements- SINES): Non-functional copies of RNA genes reintegrated into the genome via reverse transcriptase • DNA transposons • Retrovirus retrotransposons • Non-retrovirus retrotransposons (long interspersed elements- LINES) • ~50% of human genomic DNA currently will be masked • RepeatModeler searches for repeats ab initio and can find not previously characterized repeats Repeats yield similarities in nonhomologous regions GENE1 Using unmasked genomic DNA GENE2 GENE1 Using masked genomic DNA GENE2 Alkes L. Price, Neil C. Jones and Pavel A. Pevzner (June 28, 2005) http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt Predicted genes that are actually repeats Gene predictors Using masked genomic DNA No models Using unmasked genomic DNA Predicted models Repeats Factors affecting gene predictor results • Underlying algorithm • Program parameters • Training set (number and quality of models) • Extrinsic data (expression data, protein/genome alignment) Training set 1 Training set 2 GeneMark-ES (self training) 9,024 9,024 Augustus trained on Fungus 8,694 9,011 Augustus with “optimize” step 8,503 8,920 SNAP trained on Fungus 7,335 7,955 GlimmerHMM trained on Fungus 10,313 11,894 Scipio alignments with other Fungi 10,691 10,691 Trinity assemblies GMAP aligned 9,527 9,527 Trinity (Jaccard clip option on) 10,023 10,023 GLEAN consensus 8,705 9,123 Which model is “correct”? Models from three different predictors/condi tions Consensus model Protein alignments We rely on certain conventions • Rules are based on gene composition & signal • First, what is the basic structure of a gene? • Coding region (exon) is inside ORF of one reading frame • All exons on same strand for a given gene • Exons within a gene can have different reading frame • Inherent frequency patterns exist… Dimer frequency distribution • Dimer frequency in protein sequence is not evenly distributed and is organism specific • Some amino acids “prefer” to be next to one another • Most dicodons are biased toward either coding or non-coding, not neutral • Expected frequency of dimer • If random = 0.25% (1/20 * 1/20) • If a dimer has lower than expected frequency, protein less likely to contain it… and the reasoning follows that if a sequence does contain it, it is less likely to exist in a coding region • Example: In human genome, AAA AAA appears 1% of time in coding regions and 5% of time in non-coding regions Splicing • Find all GT/AG donor/acceptor sites • Score with position-specific scoring matrix (PSSM) model http://en.wikipedia.org/wiki/File:Pre-mRNA_to_mRNA.svg splice donor polybranch pyrimidine splice tract acceptor point Modified from: http://en.wikipedia.org/wiki/File:Intron_miguelferig.jpg Position Specific Scoring Matrix (PSSM) 2 3 4 5 6 7 8 A 1 1 1 0 0 0 1 1 G 1 0 0 5 0 1 2 0 C 2 1 4 0 0 2 1 4 U 1 2 0 0 5 2 1 0 1 Let’s say you look at 5 splice donor (GU) sites: ATCGUCGC UCAGUGGC CUCGUCCC GUCGUUAC CACGUCUA Gene finders use this information to predict where gene featuresGene are. For this to work, one must have confirmed splice sites to use for training. These are not always available for new genomes… and some splice sites are noncanonical… and some genes are alternatively spliced… so it can become somewhat complex. Translation start prediction • Position-specific scoring matrix (PSSM) • Certain nucleotides tend to be in position around start site (ATG), and others not so • Such biased nucleotide distribution is basis for translation start prediction Figure courtesy of Sucheta Tripathy http://www.slideshare.net/tsucheta/29th-june2011 Mathematical model • Fi(X): freq. of X (A, G, C, T) in position I • Score string by Σ log (Fi (X)/0.25) Figure courtesy of Sucheta Tripathy http://www.slideshare.net/tsucheta/29th-june2011 Pattern-based exon & gene prediction • Assess different criteria • Coding region inside ORF (start & stop, no interrupting stops) • Dimer frequency • Coding score • Donor site score • Acceptor site score • Other factors to consider • GC content • Exon length distribution • Polymerase II promoter elements (GC box, CCAT box, TATA region) • Ribosome binding site • Polyadenylation signal upstream poly-A cleavage site • Termination signal downstream poly-A cleavage site Example of ab initio gene predictor flow http://genome.crg.es/software/geneid/ Confirming a predicted gene with cDNA 26 exons! http://pasa.sourceforge.net/ Extrinsic evidence & manual curation • Expression data • EST (expressed sequence tag) sequences • RNA-seq reads • mRNAcDNA • High throughput sequencing • Align reads to genome sequence • Homology based approaches • Protein (or expression data) sequences from other organisms • Nucleic acid conservation via tblastx or many other methods • Ortholog mapping/synteny • Experimentally confirmed gene products & gene families • Manual curation is often done by experts in a domain RNA-seq of transcripts as evidence for gene models mRNA cDNA GCTAATGCGAAGTCCTAGACCAGATTGAC ATGCGATGCAGCTGACGCTGGCTAATGCG CGCATAGCCAGATGACCATGATGCGATGC TGACAGATTAGACAGTAGGACAGATAGAC ……..many millions of reads 1. Gene model is confirmed by transcript information 2. Part of the gene model is confirmed but the exons predicted in the middle do not have transcript evidence. Does this mean they are not real? Not necessarily. 3. Transcript sequencing allows for novel gene detection. There is transcript evidence for the presence of a gene (or at least transcription) in an area of the genome without a gene model currently predicted. ? 1 3 Reads mapped to genome with gene models 2 Splice boundaries and alternate transcripts • Some reads will span the intron/exon boundaries • Allows for verification of gene models • Observation of alternate transcripts http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png Multiple genome alignment & conservation Experimentally based manual curation • We have experimentally characterized protein • What do I know about this gene family? • What do I know about genes in general? • No introns in multiples of three, short introns, et cetera Leverage comparative genomics Arnaud, et al. (2010) Nucleic Acids Res.38(Database issue): D420-7. Gather models for ab initio training set • Get models verified via expression, homology, or manual curation • Use manually curated genes from your organism • Generate preliminary ab initio model set and then do a homology search at Swiss-Prot, retaining most-conserved genes • Use CEGMA (Core Eukaryotic Genes Mapping Approach) to predict highly conserved genes • Align proteins from related organisms to your genome with splice-aware aligner, thus creating models with exon boundaries that have homologs • Align RNA-seq reads or ESTs to your genome to create or update existing models. • Use models with multiple sources & remove highly similar ones OR • Use pre-existing training set related to your organism • For example, I could use chicken if I am studying finch • Many software packages provide parameter files for common organisms Run gene finder as online or stand alone • Augustus web has text & graphical output Click! Predictions stored in GFF3 or GFF2 or GTF format RNA-seq can show differential expression of alternative transcripts Combiners • Incorporate multiple evidence types including ab initio predictions, expression data, and homology—and these usually perform the best • Glean • Evidence Modeler (EVM) • Jigsaw • Maker (actually a whole pipeline that can be used online) • PASA (combines predicted structures with expression data) • And more… • Note that many ab inito predictors, for example Augustus, incorporate other data types such as protein alignments or expression data One example, the Glean combiner • • Glean paper at http://genomebiology.com/2007/8/1/R13 Top track below is a statistically derived combination of the ones below it Example of annotation pipeline • Fungal Genome Annotation Standard Operating Procedure (SOP) at JGI • Repeat masking • Mapping ESTs (BLAT) from organism and publicly available • • • • • proteins from related taxa (BLASTx) Ab initio (FGENESH, GeneMark), homolgy-based (FGENESH+, Genewise seeded by BLASTx against nr), EST-based (EST_map) gene prediction EST clustering to improve gene models Filtering overlapping gene models based on protein homology and EST support to derive “best” model Non-coding genes with tRNAscan-SE …ready for functional annotation http://genome.jgi.doe.gov/programs/fungi/FungalGenomeAnnotationSOP.pdf nGASP – the nematode genome annotation assessment project http://www.biomedcentral.com/1471-2105/9/549 Take home message • Intrinsic & extrinsic prediction methods • Intrinsic gene finders need high-quality training datasets in order to produce good predictions • “Correct” gene predictions are a moving target • Note the steady decrease in the number of predicted genes as the human genome is further curated • Gene finders & gene finding pipelines produce predictions, which must be verified and refined – do not take them at face value • The more pieces of high-quality evidence you add to the process the better • In eukaryotes especially, there is not necessarily only one correct model PROTEIN FUNCTIONAL ANNOTATION 63 64 Annotation defined • annotate – to make or furnish critical or explanatory notes or comment. -- Merriam-Webster dictionary • genome annotation – the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes. -- Lincoln Stein, PMID 11433356 • Gene Ontology (GO) annotation – the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based. -- http://www.geneontology.org 65 What do our predicted genes do? • What we would like: • Experimental knowledge of function • Literature curation • Perform experiment • Not possible for all proteins in most organisms (not even close in most) • What we actually have: • Sequence similarity • Similarity to motifs, domains, or whole sequences • Protein not DNA for finding function • Shared sequence can imply shared function • All sequence-based annotations are putative until proven experimentally 66 Basic set of protein annotations • protein name - descriptive common name for the protein • e.g. “ribokinase” • gene symbol - mnemonic abbreviation for the gene • e.g. “recA” • EC number - only applicable to enzymes • e.g. 184.108.40.206 • role - what the protein is doing in the cell and why • e.g. “amino acid biosynthesis” • supporting evidence • accession numbers of BER and HMM matches • TmHMM, SignalP, LipoP • whatever information you used to make the annotation • unique identifier • e.g. locus ids 67 Alignments/Families/Motifs • pairwise alignments – two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match • multiple alignments – 3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column – more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function. • protein families – clusters of proteins that all share sequence similarity and presumably similar function – may be modeled by various statistical techniques • motifs – short regions of amino acid sequence shared by many proteins • transmembrane regions • active sites • signal peptides 68 Important terms to understand • homologs • two sequences have evolved from the same common ancestor • they may or may not share the same function • two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another. • orthologs • a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciation event has created the two copies of the sequence. • orthologs often, but not always, share the same function • paralogs • a type of homolog where the two sequences have arisen due to a gene duplication within one species • paralogs will initially have the same function (just after the duplication) but as time goes by, one copy will be free to evolve new functions, as the other copy will maintain the original function. This process is called “neofunctionalization”. • xenologs • a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer ancestor speciation to orthologs duplication to paralogs lateral transfer to a different species makes xenologs one paralog evolves a new function “neofunctionalization” – the duplicated gene/protein develops a new function 70 Pairwise alignments • There are numerous tools available for pairwise alignments • NCBI BLAST resources • FASTA searches • Many more • At IGS we use a tool called BER (BLAST-extend-repraze) that combines BLAST and Smith-Waterman approaches • Actually much of bioinformatics is based on reusing tools in new and creative ways… 71 genome’s protein set vs. non-redundant protein database BER BLAST mini-db for protein #1 mini-db for protein #2 , mini-db for protein #3000 mini-db for protein #3 ... , Query protein is extended Significant hits (using a liberal cutoff) put into mini-dbs for each protein modified SmithWaterman Alignment BER alignment vs. Extended Query protein by 300 nt Mini database 72 BER Alignment …to look through inframe stop codons and across frameshifts to determine if similarity continues 73 74 The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein. Extensions in BER end5 end3 ORFxxxxx 300 bp 300 bp search protein match protein normal full length match ! ! FS similarity extending through a frameshift upstream or downstream into extensions * similarity extending in the same frame through a stop codon PM 75 How do you know when an alignment is good enough to determine function? • Good question! No easy answer… • Generally, you want a minimum of 40%-50% identity over the full lengths of both query and match with conservation of all important structural and catalytic sites • However, some information can be gained from partial alignments • Domains • Motifs • BEWARE OF TRANSITIVE ANNOTATION ERRORS 76 Pitfalls of transitive annotation Transitive Annotation is the process of passing annotation from one protein (or gene) to another based on sequence similarity: A B B C C D • Current public datasets full of such errors A’s name has passed to D from A through several intermediates. -This is fine if A is similar to D. -This is NOT fine if A is NOT similar to D Transitive annotation errors are easy to make and happen often. • A good way to avoid transitive annotation errors is to require that in a pairwise match, the match annotation must be trusted • Be conservative • Err on the side of not making an annotation, when possibly you should, rather than making an annotation when probably you shouldn’t. 77 Trusted annotations • It is important to know what proteins in our search database are characterized. • proteins marked as characterized from public databases • Gene Ontology repository (more on this later) • GenBank (only recently began) • UniProt • proteins at “protein existence level 1” • Proteins with literature reference tags indicating characterization 78 UniProt UniProt http://www.uniprot.org • Swiss-Prot • European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) • all entries manually curated • http://www.expasy.ch/sprot • annotation includes • links to references • coordinates of protein features • links to cross-referenced databases • TrEMBL • EBI and SIB • entries have not been manually curated • once they are accessions remain the same but move into Swiss-Prot • http://www.expasy.ch/sprot • Protein Information Resource (PIR) • http://pir.georgetown.edu 79 UniProt 80 81 82 83 84 Enzyme Commission Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they Catalyse • not sequence based • categorized collection of enzymatic reactions • reactions have accession numbers indicating the type of reaction, for example EC 220.127.116.11 • http://www.chem.qmul.ac.uk/iubmb/enzyme/ • http://www.expasy.ch/enzyme/ 85 EC number Hierarchy All ECs starting with #1 are some kind of oxidoreductase Further numbers narrow specificity of the type of enzyme A four-position EC number describes one particular reaction 86 Example entry for one specific enzyme 87 Metabolic pathway databases • KEGG • http://www.genome.jp/kegg/ • MetaCyc/BioCyc • http://metacyc.org/ • http://www.biocyc.org/ • BRENDA • http://www.brenda-enzymes.info/ 88 89 90 Hidden Markov models (HMMs) • Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and, presumably, functional similarity • Two sets routinely used for protein functional annotation • TIGRFAMs (www.tigr.org/TIGRFAMs/) • Pfam (pfam.sanger.ac.uk) • Each TIGRFAM model is assigned to a category which describes the type of functional relationship the proteins in the model have to each other – Equivalog - one specific function, e.g. “ribokinase” – Subfamily - group of related functions generally with different substrate specificities, e.g. “carbohydrate kinase” – Superfamily - different specific functions that are related in a very general way, e.g. “kinase” – Domain - not necessarily full-length of the protein, contains one functional part or structural feature of a protein, may be fairly specific or may be very general, e.g. “ATP-binding domain” 91 Annotation attached to HMMs • Functionally specific HMMs have specific annotations – TIGR00433 (accession number for the model) • • • • • name: biotin synthase category: equivalog EC: 18.104.22.168 gene symbol: bioB Roles: – biotin biosynthesis (TIGR 77/GO:0009102) – biotin synthase activity (GO:0004076) • Functionally general HMMs have general annotations – PF04055 • name: radical SAM domain protein • category: domain • EC: not applicable • gene symbol: not applicable • Roles: – enzymes of unknown specificity (TIGR role 703) – catalytic activity (GO:0003824) – metabolism (GO:0008152) HMM building Proteins from many species Alignments of functionally related proteins act as training sets for HMM building Statistical Model Model specific to a family of proteins, generally found across many species Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 93 HMM scores • When a protein is searched against an HMM it receives a BITS score and an e-value indicating the significance of the match T Statistical Model The person building the HMM will search the new HMM against a Statistical protein database and decide on Model the trusted and noise cutoff scores N • The search protein’s score is compared with the trusted and noise cutoff scores attached to the HMM • proteins scoring above the trusted cutoff can be assumed to be members of the family • proteins scoring below the noise cutoff can be assumed NOT to be members of the family • when proteins score in-between the trusted and noise cutoffs, the protein may be a member of the family and may not. 94 Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences, University of Maryland School of Medicine, 2013 HMM databases Proteins from many species T Alignments of functionally related proteins act as training sets for HMM building Examples : Pfam and TIGRFA M Statistical Model Model specific to N a family of proteins, generally found across many species Add this model to the database Database of HMM models, each specific to one protein family and/or functional level 95 The cutoff scores attached to HMMs, are sometimes high and sometimes low and sometimes even negative. There is no inherent meaning in how high or low a cutoff score is, the important thing is the query protein’s score relative to the trusted and noise scores. -50 0 …above trusted: the protein is a member of family the HMM models N -50 -50 -50 0 P 100 T …below noise: the protein is not a member of family the HMM models 100 0 …in-between noise and trusted: the protein MAY be a member of the family the HMM models 100 0 ...above trusted and some or all scores are negative: the protein is a member of the family the HMM models 100 96 Orthologous groups • COGs – have not been updated in a long time • eggNOG – newer, more complete 2 Bi-directional best BLAST B 1 3 A C 97 Motif searches • PROSITE - http://www.expasy.ch/prosite/ – “consists of documentation entries describing protein domains, families and functional sites as well as associated patterns to identify them.” • Center for Biological Sequence Analysis http://www.cbs.dtu.dk/ – Protein Sorting (7 tools) • Signal P finds potential secreted proteins • LipoP finds potential lipoproteins • TargetP predicts subcellular location of proteins – Protein function and structure (9 tools) • TmHMM finds potential membrane spans – – – – – Post-translational modifications (14 tools) Immunological features (9 tools) Gene finding and splice sites (9 tools) DNA microarray analysis (2 tools) Small molecules (2 tools) 98 One-stop shopping - InterPro • InterPro • Brings together multiple databases of HMM, motif, and domain information. • Excellent annotation and documentation • http://www.ebi.ac.uk/interpro/ 99 Making annotations • Use the information from the evidence sources to decide what the gene/protein is doing • Assign annotations that are appropriate to your knowledge • Name • EC number • Role • Etc. TIGR roles Main Categories: Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipidmetabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified (not a real role) Each main category has several subcategories. 101 Names (and other annotations) should reflect knowledge • specific function – Example: “adenylosuccinate lyase”, purB, 22.214.171.124 • varying knowledge about substrate specificity – A good example: ABC transporters • ribose ABC transporter • sugar ABC transporter • ABC transporter – choosing the name at the appropriate level of specificity requires careful evaluation of the evidence looking for specific characterized matches and HMMs. • family designation - no gene symbol, partial EC – “Cbby family protein” – “carbohydrate kinase, FGGY family” • hypotheticals – “hypothetical protein” – “conserved hypothetical protein” ONTOLOGIES 102 103 Names can be problematic…. • ….because humans do not always use precise and consistent terminology • Our language is riddled with • Synonyms – different names for the same thing • Homonyms – different things with the same name • This makes data mining/query difficult • What name should you assign? • What name should you use when you search UniProt or NCBI or any other database? 104 Synonyms • Within any domain do people use precise & consistent language? • Take biologists, for example… • Mutually understood concepts – DNA, RNA, protein • Translation & protein synthesis • Synonym: one thing, more than one name • Enzyme Commission reactions • Standardized id, official name & alternative names http://www.expasy.ch/enzyme/126.96.36.199 105 Homonyms • Different things known by same name • Common in biology • Sporulation • Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?) 106 Standardization with controlled vocabularies (CVs) • An official list of precisely defined terms used to classify information & facilitate its retrieval • Flat list • Thesaurus • Catalog • Benefits of CVs – Allow standardized descriptions – Synonyms & homonyms addressed – Can be cross-referenced externally – Facilitate electronic searching A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.” http://www.nlm.nih.gov/nichsr/hta101/ta101014.html 107 Ontology: CV with defined relationships • Formalizes knowledge of subject with precise textual definitions • Networked terms; child more specific (“granular”) than parent National Drug File 108 An example is the Gene Ontology with three controlled vocabularies • Molecular Function • What the gene product is doing • Biological Process • Why the gene product is doing what it does • Cellular component • Where a gene product is doing what it does The Gene Ontology • A good example of a biological ontology • Relationships among networked, defined terms • Vascular terms shown with relationships 110 Example: a GO annotation • Associating GO term with gene product (GP) • GP has function (6-phosphofructokinase activity) • GP participates in process (glycolysis) • GP is located in part of cell (cytoplasm) • Linking GO term to GP asserts it has that attribute • Based on literature or • computational methods • Always involves: • Learning something about gene product • Selecting appropriate GO term • Providing appropriate evidence code • Citing reference [preferably open access] • Entering information into GO annotation file 111 Annotation becomes a series of ids linked to other proteins/genes/features This protein is integral to the plasma membrane and is part of an ATP-binding cassette (ABC) transporter complex. It functions as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source. = • GO:0005887 • GO:0008272 • GO:0015419 • GO:0043190 112 Term name GO ID (unique numerical identifier) Synonyms for searching, alt. names, misspellings… GO slim Precise textual definition that describes some aspect of the biology of the gene product Definition reference Ontology relationships (next page) 113 Genomes can be compared • High-level biological process terms used to compare Plasmodium and Saccharomyces (made by “slimming”) MJ Gardner, et al. (2002) Nature 419:498-511 EVIDENCE 114 The importance of evidence tracking • • • • The process of functional annotation involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. Functional annotations should only be as specific as the supporting evidence allows All evidence that led to the annotation conclusions that were made must be stored. In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided. I conclude that you are a cat. Why? -You look like other cats I know -I heard you meow and purr I conclude that you code for a protein kinase. Why? -You look like other protein kinases I know -You have been observed115 to add phosphate to proteins Knowledge & annotation specificity • How much can we accurately say? Available evidence for three genes Corresponding GO annotations 117 Types of Evidence • Experiments (often considered the best evidence) • Pairwise/multiple alignments • HMM/domain matches scoring above trusted cutoff • Metabolic Pathway analysis • Match to an ortholog group (COG,eggNOG) • Motifs The Evidence Ontology (ECO) • ECO terms have standardized definitions, references & synonyms • Allows standardizing evidence description and searching by evidence type • Can filter by evidence type & do other things! • GO evidence codes are subset of ECO ECO roots and combinatorial term The big picture: Evidence and sequence repositories Experimental pathway (Many ECO terms describe outputs of processes in aggregate) Can be described by particular evidence term Researcher performs experiment... Publish… ...annotate Term from ontology or other descriptive vocabulary Protein name, id, et cetera Author does analysis & arrives at conclusions, publishes these in paper A different person reads about the conclusion, interprets & makes annotation How associating a piece of information with a protein was performed. Who did the work, a person or a computer? Annotation stored in sequence repository Not shown: protein ID, date, name, et cetera The annotation with a term from a descriptive vocabulary/ontolog y such as the Gene Ontolgoy The protein sequence The evidence term used to support the decision to associate the term from the descriptive vocabulary with that particular protein The overall experimental annotation flow Pure evidence term Pure assertion method term Evidence x assertion method cross product Similarity pathway Sequence database Protein sequence of interest Evidence term …researcher or computer performs search Resulting alignment with match protein Interpretation is made and annotation is made “Protein 1 & 2 look same, so protein 1 has same function as protein 2…” Assertion method The assertion Store in sequence repository Evidence x assertion method cross product Same evidence, but we know who or what did the asserting… The overall similarity annotation flow Pure evidence term Pure assertion method term Evidence x assertion method cross product CONCLUDING REMARKS 129 130 The big picture: an example pipeline DNA Sequence (assembly, masking) Predicte d protein coding genes Gene Prediction Automated start site and gene overlap correction RNA finding: tRNAScan, RNAMMER, homology searches MySQL database using the Chado schema Predicted RNA Genes Genome viewer/edit or translation Flat files of annotation information Searches: Pairwise BER searches against UniRef100 HMM searches against Pfam and TIGRfam Motif searches with LipoP, THMHH, PROSITE NCBI COGs Prium profiles Automatic Annotation using the evidence hierarchy of Pfunc 131 Some concluding themes… • The best annotation comes from looking at multiple • • • • sources of evidence It is important to track and check the evidence used in an annotation Do not assume the annotation you see on a protein is correct unless it comes from a trusted source Always err on the side of under-annotating rather than over-annotating Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of information it provides.