Chains & Nets, Conservation & Function

Report
CS173
Lecture 12: Chains & Nets, Conservation & Function
MW 11:00-12:15 in Beckman B302
Prof: Gill Bejerano
TAs: Jim Notwell & Harendra Guturu
http://cs173.stanford.edu [BejeranoWinter12/13]
1
Announcements
• HW2 Due Today
• As are project assignments
• Coming monday 2/25 lecture has been moved to LK101
(building next door – we’ll post instructions)
http://cs173.stanford.edu [BejeranoWinter12/13]
2
Inferring Genomic Mutations
From Alignments of Genomes
http://cs173.stanford.edu [BejeranoWinter12/13]
3
Terminology
Orthologs : Genes related via speciation (e.g. C,M,H3)
Paralogs: Genes related through duplication (e.g. H1,H2,H3)
Homologs: Genes that share a common origin
(e.g. C,M,H1,H2,H3)
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs173.stanford.edu [BejeranoWinter12/13]
4
• What?
• Compare whole genomes
• Compare two genomes
• Within (intra) species
• Between (inter) species
• Compare genome to itself
• Why?
• Comparison reveals functional and neutral regions
• Homologous regions most often have similar functions
• Modification of functional regions can reveal
• Disease susceptibility
• Adaptation
• How?
http://cs173.stanford.edu [BejeranoWinter12/13]
5
Every Genome is Different
DNA Replication is imperfect – between individuals of the
same species, even between the cells of an individual.
junk
functional
...ACGTACGACTGACTAGCATCGACTACGA...
chicken
egg
TT
CAT
...ACGTACGACTGACTAGCATCGACTACGA...
“anything
goes”
many changes
are not tolerated
chicken
http://cs173.stanford.edu [BejeranoWinter12/13]
6
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Similarity is often measured using “%id”, or percent identity
%id = number of matching bases / number of alignment columns
Where
Every alignment column is a match / mismatch / indel base
Where indel = insertion or deletion (requires an outgroup to resolve)
What to expect from genome comparisons?
human
lizard
Objective: find local alignment blocks, that are
likely homologous (share common origin)
O(mn) examine the full matrix using DP
O(m+n) heuristics based on seeding + extension
trades sensitivity for speed
http://cs173.stanford.edu [BejeranoWinter12/13]
8
“Raw” (B)lastz track (no longer displayed)
Alignment = homologous regions
Protease Regulatory Subunit 3
9
Chaining co-linear alignment blocks
human
lizard
Objective: find local alignment blocks, that are
likely homologous (share common origin)
Chaining strings together co-linear blocks in the
target genome to which we are comparing.
Double lines when there is unalignable sequence
in the other species. Single lines when there isn’t.
http://cs173.stanford.edu [BejeranoWinter12/13]
10
Reference genome perspective,
The Use of an Outgroup
Outgroup Sequence
A
B
C
Mouse Sequence
D
E
A
B
C
B’
D
E
Human Sequence
A
B
C
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
…
…
D
E
11
Gap Types: Single vs Double sided
Ancestral Sequence
A
B
C
D
E
Human Sequence
A
B
C
D
E
Mouse Sequence
A
B
C
B’
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
…
…
D
E
12
Conservation Track Documentation
http://cs173.stanford.edu [BejeranoWinter12/13]
13
Chains join together related local alignments
likely ortholog
likely paralogs
shared domain?
Protease Regulatory Subunit 3
http://cs173.stanford.edu [BejeranoWinter12/13]
14
Note: repeats are a nuisance
mouse
human
If, for example, human and mouse have each 10,000 copies
of the same repeat:
We will obtain and need to output 108 alignments of all these
copies to each other.
Note that for the sake of this comparison interspersed repeats
and simple repeats are equal nuisances.
Also note that simple repeats, but not interspersed repeats,
violate the assumption that similar sequences are homologous.
Solution:
1 Discover all repetitive sequences in each genome.
2 Mask them when doing genome to genome comparison.
3 Chain your alignments.
4 Add back to the alignments only repeat matches that lie within
pre-computed chains.
http://cs173.stanford.edu [BejeranoWinter12/13]
15
Chains
• a chain is a sequence of gapless aligned blocks, where there must be
no overlaps of blocks' target or query coords within the chain.
• Within a chain, target and query coords are monotonically nondecreasing. (i.e. always increasing or flat)
• double-sided gaps are a new capability (blastz can't do that) that
allow extremely long chains to be constructed.
• not just orthologs, but paralogs too, can result in good chains.
but that's useful!
• chains should be symmetrical -- e.g. swap human-mouse -> mousehuman chains, and you should get approx. the same chains as if you
chain swapped mouse-human blastz alignments.
• chained blastz alignments are not single-coverage in either target or
query unless some subsequent filtering (like netting) is done.
• chain tracks can contain massive pileups when a piece of the target
aligns well to many places in the query. Common causes of this
include insufficient masking of repeats and high-copy-number genes
(or paralogs).
[Angie Hinrichs, UCSC wiki]
http://cs173.stanford.edu [BejeranoWinter12/13]
16
Before and After Chaining
http://cs173.stanford.edu [BejeranoWinter12/13]
17
Chaining Algorithm
Input - blocks of gapless alignments from blastz
Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
j<i
Uses Miller’s KD-tree algorithm to minimize which parts of dynamic
programming graph to traverse. Timing is O(N logN), where N is
number of blocks (which is in hundreds of thousands)
See [Kent et al, 2003]
“Evolution's cauldron: Duplication,
deletion, and rearrangement in the
mouse and human genomes”
http://cs173.stanford.edu [BejeranoWinter12/13]
18
Netting Alignments
Commonly multiple mouse alignments can be found for a particular
human region, particularly for coding regions.
Net finds best match mouse match for each human region.
Highest scoring chains are used first.
Lower scoring chains fill in gaps within chains inducing a natural
hierarchy.
http://cs173.stanford.edu [BejeranoWinter12/13]
19
Net highlights rearrangements
A large gap in the top level of the net is filled by an
inversion containing two genes. Numerous smaller
gaps are filled in by local duplications and processed
pseudo-genes.
http://cs173.stanford.edu [BejeranoWinter12/13]
20
A Rearrangement Hot Spot
Rearrangements are not evenly distributed. Roughly 5%
of the genome is in hot spots of rearrangements such as
this one. This 350,000 base region is between two very
long chains on chromosome 7.
http://cs173.stanford.edu [BejeranoWinter12/13]
21
Nets attempt to capture the ortholog
(they also hide everything else)
http://cs173.stanford.edu [BejeranoWinter12/13]
22
Retroposed Genes and Pseudogenes
Pseudogenes (“dead genes”):
Genomic sequences that
resemble (originated from) genes
that no longer make proteins.
Retrogenes (“retrotranscribed”):
Protein coding RNA that was
reverse transcribed and inserted
back into the genome.
The RNA can be grabbed at any
stage (partial/full transcript,
before/during/after all introns are
spliced).
http://cs173.stanford.edu [BejeranoWinter12/13]
23
Useful in finding pseudogenes
gene
pred.
Ensembl and Fgenesh++ automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein must
be interesting!
http://cs173.stanford.edu [BejeranoWinter12/13]
24
Nets/chains can reveal retrogenes (and when they jumped in!)
http://cs173.stanford.edu [BejeranoWinter12/13]
25
Nets
• a net is a hierarchical collection of chains, with the highest-scoring
non-overlapping chains on top, and their gaps filled in where possible
by lower-scoring chains, for several levels.
• a net is single-coverage for target but not for query.
• because it's single-coverage in the target, it's no longer symmetrical.
• the netter has two outputs, one of which we usually ignore: the targetcentric net in query coordinates. The reciprocal best process uses
that output: the query-referenced (but target-centric / target singlecov) net is turned back into component chains, and then those are
netted to get single coverage in the query too; the two outputs of that
netting are reciprocal-best in query and target coords. Reciprocalbest nets are symmetrical again.
• nets do a good job of filtering out massive pileups by collapsing them
down to (usually) a single level.
• GB: for human inspection always prefer looking at the chains!
[Angie Hinrichs, UCSC wiki]
http://cs173.stanford.edu [BejeranoWinter12/13]
26
Before and After Netting
http://cs173.stanford.edu [BejeranoWinter12/13]
27
Convert / LiftOver
"LiftOver chains" are actually chains extracted from nets, or chains
filtered by the netting process.
LiftOver – batch utility
http://cs173.stanford.edu [BejeranoWinter12/13]
28
What nets can’t show, but chains will
http://cs173.stanford.edu [BejeranoWinter12/13]
29
Same Region…
same in all
the other fish
http://cs173.stanford.edu [BejeranoWinter12/13]
30
Drawbacks
• Inversions not handled optimally
Chains
> > > > chr1 > > >
> > > > chr1 > > >
< < < < chr5 < < < <
< < < < chr1 < < < <
Nets
> > > > chr1 > > >
> > > > chr1 > > >
< < < < chr5 < < < <
31
Drawbacks
• High copy number genes can break orthology
32
Self Chain reveals paralogs
(self net is
meaningless)
http://cs173.stanford.edu [BejeranoWinter12/13]
33
Conservation and Function
http://cs173.stanford.edu [BejeranoWinter12/13]
34
Evolution = Mutation + Selection
Mistakes can happen during DNA replication. Mistakes are
oblivious to DNA segment function. But then selection kicks in.
junk
functional
...ACGTACGACTGACTAGCATCGACTACGA...
chicken
egg
TT
CAT
...ACGTACGACTGACTAGCATCGACTACGA...
“anything
goes”
many changes
are not tolerated
chicken
Conservation implies function!
(But what function?)
http://cs173.stanford.edu [BejeranoWinter12/13]
35
Vertebrates: what to sequence?
, Stickleback
Which species to compare to?
Too close and purifying selection
will be largely indistinguishable
from the neutral rate.
Too far and many functional orthologs
will diverge beyond our ability
too far
to accurately align them.
sweet spot
too close
, Lizard
, Opossum
 you are here
[Human Molecular Genetics, 3rd Edition]
http://cs173.stanford.edu [BejeranoWinter12/13]
36
Searching Near And Far
Search too near (eg human to chimp or orang above) and you cannot
distinguish neutral sequence from sequence under purifying selection.
Search further still (eg mouse) and the two distributions pry apart.
But now you’ve lost younger functional sequences born after the split.
Ie, conservation implies function, but
lack of conservation does NOT imply lack of function!
http://cs173.stanford.edu [BejeranoWinter12/13]
37
Phylogenetic Shadowing
, Stickleback
, Lizard
, Opossum
“too close” can actually be a boon
if you have enough closely related genomes
too close
 you are here
http://cs173.stanford.edu [BejeranoWinter12/13]
38
PhastCons Conserved Elements
http://cs173.stanford.edu [BejeranoWinter12/13]
Distant homologies
When species diverge too much (e.g. chicken and beyond above),
confident alignments can no longer be detected at the DNA level.
E.g.: all SPI1 and SLC39A13 exons are there in chicken & fish.
http://cs173.stanford.edu [BejeranoWinter12/13]
40
Distant homologies search strategies
Here it is much better to search
a gene model from species A
(e.g human)
against the genome of species
B (e.g. chicken)
This is a search of amino acids
in all their possible codons into
a gene structure with unknown
exon – intron structure.
(eg TBLASTN, translated BLAT)
http://cs173.stanford.edu [BejeranoWinter12/13]
41
Distant homologies
Find the most distantly related
genes using gene models in
both species:
1 search amino acids sequences
against each other. (eg using
BLASTP).
2 Map your hits back to the two
respective genomes, anchored
on the amino acid alignment
(respecting any exon-intron
gene body structure change).
3 Examine co-linear homology of
flanking genes to try and call
orthologs from paralogs.
http://cs173.stanford.edu [BejeranoWinter12/13]
42
RNA homology searches
1 Define a mathematical construct that describe potential homologs.
2 Go search for them (efficiently!). 3 Examine genomic context.
http://cs173.stanford.edu [BejeranoWinter12/13]
43
Enhancer remote homologs
Enhancer =
Gene regulatory sequences in general are the most challenging
to search for:
• Individual binding sites are very flexible.
• Gaps between binding sites may evolve (semi) neutrally,
making DNA alignment seeding particularly frail.
• Binding site gain/loss and shuffling may or may not be
allowed – we need a better understanding of underlying logic.
http://cs173.stanford.edu [BejeranoWinter12/13]
44
Exceptionally Old Enhancers Exist
But how many of these really exist?
http://cs173.stanford.edu [BejeranoWinter12/13]
45
Ultraconservation: No known function requires this much conservation
*
?
CDS
*
ncRNA
*
*
*
TFBS
seq.
http://cs173.stanford.edu [BejeranoWinter12/13]
46
“Gene” Finding III: Comparative Genomics
http://cs173.stanford.edu [BejeranoWinter12/13]
47
The challenge: map code to output
3*109
1013 cells
letters
genome
person
Ultimately we sequence genomes, and study their function
in detail to understand genome to phenotype relationships:
• Minus side: Genomic contribution to disease
• Plus side: Adaptation and speciation
To be continued…
http://cs173.stanford.edu [BejeranoWinter12/13]
48

similar documents