Genome sequencing and assembly

Mayo/UIUC Summer Course in Computational Biology
Session Outline
Planning a genome sequencing project
Assembly strategies and algorithms
Assessing the quality of the assembly
Assessing the quality of the assemblers
Genome annotation
Genome sequencing
Schematic overview of genome assembly. (a) DNA is
collected from the biological sample and sequenced.
(b) The output from the sequencer consists of many
billions of short, unordered DNA fragments from
random positions in the genome. (c) The short
fragments are compared with each other to discover
how they overlap. (d) The overlap relationships are
captured in a large assembly graph shown as nodes
representing kmers or reads, with edges drawn
between overlapping kmers or reads. (e) The
assembly graph is refined to correct errors and
simplify into the initial set of contigs, shown as large
ovals connected by edges. (f) Finally, mates, markers
and other long-range information are used to order
and orient the initial contigs into large scaffolds, as
shown as thin black lines connecting the initial
Schatz et al. Genome Biology 2012 13:243
Planning a genome sequencing project
How large is my genome?
How much of it is repetitive, and what is the repeat
size distribution?
Is a good quality genome of a related species
What will be my strategy for performing the assembly?
How large is my genome?
The size of the genome can be estimated from the
ploidy of the organism and the DNA content per cell
This will affect:
» How many reads will be required to attain sufficient
coverage (typically 10x to 100x)
» What sequencing technology to use
» What computational resources will be needed
Repetitive sequences
Most common source of assembly errors
If sequencing technology produces reads > repeat
size, impact is much smaller
Most common solution: generate mate pairs with
spacing > largest known repeat
Assemblies can collapse around repetitive sequences.
Salzberg S L , and Yorke J A Bioinformatics 2005;21:43204321
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions,
please email: [email protected]
Mis-assembly of repetitive sequence
Schatz M C et al. Brief Bioinform 2013;14:213-224
Genome(s) from related species
Preferably of good quality, with large reliable scaffolds
Help guiding the assembly of the target species
Help verifying the completeness of the assembly
Can themselves be improved in some cases
But to be used with caution – can cause errors when
architectures are different!
Strategies for assembly
The sequencing approaches and assembly strategies
are interdependent!
» E.g., for bacterial genome assembly, can generate 454
sequence reads and assemble with Newbler, or
generate Illumina reads and assemble with Velvet
» Optimal sequencing strategies very different for a
SOAPdenovo or an ALLPATHS-LG assembly
Typical sequencing strategies
Bacterial genome:
» Shotgun or mate-pair >500nt reads from 454 machine at 25x coverage
(~150,000 reads for 3 MB genome), assembly with Newbler
» PacBio CLR sequences at 200x coverage, self-correction and/or hybrid
correction and assembly using Celera Assembler or PBJelly
Vertebrate genome:
» Combination paired-end (180 nt fragments) and mate-pair (1, 3 and 10
kb libraries) 100 nt reads from Illumina machine at 100x coverage (~1B
reads for 1 GB genome), assembly with ALLPATHS-LG
Mate-pair library preparation from 454 (left) and Illumina (right)
Additional useful data
Fosmid libraries
» End sequencing adds long-range contiguity information
» Pooled fosmids (~5000) can often be assembled more efficiently
Moleculo libraries
» New technology acquired by Illumina, allows generation of fully
assembled 10 kb sequences
Pacbio reads
» Provide 1-3 kb reads, but need parallel coverage by Illumina data
for error correction
Assembly strategies and algorithms
In all cases, start with cleanup and error correction of
raw reads
For long reads (>500 nt), Overlap/Layout/Consensus
(OLC) algorithms work best
For short reads, De Bruijn graph-based assemblers
are most widely used
Cleaning up the data
Trim reads with low quality calls
Remove short reads
Correct errors:
» Find all distinct k-mers (typically k=15) in
input data
» Plot coverage distribution
» Correct low-coverage k-mers to match highcoverage
» Part of several assemblers, also standalone Quake of khmer programs
Main entity: read
Relationship between reads: overlap
OLC assembly steps
Calculate overlays
» Can use BLAST-like method, but finding common kmers more efficient
Assemble layout graph, try to simplify graph and
remove nodes (reads)
Generate consensus from the alignments between
reads (overlays)
Some OLC-based assemblers
Celera Assembler with the Best Overlap Graph
» Designed for Sanger sequences, but works with 454
and error-corrected PacBio reads
Newbler, a.k.a. GS de novo Assembler
» Designed for 454 sequences, but works with Sanger
De Bruijn graphs - concept
Converting reads to a De Bruijn graph
Reads are 7 nt long
Graph with k=3
Deduced sequence (main branch)
DBG implementation in the Velvet assembler
Examples of DBG-based assemblers
EULER (P. Pevzner), the first assembler to use DBG
Velvet (D. Zerbino), a popular choice for small
SOAPdenovo (BGI), widely used by BGI and for
relatively unstructured assemblies
ALLPATHS-LG, probably the most reliable assembler
for large genomes (but with strict input requirements)
Anatomy of a WGS Assembly
STS-mapped Scaffolds
Read pair (mates)
Gap (mean & std. dev. Known)
Reads (of several haplotypes)
External “Reads”
Pairs Give Order & Orientation
Assembly without pairs results
in contigs whose order and
orientation are not known.
Consensus (15- 30Kbp)
Pairs, especially groups of corroborating
ones, link the contigs into scaffolds where
the size of gaps is well characterized.
Mean & Std.Dev.
is known
Assembly gaps
Physical gaps
Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap
Repeats often split genome into contigs
Contig derived from unique sequences
Reads from multiple repeats
collapse into artefactual contig
Handling repeats
1. Repeat detection
pre-assembly: find fragments that belong to repeats
statistically (most existing assemblers)
repeat database (RepeatMasker)
Reputer, RepeatMasker
"unhappy" mate-pairs (too close, too far, mis-oriented)
during assembly: detect "tangles" indicative of repeats (Pevzner,
Tang, Waterman 2001)
post-assembly: find repetitive regions and potential misassemblies.
2. Repeat resolution
find DNA fragments belonging to the repeat
determine correct tiling across the repeat
Obtain long reads spanning repeats
How good is my assembly?
How much total sequence is in the assembly relative to
estimated genome size?
How many pieces, and what is their size distribution?
Are the contigs assembled correctly?
Are the scaffolds connected in the right order /
How were the repeats handled?
Are all the genes I expected in the assembly?
N50: the most common measure of assembly
N50 = length of the shortest
contig in a set making up 50%
of the total assembly length
Order and orientation of contigs – more errors in one assembly than in another
CEGMA: conserved eukaryotic gene sets
From Ian Korf’s group, UC Davis
Mapping Core Eukaryotic Genes
Coverage is indicative of quality
and completeness of assembly
Even the best genomes are not perfect
There is no such thing as a “perfect” assembler (results from GAGE competition)
The computational demands and effectiveness of assemblers are very different
Assessing assembly strategies
Assemblathon (UC Davis and UC Santa Cruz)
» Provide challenging datasets to assemble in open competition (synthetic for
edition 1, real for edition 2)
» Assess competitor assemblies by many different metrics
» Publish extensive reports
GAGE (U. of Maryland and Johns Hopkins)
» Select datasets associated with known high-quality genomes
» Run a set of open source assemblers with parameter sweeps on these
» Compare the results, publish in scholarly Journals with complete
documentation of parameters
Some advice on running assemblies
Perform parameter sweeps
» Use many different values of key parameters, especially k-mer size for DBG
assemblers, and evaluate the output (some assemblers can do this
Try different subsets of the data
» Sometimes libraries are of poor quality and degrade the quality of the assembly
» Artefacts in the data (e.g. PCR duplicates, homopolymer runs, …) can also
badly affect output quality
Try more than one assembler
» There is no such thing as “the best” assembler
Genome annotation
A genome sequence is useless without annotation
Three steps in genome annotation:
» Find features not associated with protein-coding genes (e.g.
tRNA, rRNA, snRNA, SINE/LINE, miRNA precursors)
» Build models for protein-coding genes, including exons,
coding regions, regulatory regions
» Associate biologically relevant information with the genome
features and genes
Methods for genome annotation
Ab initio, i.e. based on sequence alone
» INFERNAL/rFAM (RNA genes), miRBase (miRNAs),
RepeatMasker (repeat families), many gene prediction
algorithms (e.g. AUGUSTUS, Glimmer, GeneMark, …)
» Require transcriptome data for the target organism (the more
the better)
» Align cDNA sequences to assembled genome and generate
gene models: TopHat/Cufflinks, Scripture
Methods for biological annotation
BLAST of gene models against protein databases
» Sequence similarity to known proteins
InterProScan of predicted proteins against databases
of protein domains (Pfam, Prosite, HAMAP,
Mapping against Gene Ontology terms (BLAST2GO)
MAKER, integration framework for genome
MAKER runs many software
tools on the assembled
genome and collates the
For this slide deck I “borrowed” figures and slides from
many publications, Web pages and presentations by
» M. Schatz, S. Salzberg, K. Bradnam, K. Krampis, D.
Zerbino, J. J. Cook, M. Pop, G. Sutton
Thank you!

similar documents