Metagenomics - Stanford AI Lab

Metagenomics and the microbiome
What is metagenomics?
 Looking at microorganisms via genomic sequencing
rather than culturing
 Environmental use case: ag, biofuels, pollution
 Health use case: The human microbiome
 You = 1013 your cells + 1014
Why care about
bacterial cells
 More actionable genomics
Why care about microbiome?
 Diagnostic or modulatory implications in:
 Obesity, Diabetes, Fatigue, Pain disorders
 Anxiety, Depression, Autism
 Antibiotic resistant bacteria
 IBD and other gut disorders
 Cardiac function, cancer
Diseases and the microbiome
Source: The human microbiome: at the interface of health and disease. Nature reviews genetics
Why care about microbiome?
Publications containing ‘microbiome’ by date on Science Direct
Goal 1: Composition
Source: The human microbiome: at the interface of health and disease, Nature Reviews Genetics
Diversity measures
 Alpha diversity: how diverse is this population?
Simpson’s index, Shannon’s index, etc
 Difference in alpha diversity before and after
 Beta diversity: Taxonomical similarity between 2
 Finding compositional associations between disease
cohort and microbial makeup
Sequencing for diversity
 Pyrosequencing the 16s ribosomal RNA subunit
 < 10 taxa appear in > 95% of people in HMP
 Recall the implicated diseases. Looks like GWAS
common disease, small effect size + common
disease, rare variant
Goal 2: Functional profiling
Source: The human microbiome: at the interface of health and disease. Nature reviews genetics
Functional profiling
 Current: Which genes are present and are being
 In development: proteomics, metabolomics
Sequencing for function
 Whole microbiome sequencing
 Avoids primer biases and is more kingdom agnostic
 Assembly is hard, especially where reference
genomes don’t exist
Two big problems
 Can’t understand the body without understanding
the microbiome
 Can’t understand the microbiome by only looking at
 Read fragment assembly is very very hard in
Kingdom-Agnostic Metagenomics
The players in your body
 Your cells
 Metabolites
 Bacteria
 Bacteriophages
 Other viruses
 Fungi
That’s not complexity
Source: A comprehensive map of the toll‐like receptor signaling network. Molecular Systems Biology
Prokaryotic virome: bacteriophages
 Infect prokaryotic bacteria
 Transfer genetic material among prokaryotic
 Rapidly evolving
 Put constant selection pressure on bacterial
Bacteriophages: deep sequencing results
 60% of sequences dissimilar from all sequence
 More than 80% come from 3 families
 Little intrapersonal variation
 Large interpersonal variation, even among relatives
 Diet affects community structure
 Antibiotic resistance genes found in viral material
Bacteriophages and function
 Cross the intestinal barrier possibly affecting
systemic immune response
 Adhere to mucin glycoproteins potentially causing
immune response in gut epithelium
 IBD/Chron’s: relative increase in Caudovirales
 Affect bacterial composition and/or host directly
Eukaryotic virome
 Fecal samples from healthy children shows complex
community of typically pathogenic viruses
 Includes plant RNA viruses from food
 Anelloviruses and circoviruses present in nearly
100% by age 5, likely from industrial ag
Eukaryotic viruses and function
 Simian immunodeficient experiment showed
enteric virome expansion
 Increased gut permeability and caused intestinal
lining inflammation
 Acute diarrhea subjects showed novel viruses and
highly divergent viruses with less than 35%
similarity to catalogued viruses at amino acid level
 Fungi, protazoa, and helminths (worms)
 No experiments conducted with sampling to
saturation, much more work to be done
 18S sequencing showed 66 genera of fungi in gut
and fungi were found in 100% of samples
 Most subjects had less than 10 genera
 But high fungal diversity is bad: increases in IBD,
increases with antibiotic usage
But it’s very hard
 Amplicon-based don’t work well for viruses
 Heterogeneous sample-prep is required
 Large differences in genome sizes from a few kb in
viruses to 100+Mb in fungi
 Small genomes+divergence require lots of coverage
to get contigs
Getting the whole picture
Source: Meta'omic Analytic Techniques for Studying the Intestinal Microbiome. Gastroenterology.
The assembly problem
Isn’t assembly easy?
 Recall: 500-1000 species of bacteria in the gut, but
about 30 of them make up 99% of composition
 33% of bacterial microbiome not well-represented
in reference databases, > 60% for bacteriophages
 Coverage: mean number of reads per base
 L=read length, N=number of reads, G=genome size
 Problem, with 2nd gen WMS technologies, L is low
and G is astronomical or unknown
 Thus, “full or sometimes even adequate coverage
may be unattainable”
Source: A primer on metagenomics
Sequence length and discovery
Source: A primer on metagenomics
All is not lost
Can use rarefaction curves to estimate our coverage
All is not lost
 For composition analysis the phylogenetic marker
regions (18S, 16S) work pretty well
 For functional analysis: can still find ORFs fairly
reliably and can be aligned to homologs in
 Barring this, clustering and motif-finding yield some
Different sequencing approaches?
 Single-cell microfluidics in the future
 Now: hybrid long/short read approaches.
“finishing” with Sanger sequencing
 Pacific biosciences SMRT approach
 SMRT errors are random, unbiased
 De novo assembly is 99.999% concordant with
reference genomes
assembly algorithm
Select longest reads as seeds
Use seed reads to recruit
short reads
Assemble using off the shelf
assembly tools
Refine assembly using
sequencer metadata
Source: Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods
Seed selection
 Order reads according to length
 Considering reads above length L ~ 6kb
 Rough end-pair align reads until ~20x coverage is
 17.7k seed reads, averaging 7.2kb in length, already
at 86.9% accuracy compared to reference
Recruiting short reads
 Align all reads to the seed reads
 Each read can be mapped to multiple seed reads,
controlled by –bestn parameter
 -bestn must be chosen so that the coverage of seeds +
short aligned reads is about equal to the expected
coverage of the sequenced genome
 Use MSA and consensus to error correct long reads
 Result is 17.2k reads of length 5.7kb with 99.9%
Overlap layout consensus assembly
Source: Overview of Genome Assembly Algorithms. Ntino Krampis.
 Use Quiver algorithm which looks at raw physical
data from sequencer
 Uses an HMM and observed data to tell classify
base calls as genuine or spurious
 Do a final consensus alignment, conditioned on
Quiver’s probabilities
 Final result: 17.2k reads, length of 5.7kb, accuracy
of 99.999506%
 Most of the cells in your body aren’t yours
 But looking at bacteria alone is insufficient
 Expanding our view causes us to look for needles in
haystacks which is beyond most conventional
 Motif-finding and hybrid approaches will work until
3rd gen sequencing arrives
Cho, Ilseung, and Martin J. Blaser. "The human microbiome: at the interface of
health and disease." Nature Reviews Genetics 13.4 (2012): 260-270.
Wooley, John C., Adam Godzik, and Iddo Friedberg. "A primer on metagenomics."
PLoS computational biology 6.2 (2010): e1000667.
Chin, Chen-Shan, et al. "Nonhybrid, finished microbial genome assemblies from
long-read SMRT sequencing data." Nature methods 10.6 (2013): 563-569.
Human Microbiome Project Consortium. "Structure, function and diversity of the
healthy human microbiome." Nature 486.7402 (2012): 207-214.
Norman, Jason M., Scott A. Handley, and Herbert W. Virgin. "Kingdom-agnostic
metagenomics and the importance of complete characterization of enteric
microbial communities." Gastroenterology 146.6 (2014): 1459-1469.
Morgan, X. C., and C. Huttenhower. "Meta'omic Analytic Techniques for Studying
the Intestinal Microbiome." Gastroenterology (2014).

similar documents