PowerPoint-presentatie - The Genome Analysis Centre

Report
Added value of whole-genome sequence data
to genomic predictions in dairy cattle
Rianne van Binsbergen1,2, Mario Calus1, Chris Schrooten3, Fred
van Eeuwijk2, Roel Veerkamp1, Marco Bink2
1
Animal Breeding & Genetics Centre, Wageningen UR (NL)
2
Biometris, Wageningen UR (NL)
3
CRV (cattle breeding company) , Arnhem (NL)
Genomic Prediction in agricultural species
Reference population:
1)
2)
Estimate effects for each SNP (w)
Generate a prediction equation that combines all the
marker genotypes with their effects to predict the
breeding value of each individual
Apply prediction equation to a group of individuals
that have genotypes but not phenotypes
 Estimated genomic breeding values
 Select the best individuals for breeding
Each SNP represented
by a variable (x), which
takes the values
0 [A A]
1 [A B]
2 [B B]
Advantages:
• Select at early age (before phenotypes available)
• Save costs to phenotype candidates
• Increase accuracy of predicted Breeding Values
Goddard & Hayes (2009)
Nature Reviews Genetics 10:381
One seminal paper on Genomic Prediction
Simulation Study

Dense marker maps


SNP markers at 1cM density

Prediction Accuracy



Least Squares method:
Genomic BLUP method:
Bayesian methods(A,B):
0.32
0.73
0.85
Conclusion:
“selection on genetic values predicted
from markers could substantially increase
the rate of genetic gain in animals and
plants, especially if combined with
reproductive techniques to shorten the
generation interval”
Another (seminal) paper on Genomic Prediction
“In the case of whole-genome sequence data,
the polymorphisms that are causing the
genetic differences between the individuals are
among those being analyzed.”
Higher accuracy in genomic predictions since
causal mutation is included (assumption)
 No dependency on LD
 Persistency across generations
 Genomic prediction across breeds
Prediction of Total Genetic Value
Using Genome-Wide Dense Marker
Maps T. H. E. Meuwissen,* B. J. Hayes† and
M. E. Goddard†,‡
“Only few SNPs were useful for
predicting the trait [because they
were in linkage disequilibrium (LD)
with mutations causing variation in
the trait] while many SNPs were
not useful.”
Genomic predictions from whole-genome
sequence data
 Tremendous increase in number of SNPs (more noise)
 Large (sequence) data are required
Solution
 Sequence core set of individuals (e.g. founders)
 Impute whole-genome sequence genotypes of
other individuals
Accuracy of imputation to
whole-genome sequence data
was generally high for
imputation from 777K SNP
panel
Van Binsbergen, et al. Genet Sel Evol
2014 (in press)
This presentation:
First results of genomic prediction with imputed whole-genome
sequence data for 5503 bulls with accurate phenotypes
Dataset: SNP genotypes & trait phenotypes
5503 Holstein Friesian bulls
1000 bull genomes project
777K SNP genotypes
28M SNP genotypes
(Illumina BovineHD BeadChip)
Imputation - Beagle v4 software
429 bulls
(multiple breeds)
5503 Holstein Friesian bulls
De-regressed progeny based proofs (DRP1) and
associated effective daughter contributions (EDC2)
12M SNP genotypes



MAF > 0.005
Imputation accuracy > 0.05
Somatic cell score (SCS)
Interval fist and last insemination (IFL)
Protein yield (PY)
1 VanRaden
et al. 2009 (J Dairy Sci)
2 VanRaden and Wiggans 1991 (J Dairy Sci)
Prediction reliability
= squared correlation between original phenotype (DRP) and estimated genetic values (GEBV)
5503 Holstein Friesian bulls
training population
777K SNP genotypes
4322 old bulls
(Illumina BovineHD BeadChip)
validation population
1181 young bulls
5503 Holstein Friesian bulls
12M SNP genotypes
MAF > 0.005
Imputation accuracy > 0.05
training population
4322 old bulls
validation population
1181 young bulls
Validation population
 Youngest bulls with EDC  0
 Mainly sons of bulls in training population
 Mimics breeding practice
Genomic prediction – 2 methods
GBLUP
BSSVS
 Genome-enabled best linear
 Bayes stochastic search variable
unbiased prediction
 Distribution QTL effects to be
close to infinitesimal model (all
SNPs equally small effect)
 Build a genomic relationship
matrix to model variancecovariance structure
selection
 Large number of SNPs with tiny
(close to zero) and a few SNPs
with moderate effects (=mixture
of two Normal distributions)
Implementation via
Markov chain Monte Carlo (MCMC)
simulation algorithms (computer intensive)
Calus M (2014). Right-hand-side updating for
3 chains of 60,000 cycles
(10,000 cycles burn-in)
fast computing of genomic breeding values.
Genetics Selection Evolution 46(1): 24.
Computation
GBLUP
3 chains of 60,000 cycles
(10,000 cycles burn-in)
BSSVS (per MCMC chain)
777K
SNP
● HPC – 1 node
● ~ 3 hours
● ~ 32 GB RAM
● Windows – 1 CPU
● ~ 5 days
● ~ 1.6 GB RAM
12M
SNP
● HPC – 12 nodes
● ~ 6 hours
● ~ 600 GB RAM
● HPC – 1 node
● ~ 50 days
● ~ 32 GB RAM
Windows 7 Enterprise desktop pc:
32 CPU – 8 GB RAM/CPU (clock speed 2.60 GHz)
HPC Linux cluster:
Normal nodes – 64 GB/node (2.60 GHz); 2 fat nodes – 1 TB RAM/node (2.20 GHz)
Results: Prediction Reliability
BSSVS: Average over 3
chains of 60,000 cycles
(10,000 cycles burn-in)
0.6
Reliability
0.5
0.4
BovineHD GBLUP
BovineHD BSSVS
0.3
Sequence GBLUP
0.2
Sequence BSSVS *
0.1
* Based on
45,000 cycles
0.0
SCS
IFL
PY
Results: Prediction Reliability
0.6
Reliability
0.5
0.4
BovineHD GBLUP
BovineHD BSSVS
0.3
Sequence GBLUP
0.2
Sequence BSSVS *
0.1
* Based on
45,000 cycles
0.0
SCS
IFL
PY
BSSVS: Convergence & SNP effects
Trace of variance of SNP effects
Bayes Factor for SNP effects
777K SNP
12M SNP
3 chains of 60,000 cycles
(10,000 cycles burn-in)
Sequence: 45,000 cycles
Suitability of BSSVS model?
 Large number of SNPs with tiny and a few SNPs with moderate
effects
● Sequence data: Really large number of SNPs with tiny effects 
Captures too much signal?
 Another
Bayesian Prediction Model: Bayes-C
● Large number of SNPs with NO effect and a few SNPs with moderate
effects
Concentrate on single chromosome (BTA 6)
MCMC convergence
777K SNP
12M SNP
BSSSVS
Bayes-C
Concentrate on single chromosome (BTA 6)
Signal of QTL effects
777K SNP
BSSSVS
Bayes-C
Reliability estimates
12M SNP
BovineHD
Sequence
BSSVS
0.328
0.324
BayesC
0.328
0.325
Conclusions
 Genomic prediction using sequence data becomes reality
● However, sequence data requires intensive computation
 Need for faster algorithms
 Use of Sequence Data did not improve Prediction reliability
● Convergence issues with BSSVS
 Longer chains may yield better results
 BSSVS slightly better compared to GBLUP
 Preliminary results BTA6 hint that Bayes-C method may work
better (than BSSVS) for sequence data
Next Steps: Did we bet on the wrong horse - named BSSVS?
 Review choice of priors in BSSVS model.
 Apply Bayes-C model to whole genome sequence data
Thanks!
Acknowledgments
1000 bull genomes project
(www.1000bullgenomes.com)
De-regressed
proofs (DRP)

 =  +  −  ∗

Parent
average
Effective Daughter
Contribution
Estimated
breeding value
Effective daughter
contribution (EDC)
 =   / 1 − 
(4 − ℎ2 )/ℎ2
Published reliability
of EBV
 =  − 
Based on reliability of parents
 +  /4
VanRaden et al. 2009 (J Dairy Sci)
VanRaden and Wiggans 1991 (J Dairy Sci)

similar documents