Bioinformatics, Translational Informatics & Personalized Medicine

Report
Bioinformatics, Translational
Bioinformatics, Personalized
Medicine
Uma Chandran, MSIS, PhD
Department of Biomedical Informatics
University of Pittsburgh
[email protected]
412-648-9326
07/17/2013
Outline of lecture
• What is Bioinformatics?
– Examples of bioinformatics
– Past to present
• What is translational bioinformatics?
• Personalized Medicine
– Bioinformatics and Personalized Medicine
What is Bioinformatics?
• http://en.wikipedia.org/w
iki/Bioinformatics
• Application of
information technology to
molecular biology
• Databases
• Algorithms
• Statistical techniques
Bioinformatics examples
•
•
•
•
•
•
•
•
•
•
•
•
Sequence analysis
Genome annotation
Evolutionary biology
Literature analysis
Analysis of Gene Expression
Analysis of regulation
Analysis of protein expression
Analysis of mutations in cancer
Comparative genomics
Systems Biology
Image analysis
Protein structure prediction
From Wikipedia
Early Bioinformatics
• Robert Ledley and
Margaret Dayhoff
– First bioinformaticians
– Using IBM 7090 and
punch card analyzed
amino acid structure of
proteins
– Created amino acid
scoring matrix
– Protein evolution
– Protein sequence
alignment
http://blog.openhelix.eu/?p=1078
Sequence analysis
• Databases to store
sequence info
– Phage Φ-X174
sequenced in 1977
– GenBank
• 30, 000 organisms
• 143 billion base pairs
– BLAST program for
sequence searching
• Algorithms, databases,
software tools
Evolutionary biology
• Compare relationships
between organism by
comparing
– DNA sequences
– Now whole genomes
• Can even find single base
changes, duplication,
insertions, deletions
• Uses advanced
algorithms, programs and
computational resources
Literature mining
• Millions of articles in the
literature
• How to find meaningful
information
– Natural language processing
techniques
• Example
– Type in p53 or PTEN in Pubmed –
will retrieve 1000s of publications
– How to summarize all the
information for a particular gene
– Function, disease, mutations, drugs
– IHOP database creates network
between genes and proteins for
30000 genes
Genome annotation
• Marking genes and
other features in DNA
• Algorithms, software
Bioinformatics
• Interdisciplinary discipline
– Gene/proteins/function/ - Biologist
– In Cancer – Physician/Scientist/Biologist
– Algorithms, for example, BLAST – Math/CS
– Separate Signal from Noise, Diff gene expression,
correlation with disease – Statistician
– Tools, Software, Databases – Software developers,
programmers
• Aim to make sense of biological data
Translational bioinformatics
• Translational = benchside to bedside
– Bringing discoveries made at the benchside to clinical use
• the development of storage, analytic, and interpretive methods to
optimize the transformation of increasingly voluminous biomedical
data into proactive, predictive, preventative, and participatory
health. Translational bioinformatics includes research on the
development of novel techniques for the integration of biological and
clinical data and the evolution of clinical informatics methodology to
encompass biological observations. The end product of translational
bioinformatics is newly found knowledge from these integrative
efforts that can be disseminated to a variety of stakeholders,
including biomedical scientists, clinicians, and patients.”
• Translational = benchside to bedside
Atul Butte, JAMIA 2008;15:709-714 doi:10.1197
Central dogma
• DNA is transcribed to
RNA
• RNA is translated to
protein
• Many regulatory
processes control these
steps
Molecular Biology Primer
•
•
•
•
20, 000 genes
Many transcripts, many proteins
More than 20, 000 proteins
Southern, Northern, Western Blots
Biological questions
• DNA
• Mutation
– Are there any mutations
•
•
•
•
sickle cell anemia
Cystic fibrosis
Hemophilia
Other diseases such as
diabetes, cancer ??
– Polymorphisms
• Variation in the
population
DNA amplification
• Are there regions of
amplification or
deletions that correlate
with disease
– If so, what genes are
present in these regions
– HER2 amplification in
breast cancer
– EGFR mutations in lung
cancer
RNA
• RNA
– DNA is transcribed to RNA
– Approximately 20K genes
• RNA levels will differ in
different conditions
– Liver, kidney, cancer,
normal, treatment etc
–
–
–
–
Diagnosis or prognostic
microRNAs level
lnncRNAs
Splicing differences
mRNA
Clinical questions
• DNA level
– Are there mutations or polymorphism between different cancer
patient groups
•
•
•
•
Good outcome v bad outcome
Early stage vs late stage
Therapy responders v non-responders
Examples: Renal cell, prostate cancer etc
• RNA
– Are there specific transcripts – mRNA, microRNA - that are up or
down and are signature for outcome, disease and response
– 1000s of studies
– Consortia projects
• TCGA – The Cancer Genome Atlas projects
• Profile 500 samples of each cancer for DNA, RNA changes
Molecular Biology Primer
•
•
•
•
20, 000 genes
Many transcripts, many proteins
More than 20, 000 proteins
Southern, Northern, Western Blots
Base pairing
• Microarray and
Northern/Southern blots
– Exploit the ability of
nucleotides to hybridize to
each other
– Base pairing
– Complementary bases
• A :T (U)
• G: C
Northern
Sensitivity and dynamic range
low
How are these changes measured
• Example: Northern blot (measure RNA)
– http://www.youtube.com/watch?v=KfHZFyADnNg
– Workflow of Northern blot
• Key points
– mRNA run on gel – separated by size
– transferred to a membrane – immobilized
– Have a hypothesis – for example studying RNA level for BRCA
in normal and cancer
– Only probe for a mRNA or transcript is labeled or tagged
– probe is prepared and labeled with radioactivity
– Hybridized to X-ray film
– Only that mRNA is detected and quantitated
Microarrays
• Solid surface
– Many different technologies
• Affy, Illumina, Agilent
– Probes are synthesized on the
solid surface
• Synthesized using proprietary
technology
– Probe are selected using
proprietary algorithms
– RNA (or DNA) is in solutions
– RNA is labeled or tagged
– Hybridized to the chip
– Tagged RNA is quantitated
– Compare between conditions
Affymetrix
Need for computational methods
• Data Management
– Each file for a chip experiment is
large
• 100MG x 10 = 1G
• Generates Gigabytes of data
• Data preprocessing
– Convert raw image into signal
values
• Data analysis
– 1000s of genes (or SNPs) and few
samples
– How to find differences between
samples
– What statistical methods to use?
– Like finding needle in a haystack
How to analyze?
Normal
Tumor
Noise reduction
Background subtraction
Normalization
Samples
name
id
2
2
2
2
2
2
2
Rab geranylgeranyltransferase,
100_g_at
231.5alpha subunit
250
369.7
217.5
489
228
336.3
mitogen-activated
1000_atprotein kinase
477.9 3
662.7
589.9
883.8
395.5
979.5
420.4
tyrosine kinase
1001_at
with immunoglobulin
47.4
and
150.7
epidermal15.2
growth factor
86homology
128.1
domains62.7
131.8
Burkitt lymphoma
1004_atreceptor 1, GTP
87 binding
114.4protein (chemokine
220
104.5
(C-X-C motif)
185.7
receptor
175.2
5)
170.8
dual specificity
1005_at
phosphatase
593.5
1
887.4
299.3
1324.8
132.4
831.8
173
--1008_f_at
3205.4
1582.4
5618.8
3589.1
1401.2
2951.4
1910.3
dual-specificity
101_at
tyrosine-(Y)-phosphorylation
93.5
29.3 regulated
33.5kinase 4
32.7
24.1
17.2
47.6
tyrosine 3-monooxygenase/tryptophan
1011_s_at
717.6
426.6
5-monooxygenase
61.7
activation
468 protein,
285.5epsilon276.8
polypeptide
154.9
--1017_at
33.1
173.1
82.8
213.7
132.6
393.6
57.5
wingless-type
1019_g_at
MMTV integration
199.2 site family,
310.4 member
215.4
10B 393.7
156.9
307.1
187.1
calcium and1020_s_at
integrin binding 1852
(calmyrin)
207.9
272.7
243.5
592.4
227.2
651.7
interferon, gamma
1021_at
14.6
58.4
161.5
11.3
18.4
36.1
4.2
collagen, type
1026_s_at
XI, alpha 2
122
198.8
192.6
194.6
53.7
341.8
37
topoisomerase
1028_at
(DNA) III alpha
123.7
153.5
195.2
238.8
126.6
145.3
115
thrombospondin
103_at
4
11.5
33.8
31
96
26.1
41.1
19.3
topoisomerase
1030_s_at
(DNA) I
837.2
817.4
936.4
662.3
939.3
708.1
890.5
interleukin 81032_at
receptor, beta 275.6
515.3
620
381.3
417.4
408.3
332.4
interleukin 81033_g_at
receptor, beta 156.4
125.1
264.9
168.7
33.7
112.6
127.7
tissue inhibitor
1034_at
of metalloproteinase
267.9
3 (Sorsby
390.1 fundus
507.2dystrophy,
390.7pseudoinflammatory)
273.3
512.9
301.3
tissue inhibitor
1035_g_at
of metalloproteinase
391
3 (Sorsby
331.8 fundus
556.1dystrophy,
186.1pseudoinflammatory)
196.6
350.1
167.2
interferon gamma
1038_s_at
receptor 1
290.7
235.6
93.9
200.4
267.1
231.5
313.8
hypoxia-inducible
1039_s_at
factor 1, alpha
309.5 subunit
120.3
(basic helix-loop-helix
332.2
94.9
transcription
96.7factor)103.5
278.4
POU domain,
104_at
class 6, transcription
80.6
factor
170.7
1
139.4
140.8
178.5
182.4
124.1
ephrin-A5 1041_at
96.1
81.9
332.3
53.3
10.2
57.5
13
E2F transcription
1044_s_at
factor 5, p130-binding
130.6
94.8
175.1
210.3
125.3
143.5
118.7
--1047_s_at
95.4
1055.9
368.2
170.5
146.4
99.2
103.5
melan-A 1051_g_at
14.1
18.8
48.9
23
62.4
120.9
19.3
CCAAT/enhancer
1052_s_at
binding 2091.5
protein (C/EBP),
2732.8delta 2984.6
1157.3
3959.9
1280.4
4129.2
replication factor
1053_at
C (activator
168.5
1) 2, 40kDa
17.1
30.1
99
55.6
34.9
86.2
G
E
N
E
S
2
363.2
457.8
54.4
186.5
117.5
1217.8
100.4
242.9
183.4
184.6
643.9
40.6
88.2
145.5
34.3
1006.6
435.6
28.9
187.7
77.7
243.7
146.8
94.9
36.6
129.5
166.3
28.8
2391.4
82.4
2
381.7
389.1
59.6
223.6
112.6
1195.2
19.3
166.7
237.2
204.8
517.6
14.3
224.5
198.8
76.2
698
394.7
23.5
216.6
372.3
185.4
87.2
148.6
100.8
91
221.5
20.6
4279.4
10.2
2
373.2
495.7
116.4
42.7
241.5
2928.7
20.4
257.3
81.2
290.2
478.8
122.7
194.4
166.8
32.5
742.3
366.7
56.1
255.8
427.7
183.5
523.9
115.2
77.5
15.7
190.3
149.8
1673.5
245.3
2
263.8
346.3
32.7
93.4
153.9
1305.9
111.7
283.9
103
154
742
6.9
134.5
101.5
28
838
308.6
38.9
180.6
110.8
317.2
216.6
74.4
57.3
109.8
70.9
12.8
4456
54.8
2
302.8
482.5
22.2
115.1
212.2
589.6
78
390.8
104.4
172.2
1099.3
43.6
107.6
117.1
12
1093.6
499.9
94
160.7
195
333.4
242.8
54.5
48.6
169.4
87.4
9.7
4965.1
100.4
Data
Analysis
Data analysis
• Class discovery
– Are there novel subclasses
within data?
• Class comparison
– How are tumor and normal
different in expression?
– Which SNPs are different?
• Class prediction
– Predict class of new sample
• Advanced pathway
Analysis
Pathway Analysis
Analytic methods – many studies,
many methods
Dupuy and Simon, JNCI; 2007
SNPs to detect Copy Number
changes
amplification
diploid
amplification
deletion
Hagenkord et al; Modern Pathology, 21:599
What is personalized medicine
• Personalized medicine is the tailoring of medical
treatment to the individual characteristics of each
patient.
• Based on scientific breakthroughs in
understanding of how a person’s unique
molecular and genetic profile makes them
susceptible to certain diseases.
• ability to predict which medical treatments will
be safe and effective for each patient, and which
ones will not be.
From ageofpersonalizedmedicine.org
Personalized Medicine
From ageofpersonalizedmedicine.org
Personalized Medicine
From Fernald et al; Bioinformatics, 13: 1741
Examples of personalized medicine
• Breast cancer
– 30% of patients over express HER2
– Treated with Herceptin
– Oncotype Dx: gene expression predicting
recurrence
• Cardiovascular
– Patients response to Warfarin, the blood thinner
– Response determined by polymorphism in a CYP
genes
Personalized Medicine
• Examples of personalized medicine resulted
from studies that generate
– Lots of data
– Rely on bioinformatics methods to discover these
associations
• Oncotype Dx:
– Gene expression studies of large number of patients
• CYP polymorphisms
– Discover single nucleotide polymorphisms in patient
polulations and association with response
» Initial studies done with PCR methods
Personalized Medicine
• Current examples are few in numbers
• Making personalized medicine a reality
–
–
–
–
–
Generate the data
Discover the associations
Find targeted therapies
Genome sequences prices are dropping
Large scale genome information is coming:
•
•
•
•
1000 genome
TCGA
ICGC
Also possible to commercially sequence a person’s genome
• Processing all this data into translating these discoveries
into medical practice has many challenges
Bioinformatics challenges in
personalized medicine
• Processing large scale robust genomic data
• Interpreting the functional impact of variants
• Integrating data to relate complex interactions
with phenotypes
• Translating into medical practice
Fernald et al; Bioinformatics: 13: 1741
Era of Personalized medicine
• Shift from microarrays to Next Gen
Sequencing
Central dogma
• DNA is transcribed to
RNA
• RNA is translated to
protein
• Many regulatory
processes control these
steps
Next Gen Sequencing
• Directly sequence DNA to determine
–
–
–
–
–
SNP
CN
Expression, mRNA, microRNA
Protein binding sites
Methylation
• Initial steps depend not on hybridization but also
on base pairing or complementarity and DNA
synthesis
• Bioinformatics is extremely challenging
Next Gen Sequencing
NGS in personalized medicine
• Whole genome sequencing
– Sequence genomes and find variants (1000 genome
project)
• Find variants associated with disease phenotype
• Sequence exomes only
– Find coding region variants associated with
phenotypes
• RNA seq
– RNA sequence signatures associated with phenotype
Microarrays v NGS RNA Seq
• Restricted to probes on
chips
• Only transcripts with
probes
• File sizes in MBs to GB
• Algorithms, methods
• Typically done on PCs
• Storage on hard drives
• No – predetermined
probes
• Can detect everything
that is sequenced
• More applications than
microarray
• Very large file sizes
• Computationally very
intensive
• Clusters, supercomputers
• Large scale storage
solutions
Microarrays v RNA seq Expression
Analysis
• Dynamic range is low
• Statistic to determine
expression based on
signal
• Many methods in the
last 10 years
• Dynamic range is high
– Based on reads
• Statistics based on
counts
– Affected by read length,
total number of
transcripts, lack of
replicates
Read mapping Alignment
• Denovo assembly
• Mapping to reference
genome
– Based on complementarity of
a given 35 nucleotide to the
entire genome
– Computationally intensive
• Million of 35 bp reads has to
search for alignment against
the reference and align
spefically to a given regions
– Large file sizes
• Sequence files in the TB
• Aligned file BAM files
– Several hundred GB
Reference genome
Sequence variation
Bioinformatics challenges in
personalized medicine
• Processing large scale robust genomic data
– Suppose we want to identify DNA variants associated with disease
•
•
•
•
•
•
Which technology
How much data
How to analyze the data
How to identify variants
Each genome can have millions of variants
300, 000 new variants – i.e, not in existing databases
– Will have to separate error from true variants
– 1 error per 100 kb can lead to 30,000 errors in a single experiment
• Why do these errors happen?
Fernald et al; Bioinformatics: 13: 1741
Bioinformatics Challenges
•
•
Data
Which technology to use
–
–
•
•
Each technology has different error rates , Ion Torrent (higher error rate), SOLID, Illumina
Speed of generation of data – Ion Torrent is faster
Application – Whole genome or exome or targeted exome
Analysis
•
Analysis
–
–
–
•
Speed of analysis
•
Alignment relies on matches between sequence and reference genome
–
–
–
•
•
How much mismatches to tolerate
True mismatch or error – sequencing error, true mismatch – is it a SNP
Quality of reference genome
Each whole genome sequencing experiment can generate TB of data
Where to store – patient privacy
–
•
Alignment can take days
Large amounts of data
–
•
Algorithms, speed, accuracy
BLAST is not good for WGS
Other new algorithms
Servers, locations, networking
Sample sizes – how many samples to sequence to discover the association with disease
Bioinformatics Challenges
• Technology
–
–
–
–
Ion Torrent, SoLiD, Illumina
Each has its own error rates
Speed of data generation
Dependent on application – WGS or exome
• Analysis
• Analysis
– Algorithms, speed, accuracy
• Speed of analysis
– Alignment can take days
• Alignment relies on matches between sequence and reference
genome
– How much mismatches to tolerate
– True mismatch or error – sequencing error, true mismatch – is it a SNP
• Quality of reference genome
From Mark Boguski’s presentation at the IOM, July 19, 2011
From Mark Boguski’s presentation at the IOM, July 19, 2011
From Mark Boguski’s presentation at the IOM, July 19, 2011
Molecular Diagnostics using NGS
From Mark Boguski’s presentation at the IOM, July 19, 2011
NGS Bioinformatics - medicine
• Infrastructure
– Storage, backup, archive
– Where – HIPAA compliant?
– Network
• How to move data
• Analysis
–
–
–
–
Methods – statistics, annotation
Computing resources
How many samples can be handled at a time?
Time to report
NGS and bioinformatics
Next Gen Sequencing
From Mark Boguski’s presentation at the IOM, July 19, 2011

similar documents