Lecture2 - Newcastle University Staff Publishing

Report
Data first vs Hypothesis first
Alan Ward
Data first vs Hypothesis first
Hypothesis driven approach
• Look at the data we have
• Formulate an hypothesis about ..
• Do experiments to test the hypothesis
• As a byproduct, collect more data
Weinberg R (2010) Point:
Hypotheses first. NATURE 464, 678
Data first vs Hypothesis first
Data driven approach
• Identify a system of interest
• Identify an approach to measure/describe
attributes of the system
• Collect and organise the data
Golub T (2010) Counterpoint: Data
first. NATURE 464, 679
Data first vs Hypothesis first
“Reports that say that something hasn't
happened are always interesting to me,
because as we know,
there are known knowns; there are things we know
that we know. There are known unknowns; that is
to say, there are things that we now know we don't
know. But there are also unknown unknowns –
there are things we do not know we don't know.”
—United States Secretary of Defense, Donald
Rumsfeld
Data first vs Hypothesis first
The Black Swan: The Impact of the Highly
Improbable. Nassim Taleb
Hypothesis driven
Enzyme activity research
Feedback
Allosteric
known
inhibition
regulation
unknown
Data first vs Hypothesis first
Non-coding
short RNAs
Transcriptional regulation Inducers and repressors
Data first vs Hypothesis first
Breadth first vs Depth first
A
slice
up
and
down
A slice across
Data first vs Hypothesis first
Observation has always been part of biology as
in the imatinib example (Golub, 2010)
but DNA sequencing technology has
revolutionized observational data collection.
You can see that Weinberg (2010) is arguing that
‘cheap sequencing’ on a massive scale = too
much funding for data collection.
And, he doesn’t argue it but you might spend all
your time managing the data1
1
Marx, V (2013) Biology: The big challenges of
big data. Nature 498, 255–260
Data first vs Hypothesis first
Depth first or breadth first
Two different strategies for computer search algorithms
Which is best?
That heavily depends on the structure of the search tree
and the number and location of solutions. If you know a
solution is not far from the root of the tree, a breadth
first search (BFS) might be better. If the tree is very deep
and solutions are rare, depth first search (DFS) might
rootle around forever, but BFS could be faster.
If the tree is very wide, a BFS might need too much
memory, so it might be completely impractical. If
solutions are frequent but located deep in the tree, BFS
could be impractical.
If the search tree is very deep you will need to restrict the
search depth for depth first search (DFS), anyway.
•
•
Data first vs Hypothesis first
dbEST release 130101 EST database
Summary by Organism - 01 January 2013
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Number of public entries: 74,186,692
Homo sapiens (human)
Mus musculus + domesticus (mouse)
Zea mays (maize)
Sus scrofa (pig)
Bos taurus (cattle)
Arabidopsis thaliana (thale cress)
Danio rerio (zebrafish)
Glycine max (soybean)
Triticum aestivum (wheat)
Xenopus (Silurana) tropicalis (western clawed frog)
Oryza sativa (rice)
Ciona intestinalis
Rattus norvegicus + sp. (rat)
Drosophila melanogaster (fruit fly)
…..
Salmonella enterica subsp. enterica serovar Typhi
Mycobacterium smegmatis str. MC2 155
Mycobacterium tuberculosis
8,704,790
4,853,570
2,019,137
1,669,337
1,559,495
1,529,700
1,488,275
1,461,722
1,286,372
1,271,480
1,253,557
1,205,674
1,162,136
821,005
217
30
30
Data first vs Hypothesis first
DbEST references
Boguski, MS, Lowe, TMJ, Tolstoshev, CM (1993)
DbEST - Database For Expressed Sequence Tags.
Nature Genetics 4, 332-333
Boguski, MSS (1994) Gene discovery in dbEST.
Science 265, 1993-4
Boguski, MSS (1995) The turning point in genome
research. Trends in Biochemical Sciences 20, 295-6
Nagaraj, S (2007) A hitchhiker's guide to expressed
sequence tag (EST) analysis. Briefings in
Bioinformatics 8, 6-21
Data first vs Hypothesis first
Why DNA?
An example:
Species and strain identification in prokaryotes
•
•
•
•
DNA:DNA similarity
MLEE (MultiLocus Enzyme Electrophoresis)
MLST (MultiLocus Sequence Typing)
ANI (Average Nucleotide Identity)
Defining species
The modern concept of species dates back to:
Mayr, E. (1942) Systematics and the Origin of Species(Columbia
Univ. Press, New York)
Biological species concept: Species are groups
of actually or potentially interbreeding natural
populations, which are reproductively isolated
from other such groups
de Queiroz K (2005) Ernst Mayr and the modern concept of species.
Proc Natl Acad Sci U S A. 102 Suppl 1: 6600-7.
Bacterial species
Bacteria do not interbreed in the same way so
defining species in bacteria remained an exercise in
clustering organisms with similar, initially
phenotypic, characters
Stanier RY. Adaptation, evolutionary and physiological: Or
Darwinism among the microorganisms. In: Davies R, Gale EF,
editors. Adaptation in Microorganisms, Third Symposium of the
Society for General Microbiology. Cambridge: Cambridge
University Press; 1953
Goldner M (2007) The genius of Roger Stanier
Can J Infect Dis Med Microbiol 18, 193–194
DNA:DNA similarity
From the 1960s there was a consensus that all taxonomic
information about a bacterium is incorporated in the
complete nucleotide sequence of its genome
Wayne et al., in 1987 correlated the measurement of the
similarity of DNA of two strains with then currently
defined species and concluded that:
A DNA:DNA similarity of 70% and a ΔTm of > 5°C, both
are important, marks the boundary of a group of strains
which belong to the same species
Wayne, L. G., Brenner, D. J., Colwell, R. R., Grimont, P. A. D., Kandler, O.,
Krichevsky, M. I., Moore, L. H., Moore, W. E. C., Murray, R. G. E. & other authors
(1987). Report of the ad hoc committee on reconciliation of approaches to
bacterial systematics. Int J Syst Bacteriol 37, 463–464.
DNA-DNA similarity
Measuring DNA similarity by hybridisation is not the
same as DNA sequence similarity and it is measured
using a number of different techniques
% Similarity
De Ley – rate of renaturation
Ezaki – microplate binding
ΔTm
DNA melting
Elution from hydroxyapatite
The methods are not robust and few labs can do:
Stackebrandt et al. (2002) Report of the Ad Hoc Committee for the reevaluation of the species definition in bacteriology. Intl J Systematic
Evol Microbiol 52, 1043-1047
Melting Temperature analysis
DNA Melting
Using RT-PCR and Syber Green
for DNA melt curve analysis
Gonzalez, JM & Saiz-Jimenez, C (2005) A simple fluorimetric method for the estimation of
DNA–DNA relatedness between closely related microorganisms by thermal denaturation
temperatures. Extremophiles 9, 75–79
ΔTm determination
Exactly the same melting program, but this time the DNA from Organism 1 and Organism 2
has been mixed, denatured and then renatured at the optimum temperature for
renaturation TOR calculated from the %GC (Tor=0.51(%GC)+47.0) before adding Syber Green
and melting
Disadvantages of DNA-DNA similarity
Because DNA:DNA hybridisation compares the whole
genome it has remained the “Gold standard” for
species delineation but it has several disadvantages:
It requires large amounts of high quality DNA
The methods are difficult to do
Different methods can different results
Reciprocal measurements can be very different
(amount of A binding to B is different from amount of B binding to A)
The experimental measurement has to be made between 2
strains – so to obtain DNA-DNA similarity for 5 strains requires 20
experimental determinations and if a 6th strain needs to be
compared another 5 experiments are needed
You can’t build an incremental database
Disadvantages of DNA-DNA similarity
Multilocus Enzyme
Electrophoresis
MLEE
Selander, RK, Caugant, DA, Ochman, H,
Musser, JM, Gilmour, MN and
Whittam, TS (1986) Methods of
multilocus enzyme electrophoresis for
bacterial population genetics and
systematics. Appl. Environ. Microbiol
51, 873-884
Multilocus sequence typing
MLST
Staphylococcus aureus
Maiden, MCJ, Bygraves, JA, Feil, E, Morelli, G, Russell, JE, Urwin, R, Zhang,
Q, Zhou, J, Zurth, K, Caugant, DA, Feavers, IM, Achtman, M, and Spratt, BG
(1998) Multilocus sequence typing: A portable approach to the
identification of clones within populations of pathogenic microorganisms.
Proc. Natl. Acad. Sci. USA 95, 3140–3145
Multilocus sequence typing
MLST
•
•
•
•
•
Portable
Unambiguous
Reproducible
Cumulative
Scalable
Data first vs Hypothesis first
The traditional method of data reduction is
publication —results are summarized in peerreviewed journals.
Publications include only the most important
results, from experiments that may have been
performed over many years.
The published paper is a concise compilation of the
data, an interpretation of the results, and a
comparison with results obtained by others.
A significant fraction of experiments from academic laboratories
cannot be repeated in industry1. Reflecting inadequate description of
experiments performed on different equipment and on biological
samples that were produced with disparate methods.
1
Begley CG & Ellis LM (2012) Drug development: Raise standards for preclinical
cancer research Nature 483, 531–3
Data first vs Hypothesis first
In 1990s NCBI scanned the
literature for sequences and
manually typed them into the
database.
In 1991 the GenBank On-line Service utilized a Solbourne
5/800 running OS/MP 4.0C. The database work was done
on a Sun network 4/490 server and workstations running
SunOS UNIX version 4.1. The GenBank database was
maintained on Sybase relational database management
system (RDBMS). Software was developed in ' C language.
Data first vs Hypothesis first
Benson, DA, Cavanaugh, M, Clark, K,
Karsch-Mizrachi, I, Lipman, DJ, Ostell J
and Sayers EW (2013) Genbank
Nucleic Acids Research 41, D36–D42

similar documents