### THE EARLY HSTORY OF BAYESIAN STATISTICS Tom Leonard

```REFERENCES:
A personal history of Bayesian Statistics (2014)
Wiley Interdisciplinary Reviews, Comput Stat, 6:80-115
with link to remaining chapters (from 1972) on my website
www.thomashoskynsleonard.co.uk
Refers to technical material in my book
Bayesian Methods: An Analysis for Statisticians and
Interdisciplinary Researchers
(1999, with John S.J. Hsu) Cambridge University Press
Self-published on my website
Slides prepared by Thomas Tallis
Among competing (plausible) hypotheses, the hypothesis with the
fewest assumptions should be selected. (WILLIAM OF OCKHAM)
In other words: Keep things simple, and cut out extraneous
information
OCCAM’S
RAZOR (William
of Ockham,
c1287-1347)
FOR EXAMPLE::
Use parameter parsimonious sampling models which depend upon
on low numbers of unknown parameters (e.g. which minimise AIC
or DIC)
Contrasts with:
‘A model should be as big as an elephant’ (Leonard ‘Jimmie’
Savage, 1954, Lindley, 1983)
Agrees with:
‘The greater the amount of information the less you actually
know’ (Toby Mitchell, c 1980)
Related to:
E.T. Jaynes’ extremely valuable idea (1957 and 1968) of choosing
the ‘maximum entropy’ prior distribution when only p summaries
of the prior information are specified.
Pascal
Fermat
Blaise Pascal (1623-1662) formulated ‘Pascal’s Wager’ by reference to the notion of
subjective probability.
Pascal corresponded with Pierre de Fermat about the potential development of
probability theory.
In 1654, Pascal and De Fermat (1601 or 1607 -1665 ) together solved the problem of
‘points’ or ‘division of stakes’.
In 1657, Christian Huygens discussed the Pascal –De Fermat debate, in
De rationiciis in ludo aleae
Daniel Bernoulli (1700-1782) Swiss physician, doctor
and mathematician.
Formalised subjective view of probability, decision
making and risk.
Introduced concept of EXPECTED UTILITY in 1738 in
historic paper published in St Petersburg
Used the St PETERSBURG PARADOX to justify
maximising expected utility.
Daniel Bernoulli
(where the expected reward from the specified
betting scheme is infinite, but most punters would
only want to place a small bet on the outcome
because of the high probability of a low return)
Educated (from age 12) at University of Edinburgh
Sceptical views about causality in 1739-41 trilogy
between 1723 and 1725
Questionable cause fallacy----The false assumption that
correlation proves causality
Subjective probability discussed in Ch 6 of his 1748 book
David Hume F.R.S.E
(1711-1776)
Author of is-ought problem or Hume’s guillotine
Significant difference between descriptive statements
ought to be)
Not obvious how to get from descriptive statements to
prescriptive ones
Hume’s Law: You can’t derive an ought from an is
“A midget on the shoulders of giants like Hume and
Huygens’ (Tom Leonard, 2014)
Studied for Presbyterian Ministry at University of
Edinburgh between 1719 and about 1722.
Probably derived continuous version of ‘Bayes’
Theorem’ during the 1740’s while a wealthy, wellconnected minister in Tunbridge Wells, with a serious
demeanour and happy disposition.
Rev. Thomas Bayes
(1701-1763)
The Notebook of Thomas Bayes (1747-1760) contains a
section on probabilities.
In his tract In defence of Isaac Newton (1736, printed by
John Noon), sold for a shilling, Bayes writes,
To suspect Isaac Newton of the mean design of seeking
reputation among the ignorant by venting unintelligible
notions, and defending them by artful cunning and
cunning artistry, is what no man is capable of doing.
Moral philosopher, inductive thinker, and political
activist in support of American Revolution.
In 1763, Richard Price published Bayes’ paper ‘An
Essay towards solving a Problem in the Doctrine of
Chances’, posthumously, in the Proceedings of the
Royal Society of London.
Bayes solved a complicated ‘Ball tossing problem’
involving n non-independent trials and with
applications in life assurance. His mathematical
solution was brilliant, but counterintuitive.
Rev. Richard Price F.R.S.(1723-1791)
***
He posed this as a special case of:
Obscurely Worded General Problem:
Given the number of times (n) an unknown event has happened
and failed, REQUIRED the chance that the probability (ξ) of its
happening in a single trial lies somewhere between any two degrees
of probability that can be made?
A further special case (n=50 independent Bernoulli trials---see Bayes
Appendix):
If you fail to win a lottery on n=50 occasions, with equal chance ξ of
winning on reach occasion, then what is the chance that you
probability ξ of winning it on the 51st attempt lies between 0.001 and
0.01?
A young Bayesette
VERY SPECIAL CASE (n=1)
If a mother’s first baby is a girl, then what is the chance that the probability ξ that her
second baby is a boy lies between 0.5 and 1?
Note that probability (girl on first birth, given ξ ) = 1-ξ
Therefore LIKELIHOOD FUNCTION OF ξ is
L (ξ, given girl on first birth) = 1-ξ for 0< ξ <1
In general, the likelihood of the unknown parameters is the assumed sampling density
or probability mass function of the observations but expressed as a function of the
unknown parameters, given the observations actually observed.
Initiated the ‘Savageous’ philosophy of Bayesian
Statistics
Posterior information=Prior Information +
Sampling Information. (\$\$\$)
A Bayesian is somebody who tries to represent his prior
information about ξ by a probability distribution on ξ
BAYES THEOREM (Continuous case):
POSTERIOR DENSITY = K x PRIOR
DENSITY x LIKELIHOOD
LEONARD ‘JIMMIE’
SAVAGE (1917- 1971)
where K can be calculated by noting that posterior
density integrates to unity across the parameter space.
However, in his 1763 paper, Bayes assumed a uniform
prior distribution on (0,1) for ξ, in which case
POSTERIOR DENSITY=K x LIKELIHOOD
POSTERIOR DENSITY OF PSI
In preceding very special case,
Posterior density of ξ , given girl on first birth
= (1-ξ)/2 (0<ξ<1) (*)
D
E
N
S
I
T
Y
Posterior mean of ξ =predictive probability
that next baby is a boy= 1/3
and
P (0.5 <ξ <1, given girl on first birth) =1/4
If first n babies are girls,
then predictive probability that next baby
is a boy is 1/(n+2)
PSI
French Astronomer, Mathematician, and Politician
Minister in Napoleon’s Government
FOUNDING FATHER OF BAYESIAN STATISTICS AND DATA
ANALYSIS
In 1774, his Memoir on the Probability of the Causes of
Events Included a Bayesian analysis of the causes of
events.
In 1812, his Analytic Theory of Probabilities contained a
number of detailed statistical analyses.
He introduced a general version of Bayes’ theorem that
Le Marquis Pierre’ Simon de includes the discrete and multiparameter cases.
Laplace (1749-1827)
Applied it to ANALYZE DATA in celestial mathematics,
MEDICAL STATISTICS, reliability and jurisprudence.
Developed LAPLACE’S APPROXIMATION to multidimensional integrals
And LAPLACE TRANSFORMATIONS (moment generation
functions)
political economist.
The Wealth of Nations , 1776
Rejected the idea that:
Demand must be related to utility
i.e. the more useful a thing is, and the more
satisfaction it gives, the more people would be
willing to pay for it.
THE PARODOX OF DIAMONDS AND WATER
Water is necessary for life, and yet very cheap
Diamonds have little utility, and are yet very
costly.
Smith thereby concluded that willingness to
pay is not related to utility.
Adam Smith proposed using interval bounds
for probabilities, rather than precisely specified
subjective probabilities
British philosopher, jurist and social reformer.
Regarded by some as the father of modern utilitarianism,
and by others, in the context of banking, insurance, and
speculation, as the founder of the subjectivist, Bayesian
approach to decision making.
(Bentham’s approach to subjective probability is an earlier
version of the exact, linear approach recommended as being
rational by Tversky and Kahnemann).
Introduction to Principles of Morals and Legislation, 1780
GREATEST HAPPINESS PRINCIPLE:
Jeremy Bentham
(1748-1832)
It is the greatest happiness of the greatest number which is
the principle of right or wrong.
Classification of 12 pains and 14 pleasures by which we may
test the happiness factor of any action.
Formalised set of criteria for measuring the extent of pain
or pleasure that any decision will create. Reviewed concept
of punishment, and whether a particular punishment will
create more pain or pleasure for society.
Bentham applied similar ideas to monetary economics.
Anglo-Indian mathematician, statistician and
spiritualist. Appointed to Chair of Mathematics at
University of London (later UCL) in 1838
See his Essay on Probabilities (1838)
De Morgan further developed Bayes’s and Laplace’s
approach to INVERSE PROBABILITY...
Augustus De Morgan (1806-71)
Posterior probabilities when the prior distribution
is uniform.
Somewhat arbitrary e.g. a uniform prior for
a non-linear transformation of the parameter
will give different posterior.
Uniform priors over on continuous unbounded
parameter space are improper, but can, though not
always, yield meaningful proper posteriors.
De Morgan sought to justify uniform prior by
Laplace’s Principle of Insufficient Reason
Florence Nightingale (1820-1910)
Nurse and statistician
For remainder of 19th century
(A) Many statistical scientists (e.g. Gauss, Edgeworth, Galton) thought
Bayesian
(B) Inverse probabilities remained the main methodology for statistical
Inference. Fisher dabbled with then in the early 20th century and discarded
them because of the arbitrariness in the choice of uniform prior.
(C) Emphasis seemed to shifted somewhat to numerical and graphical
summaries of data.
e.g. London Cholera epidemic map (1832) and Crimean War (Florence
Nightingale, e.g. pie charts)
English geneticist, statistician and
polymath, a truly great man of science
In 1877 built machine called GALTON
QUINCUNX
Used simulations while attempting to
calculate posterior distribution
Galton encouraged use of Bayes Theorem
Sir Francis Galton (1822-1911)
Informative conjugate analysis for normal
distribution developed around that time.
American philosopher, logician, mathematician and
scientist.
‘The father of pragmatism’
Emphasised that objective statistical conclusions
can only be hoped for if the data result from a
randomised experiment.
Was the first scientist to elicit subjective
probabilities in experimental psychology.
French Military Officer
1894 TRIAL OF MILLENIUM
Dreyfus tried for treason
Bizarrely justified subjective ‘probability’ of
forgery.
Falsely convicted of transmitting military secrets
to Germany.
Probability related to possible coincidences
concerning frequencies of symbols in the code.
Alfred Dreyfus
9 October 1859 – 12 July 1935)
‘SIMILAR PROBLEMS OCCUR TODAY WHENEVER
STATISTICAL EVDENCE AND SUBJECTIVE
PROBABILITIES ARE INTRODUCED INTO
EVIDENCE’
David H. Kaye, Minnesota Law Review
(2007)
O.J. Simpson murder case, Adam’s Rape Case,
Sally Clark Cot Death Case
the threat to civil liberties. Yale University Press
British mathematician, philosopher and economist
1926 papers on subjective probability and utility were
encouraged by the economist John Maynard Keynes
His work on subjective probability and its elicitation
satisfied Charles Peirce’s empirical test.
Used by experimental psychologists and recognised in
1944 by Von Neumann and Morgenstern, in their book The
Theory of Games and Economic Behaviour
Famously used utility theory to judge ‘how much of its
Frank Ramsey (1903-1930) wealth a nation should spend’
Close friend of philosopher Ludwig Wittgenstein whose
works he translated
Never stay up on the barren heights of cleverness, but
come down into the green valleys of silliness
Highly eccentric English statistician, evolutionary
biologist, geneticist and eugenics
One of the chief architects of neo-Darwinian synthesis
Galton Professor of Eugenics at UCL (1933-43)
Argued with Karl Pearson e.g, about who should teach
which course.
Dabbled with Bayesian inference and inverse
probability, then argued vehemently against it
because of its dependence on prior e.g. the choice of
‘vague’ so-called ignorance prior.
Sir Ronald Fisher (1990-1962)
Introduced FIDUCIAL INFERENCE in paper in Annals of
Eugenics (1935).Disputed by Neyman and shown by
Lindley in 1958 to violate Kolmorogov’s addition laws
of probability.
Baron Keynes of Tilton
Cambridge Economist
Employed expected utility in 1936 in Chapter 12 of
The General Theory of Employment, Interest and
Money.
Keynesian Economics has fundamentally affected
the theory and practice of modern macroeconomics,
and influenced the policies of governments, until
about 1979, until the ideas of Milton Friedman, who
also used expected utility, took over.
John Maynard Keynes
(1883-1946)
Cambridge-based Mathematician, Statistician, Geologist
and Astronomer
The Theory of Probability (1939)
Precursed Anglo-American Bayesian Revival of 1960s
Led by Rudolf Kalman, Raiffa and Schlaifer, Mosteller and
Wallace, Box and Tiao, John Aitchison F.R.S.E and Dennis
Lindley.
INCLUDED: Invariance priors---Vague priors which refer to
the determinant of Fisher’s Information and yield
Sir Harold Jeffreys F.R.S.
posterior distributions which are invariant under non(1891-1989)
linear transformations of the parameters.
Approximate Bayes intervals (also approximate confidence
intervals) centred on the maximum likelihood estimate,
which also refer to the likelihood dispersion.
Pre-eminent Russian Mathematician and Probabilist
Introduced concept of Bayesian sufficiency in his paper on
the statistical estimation of the law of Gauss in !942 in
Kolmogorov’s Extension Theorem constrains us to only
defining our probability distributions on measurable
subsets of the parameter space or sample space
(i.e. those which are elements of an appropriate sigmafield, such as a Borel field)
Andrey Kolmogorov
(1903-1987)
Alan Turing (1912-1954)
Irving Jack Good (1916-2009 )
Alan Turing: Gay icon and martyr, father of machine intelligence, modern computer
science and artificial intelligence. Also the father of modern Bayesian applied
statistics.
Jack Good: cryptanalysist, mathematician, statistician and philosopher.
While solving the Nazi codes at Bletchley Park, Turing and Good used various
pioneering, effectively Bayesian procedures including
•Empirical alternatives to Bayes factors as measures of evidence
•Effectively Bayesian sequential analysis and decision-tree analysis
•Shrinkage estimators for multinomial cell probabilities, which smooth the relative
frequencies of the letters in the German code towards a common value,