Guidance for Statisticians when Reporting Biomarker Studies

Report
Biomarkism: taming the
revolution?
May 12th 2014
PSI Conference
David Lovell
St George’s Medical School
University of London
Henry Gray
Edward Jenner
Plan of contribution
• Different types of biomarkers
• Biomarkers (the journal) and statistical
guidelines
• Some personal editorial/refereeing experiences
• Challenges
• Epigenetics
• Discussion points
Editorial Board Member
• Biomarkers
• Mutagenesis
• Toxicology in Vitro (until 2008)
• Refereeing for numerous journals (20+) over
last 10-15 years
Biomarkers was started in 1996 by John Timbrell (UoL School
of Pharmacy)
The journal Biomarkers brings together all aspects of the
rapidly growing field of biomarker research, encompassing
their various uses and applications in one essential source.
Manuscripts can describe biomarkers measured in humans or
other animals in vivo or in vitro.
FDA definition of Biological Marker
(biomarker)
“A characteristic that is objectively measured and
evaluated as an indicator of normal biologic
processes, pathogenic processes, or pharmacologic
responses to a therapeutic intervention.”
"Biomarkers and Surrogate Endpoints: Preferred
Definitions and Conceptual Framework"
Biomarkers Definitions Working Group (2001)
• Biomarkers of exposure: covering detection and
measurement of internal exposure to drugs and other
chemicals;
• Biomarkers of response: including measures of endogenous
substances or parameters indicative of pathological or
biochemical changes both toxicodynamic and
pharmacodynamic, resulting from exposure to drugs and other
chemicals;
• Biomarkers of susceptibility: including genetic factors which
alter susceptibility to drugs and other chemicals ;
• Biomarkers of disease: covering measurement of endogenous
substances or parameters indicative of a disease process and
the use of pharmacodynamic and genetic markers in evidencebased laboratory medicine and treatment (markers of efficacy)
• 13 (at least) other journals with Biomarkers in the
title
• 8 issues/year
• >400 papers received in 2012 and 2013
• Approx 225 referees in 2013
• Rejection rates have increased from 51% in 2009 to
73% in 2013.
• Year average impact factor stable at 2.230
• Geographical spread
Country/Year
U.S.
China
India
Italy
U.K.
Germany
Brazil
Turkey
2009 2010
54
54
33
44
25
27
20
12
14
22
13
20
9
9
5
7
2011
32
54
23
17
14
20
9
2012
36
108
35
22
11
8
8
2013
16
122
26
17
16
5
15
7
27
24
The paper
"Biomarkers expects high standards but
recognizes that it needs to be vigilant as
scientific research continues to be affected by
errors in the conduct and reporting of research
and by fraudulent research. There have been
reports of the high incidence of statistical
errors, poor statistical practice, and limitations
in the designs used in papers published by
peer-review journals."
Quotes from the paper
Biomarkers, therefore, starts from the position that it is
of paramount importance that studies published in a
peer-reviewed journal should have been correctly
designed, carried out, and reported, and that the
results are provided in such a way that the
experimental and statistical methods could be
repeated. This is also important for both economic
and ethical considerations.
Transparency in terms of the availability of and access
to the original raw data is a key component for the
critical assessment of evidence-based research.
Biomarkers, at present, does not have statistical guidelines. It does, though,
have instructions to authors, which provide sensible general requirements.
Table 1 provides a non-exhaustive list of some of the guidelines available. In
many specific technical areas, such as the − omics technical reviews of
appropriate statistical methods have been produced; the referee can
reasonably expect that an author is aware of them, applies the methods, and
cites the publications where appropriate.
Biomarkers, therefore, expects to see evidence of the
planning that went into a study and to see statistical
analyses, which make full use of the design.
Examples would be details of the statistical analysis
plan (SAP), consideration of the primary endpoint,
and whether the primary aim of the study was
hypothesis testing or hypothesis generation. A failure
to declare in the “methods” section that blinding and
randomization were carried out would be interpreted
as implying that this had not been done. Details, such
as the type of randomization, e.g., block or stratified
and methods used for blinding, should be given when
relevant.
Biomarkers expects to see a justification of the
sample sizes used and, where relevant, the
power calculations, which were carried out as
part of the development of the SAP for both
experimental and observational studies.
uncritical use of hypothesis testing and the
reporting of results as either statistically
significant (p <0.05) or, preferably, with the
exact p values is not acceptable. Statistical
significance alone is not a justification for
publication.
It is, therefore, important to note that Biomarkers
policy is that well-designed studies, which
produce negative results, are viewed
favourably for publication. This policy also
meets the ICMJE (2008) obligation to publish
negative studies.
The Refereeing Challenge
An email arrives to the ‘volunteer’:
• Invitation to Review Manuscript ID xxx
• Recently, you agreed to review Manuscript ID
xxx. A previous e-mail was sent to you four
days ago as a reminder that your review was
due. We have yet to receive your review of
this manuscript.
Indication of the problem
•
•
•
•
42,328 journals listed in PubMed
(Biomarkers is # 29,757)
Estimated 1 million+ papers in them a year?
“A reasonably mature journal like Neuroimage would
hope to see between 70% and 90% of submissions
rejected.”
• The Intergovernmental Panel on Climate Change
(IPCC) latest report is based upon 73,000
publications (25% of them in Chinese) 100-fold
increase in 30 years (Economist 5/4/14)
How long does it take to referee a
paper?
•
Initial read through (30-40) minutes: locate the paper in the scientific universe
(main concepts/general theme).. 10-20 minutes digesting, follow up legends of
figure and table to see if they link in with what I read in the text. Is this groundbreaking work or junk? Can I tell? (1 hour)
•
Second read through (two days later) (1hr) more detailed, identification of main
methods, any limitations, uncertainties, unanswered questions, link more closely
text (especially conclusions) to data, analysis and results. Identify figures don't
match text or legends. Results seem odd?. Gaining confidence in view that
uninteresting and should be rejected. First draft of referee report (1 hour)
•
Third reading (next day) (40 mins) concentration on areas not completely
understood and to identify key points in referee's report to justify recommending
rejection rather than resubmission. Write up report and send back to editor 20 mins:,
completion of final report to editor (1 hour)
•
Total 3 hours plus time taken to access websites, remember passwords etc.
Follow on
Statistics: Guidelines are given in Lovell, D.P. (2012)
BiOMARKERS 17(3) 193-200.
In brief, BiOMARKERS expects authors to be aware of the
appropriate statistical analyses that should be used in their
specific field of research and prepare their submissions
accordingly. A statistical analysis plan should be available for
the studies reported and, if required, all relevant data and
analyses must be accessible to reviewers. Authors should
indicate how datasets used in the analyses will be maintained
and be willing to make their data available to other
researchers.
The author(s) responsible for statistical design and analysis must
be indicated as a point of contact on the title page by the #
symbol. If a statistician was employed for the analysis, but is
not an author, s(he) must be identified and have agreed that
their name and email address will appear in the
acknowledgements section.
Following on
• Citations?
• Effect on journal?
From Volume 19 2014 onwards:
# ***** **** and ****** *********
are responsible for statistical
design and analysis.
• Speed of response?
• How to monitor and follow up?
Personal examples
Example 1
What goes round comes round
•
•
•
•
Paper refereed for one journal (rank #1)
Identical re-submission to first journal
Asked to referee by 2nd journal. Refused.
Asked to referee by 3rd journal. Refused.
Example 2
“You can be one of the authors”
Reviewer Comments:
“A statistician with experience in systematic review and meta-analysis should be
consulted to assist with the analysis as the description given of the statistics and the
summary statistics provided are not correct.
Authors’ Comments:
Reviewer clearly very upset with basic statistical analysis performed.
We do not have the knowledge to perform such analysis given that the individual
papers are so heterogenous. Unusual analysis required.
Likely lower impact journal would have accepted our statement!
Authors Suggested Action:
Suggest contact Dr. Lovell and offer authorship in return for review of statically (sic)
element of paper
“We have got your name from Prof ****** ******* because we need some support to
improve a systematic review paper on the long term issues associated with
********. Paper has been accepted with major revision in the American Journal of
********.” (Impact Factor 2.516)
Example 3
“Although there is a distinct grouping of CASE samples on
the left side of the map, there are many other samples that are
located throughout the Control samples, indicating these
samples have measures for these analytes that are more
similar to Controls than CASE.
Thus these analytes are not sufficient to differentiate between
all members of the 2 groups.”
Report from company bioinformatician
Scatterplot of PC II vs PC I
7.5
Group
C ASE
C ontrol
5.0
PC II
2.5
0.0
-2.5
-5.0
-7.5
-5.0
-2.5
0.0
PC I
2.5
5.0
Scatterplot of C103 vs C3
Group
C ASE
C ontrol
5.0
2.5
C103
0.0
-2.5
-5.0
-7.5
0
20
40
60
C3
80
100
Scatterplot of PC II vs PC I
7.5
Batch 1
1
2
5.0
PC II
2.5
0.0
-2.5
-5.0
-7.5
-5.0
-2.5
0.0
PC I
2.5
5.0
Scatterplot of PC I vs Study ID
Batch 1
1
2
5.0
2.5
PC I
0.0
-2.5
-5.0
-7.5
1000
1020
1040
1060
Study ID
1080
1100
1120
Scatterplot of PC I vs Study ID
Group
C ASE
C ontrol
5.0
2.5
PC I
0.0
-2.5
-5.0
-7.5
1000
1020
1040
1060
Study ID
1080
1100
1120
“As you can imagine I am slightly distraught but have been in contact with
both the company and the statistician to look again at the data to quantify the
effect”
“I find it interesting there are two distinct groupings unlinked to my clinical
categorisation.”
“I did send two lots of samples separated by about a year but this is the only
difference - I wonder if they separate on ID number (1001-1050 roughly went
first) 1151 onwards went second. I was expecting there to be minimal
experimental variation as this is what the company promise so this would be an
important quality control issue to flag up.”
“…there were 18 months between batches so entirely possible this explains the
split. ”
All other methods of collection and storage remained the same as far as I am
aware.
• I am essentially going to bin these results
• Thanks for pointing it out - I at least can
resend them in one go and hopefully *******
will be able to give me a discount on principle
- if not the grant can take it luckily
• I am going to withdraw all the papers
obviously
Challenges and solutions?
Fraud and forensic bioinformatics
Potti et al and Duke University
http://arxiv.org/pdf/1010.1092.pdf
After thousands of hours of investigation, three clinical trials at Duke University in
Durham, North Carolina, were suspended in late 2009 because of the irreproducibility of
the genomic 'signatures' used to select cancer therapies for patients. Journals have a duty
to help the community by maintaining reproducibility as a cornerstone of the scientific
process.
“They also noted that the internal committees responsible for protecting
patients and overseeing clinical trials lacked the expertise to review the
complex, statistics-heavy methods and data produced by experiments
involving gene expression “
“That is a theme the investigating committee has heard repeatedly. The
process of peer review relies (as it always has done) on the goodwill of
workers in the field, who have jobs of their own and frequently cannot spend
the time needed to check other people's papers in a suitably thorough
manner. (Dr McShane estimates she spent 300-400 hours reviewing the
Duke work, while Drs Baggerly and Coombes estimate they have spent
nearly 2,000 hours.) Moreover, the methods sections of papers are
supposed to provide enough information for others to replicate an
experiment, but often do not. Dodgy work will out eventually, as it is found
not to fit in with other, more reliable discoveries. But that all takes time and
money.”
Economist Sep 10th 2011 http://www.economist.com/node/21528593
Challenges to Guideline approaches
• Academic science is better than GLP science?
• “But scientific reform is needed as well. For decades,
regulatory bodies have relied on guideline studies conducted
under national and internationally agreed standards known as
Good Laboratory Practice (GLP). This governs how the
studies are planned, performed, monitored, recorded, reported
and archived. These standards are invaluable, providing a
guarantee of reliability and cross-comparability for studies on
chemical safety. But the glacial pace of consensus building and
validation required to update guidelines can leave gaping holes
that allow the approval of chemicals of questionable safety.”
http://www.nature.com/nature/journal/v464/n7292/full/4641103b.html
“Moreover, detecting BPA's effects generally requires cutting-edge biological techniques
whose results, in the eyes of regulatory bodies, carry just a fraction of the weight of those
produced by a GLP study.”
Séralini et al (2012) paper
Séralini et al (2012) paper
“The Editor-in-Chief again commends the
corresponding author for his willingness and openness
in participating in this dialog. The retraction is only on
the inconclusiveness of this one paper. The journal’s
editorial policy will continue to review all manuscripts
no matter how controversial they may be. The editorial
board will continue to use this case as a reminder to be
as diligent as possible in the peer review process.”
“Ultimately, the results presented (while not incorrect) are
inconclusive, and therefore do not reach the threshold of
publication for Food and Chemical Toxicology. The peer
review process is not perfect, but it does work. The journal is
committed to getting the peer-review process right, and at times,
expediency might be sacrificed for being as thorough as possible.
The time-consuming nature is, at times, required in fairness to
both the authors and readers. Likewise, the Letters to the
Editor, both pro and con, serve as a post-publication peerreview. The back and forth between the readers and the author
has a useful and valuable place in our scientific dialog.”
FCT (2014)
“Efforts to suppress scientific findings, or the appearance of
such, erode the scientific integrity upon which the public
trust relies. The retraction by the FCT marks a significant
and destructive shift in management of the publication of
controversial scientific research. Equally troublesome is that
this retraction does not really impact how the science will be
viewed by scientists, but only how it is viewed by others
outside of the scientific community. We feel the decision to
retract a published scientific work by an editor, against the
desires of the authors, because it is “inconclusive” based on
a post hoc analysis represents a dangerous erosion of the
underpinnings of the peer-review process, and Elsevier
should carefully reconsider this decision.”
Portier et al (2014) Inconclusive Findings: Now You See Them, Now You Don’t!
Environmental Health Perspectives volume 122 February 2014
Genetics
Economist 4/1/14
Illumina this week (15/1/14) claimed to be the first company to
achieve the coveted $1,000 genome.
• Genome-wide Association Studies (GWAS)
• Next Generation Sequencing (NGS)
• Analysis of exome chip, exome sequencing data and
whole genome sequencing data.
• Haplotype mapping, analysis of structural variation,
meta-analysis and gene-environment interaction
• Qualitative differences, stable within the individual
and over time (cancer/mutation etc)
Epigenetics
Marksists
• Something revolutionary.
• Studying all the marks left in the genome that
form the basis of epigenetics suggests a new
type of scientists is about to appear: Marksists.
Epigenetics
•
•
•
Epgienetic marks
Methylation of DNA bases, histone variation
Switch genes on or off and/or regulates them
•
Epigenome
•
•
•
•
•
“The epigenome comprises all of the chemical compounds that have been added to
the entirety of one’s DNA (genome), but are not part of the DNA sequence, as a
way to regulate the activity (expression) of all the genes within the genome.”
.
10’s of millions of methylation sites
Pattern variable within individual and over time
Chip ($200) 450,000 sites
•
•
•
•
Inter-generational and trans-generational inheritance
Three generation test (link to reproductive toxicity)
Male-mediated teratogenesis
“Sins of the grandmother”
Epigenome-wide association studies
(EWAS) EWAS
• The investigation of the distribution of methyl groups at
thousands of specific DNA nucleotides across the genome to
identify arrangements that are common in a disease, or
associated with variation in a trait.
• “The problem with EWAS is that there’s so much more that
can confound an outcome compared with a GWAS,” (John
Greally)
• Epigenetic signatures which were thought to result from
ageing, instead reflected the changing proportions of blood cell
types with age.
• Methods for analysing chemical patterns on DNA shows
promise for explaining disease, but few results have yet been
replicated.
Examples
•
•
•
•
•
•
•
•
•
Stressful home life associated with shorter telomeres in a group of 9 yo boys
Psychotherapy can alter methylation
Post-traumatic stress disorders (PDST) “unusual profiles”
People abused as children differ from those abused as adults)
Patterns related to suicide, successful dieting, US social status)
Drugs can alter epigenome
Holocaust survivors v. no traumatic experience
Hungerwinter study of the Dutch Famine Birth Cohort
Maternal nutrition around the time of conception can affect the regulatory tagging of child’s
DNA
•
•
Marks left by: smokers, ex-smokers, food,
Diesel fumes, pesticides arsenic ‘produce distinct patterns’
•
•
Male mice with folate deficiencies ‘reprogram sperm’
Markers associated with rodent stress in early life correlated with certain aversive behaviours
same marks could be found in their offspring, and in some cases, in their offspring’s offspring
•
Methylation profile involving around 400 sites give five years' warning of the onset of breast
cancer
http://www.economist.com/news/science-andtechnology/21591547-lack-folate-diet-male-mice-reprogramstheir-sperm-ways
Skinner, M. K. (2008). What is an epigenetic
transgenerational phenotype? F3 or F2. Reprod.
Toxicol. 25, 2–6.
Male mice of great grandmothers who have been
exposed to PCB have lower sperm counts that others
whose great grandmothers weren’t (Poscar et al,
2013)
How do you design studies that control all the
confounders over three generations?
Discussion Points
• Is peer review feasible if every paper can be
published somewhere?
• Can guidelines by themselves be used to ‘police’ the
literature?
• Is it realistically possible to review multiauthor/multi-disciplinary work?
• Is it possible to ensure quality prospectively or
retrospectively?
• How (or even should we be trying) to maintain
quality as the scientific output becomes increasingly
global/multi-polar?
• Should reproducibility be a pre-condition for
publication?
• ‘Black boxes’?
• Bioinformatics and/or statistics?
• Use of Check Lists?
• Perception of statistics as a tool (technical rather than
scientific)?
• “I’m a molecular biologist, we don’t need statistics”
Nature 13th February 2014
Nature Editorial 13th February 2014
“Too many researchers have an incomplete or outdated sense of what is necessary in
statistics; this is a broader problem than misuse of the P value. Among the most common
fundamental mistakes in research papers submitted to Nature, for instance, is the failure
to understand the statistical difference between technical replications and independent
experiments.“
"Department heads, lab chiefs and senior scientists need to upgrade a good working
knowledge of statistics from the ‘desirable’ column in job specifications to ‘essential’.
But that, in turn, requires universities and funders to recognize the importance of statistics
and provide for it."
"Good statistics can no longer be seen as something that makes science better — it is a
fundamental requirement, and one that can only grow in importance as funding cuts bite
and competition for resources intensifies."
"Correctable weaknesses in the design, conduct, and analysis of biomedical and public
health research studies can produce misleading results and waste valuable resources.
Small effects can be difficult to distinguish from bias introduced by study design and
analyses. An absence of detailed written protocols and poor documentation of research is
common. Information obtained might not be useful or important, and statistical precision
or power is often too low or used in a misleading way. Insufficient consideration might
be given to both previous and continuing studies. Arbitrary choice of analyses and an
overemphasis on random extremes might affect the reported findings. Several problems
relate to the research workforce, including failure to involve experienced statisticians
and methodologists, failure to train clinical researchers and laboratory scientists in
research methods and design, and the involvement of stakeholders with conflicts of
interest. Inadequate emphasis is placed on recording of research decisions and on
reproducibility of research. Finally, reward systems incentivise quantity more than
quality, and novelty more than reliability. We propose potential solutions for these
problems, including improvements in protocols and documentation, consideration of
evidence from studies in progress, standardisation of research efforts, optimisation and
training of an experienced and non-conflicted scientific workforce, and reconsideration
of scientific reward systems.“
Ioannidis et al (2014) Research: increasing value, reducing waste 2
Published Online January 8, 2014 http://dx.doi.org/10.1016/ S0140-6736(13)62227-8
New Scientist Survey
• N =122 (out of 1000 stem cell researchers)
• 55% thought stem cell research is put under
more pressure that other areas of biomedical
science
http://www.newscientist.com/articleimages/mg22129623.400/1-stem-cell-scientistsreveal-unethical-work-pressures.html
New Scientist 29/3/14
http://www.newscientist.com/data/doc/article/dn25281/stemcellsurveypdf1.pdf
The importance of transparent
reporting of biomarker studies
Doug Altman
Centre for Statistics in Medicine
University of Oxford
The importance of transparent
reporting
 Research only has value if
– Study methods have validity
– Research findings are published in a usable form
 The goal should be transparency
– Should not mislead
– Should allow replication (in principle)
– Can be included in systematic review and meta-analysis
85
Biomarker studies:
Focus on studies of prognosis
Prognosis refers to the risk of future health outcomes
in individuals or groups with a given disease or health
condition
The study of prognosis has never been more important
– more people are living with conditions impairing
health due to improvements in life expectancy
Understanding and improving prognosis is pivotal to
the practice of clinical medicine
Prognostic information is increasingly used by
clinicians to help manage patients
86
Prognostic research themes
1) Fundamental prognosis research
The course of health related conditions in the context of the nature
and quality of current care
2) Prognostic factor research
Specific factors (such as biomarkers) that are associated with
prognosis
3) Prognostic model research
The development, validation, and impact of statistical models that
predict individual risk of a future outcome
4) Stratified medicine research
The use of prognostic information to help tailor treatment decisions
to an individual or group of individuals with similar characteristics
[Hemingway et al. BMJ 2013]
87
Prognostic research themes
1) Fundamental prognosis research
The course of health related conditions in the context of the nature
and quality of current care
2) Prognostic factor research
Specific factors (such as biomarkers) that are associated with
prognosis
3) Prognostic model research
The development, validation, and impact of statistical models that
predict individual risk of a future outcome
4) Stratified medicine research
The use of prognostic information to help tailor treatment decisions
to an individual or group of individuals with similar characteristics
[Hemingway et al. BMJ 2013]
88
Prognostic factor research
Aims to identify factors associated with subsequent
clinical outcome in people with a particular disease or
health condition
Examples:
 Biological (biomarkers)
–
–
–
–
genomic
proteomic
imaging
physiological variables
 Others
– psychosocial (e.g. depression)
– ecological (e.g. area-level social deprivation)
89
90
Hamilton et al, J Transl Med 2010
91
92
Prognostic importance of a single
specific prognostic factor/marker
 A clear view of the benefit of a marker is only likely
to emerge from looking across multiple studies
– Systematic review
 We should by now know the prognostic importance
of numerous markers that have been extensively
investigated for many cancers and other diseases
– Why don’t we?
93
Example: p53 as a prognostic
marker in bladder cancer
Systematic review of literature
 168 published studies
 >10000 patients
“After 10 years of research, evidence is not sufficient
to conclude whether changes in P53 act as markers of
outcome in patients with bladder cancer.”
[Malats et al, Lancet Oncology 2005]
Example: Ki-67 in Breast cancer
Systematic review
 43 studies
 >15,000 patients
 Some evidence of publication bias
“Whether these proliferation markers provide
additional prognostic information to commonly
used prognostic indices remains unclear.”
[Stuart-Harris et al, Breast 2008]
95
Evidence from systematic reviews
that the quality of prognostic factor
research needs to improve
Coronary disease
“Multiple types of reporting bias, and publication bias, make the
magnitude of any independent association between CRP and prognosis
among patients with stable coronary disease sufficiently uncertain that
no clinical practice recommendations can be made.”
[Hemingway et al, PLoS Med 2010]
Osteosarcoma
“93 papers were studied in depth … Only 7 papers were of sufficient
quality to analyze ... Because of heterogeneity of the studies, pooling
results is hardly possible.” [Bramer et al, Eur J Surg Oncol 2009]
Peptic ulcer perforation
“Fifty prognostic studies with 37 prognostic factors comprising a total of
29,782 patients were included in the review. The overall methodological
quality was acceptable, yet only two-thirds of the studies provided
confounder adjusted estimate” [Moller et al, Scand J Gastroenterol] 96
Multiple studies
 Clinical and methodological heterogeneity
–
–
–
–
Different patient groups
Different assays/measurement techniques
Variation in cutpoints
Adjustment for different other variables (or none)
… leading to
– confusion
– amplification of biases
 Results are probably not reliable even if there is
apparently a clear picture
– More studies may make things worse!
97
Publication bias
 “… the literature is probably cluttered with falsepositive studies that would not have been submitted
or published if the results had come out differently.”
[Simon, 2001]
98
“Together with the long recognized problem of publication
bias favoring studies that report positive findings, the result
may be a body of literature that is heavily influenced by
false-positive findings.”
99
Bcl-2
Martin et al, BJC 2003
100
Hemingway et al, PLoS Med 2010
83 studies of C-reactive protein in stable coronary artery disease
101
Prognostic factor research
Limitations
 Small samples
 Poor statistical analysis
– adjustment for known predictors, handling of continuous
variables
 Heterogeneous laboratory methods
 Lack of replication
 Poor publication practices
– Inadequate reporting
– Selective publication
 Reliable answers require better studies
– especially planned collaborative studies leading to IPD metaanalysis
102
Reporting guidelines
JNCI, BJC, JCO, EJC 2005
103
REMARK: REporting guidelines for
tumor MARKer prognostic studies
Recommended reporting elements to facilitate
– Evaluation of appropriateness & quality of study
design, methods, and analysis
– Understanding of context in which conclusions apply
– Reproducibility
– Comparisons across studies, including formal metaanalyses
104
REMARK checklist elements
Introduction
 Markers examined
 Study objectives
Methods
 Patients
 Specimen
characteristics
 Assay methods
 Study design
 Statistical analysis
methods
Results
 Data
 Analysis & presentation
Discussion
 Interpretation
 Implications
20 items in total
105
REMARK Item 17:
“Among reported results, provide estimated effects
with confidence intervals from an analysis in which the
marker and standard prognostic variables are included,
regardless of their statistical significance”
106
129 articles
36% included the marker in a multivariable model with
standard clinical variables
107
Vickers et al, Cancer 2008
Reporting of prognostic studies – Pre-REMARK
First 10 articles 5 high profile cancer journals 2006-7
REMARK item
Reported
Number of patients overall
Assessed for eligibility
56%
Excluded
54%
Number available for analysis
Patients
98%
Events
50%
Number in univariable analysis
Patients
54%
Events
21%
Numbers in multivariable analysis
Patients
54%
Events
30%
Mallett et al, BJC 2010
108
Prognostic model research
A prognostic model is a formal combination of multiple
prognostic factors from which risks of a specific
endpoint can be calculated for individual patients.
Also called:
prognostic (or prediction) index
prognostic (or prediction) rule
risk (or clinical) prediction model
109
Prognostic model research
Uses
 Clinical practice
– Communication with patients/relatives
– Risk stratification
 Design and analysis of clinical trials
 Case mix adjustment
110
111
Prognostic model research
112
Prognostic model research
Major steps:
Development
External
validation
Identification and
combination of variables
associated with outcome
Evaluate the model’s
predictive ability in a
different population
Impact
Evaluate the impact
of the use of the
prognostic model on
health outcomes
113
Published Prediction Models
111 models for prostate cancer (Shariat 2008)
102 models for traumatic brain injury (Perel 2006)
83 models for stroke (Counsell 2001)
54 models for breast cancer (Altman 2009)
43 models for type 2 diabetes (Collins 2011; van Dieren 2012)
20+ more models have since been published!
31 models for osteoporotic fracture (Steurer 2011)
Omitted FRAX due to insufficient information
29 models in reproductive medicine (Leushuis 2009)
26 models for hospital readmission (Kansagara 2011)
>25 models for length of stay after cardiac surgery (Ettema 2010)
13 models for tooth decay (Ritter 2010)
Very few of these models have been ‘validated’ in new data and compared
114
Prediction Models in UK Clinical Guidelines

Framingham Risk Score & QRISK2 (NICE CG67)
– 10-year CVD risk

Nottingham Prognostic Index (NICE CG80)
– Recurrence & survival in breast cancer patients

FRAX & QFracture (NICE CG146)
– 10-year osteoporotic and hip fracture risk

GRACE/PURSUIT/PREDICT/TIMI (NICE CG94)
– Adverse CV outcomes in patients with UA/NSTEMI

APGAR (NICE CG132/2)
– Newborn prognosis

SAPS & APACHE (NICE CG50)
– ICU scoring systems

Leicester Diabetes Risk Score, QDSCORE, Cambridge Risk score
(NICE PH38)
–
Type 2 diabetes
115
Model development
 Select important candidate predictors
– Trying to avoid selection based on statistically significant
univariable associations with outcome
 Appropriately handle (acknowledge) missing data
 Fit a multivariable model
 Estimate the predictive performance
– Calibration and discrimination
– Quantify any optimism from overfitting
• Use bootstrapping (avoid randomly splitting a dataset)
 Prediction models should be presented in adequate
detail to allow predictions in individuals, either for
subsequent validation studies or in clinical practice
116
Why do we need to
validate a model?
 Deficiencies in design of prognostic studies
 Deficiencies of standard modelling methods
 Models may not be transportable
– over-optimism because of data-dependent analysis choices
– variation in ‘case-mix’
 Performance cannot be predicted
– Need empirical demonstration of model performance
 Usefulness is determined by how well a model works
in practice, not by P values
 An important feature of validation is to provide an
unbiased estimate of prediction error
117
Poor reporting … and poor conduct
Reviews of published studies
 Diabetes (Collins et al, BMC Med 2011)
 Cancer (Mallett et al, BMC Med 2010)
 Kidney disease (Collins et al, J Clin Epidemiol 2012)
 General medical journals (Bouwmeester et al, PLoS
Med 2012)
 Breast cancer (Altman, Cancer Invest 2009)
 Missing data in prognosis studies (Burton, Br J
Cancer 2004)
and many more….
118
Conclusions from the systematic
reviews
 Poor reporting
– Number of events often difficult to identify
• Candidate predictors (and number) inadequately defined
– Insufficient information to determine events per variable (EPV)
• 40% studies (Mallett 2010; 33% Collins 2011)
– How candidate predictors were selected
• Unclear in 25% studies (Bouwmeester 2012)
– How the multivariable model was derived
• Unclear in 77% studies (Mallett 2010)
119
Conclusions from the systematic
reviews
 Poor reporting
– Missing data rarely mentioned
(41% Collins 2010; 45% Collins 2012)
– Missing data is often an exclusion criterion (but often not
specified)
– Complete-case analysis usually carried out
 Model often not reported in full
– intercept missing for logistic regression,
– baseline survival missing for Cox regression models
 Ranges of continuous predictors rarely reported
120
Conclusions from the systematic
reviews
 Methodological shortcomings including
– Small sample size (number of events) [EPV<10]
– Large number of candidate predictors
– Calibration rarely assessed
• 74% not done (Collins); 46% not done (Bouwmeester)
– Dichotomization of all/some continuous predictors
• 63% of studies (Collins); 70% of studies (Mallett)
– Previously published models often ignored
– Inadequate validation
• Reliance on random-split (often using an already small dataset) to
validate
 Lack of comparisons of competing models on same
dataset
– Siontis et al, BMJ (2012), Collins & Moons, BMJ (2012)
121
External validation
 Separate dataset (not a random split)
– Different centres (geographical validation)
– Different time period (temporal validation)
– Different case-mix
– Possibly with different definitions of predictors and outcome
 Ideally conducted by independent researchers
122
Evaluating model performance
 Performance of prediction models is characterised
by
– Calibration
• agreement between observed outcomes and predictions
– Often ignored
– Preferably assessed graphically
– Discrimination
• ability to distinguish between patients who do and do not
experience the event of interest
– Usually reported (c-index)
123
Calibration plot for a scoring system for predicting postoperative nausea
and vomiting (PONV) [van den Bosch et al 2005]
124
125
Review of published validation studies
[Collins et al, 2014]
Reviewed 78 articles that evaluated 120 prediction
models in participant data that were not used to
develop the model
 16% did not report number of outcome events in the
validation dataset
 54% made no explicit mention of missing data
 67% did not report evaluating model calibration
126
Transparent Reporting of multivariable
models for Individual Prognosis
Or Diagnosis (TRIPOD)
 Consensus-based guidelines for improving the
quality of reporting of multivariable prediction
modelling studies
 Focus on reporting (but much attention to
methodological conduct in long E&E paper)
 Steering group:
Gary Collins (Oxford)
Karel Moons (UMC Utrecht)
Doug Altman (Oxford)
Hans Reitsma (UMC Utrecht)
127
TRIPOD checklist elements
Introduction
Title and abstract
Introduction
 Background & Objectives
Methods
 Source of data
 Participants
 Outcome
 Predictors
 Sample size
 Missing Data
 Statistical analysis methods
 Risk groups
 Development versus validation
Results
 Participants
 Model Development
 Model Presentation
 Model Performance
 Model Updating
Discussion
 Limitations
 Interpretation
 Implications
Other information
22 items in total
128
TRIPOD
 Key minimal information deemed important to
report
 Help authors, peer reviewers, editors, readers and
potential users
 Educational – providing guidance, cautioning against
particular approaches
 Improve evaluating risk of bias (PROBAST) if more
information is reported
 Submitted for publication in March 2014
129
Published prognostic studies
 Poor methods are widely used
– Exploratory studies presented as if confirmatory
 We need high quality reporting so we can identify
and discard bad studies
– REMARK, TRIPOD, …
 Other initiatives…
130
131
“Across many types of research, accumulating evidence of bias
has led to increasing support for greater transparency,
especially relating to registration, publication of full protocols,
and adherence to reporting guidelines. None of these will solve
all the problems, but certainly all will help.”
132
133
Assessing risk of bias (QUIPS tool)
134
QUIPS: 6 bias domains
1.
2.
3.
4.
5.
6.
Participation
Attrition
Prognostic factor measurement
Confounding measurement and account
Outcome measurement
Analysis and reporting
135
Phases in prediction model research
 Development
– Predictor selection, model building
– Internal validation (evaluating optimism)
• Split sample (random) – inefficient - not very useful
• Cross-validation
• Bootstrapping
 Evaluate performance (external validation)
– Temporal & geographical validation
– Independent validation (i.e. independent investigators)
 Impact study
– Does the prognostic model improve patient outcomes?
– Does the prognostic model change clinician behaviour?
– Is the prognostic cost effective?
136
Assessing performance: comparing
observations with predictions
 Comparison of observed and predicted event rates
for groups of patients (calibration)
– Can plot observed proportions of events against predicted
probabilities
– Ideally observed and predicted proportions agree over the
whole range of probabilities, and plot shows a 45 line
– Can fit model:
observed mortality = a + b x risk score
 Measures that distinguish between patients who do
or do not experience the event of interest
(discrimination)
– Discrimination is often assessed in graph or in table
137

similar documents