Research and data intergrity

Research Integrity:
The Importance of Data Acquisition
and Management.
Jennifer E. Van Eyk, Ph.D.
Prof. Medicine, Biol. Chem. and BME
Director, JHU Bayview Proteomics Center
Director, JHU ICTR Biomarker Development Center
JHU NHLBI Innovative Proteomics Center on Heart Failure
Principles of Research Data Integrity
• Research integrity depends on data integrity.
– Includes all aspects of collection, use, storage and
sharing of data.
• Data integrity is a shared responsibility.
– Everyone involved in the research is responsible.
– The ultimate responsibility belongs to the PI.
– However, there is a broader role and responsibility
for the institute and scientific community.
• Transparency of the research data is required.
Free and accurate information exchange is fundamental
to scientific progress.
Data integrity can be compromised numerous ways.
i) malicious proprietors,
Top Ten Less-Extreme Rock
ii) human mistakes and naivety,
Climbing Routes
iii) technical error.
Stolen Chimney, Utah
Data integrity is based on accurate
and traceable:
i) collection,
ii) recording,
iii) storage,
iv) reporting.
The Consequences of Failure
Personal loss
Blocked scientific progression
Impaired technology development
Damage to the institution and sponsors
Tarnished public perception of science
Damage to or loss of patent protection
Clinical trials based on genomic
selection: Duke University
Based on 2 genomic studies coming from the same multi-disciplinary group (Potti and Nevins) from
which three clinical trials were undertaken. All clinical trails have been ultimately suspended.
1. Papers using cancer tissue (Potti et al., N Engl J Med 2006;355:570) and cell based approaches (Hsu et al, J Clin Oncol.2007;25:4350;
Potti, A. et al. Nat. Med. 2006 12, 1294–1300) were published with a lot of hype.
2. Issues were raised by K. Baggerly and/or K. Coombes (M.D. Anderson Cancer Center) based on publically available data (Annals of
Applied Statistics, 2009:3:1309 and Nat Med. 2007;13:1276) as well as others.
3. Dukes argues mistakes were “clerical errors” and do not alter fundamental conclusions of papers.
4. NCI/CTEP required LMS to be tested in blinded pre-validation study. It failed. Predictor was altered after corrected for having been
carried out in two different labs. Trials had to be randomized and blinded to minimize sensitivity of predictor to laboratory effects.
5. NCI due to continued concerns requests all computer code and data preprocessing in order to try to replicate earlier finding. It
failed. Using predictor for randomization stratification was stopped in the trial.
6. Duke carried out internal review and reopened trial.
7. NCI determines it is partially funding another trial based on a different paper (Chemo-sensitivity). Issues were discovered with
respect to differences in data used to build predictor and data used for validation. Trials were ultimately suspended.
8. Potti’s academic credentials were found to be falsified. Duke acts.
9. Duke statement indicates that that with respect to validation studies: the sensitivity labels are wrong, samples labels are wrong, the
gene labels are wrong, making it “wrong: in way that could lead to assignment of patients to the wrong treatment.
10. Co-author J. Nevins institute (Duke Institute for Genomic Science and Policy) he directed is closed (“due reorganization”) as is the
Center for applied Genomics and Technology which Dr. Anil Potti was based. Hsu D is still publishing.
11. Papers were retracted(JCO Dec 2010;28:5229 and N Engl J Med. Mar 2011;364:1176).
12. Institute of Medicine (IOM) committee struck for independent review and recommendations.
13. FDA audit at Duke starting in 2011.
The Cancer Letters, edit and published by P. Goldbert,
Learn From Mistakes
• Mentorship (do no harm)
• Oversight, training
• Verification and more verification
• Develop of multiple processing pipelines
• Patience and wisdom in applying translation
• Ensure benefit patents (do no harm)
• Being practical while avoiding potential errors
– The data
• Individual Responsibilities
– Data Management
• Data Collection
• Data Storage
– Data Interpretation
• Data interpretation and publication in a changing world of translation
– The reality of translational science
• Challenges
• Role of Core Facilities
– The role of the scientific community
• Journals
• Scientific organizations
• Round table discussion on the research at Johns Hopkins
The Data
Fundamental to research
Basis for writing papers
Important for experiment
Meet contractual/funding
Settle intellectual property
Defense against a charge
of fraud
Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab)
Individual responsibility
Data Management
Three aspects to consider before starting
your data collected:
1. Ownership
2. Collection
3. Storage/protection of confidentiality/sharing
4. Interpretation and publication
Whose data is it?
• Custody does not imply ownership.
• Custody remains with investigator (PI) but JHU
owns all data. But, others have rights.
– Funders
– Other data sources
If there is intellectual property or the research
was funded by a sponsored research
agreement with a company, who owns the data?
Data Collection
• Depends on the type of raw data
• Notebooks – day to day or specialize types of
• Images
• Generated numbers and information
Goal is to preserve raw data, transparent
processing of data, unbiased interpretation and
representation of data.
Data integrity in the digital age
“With the emergence of web-based lab notebooks, digital image “enhancement”, and
the quick and easy (and possibly dirty) generation and dissemination of colossal
amounts of data, it’s becoming increasingly clear that technology provides new
challenges to maintaining scientific integrity. In an attempt to tame the beast while it
still has its baby teeth, the US National Academy of Sciences released a report today
that provided a framework for dealing with these challenge "Ensuring the integrity,
accessibility and stewardship of research data in the digital age.”
“One theme, that threads through many fields: the primacy of scrupulously
recorded data. Because the techniques that researchers employ to ensure the
integrity—the truth and accuracy—of their data are as varied as the fields
themselves, there are no universal procedures for achieving technical accuracy. The
term “integrity of data” also has a structural meaning, related to the data’s
preservation and presentation. “
“Broadly accepted practices for generating and analyzing research must be shown
to be reproducible in order to be credible. Other general practices include checking
and rechecking data to confirm their accuracy, validity and also submitting data and
results to peer review to ensure that the interpretation is valid.”
What evidence proves the 67 kDa band is the
same data as the 32 kDa band?
How can you show that three lanes are the same data?
Gross manipulation of blots
What image should be published?
Misrepresentation of immunogold data. The gold particles, which were actually
present in the original (left), have been enhanced in the manipulated image (right).
Note also that the background dot in the original data has been removed in the
manipulated image. Example provided by Journal of Cell Biology.
Data Forensics
• Can only "de-authenticate" an image (indicate discrepancies).
• Authentication requires access to the original data.
• The identification of a discrepancy is an allegation, and does not
mean there was an intentional falsification of data.
• The interpretation of whether any image manipulation is serious
requires familiarity with the experiment(s) and imaging
Data Forensic Tools are employed by journals in a
manner similar to tools used to detect plagiarism.
Office of research integrity
US Department of Health and Human Services
Data Storage, Protection & Sharing
• Raw data needs to be stored
– Lab notebooks should be stored in a safe place
– Computer files should be backed up
– Protected and limited access to computer raw data
• Samples should be saved appropriately so they will not
degrade over time.
• Data and experiment information should be available after
• Data should be retained for a reasonable period of time.
Dilemma: When PDF or clinical fellow leaves a lab
(especially if paper is not written up) where does
the data stay?
How long?
Retain study records and records of disclosures of study information:
– For IRB clinical trial
• Retain records for 7 years after last subject completed study OR 7 years after date
of last disclosure of identifiable health information from study records.
• If research subject is a child, retain until subject reaches age of 23
– For Investigational New Drug (IND) research –
• Retain records for 2 years after marketing application approved for new drug or
until 2 years after shipment and delivery of drug for investigation use is
– For Investigational Device Exemption (IDE) research
• Retain records for 2 years after the latter of the following two dates: date on which
investigation is terminated or completed or date on which records no longer
required for purposes of supporting a premarket approval application or notice of
completion of a product development protocol.
Provide adequate data and safety monitoring (if activity represents more than
minimal risk to participants)
Complete required training
– Human Subject Research (HSR) compliance
– Conflict of Interest (COI)
– Privacy issues
Traditional Lab Notebook
Best Practices
•Date all entries (especially important if contesting IP)
•Title and state purpose of experiment
•Describe experiment in detail
– Protocol
– Calculations
– Reagents (lot numbers, passage numbers, etc.)
– Results (everything that does and doesn’t happen)
– Print-outs, pictures, graphs, etc with links to other data storage locations.
• Record needs to be intact and permanent
– All mistakes are to be left (cross out)
– Do not remove pages
– Write in pen
– Clearly link connected experiments across time
Requirement: Need to be able to follow the development and
execution of the experiments and all of the data analysis.
Laboratory Notebooks:
Types, Advantages & Drawbacks
Bound book
•No lost sheets
•Proof against fraud
•Experiments entered as done, no
logical order
• can not keep some raw data forms
Loose-leaf sheets/folders
•Can group by experiment,
maintain order
•Easy to record data during
• More flexible to hold various
types of data
•Can lose sheets, harder to prove
Electronic notebook
•Easy to read
•Easy to do calculations
•Must back up data, harder to prove
•Can be manipulated after the fact.
Barker, Kathy. At the Bench: A Laboratory Navigator. Cold Spring Harbor: Cold Spring Harbor
Laboratory Press (2005), 90.
ELISA readouts,
or miRNA
data, etc.
of data
Raw MS spectrum
Processed MS spectrum
Identifies peptides based on
comparison to existing database
Peptides clustered to identify
protein name
Proteins are clustered to remove
name redundancy present
Proteomics uses mass spectrometers (MS) to
identify peptides and proteins. MS accurately
weighs the mass of peptides and their fragments.
The observed spectrum is compared to the
theoretical mass of all known amino acid
sequences in a database allowing assignment to a
protein with a certain probability. Quantification
can be based on number of spectrum observed
per/analysis (spectral count).
From one high accurate MS instrument (Orbitrap
LC/MS/MS) produces in a single run C number
spectrum. This means there is ~ 21-25million
spectrum/yr = to ~1 terabyte of raw data/year.
It is challenging as science, technology and our
understanding is always advancing.
Interpretation can change but RAW data never does.
Auditing Logs
Offsite Independent
local back-ups
Uploads and
NO Deletes
NO Overwrites
IA Storage Server
• Currently Using 1.73TB
• Multiple Redundant Hard Disks
• Secured Data Center
Checks/ 2
Balances 3
Lab (Pass)
Rounding ERROR!
Backup of raw mass spectrometry data
1. Source Matches Target
i) Size and date (easy but 95% accurate)
ii) CheckSum method (time consuming but 100% accurate)
2. No file update and/or delete permitted
i) If deletion of file required, written authorization required from PI
ii) Overwriting is not possible
3. Easily accessible auditing of all activities
(who uploaded/downloaded which files, when, from where etc.)
4. Backup, backup, backup….
i) Different locations
ii) Multiple time points
Data Processing Pipeline Quality Control
Ensuring integrity of data analysis at local level
• “Bookkeeping Checks”
– Treat core or, if reasonable, the entire informatics pipeline
as “black box”, then run as many “integrity tests” as
possible to verify input (i.e. original raw files) matches final
output (e.g. reports).
– Manual Spot-Checking
• Compare final outputs (e.g. from reports) for a few
select data points with same data points in original
inputs (e.g. raw files).
What is different?
Auditing Logs
Offsite Independent
local back-ups
Uploads and
NO Deletes
NO Overwrites
IA Storage Server
• Currently Using 1.73TB
• Multiple Redundant Hard Disks
• Secured Data Center
Checks/ 2
Balances 3
Lab (Pass)
Rounding ERROR!
When there is lots of data and/or fast analysis
is required….how secure is the Cloud?
• 99.9% uptime guarantee for most clouds providers
• Still, good to have local “cheap” backups (e.g. 23 computers
in the JVE lab)
• To ensure security…
– Transmission over internet can use same security as your online
banking system (not hard to do…)
– Clouds can be “VPN-enabled”, so that those cloud machines are
“behind” JHU firewall, thus benefitting from JHU firewall
– Nevertheless, best idea (for any system, cloud or not): minimize
capturing sensitive information unless absolutely required for
stats/analysis; if that’s not possible, encrypt sensitive information
& restrict access conservatively
• NB – Different rules for HIPAA protected data
Learning from Recent Warning Letters Related to Computer Validation
6th October 2011 at 11:00 to 12:00
Organizer: Dr. Ludwig Huber
This seminar will provide more that 20 examples of recent FDA warning letters and give clear
recommendations for corrective and preventive actions.
In the last couple of years the FDA has discovered serious fraud related to security and integrity of
electronic data. As a result FDA inspections look more than ever at computers, how they are validated
and how companies comply with FDA's 21 CFR Part 11. This seminar will present more than 20 related
warning letter examples together with detailed recommendations on how to avoid them.
Areas Covered in the Seminar:
FDA inspections: Preparation, conducts, follow up
The meaning of warning letters and 483 inspectional observations
Learning from an FDA presentation: “Data Integrity and Fraud – Another Looming Crisis?”
Data integrity and authenticity: FDA's new focus during inspections
Examples of recent Part 11 related 483’s and Warning Letters
Examples of recent 483’ and warning letters related to computer system validation ‘
Most obvious reasons for deviations
Responding to 483's to avoid warning letters: going through case studies
Writing corrective AND preventive action plans as follow up to 483's
Using internal audits to prepare yourself for Part 11 related FDA inspections?
Strategies and tools for compliant Part 11 implementation
The future of Part 11 and computer system validation
Interpretation and Publication
Use of core facilities
Role of collaborators
Setting standards via professional societies
Responsibilities of journals and reviewers
Translational Science - Big Science
Dealing with massive data sets,
new technologies, and
novel statistical approaches.
More Lessons from Duke
Requirement of additional expertise outside group. It is still rare to
have someone in group with sufficient expertise to monitor cross all
aspect of the project.
Massive amounts of data and software complexity.
Error introduced due to data handling and poor documentation.
Computer software maybe “research grade”, highly complex and
misunderstood or used inappropriately.
If you think or figure out something is not right, admit it and track it
down and correct it.
“Some times the glamour (and ease) of (some) technology makes
investigators forget basic scientific
(and biological) principals”
The Cancer Letters, edit and published by P. Goldbert,
What’s wrong with this MS spectrum?
Unless you are an expert you will not know. But, it is wrong.
Proteomic analysis of age dependent
nitration of rat cardiac proteins by
solution isoelectric focusing coupled
to nanoHPLC tandem mass
spectrometry. Hong SJ, Gokulrangan
G, Schöneich C. Exp Gerontol.
4 proteins and post-translational
modified amino acid residues
were reported, all were
subsequently shown to be
Manuscript that corrected it:
Misidentification of nitrated
peptides: comments on Hong, S.J.,
Gokulrangan, G., Schöneich, C., 2007.
Exp. Gerontol. 42, 639-651. Prokai L.
Exp Gerontol. 2009; 44(6-7):367-9.
Use of core facilities:
Can they (should they) provide
required expertise?
• Concern:
– Who is responsible for analysis?
– Complex data still requires understanding of technology limitations?
– Situation is worse with emerging technologies where development of
data analysis is still being developed?
• Solutions
– Cores with experts (and provided support to help with data analysis
but is time consuming and expensive)
– New hybrid cores-academic technology development labs.
– Preservation of data transparency and storage of raw data.
– Time to learn the methods across disciplines .
– New paradigm in collaboration requires new approaches to training.
We assume the best in people.
B. Obama at Martin Luther King Memorial speech Oct 2011.
What is the role of a collaborator?
Is being naïve or inexperienced a sufficient
reason for not being responsible for data
integrity and data interpretation?
How is broad (but in-depth) experience
obtained when focused expertise is the norm?
How do you develop collaborative networks
where in-depth cross disciplinary learning and
training is intensic?
Role of our scientific community
Setting data standards
HUPO Proteomics Standards Initiative
The HUPO Proteomics Standards
Initiative (PSI) defines community
standards for data representation in
proteomics to facilitate data
comparison, exchange and
verification. The PSI was founded at
the HUPO meeting in Washington,
April 28-29, 2002 (Science 296,827).
The minimum information
about a proteomics experiment
MS informatics
Column chromatography
Capillary electrophoresis
Protein modifications
Protein affinity
Bioactive entities
All process MS data can be/should be
uploaded to public data base
Role of journals
State requirements in instruction to authors
Set standards for data integrity?
Who curates?
Store raw or processed data?
How is annotated?
Detection prior to publication?
How many DBs?
Who pays?
Have expert reviewers?
If detected, bar authors from publishing in their
journal and or inform author’s institute?
• If detected after publication? Enforced
As a reviewer, what is your role?
How do we train reviewers?
More lessons from Duke
IOM “Committee on the Review of Omics-Based Tests for Predicting
Outcomes in Clinical Trials” for IOM will determine criteria important for
analytical validation, qualification and utilization components of test
evaluation for the use of models that predict clinical outcomes from genomic
and other Omic technologies. Report is due shortly.
Hopefully, IOM report will set goal posts and pathways
in the same manner as for drug trials.
What I have learnt.
Science is difficult. We are limited by our
Raw data is never wrong. We may
misinterpret it, be fooled by an incorrect
assumption, be limited by
technology/approach but intrinsically, it
is not wrong.
Your scientific reputation is based on the
quality of your data.
Data Integrity Principle: Research data integrity is essential for
advancing scientific and medical knowledge and for maintaining public
trust. Researchers that are ultimately responsible.
Data Access and Sharing Principle: Research (raw and processed)
data, methods, and other information integral to publicly reported
results should be publicly accessible.
Data Stewardship Principle: Research data should be retained to
serve future use. Thus, (raw and processed) data must documented,
referenced, and indexed in order for them to be used accurately and
Round table
Sheila Garrity (Moderator)
Allen Everett (Assoc Professor)
Kathleen Barnes (Professor)
David Graham (Asst Professor)
Director, Division of
Research Integrity
Pediatrics - Cardiology
Medicine–Clinical Immunology
Molecular and Comparative
Johns Hopkins
• JHU Data Management Policy:
• Overall list of JHU policies page:
• JHU laptop encryption:
• Overall JHU IT security page:

similar documents