CDC Presentation - Institute for People and Technology

Report
Advanced Molecular Detection
Duncan MacCannell, PhD
Georgia Tech / CDC Collaborates
March 12th, 2014
National Center for Emerging and Zoonotic Infectious Diseases
Office of the Director
Roche 454 PTP plate, Ion Torrent 314, Pacific BioSciences SMRTcells (x 3)
Devices and brand names provided for illustrative purposes only. Their use does not imply endorsement by CDC or HHS.
VOLUME OF RAW DATA
Data Acquisition/Analysis Challenges
For PulseNet USA alone:
>70,000 samples/year
x
2 to 3 GB raw sequence + 5-10 GB intermediate
~0.9 petabytes of raw data/year
Transmission and storage?
Is better data compression the answer?
Distributed processing and extraction?
Is full WGS the right approach for large-scale surveillance?
Any solution must balance the advantages of WGS, with the costs of implementation.
Library
Data
Info.
Input: DNA/RNA
NGS
Bioinformatics
Source:
Genomic
Amplicon
Whole sample
Workflow:
Platforms
Chemistry
Perf. char.
Labor/TaT
Cost
Workflow:
Hardware/software
Specialized skillsets
Algorithms/pipelines
Pathogen databases
Data analysis/interpret/
Integration/visualization
Host/vector/
pathogen/
environment
…
Increasingly Universal Workflows
A Moving Target
Established sequencing workflows for a wide range of
pathogens.
Rapidly evolving technology space.
Changing hardware and COTS/OSS
capabilities. Lots of choice, but lack of
consistent standards. BIG DATA. New
workforce and skillset is required.
Sample intake Conversion
Prep/staging Library prep
Extraction
Sequencing
File hashes/versioning QA/QC
Reporting
Validated methods/databases
Skills/proficiency
Process logging/audit Standards Security
Pathogen- and application-specific, CLIA-compliant assays
ACAATTTGTGCATAACATGTGGACAGTTTTAATCACATGTGGGTAAATAGTTGTCCACATTTGCTTTTTT
TGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACAT
TTTATATTTATTAGGTTGTACATTTGTTGCGCAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACAC
CTTTGGAAAATATCTCTGATTTATGGAATAGTGCCTTAAAAGAATTAGAAAAAAAGGTAAGCAAGCCTAG
TTATGAAACATGGTTAAAATCAACAACGGCTCATAACTTGAAGAAAGACGTATTAACGATTACAGCTCCA
AATGAATTTGCTCGTGACTGGCTAGAATCTCATTACTCAGAACTTATTTCGGAAACACTATACGATTTAA
CAGGGGCAAAATTAGCAATTCGCTTTATTATTCCCCAAAGTCAATCGGAAGAGGACATTGATCTTCCTCC
AGTTAAGCGGAATCCAGCACAAGATGATTCAGCTCATTTACCACAGAGCATGTTAAATCCAAAATATACA
TTTGATACATTTGTTATCGGCTCTGGTAACCGTTTTGCCCATGCAGCTTCATTAGCTGTAGCCGAGGCGC
CAGCTAAAGCGTATAATCCACTCTTTATTTATGGGGGAGTTGGGCTTGGAAAGACGCATTTAATGCACGC
AATTGGTCATTATGTAATTGAACATAATCCAAATGCAAAAGTTGTATATTTATCATCAGAAAAATTCACG
AATGAATTTATTAACTCTATTCGTGATAATAAAGCTGTTGATTTTCGTAATAAATATCGCAACGTAGATG
Output: Information
From Sequence Data
Comparative Genomics
HR Straintyping/Subtyping
Cluster identification
Molecular evolution
Genotypic characterization
Virulence, AR, signatures
Functional annotation
Diagnostic dev/validation
Metagenomics
Pathogen identification/discovery
Culture-independent diagnostics
Microbial ecology/diversity
….
Objective,“Future-Proof” Data
Intrinsic quality metrics. Ability to back-test retrospective
sequence data in silico for genes/markers identified at a
future date.
MANY RESULTS POSSIBLE FROM A SINGLE DATASET!
WGS and Pathogen Genomics: Advantages

It’s universal…
 DNA/RNA sequencing workflows and approaches can be applied to
a wide range of pathogenic organisms.

It’s fundamental…
 Genomics is a cornerstone for other “omic” approaches
 Sequence databases starting point for assay devel./validation.

It’s objective…
 Sequence-based methods avoid subjectivity of phenotypic or
fragment-based approaches. Volume of data  internal controls.

It’s (relatively) future proof…
 Comprehensive sequencing captures the features you know about,
and those you don’t. Quality may change, but the sequence will not.
 This makes it possible to back-test future approaches/targets on the
data you collect today.
WGS and Genomic Epidemiology: Limitations

It lacks standardization…
 WGS is a rapidly-evolving technology space, both in terms of
sequencing and analytics.
 Standards and mechanisms for data/metadata analysis, storage and
exchange remain under active debate and development.

Comprehensive databases are still being built…
 Without a useful baseline understanding of pathogen
features/diversity, interpretation may be limited.
 Need curated,and comprehensive epi-linked reference databases.

Many analyses require specialized bioinformatics
infrastructure and staff.
 Bioinformaticists, DBAs, programmers, system administrators, etc.
 Technical and computational complexity of tasks can vary widely.

Data management, retention and release. Storage. LIMS.
Advanced Molecular Detection
Proposed $30M FY2014 budget request to:
1. Improve pathogen identification and detection
Outcome: Rapid progress toward modernizing PulseNet and other
critical lab-based surveillance systems
2. Adapt new diagnostics to meet evolving public health needs
Outcome: Enhance CDC’s ability to detect outbreaks early, develop new test during
outbreaks, and better characterize infectious disease threats
3. Help states meet future reference testing needs in a
coordinated manner
Outcome: More effective and better integrated outbreak response activities
4. Implement enhanced, sustainable, and integrated laboratory
information systems
Outcome: Labs inside and outside CDC can share information quickly and
seamlessly, including with other CDC databases, such as MicrobeNet and PulseNet
5. Develop prediction, modeling, and early recognition tools
Outcome: Better equipped to prevent, detect & respond to infectious diseases.
EPI
NGS
BIOINFO
AMD
AMD Initiative: Strategic Investments (1)

Scientific Infrastructure:
 Critical laboratory and bioinformatics infrastructure at CDC,
state/local PHL, and key overseas laboratories.
•
•
•
•
Sequencers, mass-spec, other instrumentation, reagents.
High performance computing, workstations.
Data storage, networking; data integration, knowledge management.
Service contracts, software licensing, etc.
AMD Initiative: Strategic Investments (2)

Workforce development:
 Training for CDC and PHL staff (bioinformatics, genomics, -omics)
 New or re-tooled fellowship programs (bioinformatics, genomics)
 Recruitment of new staff and skillsets (bioinformaticians, data
scientists, lab specialists, …)
AMD Initiative: Strategic Investments (3)

Consortia, partnerships and alignment of efforts







Academic institutions
State, Federal (NIH, FDA, DHS, DoD, DoE/National Laboratories)
Non-Profit/NGO
International community
Commercial/For-Profit
Clinical laboratories
Pilot projects with state/local and other partners.
 Outbreak detection, investigation and response
 Leverage existing laboratory-based surveillance systems
Challenges and Opportunities for CDC/GT

Training and workforce development.
 Development of wet bench and bioinformatics curriculum for
public health audiences. Scientific exchanges. Fellowship
programs. MOOC-style coursework and training modules for PHL.

Bioinformatic challenges.
 Analysis and visualization of complex structured and unstructured
data. Epi/lab integration. Dashboards/decision support.
 Development and standardization of deployable, CLIA compatible
bioinformatics workflows. Fieldable/portable systems.
 Machine learning and other approaches for genotypic prediction
of complex microbial phenotypes (eg: antimicrobial resistance)
 Approaches to address CIDT: eg: accelerated metagenomic
classification, lab/bifx approaches for complex sample matrices.
 Tools for rapid assay design and validation from HTS data
 Hardware-accelerated algorithms, scalable HPC (+NoSQL/Hadoop)
 …
Questions and Discussion
For more information please contact Centers for Disease Control and Prevention
1600 Clifton Road NE, Atlanta, GA 30333
Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348
Visit: www.cdc.gov | Contact CDC at: 1-800-CDC-INFO or www.cdc.gov/info
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the
Centers for Disease Control and Prevention.
National Center for Emerging and Zoonotic Infectious Diseases
Office of the Director

similar documents