Hadoop - interface:systems

Report
BigData
Vom Experiment zur Produktion
Mario Vosschmidt
Consulting Systems Engineer
1
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
Agenda
BigData oder SmartData?
1) Was ist „BigData“
2) Anforderungen und Herausforderungen
3) Auf welche Szenarien konzentrieren wir uns?
4) Wie sehen Lösungsansätze aus?
5) Wie implementiere ich diese Lösungen?
6) Zusammenfassung
2
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
The Big Data
Landscape
3
BigData
The 3V Paradigm
 Variety
 Multiple data sources
 Multiple data formats
 Velocity
 High speed processing
 Fast changing requirements
 Volume
 Huge amounts of data
 Process and persist
4
NetApp Confidential - Internal Use Only
Entering a New Era of Scale
5
Big Data Solution Portfolio
A B C s of Big Data at Netapp
Insight from extremely
large datasets
Big
Data
Secure boundless
data storage
6
Performance for data
intensive workloads
Not Even to The “Peak”
VISIBILITY
Peak of Inflated Expectations
Plateau of Productivity
Slope of Enlightenment
Trough of Disillusionment
Technology Trigger
TIME
35 Zettabytes
5 Billion
Estimated size of the
digital universe in 2020
Smart phones
30 Billion
80%
Pieces of new content to
Facebook per month
Unstructured
data
7
Big Data Vendor Landscape
A Lot of Hype and Buzz – Everyone is Jumping In
Funding for Hadoop and NoSQL
451 Research
400
350
Cloudera series D
10gen series D
MapR series B
DataStax series B
Neo Technology series A
Opera Solutions series A
Platfora series A
Couchbase series C
300
250
200
150
100
Cloudera series C
Cloudera series B
MapR series A
50
0
Jan-08
Nov-11
 Market is expected to grow from $3.2 billion
in 2010 to $16.9 billion in 2015
 Most firms are taking a pragmatic approach
 Big data is in the very early stages of maturity
 Best practices are not mature
"The Big Data market is expanding rapidly …
For technology buyers, opportunities exist to
use Big Data technology to improve
operational efficiency and to drive innovation.
Use cases are already present across
industries and geographic regions."
Dan Vesset, Vice President, IDC
IDC Big Data Survey
8
NetApp Confidential - Internal Use Only
8
Data Growth Impact on Business
Complexity
“Big Data” refers to datasets whose size is
beyond the ability of typical tools to capture,
store, manage and analyze
Speed
Business Velocity
Information Becomes
a Propellant to Business
Inflection
Point
2010
9
Data Becomes a
Burden to IT Infrastructure
2020
Volume
The Big Data Opportunities
Financial Services
 Fraud detection &
prevention
 Anti-money laundering
 Risk management
Government
 Law enforcement
 Counter-terrorism
 Research and Education
Manufacturing




10
Supply chain optimization
Defect tracking
Root cause analysis
RFID correlation
Healthcare
 Drug development
 Patient Records
 Evidence-based medicine
Why Should You Care?
It’s the Value of Your Data
 Top line revenue
– Leverage their data
assets into business
advantage




5 Billion Records
Anywhere, Anytime
Faster time to market
50% Increase in Revenue




Over 1PB of data
Growth of 175% YOY
90 days of data within
24 hours of a failure
 Bottom Line savings
– Lower the cost of
compliance
– Manage ever growing
data efficiently
11
NetApp Confidential - Internal Use Only
NetApp Big Data
Why NetApp?
Practical solutions that solve today’s problems
Get
Control
Break
Through
Gain
Insight
13
NetApp helps you turn your
exploding data from threat to
opportunity. Manage your data
effectively and affordably.
Break through the limits. With
NetApp, you can take on even the
most massive and complex data
projects.
Turn insight to action. NetApp helps
you get to clarity and insight faster
and more reliably.
Experience Managing Data at Scale
NetApp’s Largest Customer
100 PB
4 Customers
50 PB
10 Customers
20 PB
50 Customers
10 PB
100 Customers
14
NetApp Big Data Strategy
Open
Best-of-Breed
Choice
 Best of breed storage for Big Data
Applications
 Built on open standards with bestin-class partnerships
 Validated with
ecosystem leaders
 Complete server, network and storage
“Racks”
 Delivered via trusted
high-value partners
15
NetApp Confidential - Internal Use Only
15
Analytics
Smart Data
16
Big Analytics Strategy
Smart Data
DSS / DW (traditional analytics)
 Solutions partners include IBM, Oracle, Microsoft,
ParAccel, Exasol and SAND
Big Analytics
 Enterprise class Hadoop-based solutions
 MapR, Hortonworks, Cloudera
Leverage partners to complete Big Analytics
stack
 Solutions for validated server, network and storage
1
Big Analytics Solutions
Data Warehouse
Mixed Use Database, Cubes
Fast, space-efficient
backup and recovery
with storage utilization
up to 90%. Less raw
capacity with modular
scalability
Optimized for IBM,
Oracle and Microsoft.
Simplified data
management and
protection. Zero down
time
Hadoop
Enterprise class Hadoop with
Lower total cost of ownership and
based on open standards
18
The Value Proposition:
Some problems require and Enterprise Class Hadoop Solution
Enterprise Class Hadoop
Enterprise Class Hadoop
Compute Power
Packaged ready-to-deploy modular
Compute / Memory intensive Hadoop cluster
 Compute intensive applications
 Tic Data Analysis
 Extremely tight Service Level
expectations
 Severe financial consequences if the
analytic run is late
Packaged ready-to-deploy modular Hadoop
cluster
 The Data has intrinsic value $$$
 Usable capacity must expand faster than
compute
 Higher storage performance
 Real human consequences if the system fails
(Threats, treatments, financial losses)
 System has to allow for asymmetric growth
Enterprise Class Hadoop
White Box Hadoop
Values associated with early adopters of
Hadoop




Social Media Space
Contributors to Apache
Strong bias to JBOD
Skeptical of ALL vendors
Bounded Compute algorithm / Memory
intensive Hadoop cluster
 Compute intensive applications
 Additional CPUs do not improve run time
 Extremely tight Service Level
expectations
 Severe financial consequences if the
analytic run is late
 Need for deeper storage per datanode
Storage Capacity
19
NetApp Confidential - Internal Use Only
Challenges with Hadoop in Enterprise
Operations
Availability

NameNode is a single point of failure

Slow recovery from disk drive failure



Requires three copies of data, larger footprint,
and more storage
Expensive process to replace failed disks
online

Limited flexibility; storage and servers tied
together affects scalability
Most common Hadoop support issue is disk
drive failure

Low cluster efficiency, higher network
congestion
Implementation
20

Need to keep up with fast-paced patches,
projects of open source platform

Need to decide on distribution of Hadoop

Skills are not common

Integration with existing IT infrastructure can be
difficult

Tuning expertise needed to make Hadoop
perform optimally
Cisco and NetApp Confidential. For Internal Use Only. Do Not Distribute.
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
20
Why Big Data and Analytics as a service is
important!
21
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
FlexPod Converged Infrastructure Family
FlexPod® Express
FlexPod Data Center
FlexPod Select
MSB/Branch Office
Enterprise/Service Provider
For smaller, less-dynamic
requirements and VAR velocity
Massively scalable shared virtual data
center infrastructure
App
Compute Pool
Network Pool
Storage Pool
Cisco UCS C-Series
Nexus® 3K
FAS2xx0,
Two fixed pod sizes
Cisco UCS Director,
VMware®, and Microsoft®
App
App
App
App
App
App
Compute Pool
Network Pool
Storage Pool
Cisco UCS C-Series/B-Series,
Nexus® 5k
FAS Storage
Flexible pod sizes
FlexPod validated management
and ecosystem
Big data analytics, scientific,
HPC
Distinct Architectures
App
Distinct Architectures
App
Dedicated
App
Compute
Nodes
Network / Direct
Storage
Cisco UCS C-Series
Nexus, Catalyst®, MDS
E-Series, FAS
Reference architecture and/or designs
Application-based management
Netapp Reference Architecture
23
NetApp Confidential - Internal Use Only
Example: FlexPod Select with Cloudera
 Converged big data platform from
NetApp and Cisco for Hadoop
Cisco UCS®
C-Series Rack
Mount Servers
 Enterprise-class Hadoop: Innovative
storage, servers, networking validated
with leading Hadoop distributions
Cisco UCS Fabric
Interconnect
 Faster time to value: Prevalidated
configuration accelerates deployment
Cisco UCS
Manager
 High availability: Less downtime, higher
serviceability to meet tight SLAs around
data applications and processes
 Flexible scaling: Independently scale
servers and storage; modular design for
scaling as data needs grow
NetApp® FAS
Storage Systems
NetApp E-Series
Storage Array
24
* NetApp 50% Storage Guarantee http://www.netapp.com/us/solutions/infrastructure/virtualization/guarantee.html
FlexPod Select with Hadoop
NetApp and Cisco deliver enterprise
class Hadoop for high availability,
performance, scalability
Architected for the enterprise
 Superior NameNode protection
 Faster recovery from failover
 Lower cluster downtime
Cloudera or Hortonworks Distribution of Hadoop
Faster time to value
 Validated, presized configurations
…
…
 Low-latency, high-bandwidth
networking
 12 DataNodes in master, 16 in
expansion
Coexistence with current applications
and infrastructure
 Supports existing applications from
SAP, Microsoft, Oracle
Master
Expansion
 Data management and monitoring
with Cloudera Manager, Cisco UCS®
Manager
26
Service-Level Expectations Around Data
High-Value Time-Sensitive Problems
Accelerate time to insights
Fast deployment with validated, preconfigured, reference
designs
Store, process, analyze all data for new opportunities and
business impact
More time to focus on data analysis rather than deal with cluster
downtime
Making the Hadoop experience better
Optimized, tuned, fully configured cluster
Hadoop integrated with storage, compute, networking
Monitoring and management tools with SANtricity® and from
partners (Cloudera Manager, Cisco UCS® Manager)
High density and capacity reduce data center footprint
Reduce risk in an open ecosystem
Compatibility with existing infrastructure and applications
Best-in-class partnerships, not entire stack from one vendor
Future-proof against lock-in and benefit from evolving
ecosystem
FlexPod Select for
Hadoop with
Cloudera
Ease of Setup and Deployment
Preconfigured – Pre-Vaildated
28
Use Case Example: NetApp Auto Support
Phone home data representing information about
the status NetApp storage controllers
 Correlate disk latency (hot) with disk type
 24 billion records
 4 weeks to run query
 Hadoop implementation 10.5 hours
 Bug detection through pattern matching
 240 billion records – Too large to run
 Hadoop implementation 18 hours
30
User Interface +
Search Tool
Wireless Service Provider
Archiving & Indexing Tools
Telco Industry
NetApp Hadoop Solution
Provides wireless voice
and data services globally
DN DN
DN DN
DN DN
DN DN
Hadoop Distributed
File System (HDFS)
Agent Servers
AS
AS
AS
Remote Site
Collector Servers
CS
CS
CS
Central Site
Agent Servers
AS
AS
AS
Remote Site
The solution consists of an eight node Hadoop cluster at the core site. All the
data from the remote sites are transported over WAN into the central site.
The data gets collected, ingested, compressed and archived into the Hadoop
cluster via HDFS. The data is then categorized, put into separate containers,
and indexed based on its record keeping tags.
32
32
Analytics & Enterprise Apps Environment
Reporting/Dashboard/Visualization
Applications
OLAP
Analytics
ETL
Data Management
ETL
OLAP
OLTP
Storage File Systems
Mobile Devices
Location/GPS
Logs
Sensors
Applications
Other
Data
Source
s
Content
Repositories
Shared Storage
Infrastructure
Storage
Data
Manageme
nt
NFS/sNFS/pNF
S
Storage
(All other storage, i.e. internal DAS)
33
Bandwidth
34
Big Bandwidth Solutions
Full Motion Video
Scalable density and
performance to ingest and
simultaneously analyze
UAV and satellite video
data
Media Content Management
High ingest & play-out
rates with support for
media and entertainment
workflows
Video Storage for Surveillance
High bandwidth & density
supporting hundreds or
thousands of HD cameras
HPC: Lustre, GPFS, BeeGfs
Massively parallel
distributed file
system for large scale
cluster computing and
O&G Seismic Processing
Big Bandwidth Solutions
Applications
Storage File Systems
Density
Performance
Reliability
Efficiency
Modularity
Flexibility
E-Series Storage
Full-Motion Video Storage Solution
High bandwidth HD Video Ingest
• Satellite
• UAV
Full-Motion Video
Built on E-Stack
E5460 Stack
Turnkey solution in a 40U
industry-standard rack
 Single architecture for ingest,
exploitation and
dissemination
 1.8PB Raw Capacity
– 4000+ hours of uncompressed
720p HD video
Quantum® StorNext File System
Massively Scalable
Single Data Container
Multi-Stream
Video Playout
• Processing
• Exploitation
• Analyst
Viewing
 >20 GB/s R/W Performance,
>30 GB/s Peak Performance
 Scale to multiple Petabytes
in a single data container
HPC: Lustre
 Performance to meet the needs
of the world’s fastest
Supercomputers
 High Bandwidth & Density
– 1.8PB & 30GB/s per
40U rack
 Highly available
– No Single points of failure
– Extensive RAS features
 NetApp provided 7x24 Lustre
Support
 NetApp Professional Services
38
NetApp Confidential – Limited Use
Lawrence Livermore National Lab
Sequoia – announced as the fastest
supercomputer and storage combination on
the planet at ISC 2012
 Supercomputer storage to support
twenty thousand trillion arithmetic
operations per second with access
speeds up to 1 TB/sec
 55PB of usable storage
 Simulations for nuclear weapons
viability
 Counter Terrorism
 Energy Security
 Understanding Climate Change
Press Release: http://www.netapp.com/us/company/news/news-rel-20110928-990734.html
NetApp Confidential – Limited Use
39
Video Surveillance Storage
Enhance public safety with
better physical security
 Industry trends are exploding
storage
 Analog to Digital
 SD to HD
 7 days to 30+ Days
 Open Platform Solution




Best of breed industry partners
Flexible deployments
Modular scalability
99.999% up time
40
Unique Out-of-Band Recording
No servers required between cameras and storage
 save HW/SW, licensing, footprint, very robust, save a lot of network cabling, easy to scale.
41
NetApp Confidential - Internal Use Only
Media Content Management
 Highly scalable digital repository
 Consolidates collaborative production
 Multi-format distribution workflows
 Industry-leading bandwidth per rack to
reduce bottlenecks
 Highest capacity density to minimize
power and cooling
 Single namespace for multi-petabyte
repositories
 Unmatched breadth of production client
support
NetApp Confidential – Limited Use
42
Content Management
44
NetApp Confidential – Limited Use
Big Content Solutions
File Services
Enterprise Content Repository
Multi-application
workloads
Non-disruptive
operation
Integrated data
protection, efficiency
Infinite container
Fixed content
Non-disruptive
operation
Integrated data
protection, efficiency
Distributed Content Repository
Large, multi-site repository
Policy based data management
Metadata-enabled object storage
NetApp Confidential – Limited Use
45
File Services
ONTAP Cluster Mode
 Heterogeneous cluster:
 A mix of controller types in a
single cluster per workload needs
 Entry, mid, and high-end platforms
 Native and third-party storage
(FAS and V-Series)
 Multiprotocol: NFS, pNFS, CIFS,
iSCSI, FCP
 Integrated Data Protection
 Virtual storage tier:
 Match data to disk price and
performance
 Manage multiple tiers in the
same namespace or many
46
NetApp Confidential – Limited Use
Enterprise Content Repositories
ONTAP Cluster Mode with Infinite Volume
Single large content repository
 Scales to PBs and billions of files across
cluster
 Native storage efficiency
Simplified operations




Multi-tenancy
Simplifies application workflows
Load balances data at ingest
Starts small, grow granularly
High availability
 Protects against disk and hardware failures
 Snapshots & Replication for quick recovery
 Manage & Upgrade non-disruptively
47
Content Repository
Object Storage Insights
 Flat Namespace
 Less data management overhead
 No filesystem hierarchy
 Metadata separated
 High Metadata rates
 Not within data space
 Metadata serve as descriptors
 Can change over time
 However Data is persistent
 Less space management
 Data are replicated across Geos
 Objects referenced by ID
 Index
 Write once read many
 Similar to library
 Objects do not change
 Single writer multiple readers
48
NetApp Confidential - Internal Use Only
 Simplified rights management
Distributed Content Repositories
StorageGRID
Large content repository for big,
unstructured data
 Billions of data sets, dozens of petabytes
Create, manage and consume
content globally
 Predictable access to data
independent of location
 Policy-controlled
data stores at each site
Intelligent data classification and access
 Metadata-based management
49
StorageGRID Functional Diagram
NAS
I/O
NAS
Protocols
(SG 9)
Object Ingest
and Retrieval
HTTP API / CDMI
Metadata Tagging and Query
Policy-Driven
Data Placement
Global Object Namespace
Object-Level Data Management
Location-Transparent Distributed Object Store
Storage Systems
Media Content Repository
PNI Digital Media
 High-performance, scalable storage
infrastructure built to support 17 million
revenue-generating transactions annually
 100% uptime even during peak holiday
access when transaction increase 6 to 10
times
 3PB of rich media data
“We’ve increased the number of retail partners we
work with from 2,000 to almost 20,000 in just a
few years. In the past 6 years, we’ve seen a
1,900% increase in transactions. This plus the
massive increase in digital images uploaded by
consumers demanded a more robust and highly
scalable storage infrastructure.”
– Zach Wickes, Vice President of Technology, PNI
51
NetApp Confidential – Limited Use
 Consumer access to 950 million digital
images
 20,000 worldwide retail locations, online
fulfillment partners and in-store kiosks
 Wal-Mart Canada, Costco, Sam’s Club,
Tesco, CVS/pharmacy, and Kodak
 NetApp FAS6280 and FAS3200, Data
ONTAP, and FlashCache
Health in the Cloud
 STaaS offering for healthcare providers
 Medical Image Archive Cloud






Two sites with ~1PB each
2TB+ local cache at each edge site
8x growth in capacity last 12 months
100% uptime since start of service
“Forever” retention policies
~60% of customers use hybrid cloud model
 Solution offers a proven 100% up-time with
automated data movement from on-premise
to off-premise public clouds with “keep
forever” retention policy and indefinite
growth
Press Release: http://www.netapp.com/us/company/news/news-rel-20111128-36413.html
52
Integrated Big Data Solutions and Expertise
Planning and implementation expertise for Big Data
Turn-key solution stacks and Big Data services
Big Data System Integrators Solutions Built on
NetApp®
53
NetApp Confidential – Limited Use
Reference Material
54
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
Flexpod Select
Common Architecture
Software Solution
Appliance
Solution Rack
Application Packaging
Analytics
+
Integration
Management
Efficiency
Validated Architecture
& SKUs
55
Infrastructure Integration
& Distribution
Services & Support
Visualization
Operational Integration
& System Integrators
Big Data Summary
 Enable enterprise customers to
gain business advantage
 Practical solutions proven to
reduce complexity, increase
efficiency and lower cost of
ownership
 Open standards based with bestin-class partnerships
For more information: http://www.netapp.com/us/company/leadership/big-data/
56
Next Steps - Team with the Experts
 Strategic Assessment
 Business goals
 Data growth needs
 Use case discovery
(partner delivery)
 Consult
 Solution architecture and
design (NetApp delivery)
 Deploy
 Installation and implementation
(NetApp delivery)
 Solution implementation
(partner delivery)
57
Support options:
Global support available from
NetApp and partners
Thank You
NetApp Confidential - Internal Use Only

similar documents