Big Data Security - David Veuve . COM

Report
Gopi Ramamoorthy
CISSP, CISA, CISM
Agenda
 Bigdata – Quick Overview
 Bigdata Eco System – Quick Overview
 Bigdata Security – Current Options
 Bigdata Security – An efficient way
Introduction
 What is the presentation about?
 Securing Big data using different available technologies
without impacting performance
 What is Bigdata?
 Defined as data sets that are too large and complex to
manipulate or interrogate with standard methods or
tools.
 Some of the characteristics of Bigdata are
 4 Vs
Introduction
 What is the presentation about?
 Securing Big data using different workflows and improved
performance
 What is Bigdata?
 Defined as data sets that are too large and complex to
manipulate or interrogate with standard methods or tools.
 Some of the characteristics of Bigdata are
 volume
 velocity
 variety
 volatile nature
Problem Overview
 Feed to HDFS come from different sources
 Hadoop eco system does not provide in built security
and vault features similar to the ones provided by
RDBMS database systems.
 There are many components in eco system that do not
address security directly or indirectly.
 Encryption and decryption of huge amount of data
will slow down the performance. Also at times, it will
be heavy resource consuming
Problem Overview
 This presentation discusses building/changing
infrastructure to resolve above problems without
impacting performance and response time.
Units used to measure Big Data
Size
Prefix
10 ^ n
Symbol
Giga
10 ^ 9
G
Tera
10 ^ 12
T
10 ^ 15
10 ^ 18
10 ^ 21
10 ^ 24
Units used to measure Big Data Size
Prefix
10 ^ n
Symbol
Example Data Channel
Giga
10 ^ 9
G
Tera
10 ^ 12
T
Common with RDBMS
databases
Peta
10 ^ 15 or 1000
TB
P
User data created in an online
site in a couple of hours
Exa
10 ^ 18 or 1mil
TB
E
Data created in internet every
day
Zetta
10 ^ 21
Z
Yotta
10 ^ 24
Y
Hadoop Eco System
Category
Tool / Framework
Getting Data Into HDFS
Flume, Sqoop, Scribe, Chukwa, Kafka
Compute Frameworks
MapReduce, YARN, Weave, ClouderaSDK
Querying Data
Pig, Hive, Impala, Java MapReduce, Hadoop
Streaming, Cascading Lingual, Stinger /TEZ,
Hadapt, Greenplum HAWQ, ClouderaSearch,
Presto
NoSQL Stores
Hbase, Cassandra, Redis, Amazon SimpleDB,
Voldermort, Accumulo
Hadoop Eco System
Category
Tool / Framework
Hadoop in the cloud
Amazon EMR, Hadoop on Rackspace, Hadoop
on Google Cloud
Workflow Tools &
Schedulers
Oozie, Azkaban, Cascading, Scalding, Lipstick
Serialization Frameworks
Avro, Trevni, Protobuf, Parquet
Monitoring Systems
Hue, Ganglia, Open TSDB, Nagios
Applications / Platforms
Mahout, Giraph, Lily
Distributed Coordination
Zookeeper, Bookkeeper
Distributed Message
Processing
Kafka, Akka, RabbitMQ
BI
Datameer,Tableau,Pentaho,SiSense,SumoLogic
Hadoop Eco System
Category
Tool / Framework
YARN-Based Frameworks
Samza, Spark, Malhar, Giraph, Storm,
Hoya
Libraries & Frameworks
Kiji, Elephant Bird, Summing Bird,
Apache Crunch, Apache DataFu,
Continuity
Data Management
Apache Falcon
Security
Apache Sentry, Apache Knox
Testing Frameworks
MrUnit, PigUnit
Miscellaneous
Apark, Shark
Hadoop Eco System
 Core: A set of shared libraries
 HDFS: The Hadoop filesystem
 MapReduce: Parallel computation framework
 Flume: Collection and import of log and event data
 Sqoop: Imports data from relational databases
 ZooKeeper: Configuration management and coordination
 HBase: Column-oriented database on HDFS
 Hive: Data warehouse on HDFS with SQL-like access
 Pig: Higher-level programming language for Hadoop computations
 Oozie: Orchestration and workflow management
 Impala: Realtime Querying tool
 Mahout: A library of machine learning and data mining algorithms
Basic Security
 Network Separation
 Authentication
 Permission
 Authorization
 Management Solution
 Encryption
Efficient Security
 Data categorization
 Data Masking
 Tokenization
 Do not send sensitive data to HDFS if not required
 Use Workflow
 Separate sensitive data into another cluster
 Monitor Hadoop Eco System
 Deploy SIEM model monitoring
Bigdata: Security based on Data
and Work Flow
 Identify Channels and Data Sources
 Identify Data Content
 Introduce/Extend Data Classification to Bigdata
 Identify workflow
 Select Access Methods
 Select Encryption Methods
 Select Analytics tool
 Define Archive Policy
 Define Purge and Retention Policy
Must Features for Security
Modules/Architecture
 Key Manager
 No impact to performance
 HSM Integration and Support
 Compliance Support
 Easy to Administer and Migrate
Data Categorization
 Data categorization is well known concept that is used
to implement different levels of security based on data.
 For Big data , the data categorization needs to be
extended to complete data flow from entry to end
(purge).
 Implement multiple big data clusters based on data
category
 More on coming slides
Data Classification
 Super Sensitive
 DOB, SSN, IP, Design
 Sensitive
 Account, Address, Balance, etc.
 Confidential / Private
 Company Business Information, Vendor Information
 Public
 News Release, Public Finance Data
Bigdata: Data LifeCycle
Data Sources
Channel 1
Encryption e4
Channel 4
Access a4
Channel 2
Encryption e4
Channel 3
Access a4
Channel 5
Encryption e4
Channel 7
Access a4
Channel 7
Channel 6
Encryption e4
Channel 8
Channel 8
Access a4
Channel 1
Channel 2
Channel 3
Analyze/Purge /Retention pr4
Archive ar4
Purge /Retention pr4
Archive ar4
Channel 4
Channel 5
Channel 6
Purge /Retention pr4
Archive ar4
Purge /Retention pr4
Archive ar4
References and Acknowledgments










Cloudera
Project Rhino by Intel (open source)
ZettaSet
Apache Projects
Hadoop Illuminated
IBM
Yahoo
Oracle
Horton
And many more
Questions

similar documents