Bill Lisse - Central Ohio ISSA

“Big Data” A New Security Paradigm
Bill Lisse, CISSP, CISA, CIPP, PMP, G2700
• Global ISO for OCLC Online Computer Library Center, Inc.
• Over 25 years of security, audit, investigative experience
• Both U.S. Government and commercial organizations
• Financial Institutions, Manufacturing & Distribution, Healthcare
• OCLC WorldCat®
72,000 - Number of libraries represented worldwide
1,801,677,890 - Number of holdings
170 Countries and territories with library holdings in WorldCat
Every 10 seconds - How often a record is added
Over 470 - Number of languages and dialects represented
Every 4 seconds - How often a request is filled through WorldCat
Resource Sharing
• 256,514,231 - Number of bibliographic records
What Is "Big Data?”
• Logical outgrowth of increased use of virtualization
technology, cloud computing and data center consolidation
• NoSQL, defined as non-relational, distributed, and horizontally
scalable data stores (
• Abandons the constraints of schemas and transactional consistency in
favor of simple usage and massive scalability in terms of data storage
and processing capabilities
• A technology that can handle big data, storing more and being able to
analyze the aggregate, at a scale beyond the reach of relational
databases (RDBMS)
• The NoSQL alternative to RDBMS ACID is sometimes described as BASE
[Basically Available, Soft state, Eventual consistency]
New Security Paradigm
• Most of us don't have the tools or processes designed to
accommodate nonlinear data growth
Traditional security tools may no longer provide value
• What Were Your Tools and Processes Designed to Do?
• How difficult it is to do a malware scan across a large NAS
volume, or SAN. Would it be feasible to scan through it all
every day like we do now with 100K more?
• If data discovery is required to support data leak prevention
(DLP) or regulatory compliance, what are the implications?
• Some scenarios where data size could be a factor in the
proper operation of a security control are:
log parsing
file monitoring
encryption/decryption of stored data
file-based data integrity validation controls
From a security standpoint, we are all starting from scratch
Targets and Threats
• "Eggs in one Basket" – Centralized data is a lucrative target
• Growth in research and hacker activity targeting NoSQL databases
• Ozzie (Proxy) is a superuser capable of performing any operation as
any user
• Hadoop Distributed File System (HDFS) proxies are authenticated by
IP address; stealing the IP of a Proxy could allow an attacker to
extract large amounts of data quickly
• Name Nodes or Data Nodes can give access to all of the data stored
in an HDFS by obtaining the shared "secret key”
• Data may be transmitted over insecure transports including HSFTP,
FTP and HTTP; HDFS proxies use the HSFTP protocol for bulk data
• Tokens: Must get them all - Kerberos Ticket Granting Token;
Delegation Token; Shared Keys (if Possible); Job Token; Block Access
Actors in Hadoop security
• User (access HDFS and Map-Reduce services)
• HDFS and Map-Reduce services (services user requests and
coordinates among themselves to perform Hadoop cluster
• Proxy service like Oozie (accesses Hadoop services on behalf of
• Almost every NoSQL developer is in a learning mode; over a hundred
different NoSQL variants
• HDFS does not provide high availability, because an HDFS file system
instance requires one unique server, the name node (single point of failure)
• Environments can include data of mixed classifications and security
• Aggregating data from multiple sources can cause access control and data entitlement
• Aggregating data into one environment also increases the risk of data theft and accidental
• There is no such thing as vulnerability assessment or database activity
monitoring for NoSQL
• Label security is based on schema, which does not exist in NoSQL; No Object
level security (Collection, Column)
Vulnerabilities (Cont.)
• Encryption can be problematic
• data and indices need to be in clear text for analysis, requiring application designers to
augment security with masking, tokenization, and select use of encryption in the
application layer
• DoS attacks
• NoSQL Application Vulnerabilities
Connection Pollution
JSON Injection
Key Brute Force
HTTP/REST based attacks
Server-side JavaScript: Integral to many NoSQL databases
NoSQL Injection
Architecture and Design Considerations
• Define your use cases - Security requirements derived from core business
and data requirements; assess if NoSQL is still a valid solution
• Based on security requirements, decide if you should host your database(s) in your
own Data Center or on the Cloud
• Categorize use cases to see where NoSQL is a good solution and where it's not
• Define Data Security Strategy and Standards
• Data Classification is imperative
• How do we prevent bad data from getting into NoSQL data store
• Built-in HDFS security features such as ACLs and Kerberos used alone are
not adequate for enterprise needs
• Software running behind a firewall with inadequate security?
• Authentication
• Role Based Access Control (RBAC)
• Support for AUTHN (Authentication) and AUTHZ (Authorization)
• Some federated identity systems implemented with SAML, and environment
security measures embedded with the cloud infrastructure
• ACLs for Transactional as well as Batch Processes
Architecture and Design Considerations
• Defense In Depth
• The security features in Cloudera Distribution with Hadoop 3 meet the needs of
most customers because typically the cluster is accessible only to trusted
• Hadoop's current threat model assumes that users cannot:
• Have root access to cluster machines
• Have root access to shared client machines
• Read or modify packets on the network of the cluster
• Middle Tier: Act as broker in interacting with Hadoop server: Apache Hive,
Oozie etc.
• RPC Connection Security: SASL GSSAPI
• HDFS: Permissions Model
• Job Control: ACL based; includes a View ACL
• Web Interfaces: OOTB Kerberos SSL support
• HDFS and MapReduce modules should have their own users
• NoSQL DB Servers behind Firewall and Proxy
Architecture and Design Considerations
• Separate persistence layer to apply Authentication and ACL's in a standard
and centralized fashion
• Batch jobs and other utility scripts that access database outside the
applications should be controlled
• Logging
• Audit trails are whatever the application developer built in, so they are both
application-specific and limited in scope
• What data needs to be logged for security analytics purposes?
• What should be the log format for business v. security logs?
• Do we need to store the security logs in a different file (a new log4j appender)
so only authorized users (admin) will have access to it?
• How would the logs work with SIEM tool (if applicable)
Architecture and Design Considerations
• If necessary, put NoSQL-stored data into separate "enclaves" to ensure that
it can be accessed by only authorized personnel
• Security infrastructure for Hadoop RPC uses Java SASL APIs
• Quality of Protection (QOP) settings can be used to enable encryption for
Hadoop RPC protocols
“In nutshell, Hadoop has strong support for authentication and
authorization. On the other hand privacy and data integrity is
optionally supported when Hadoop services are accessed
through RPC and HTTP, while the actual HDFS blocks are
transferred unencrypted. Hadoop assumes network involved in
HDFS block transfer is secure and not publicly accessible for
sniffing, which is not a bad assumption for private enterprise
- Nitin Jain
Questions? More
Bill Lisse
[email protected]

similar documents