PPT - Big Data Open Source Software and Projects

Big Data Open Source Software
and Projects
ABDS in Summary II: Layer 5
I590 Data Science Curriculum
August 15 2014
Geoffrey Fox
[email protected]
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Message Protocols
Distributed Coordination:
Security & Privacy:
IaaS Management from HPC to hypervisors:
Here are 17 functionalities. Technologies are
File systems:
presented in this order
Cluster Resource Management:
4 Cross cutting at top
Data Transport:
13 in order of layered diagram starting at
SQL / NoSQL / File management:
In-memory databases&caches / Object-relational mapping / Extraction Tools
Inter process communication Collectives, point-to-point, publish-subscribe
Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI:
High level Programming:
Application and Analytics:
Xen http://en.wikipedia.org/wiki/Xen supports a form of type 1 virtualization known
as paravirtualization, in which guests run a modified operating system. The guests are
modified to use a special hypercall ABI, instead of certain architectural features.
Through paravirtualization, Xen can achieve high performance even on its host
architecture (x86) which has a reputation for non-cooperation with traditional
virtualization techniques
Xen was developed at the University of Cambridge but is now owned by Citrix
Responsibilities of the hypervisor include memory management and CPU scheduling
of all virtual machines ("domains"), and for launching the most privileged domain
("dom0") - the only virtual machine which by default has direct access to hardware.
From the dom0 the hypervisor can be managed and unprivileged domains ("domU")
can be launched.
KVM, VirtualBox
• KVM http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine is a
GNU licensed type 2 virtualization infrastructure for the Linux kernel
that turns it into a hypervisor, which was merged into the Linux
kernel mainline in February 2007
– It requires a processor with hardware virtualization extension.
• Oracle VirtualBox https://www.virtualbox.org/
http://en.wikipedia.org/wiki/VirtualBox is another well known type 2
hypervisor with GPL2 license
– Runs on many O/S
• Microsoft proprietary Hypervisor
http://en.wikipedia.org/wiki/Hyper-V that
supports Windows and some variants of Linux
• There must be a parent partition running
Windows Server
• OpenVZ is a type 2 Hypervisor http://openvz.org/Main_Page
with GPL license
• OpenVZ (Open VirtualiZation) or Open Virtuozzo is an operating systemlevel virtualization technology based on the Linux kernel and operating
system. OpenVZ allows a physical server to run multiple isolated operating
system instances, known as containers, Virtual Private Servers (VPSs), or
Virtual Environments (VEs).
• Docker works well with containers
• OpenVZ is not true virtualization but really containerization like FreeBSD
• Technologies like VMware and Xen are more flexible in that they virtualize
the entire machine and can run multiple operating systems and different
kernel versions.
• OpenVZ uses a single patched Linux kernel and therefore can run only Linux,
all containers share the same architecture and kernel version. However, as it
does not have the overhead of a true hypervisor, it is very fast and efficient.
• The disadvantage with this approach is the single kernel. All guests must
function with the same kernel version that the host uses.
• LXC (LinuX Containers) and Linux-Vserver are similar technologies
OpenStack, OpenNebula, CloudStack, Nimbus, Eucalyptus are all cloud or Virtual
managers. They help users and system administers use virtual machines with various
– The big commercial public clouds have equivalent proprietary systems
OpenStack http://en.wikipedia.org/wiki/OpenStack http://www.openstack.org/ is a free
and open-source Apache Licensed software cloud computing software platform. Users
primarily deploy it as an infrastructure as a service (IaaS) solution. The technology
consists of a series of interrelated projects that control pools of processing, storage, and
networking resources throughout a data center—which users manage through a webbased dashboard, command-line tools, or a RESTful API.
OpenStack began in 2010 as a joint project of Rackspace Hosting and NASA. Currently, it
is managed by the OpenStack Foundation, a non-profit corporate entity established in
September 2012 to promote OpenStack software and its community. More than 200
companies have joined the project, including Arista Networks, AT&T, AMD, Avaya,
Canonical, Cisco, Dell, EMC, Ericsson, Go Daddy, Hewlett-Packard, IBM, Intel, Mellanox,
Mirantis, NEC, NetApp, Nexenta, Oracle, PLUMgrid, Red Hat, SUSE Linux, VMware and
The OpenStack community collaborates around a six-month, time-based release cycle
with frequent development milestones. During the planning phase of each release, the
community gathers for the OpenStack Design Summit to facilitate developer workingsessions and to assemble plans.
The most recent OpenStack Summit, in May 2014 in Atlanta, drew 4,500 attendees, a
50% increase from the Hong Kong Summit six months earlier
Apache CloudStack
• http://cloudstack.apache.org/
http://en.wikipedia.org/wiki/Apache_CloudStack Has reputation for
solid software but does not have the rabid adoption of OpenStack;
unusual that Apache solution not most popular!
• Came from Citrix via acquisitions
• Features include
Built-in high-availability for hosts and VMs
AJAX web GUI for management
AWS API compatibility
Hypervisor agnostic (VMware, KVM, XenServer, Xen Cloud Platform (XCP) and
Snapshot management
Usage metering
Network management (VLAN's, security groups)
Virtual routers, firewalls, load balancers
Multi-role support
Eucalyptus, Nimbus
• Eucalyptus https://www.eucalyptus.com/
http://en.wikipedia.org/wiki/Eucalyptus_(software) was top
academic project in 2009 and was commercialized and just recently
purchased by Hewlett Packard
– Eucalyptus had both commercial and Open source GPL3 tracks but latter was
not developed as vigorously as other open source solutions
– Perhaps first to offer AWS compatible interface
• Apache licensed Nimbus
http://www.nimbusproject.org/ was probably most effective
academic cloud software after Eucalyptus was commercialized and
before OpenStack became popular
IaaS request popularity by year
• http://en.wikipedia.org/wiki/OpenNebula
http://opennebula.org/ Apache License.
• OpenNebula orchestrates storage, network, virtualization,
monitoring, and security technologies to deploy multi-tier services
(e.g. compute clusters) as virtual machines on distributed
infrastructures, combining both data center resources and remote
cloud resources, according to allocation policies
• The toolkit includes features for integration, management,
scalability, security and accounting. It also claims standardization,
interoperability and portability, providing cloud users and
administrators with a choice of several cloud interfaces (Amazon
EC2 Query, OGF Open Cloud Computing Interface and vCloud) and
hypervisors (Xen, KVM and VMware), and can accommodate
multiple hardware and software combinations in a data center
• Good system which strongly promoted in Europe but little used in
USA where eclipsed by OpenStack
VMware vCloud
• VMware ESX http://en.wikipedia.org/wiki/VMware_ESX is an enterpriselevel computer virtualization product offered by VMware. ESX is a
component of VMware's larger offering, VMware Infrastructure, which
adds management and reliability services to the core server product.
VMware recommends that deployments running the earlier ESX
architecture migrate to the newer ESXi hypervisor architecture.
• VMware ESX and ESXi are VMware's enterprise software Type 1
hypervisors for guest virtual servers; they run on host server hardware
without an underlying operating system.
• vSphere http://en.wikipedia.org/wiki/VMware_vSphere uses VMware’s
ESXi hypervisor adding management (as in OpenStack)
• Note desktop VMware Workstation is a type 2 hypervisor
• VMware has historically been a software vendor focused on virtualization
technologies. It entered the cloud IaaS market when it launched the
VMware vCloud Hybrid Service (vCHS) into general availability in
September 2013. http://en.wikipedia.org/wiki/VCloud This allows
customers to migrate work on demand from their "internal cloud" of
cooperating VMware hypervisors to a remote cloud of VMware
– This is called cloud bursting
Amazon, Azure, Google Clouds
Gartner has a “magic quadrant” summarizing public clouds 28 May 2014
Note Amazon is way ahead!
Google with GCE (Google Compute Engine) is just starting IaaS. Previously it offered
PaaS with Google App Engine
Microsoft has recently expanded Azure
but still catching up
Dynamic Orchestration and Dataflow
Or Usage)
 Class Usages e.g. run
GPU & multicore
 Applications
 Control Robot
 Cloud e.g. MapReduce
 HPC e.g. PETSc, SAGA
 Computer Science e.g.
Compiler tools, Sensor
nets, Monitors
Infra  Software Defined
Computing (virtual Clusters)
 Hypervisor, Bare Metal
 Operating System
 Software Defined
 OpenFlow GENI
Amazon Web Services AWS
• Compute: Elastic Compute Cloud (EC2) offers multitenant, fixed-size and
nonresizable, Xen-virtualized VMs without autorestart. Single-tenant VMs
are available via Dedicated Instances. There are special options for HPC,
including graphics processing units (GPUs). AWS does not have any formal
private cloud offerings, though it is willing to negotiate such deals (such as
its deal for the U.S. intelligence community cloud).
• Storage: VM storage is ephemeral. Persistence requires VM-independent
block storage (Elastic Block Store). There is an option for SSDs, as well as
storage performance guarantees (Provisioned IOPS). Object-based storage
(Simple Storage Service [S3]) is integrated with a CDN (CloudFront), there is
an option for long-term archive storage (Glacier), and AWS offers its own
cloud storage gateway appliance.
• Network: AWS offers a full range of networking options. Complex
networking and IPsec VPN is done via Amazon Virtual Private Cloud (VPC).
Third-party connectivity is via partner exchanges (AWS Direct Connect).
• Security: RBAC (Role based Access Control) is per-element, with customerdefined roles and exceptional control over permissions. AWS has obtained
many security and compliance-related certifications and audits.
Google Compute Engine
• Google has been operating App Engine since 2008, but did not enter the IaaS
market until the general-availability launch of GCE in December 2013.
• Compute: GCE offers multitenant, fixed-size and nonresizable, KVM-virtualized
VMs, metered by the minute. Provisioning is exceptionally fast (typically under 1
• Storage: VM storage is persistent, and there is also VM-independent block storage.
All block storage is encrypted.
• Network: Third-party private connectivity is not supported. Customers cannot
bring their own private IP addresses (although this need may possibly be addressed
by GCE's Advanced Routing features). There is no back-end load balancing.
• Security: RBAC permissions apply to the whole account.
• Google's strategy for Google Cloud Platform centers on the concept of allowing
other organizations to "run like Google" by taking Google's highly innovative
internal technology capabilities and exposing them as services that other
companies can purchase. Consequently, although Google is a late entrant to the
IaaS market, it is primarily productizing existing capabilities, rather than having to
engineer those capabilities from scratch. It will therefore be able to advance its
offering more rapidly than most competitors
Microsoft Azure
• The Azure business was previously strictly PaaS with a Windows
and .Net focus, but Microsoft launched Azure Infrastructure
Services (which include Azure Virtual Machines and Azure Virtual
Network) into general availability in April 2013, thus entering the
cloud IaaS market.
• Compute: Azure VMs (Linux or Windows) are fixed-size, paid-bythe-VM, and Hyper-V-virtualized; they are metered by the minute.
• Storage: Block storage ("virtual hard disk") is persistent and VMindependent. Object-based cloud storage is integrated with a CDN.
• Network: There is no support for complex network topologies.
Third-party connectivity is via partner exchange (Azure
• Security: Virtual network topology limitations prevent useful
deployment of most security-related virtual appliances, such as a
perimeter intrusion detection/prevention system (IDS/IPS). RBAC
uses Azure Active Directory, but permissions are whole-account.
Google Cloud DNS
& Amazon Route 53
• Google Cloud DNS
– Authoritative DNS server available as a service in Google Cloud
– The service is efficient, fault-tolerant and available globally
– This service can be used by the user hosted services in Google
Cloud or from third party applications
– https://developers.google.com/cloud-dns/what-is-cloud-dns
• Amazon Route 53
– Authoritative DNS server available as a service in Amazon AWS
– Provides a fault-tolerant, very fast DNS service.
– Same as Google Cloud DNS this service can be used by the
hosted services in Amazon Cloud or from third party applications
– The service is available in all continents except Africa
– http://aws.amazon.com/route53/

similar documents