PPT - Big Data Open Source Software and Projects

Big Data Open Source Software
and Projects
ABDS in Summary VIII: Level 11A
I590 Data Science Curriculum
August 15 2014
Geoffrey Fox
[email protected]
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Here are 17 functionalities. Technologies are
presented in this order
Message Protocols
Distributed Coordination:
4 Cross cutting at top
13 in order of layered diagram starting at
Security & Privacy:
IaaS Management from HPC to hypervisors:
File systems:
Cluster Resource Management:
Data Transport:
File management and
SQL / NoSQL / File management:
In-memory databases&caches / Object-relational mapping / Extraction Tools
Inter process communication Collectives, point-to-point, publish-subscribe
Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI:
High level Programming:
Application and Analytics:
• The Integrated Rule-Oriented Data System (iRODS)
is an BSD Open source http://irods.org/ data
management software in use at research organizations and government
agencies worldwide.
• iRODS is a production-level distribution aimed at deployment in mission
critical environments.
• It functions independently of storage resources and abstracts data
control away from storage devices and device location allowing users to
take control of their data.
• iRODS abstracts data services from data storage to facilitate executing
services across heterogeneous, distributed storage systems.
• iRODS empowers data stewards by partitioning policies, rules, and
services they develop from repository management.
• iRODS executes your data policies on your schedule.
• iRODS virtualizes data policy by separating data management policy
enforcement from repository management.
Scientific Data/File Formats I
• NetCDF http://en.wikipedia.org/wiki/NetCDF (Network Common
Data Form) is a set of software libraries and self-describing, machineindependent data formats that support the creation, access, and
sharing of array-oriented scientific data. The project homepage is
hosted by the University Corporation for Atmospheric Research
(UCAR). This work started in 1989 and was based on CDF
• Common Data Format (CDF)
http://en.wikipedia.org/wiki/Common_Data_Format is a library and
toolkit that was developed by the National Space Science Data Center
(NSSDC) at NASA starting in 1985. The software is an interface for the
storage and manipulation of multi-dimensional data sets.
Scientific Data/File Formats II
Hierarchical Data Format (HDF) http://en.wikipedia.org/wiki/Hierarchical_Data_Format is a set of
file formats (HDF4, HDF5) designed to store and organize large amountsof numerical data.
Originally developed at the National Center for Supercomputing Applications, it is supported by
the non-profit HDF Group, whose mission is to ensure continued development of HDF5
technologies, and the continued accessibility of data stored in HDF.
In keeping with this goal, the HDF format, libraries and associated tools are available under a
liberal, BSD-like license for general use. HDF is supported by many commercial and noncommercial software platforms, including Java, MATLAB/Scilab, Octave, IDL, Python, and R. The
freely available HDF distribution consists of the library, command-line utilities, test suite source,
Java interface, and the Java-based HDF Viewer (HDFView).
The current version, HDF5, differs significantly in design and API from the major legacy version
HDF5 simplifies the file structure to include only two major types of object:
Datasets, which are multidimensional arrays of a homogeneous type
Groups, which are container structures which can hold datasets and other groups
This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are
even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of
user-defined, named attributes attached to groups and datasets. More complex storage APIs
representing images and tables can then be built up using datasets, groups and attributes.
In addition to these advances in the file format, HDF5 includes an improved type system, and
dataspace objects which represent selections over dataset regions. The API is also object-oriented
with respect to datasets, groups, attributes, types, dataspaces and property lists.
The latest version of NetCDF, version 4, is based on HDF5.
Scientific Data/File Formats III
OPeNDAP "Open-source Project for a Network Data Access Protocol“
http://en.wikipedia.org/wiki/OPeNDAP is a data transport architecture and protocol
widely used by earth scientists. The protocol is based on HTTP and the current
specification is OPeNDAP 2.0 draft.
– OPeNDAP includes standards for encapsulating structured data, annotating the data with attributes
and adding semantics that describe the data.
– The protocol is maintained by OPeNDAP.org, a publicly funded non-profit organization that also
provides free reference implementations of OPeNDAP servers and clients.
An OPeNDAP client could be an ordinary browser or specialized visualization/analysis
An OPeNDAP client sends requests to an OPeNDAP server, and receives various types of
documents or binary data as a response. One such document is called a DDS (received
when a DDS request is sent), that describes the structure of a data set. A data set, seen
from the server side, may be a file, a collection of files or a database. Another
document type that may be received is DAS, which gives attribute values on the fields
described in the DDS. Binary data is received when the client sends a DODS request.
An OPeNDAP server can serve an arbitrarily large collection of data. Data on the server
is often in HDF or NetCDF format, but can be in any format including a user-defined
format. Compared to ordinary file transfer protocols (e.g. FTP), a major advantage using
OPeNDAP is the ability to retrieve subsets of files, and also the ability to aggregate data
from several files in one transfer operation.
OPeNDAP is widely used by governmental agencies such as NASA and NOAA to serve
satellite, weather and other observed earth science data.
Scientific Data/File Formats IV
Flexible Image Transport System http://en.wikipedia.org/wiki/FITS
is an open astronomy standard defining a digital file format useful for
storage, transmission and processing of scientific and other images.
– FITS is the most commonly used digital file format in astronomy.
– Unlike many image formats, FITS is designed specifically for scientific data and hence includes many
provisions for describing photometric and spatial calibration information, together with image
origin metadata.
The FITS format was first standardized in 1981; it has evolved gradually since then, and
the most recent version (3.0) was standardized in 2008.
FITS was designed with an eye towards long-term archival storage, and the maxim once
FITS, always FITS represents the requirement that developments to the format must be
backwards compatible.
A major feature of the FITS format is that image metadata is stored in a humanreadable ASCII header, so that an interested user can examine the headers to
investigate a file of unknown provenance.
The information in the header is designed to calculate the byte offset of some
information in the subsequent data unit to support direct access to the data cells.
Each FITS file consists of one or more headers containing ASCII card images (80
character fixed-length strings) that carry keyword/value pairs, interleaved between data
blocks. The keyword/value pairs provide information such as size, origin, coordinates,
binary data format, free-form comments, history of the data, and anything else the
creator desires: while many keywords are reserved for FITS use, the standard allows
arbitrary use of the rest of the name-space.
RCFile Hive Data Format I
• See Hive’s RCFile (Row Column File) introduced in 2011
• http://www.infomall.org/I590ABDSSoftware/Resources/ICDEworkshop-2014_XiaodongZhang.pptx
• Distributed Row-Groups among Nodes as in row store but organize by
columns in each group
• Minimize unnecessary I/O operations
– In a row group, table is partitioned by columns
– Only read needed columns from disks
• Minimize network costs in row construction
– All columns of a row are located in same HDFS block
Comparable data loading speed to Row-Store
– Only adding a vertical-partitioning operation in the data loading procedure of
• Applying efficient data compression algorithms
– Can use compression schemes used in Column-store
RCFile Hive Data Format II
Store Block 1
Store Block 2
Store Block 3
Row Group 1-3
Row Group 4-6
DataNode 1
I/O transfers
Row Group 7-9
DataNode 2
DataNode 3
Compressed Metadata
RCFile: Combined
row-stores and
Compressed Column A
101 102 103 104 105
Compressed Column B
Unnecessary network transfers (MBytes)
201 202 203 204 205
Compressed Column C
301 302 303 304 305
Compressed Column D
401 402 403 404 405
ORC Hive Data Format
• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP2.0.0.2/ds_Hive/orcfile.html
• Hive’s RCFile has been the standard format for storing Hive data since 2011.
However, RCFile has limitations because it treats each column as a binary blob
without semantics.
• Hive 0.11 adds a new file format named Optimized Row Columnar (ORC) file
that uses and retains the type information from the table definition. ORC uses
type specific readers and writers that provide light weight compression
techniques such as dictionary encoding, bit packing, delta encoding, and run
length encoding -- resulting in dramatically smaller files.
• Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on
top of the lightweight compression for even smaller files. However, storage
savings are only part of the gain. ORC supports projection, which selects
subsets of the columns for reading, so that queries reading only one column
read only the required bytes.
• Furthermore, ORC files include light weight indexes that include the minimum
and maximum values for each column in each set of 10,000 rows and the
entire file. Using pushdown filters from Hive, the file reader can skip entire sets
of rows that aren’t important for this query.
• Finally, ORC works together with query vectorization work providing a high
bandwidth reader/writer interface
• http://2013.berlinbuzzwords.de/sessions/orc-file-improving-hive-data-storage
Apache Parquet IO/Format I
• https://github.com/Parquet/apache-proposal
• In its relatively short lifetime (co-founded by Twitter and Cloudera in
July 2013), Parquet has already become the de facto standard for
efficient columnar storage of Apache Hadoop data —
– with native support in Impala, Apache Hive, Apache Pig, Apache Spark,
MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading,
Scalding, Kite, Presto and Shark.
• Supports data models: Apache Avro, Thrift and Google Protocol
• Based on Google Dremel paper
• http://blog.cloudera.com/blog/2014/05/congratulations-to-parquetnow-an-apache-incubator-project/
• http://www.slideshare.net/cloudera/hadoop-summit-36479635
Apache Parquet IO/Format II
• https://github.com/apache/incubator-parquet-format
• Parquet makes the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop
• Parquet is built from the ground up with complex nested data
structures in mind, and uses the record shredding and assembly
algorithm described in the Dremel paper. We believe this approach
is superior to simple flattening of nested name spaces.
• Parquet is built to support very efficient compression and
encoding schemes. Multiple projects have demonstrated the
performance impact of applying the right compression and
encoding scheme to the data. Parquet allows compression
schemes to be specified on a per-column level, and is futureproofed to allow adding more encodings as they are invented and

similar documents