Optimize Hadoop Solutions

Report
Community 1.3.0
(Optimize both Yarn & Non Yarn Hadoop clusters)
1
Agenda
• Big Data Trends
• What is Jumbune?
• Description of Components
2
Big Data Trends
Data ETLing from all
possible sources to Data
Lake
Multiple Execution engines:
MapReduce, Spark, Hama,
Storm, Giraph, etc.
Shared cluster workers
(resources)
3
Resource sharing/isolation
frameworks: Yarn, Mesos,
etc.
Hadoop based solution life stages
(as on ground) – Cyclic execution
Bad
Logic?
xxx
xxx
Data Analyst
Business User
Monitoring
Needs
Production
MapReduce Dev
Resource
Utilization ?
Devops
Staging Data
Bad
Data?
4
Logic & Data Test
Challenges in Analytical Solutions
1. No common
platform across
actors to detect root
causes
5
5
4. Implementing
models in custom MR
in initial attempts is
like hitting bull’s eye
2. Incremental
imports may ingest
bad data
5. Bad Logic or Bad
data
3. Cluster resources
are shared and
optimal utilization is
key
Intersecting solution Lifecycle Stages
xxx
xxx
Solution
Development
Bulk & Incremental
Data
6
Quality Test
Devops
Jumbune
“A catalyst to accelerate realization of analytical solutions”
Data Validation
7
Flow Analyzer
Cluster Monitor
Job Profiler
Niche offerings
8
•
In depth code level analysis of cluster wide flow
•
Record level data violation reports.
•
No deployment on Workers - Ultra light agent installation on Hadoop master only
•
Ability to turn on/off cluster monitoring at will – lessens resource load
•
Customizable rack aware monitoring
•
Correlated profiling analysis of phases, throughput and resource consumption
•
Ability to work across all Hadoop Distributions
Components - Recommended Environments
Dev
• Flow
Debugger
• Data
Validation
• MR Job
Profiler
9
QA
• Data
Validation
Stage + Perf
• MR Job
Profiler
Prod
• Cluster
Monitoring
• Data
Validation
Supported Deployments
Azure, EC2
All major distributions
Jumbune
On Premise
10
MapReduce Flow Debugger
11
•
Verifies the flow of input records in user’s map reduce implementation
•
Drill down visualization helps developer to quickly identify the problem.
•
Only tool to assist developers to figure out MapReduce implementation faults
without any extra coding
Data Validator
• Validates inconsistencies in data in the form of :
– Null checks
– Data type checks
– Regular expression checks
12
•
Generic way of specifying validation rules
•
Provides record level report for found anomalies
•
Currently supports HDFS as the lake file system
MR Job Profiling
13
•
Per Job Phase wise
– performance for each JVM
– data flow rate
– Resource usage
•
Per Job Heap sites for Mapper & Reducer
•
Per Job CPU cycles for Mapper & Reducer
Hadoop Cluster Monitoring
14
•
Data Centre & Rack aware nodes view of Yarn and Non Yarn Daemons
•
Dynamic Interval based monitoring
•
Hadoop JMX, Node Resource Statistics
•
Per file, node wise replica Placement (which nodes have replicas of a given file ?)
•
HDFS data placement view (HDFS balanced ?)
How we are building Jumbune?
15
Let’s Collaborate 
Website
• http://jumbune.org
Contribute
• http://github.com/impetus-opensource/jumbune
• http://jumbune.org/jira/JUM
Social
• Follow @jumbune Use #jumbune
• Jumbune Group: http://linkd.in/1mUmcYm
Forums
• Users: [email protected]
• Dev: [email protected]
• Issues: [email protected]
Downloads
• http://jumbune.org
• https://bintray.com/jumbune/downloads/jumbune
16
Thanks
17

similar documents