Community 1.3.0 (Optimize both Yarn & Non Yarn Hadoop clusters) 1 Agenda • Big Data Trends • What is Jumbune? • Description of Components 2 Big Data Trends Data ETLing from all possible sources to Data Lake Multiple Execution engines: MapReduce, Spark, Hama, Storm, Giraph, etc. Shared cluster workers (resources) 3 Resource sharing/isolation frameworks: Yarn, Mesos, etc. Hadoop based solution life stages (as on ground) – Cyclic execution Bad Logic? xxx xxx Data Analyst Business User Monitoring Needs Production MapReduce Dev Resource Utilization ? Devops Staging Data Bad Data? 4 Logic & Data Test Challenges in Analytical Solutions 1. No common platform across actors to detect root causes 5 5 4. Implementing models in custom MR in initial attempts is like hitting bull’s eye 2. Incremental imports may ingest bad data 5. Bad Logic or Bad data 3. Cluster resources are shared and optimal utilization is key Intersecting solution Lifecycle Stages xxx xxx Solution Development Bulk & Incremental Data 6 Quality Test Devops Jumbune “A catalyst to accelerate realization of analytical solutions” Data Validation 7 Flow Analyzer Cluster Monitor Job Profiler Niche offerings 8 • In depth code level analysis of cluster wide flow • Record level data violation reports. • No deployment on Workers - Ultra light agent installation on Hadoop master only • Ability to turn on/off cluster monitoring at will – lessens resource load • Customizable rack aware monitoring • Correlated profiling analysis of phases, throughput and resource consumption • Ability to work across all Hadoop Distributions Components - Recommended Environments Dev • Flow Debugger • Data Validation • MR Job Profiler 9 QA • Data Validation Stage + Perf • MR Job Profiler Prod • Cluster Monitoring • Data Validation Supported Deployments Azure, EC2 All major distributions Jumbune On Premise 10 MapReduce Flow Debugger 11 • Verifies the flow of input records in user’s map reduce implementation • Drill down visualization helps developer to quickly identify the problem. • Only tool to assist developers to figure out MapReduce implementation faults without any extra coding Data Validator • Validates inconsistencies in data in the form of : – Null checks – Data type checks – Regular expression checks 12 • Generic way of specifying validation rules • Provides record level report for found anomalies • Currently supports HDFS as the lake file system MR Job Profiling 13 • Per Job Phase wise – performance for each JVM – data flow rate – Resource usage • Per Job Heap sites for Mapper & Reducer • Per Job CPU cycles for Mapper & Reducer Hadoop Cluster Monitoring 14 • Data Centre & Rack aware nodes view of Yarn and Non Yarn Daemons • Dynamic Interval based monitoring • Hadoop JMX, Node Resource Statistics • Per file, node wise replica Placement (which nodes have replicas of a given file ?) • HDFS data placement view (HDFS balanced ?) How we are building Jumbune? 15 Let’s Collaborate Website • http://jumbune.org Contribute • http://github.com/impetus-opensource/jumbune • http://jumbune.org/jira/JUM Social • Follow @jumbune Use #jumbune • Jumbune Group: http://linkd.in/1mUmcYm Forums • Users: [email protected] • Dev: [email protected] • Issues: [email protected] Downloads • http://jumbune.org • https://bintray.com/jumbune/downloads/jumbune 16 Thanks 17