Data Node - HAMS technologies

Report
1
HAMS Technologies
www.hams.co.in
[email protected]
[email protected]
[email protected]
HAMS Technologies
» Data size is continuously increasing due to
˃ Multiple source of data
˃ Easy and fast availability of data sources
» R is widely known and accepted language in field of statistical
analysis and prediction techniques. But performing an analysis
over Big data is always a challenging task.
» MapReduce is a mechanism to handle large volume of data.
Hadoop is one of open source MapReduce implementation
provided by Apache.
» R and Hadoop can be used together with the help of a
package called RHadoop.
2
HAMS Technologies
Before moving further, We can take a look of MapReduce architecture and Hadoop
implementation..
General flow in MapReduce architecture
1.
2.
3.
4.
Create a clustered network
Load the data into cluster using Map (mapper task)
Fetch the processing data with help of Map (mapper task)
Aggregate the result with Reducer ( Reducer task)
Local Data
Local Data
Map
Partial
Result-1
Map
Partial
Result-2
Reduce
Local Data
Map
Partial
Result-3
Aggregated
Result
3
HAMS Technologies
General attributes of in MapReduce architecture
1.
2.
3.
4.
Distributed file system (DFS)
Data locality
Data redundancy for fault tolerance
Map tasks applied to partitioned data it scheduled so that input blocks
are on same machine
5. Reducer tasks applied to process data partitioned by MAP task
Local Data
Local Data
Map
Partial
Result-1
Map
Partial
Result-2
Reduce
Local Data
Map
Partial
Result-3
Aggregated
Result
4
HAMS Technologies
Hadoop is an open source implementation of MapReduced architecture
maintained by Apache
Hadoop
Master
nodes
HDFS
MapReduce
Hadoop Distributed file system
Job trackers
name node/s
Job tracker node/s
Data node/s
Hive
(Hadoop
interactIVE)
Slave
nodes
Data Node
Data Node
Data Node
Data node/s
Data node/s
Data node/s
Tracker node/s
Tracker node/s
Tracker node/s
5
HAMS Technologies
» Hadoop-streaming allow to create and run MapReducde job as Mapper
and/or as Reducer.
» HDFS (Hadoop Distributed File System) is a clustered network used to
store data. HDFS contain the script to replicate and track the different data
blocks. HDFS write is show below. In same reverse manner we retrieve
data from HDFS.
I am having a file contains 3
blocks.. Where should I write
these?
hams.txt
Block-1
1
Block-2
Block-3
2
3
3
Okey, Write these
on data-node 1 ,2
and 3
Name Node
3
Data Node-1
Data Node-2
Data Node-3
Data Node-n
Data node/s
Data node/s
Data node/s
Data node/s
Tracker node/s
Tracker node/s
Tracker node/s
Tracker node/s
6
HAMS Technologies
» HIVE (Hadoop InteractIVE) support high level
function for handling hadoop framework like
hive.start(), hive.create().. .etc.
» It provided functions like hive.stream() to support
Hadoop-streaming
» DFS functions in R like DFS.put(), DFS.list()…
» To start working with HIVE
˃ Download it from https://hadoop/apache.org/core
˃ Configure files
> mapred-site.xml – to configure mapReduce
> Core-site.xml – to configure basic hadoop
> Hdfs-site.xml – for configuration related to HDFS
7
HAMS Technologies
» R and Hadoop : a package Rhadoop is available in
R-Forge
» In R prompt,
˃ Hadoop package can be loaded as
» Library(‘hive’);
˃ To start hadoop
» hive_start()
˃ Put the data, list the data
» DFS.put(‘source_data’, ‘/router_list’)
» DFS.list(‘/router_list’);
8
HAMS Technologies
» Hive stream can be initialized as
˃ hive_stream(mapper = <mapper_function_name>
reducer= <reducer_function_name>
input = <routers>
output = <routers_out> )
» Other important functions
˃
˃
˃
˃
DFS_put_object()
DFS_cat()
Hive_create()
Hive_get_parameter()
9
HAMS Technologies
Thank you
Kindly drop us a mail at below mention address for any
suggestion and clarification. We like to hear from you
HAMS Technologies
www.hams.co.in
[email protected]
[email protected]
[email protected]
10
HAMS Technologies

similar documents