Project

General

Profile

Hadoop Add-on » History » Revision 14

Revision 13 (Henning Blohm, 21.09.2012 11:49) → Revision 14/16 (Henning Blohm, 21.09.2012 11:55)

h1. The Hadoop add-on 

 The Hadoop add on actually contains the client parts of the Cloudera Hadoop and HBase distribution plus some integration features that are described in [[How to Hadoop]] and related samples. 

 It is provided via the repository "z2-addons.hadoop":http://redmine.z2-environment.net/projects/z2-addons/repository/z2-addons-hadoop. 

 As Hadoop and HBase do not have a clear client - server compatibility vector, you may only use the Hadoop add-on with a matching server version.  

 h2. Version map 

 |_. add-on version |_. Hadoop/HBase version | 
 | 2.1 | CDH 4.0.1 | 

 We do - for experimental use only! - provide an easy to install and use, pre-configured single-node CDH 4.0.1 via the Git repository "z2-samples.cdh4-base":http://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-cdh4-base. The samples [[Sample-hadoop-basic]] and [[Sample-hbase-full-stack-TBD]] make use of that. See [[Install prepacked CDH4]] on how to set it up. 

 Extensions that help working with Hadoop are implemented by the modules *com.zfabrik.hadoop* and *com.zfabrik.hbase*. 

 h2. Details on *com.zfabrik.hadoop* 

 Javadocs can be found here: "Javadocs":http://www.z2-environment.net/javadoc/com.zfabrik.hadoop!2Fjava/api/index.html 

 h3. Component type com.zfabrik.hadoop.configuration 

 Components of this type provide Hadoop or HBase connectivity configuration via a component resource file "core-site.xml". There is no further configuration applicable. 

 h3. Component type com.zfabrik.hadoop.job 

 A Hadoop Map/Reduce job implementation. Using this component type, Jobs may be programmatically scheduled and run within the Z2 runtime.  

 Properties: 

 |_. Name |_. Value or Description| 
 |com.zfabrik.component.type|com.zfabrik.hadoop.job| 
 |component.className|The name of a class provided by the module that implements "IMapReduceJob":http://www.z2-environment.net/javadoc/com.zfabrik.hadoop!2Fjava/api/com/zfabrik/hadoop/job/IMapReduceJob.html | 


 h2. Details on *com.zfabrik.hbase* 

 This module provides additional utilities and types on top of com.zfabrik.hadoop that simplify and help working with HBase. 

 Javadocs can be found here: "Javadocs":http://www.z2-environment.net/javadoc/com.zfabrik.hbase!2Fjava/api/index.html 

 This module does not provide component types. It does however providing a narrowing interface "IHBaseMapReduceJob":http://www.z2-environment.net/javadoc/com.zfabrik.hbase!2Fjava/api/com/zfabrik/hbase/IHBaseMapReduceJob.html that extends "IMapReduceJob":http://www.z2-environment.net/javadoc/com.zfabrik.hadoop!2Fjava/api/com/zfabrik/hadoop/job/IMapReduceJob.html for Map/Reduce jobs over HBase tables.  

 See also [[Sample-hbase-full-stack-TBD]]. 

 h2. How does Map/Reduce with Z2 on Hadoop work 

 When preparing a job for execution by Hadoop, what actually happens is that Hadoop stores one or more jar libraries in HDFS. In order to run a map, reduce, or combine task, Hadoop downloads the libraries to the local node and runs the required task from code provided by the libraries. 

 When running a job with Z2's Hadoop integration this is no different. But instead of submitting the actual task implementations to Hadoop, a generic job library is provided to Hadoop. On the node executing a task, the generic task implementations (all "here":http://www.z2-environment.net/javadoc/com.zfabrik.hadoop!2Fjava/api/com/zfabrik/hadoop/impl/package-summary.html) start an embedded Z2 runtime (see "ProcessRunner":http://www.z2-environment.net/javadoc/com.zfabrik.core.api!2Fjava/api/com/zfabrik/launch/ProcessRunner.html), in-process, look for the Job component that implements "IMapReduceJob":http://www.z2-environment.net/javadoc/com.zfabrik.hadoop!2Fjava/api/com/zfabrik/hadoop/job/IMapReduceJob.html and delegate execution in context to the real implementation. 

 p{margin-top:3em; margin-bottom:3em}. !z2_hadoop.png! 

 The one catch here is that z2 home must be available on the node running the task and it must be found by the generic implementation.  

 In the samples this is achieved by having the environment variable Z2_HOME point to the installation next to the Hadoop installation. In cluster setups, a Z2 core is part of the installables next to Hadoop, HBase, and others. 

 Only the core is required, as job updates will be retrieved from repositories automatically. 

 A true specialty of the sample setups is the use of the Dev Repo (see "Workspace development using the Dev Repository":http://www.z2-environment.eu/v21doc#Workspace%20development%20using%20the%20Dev%20Repository). As the Dev Repo is controlled by system properties and as the Hadoop integration is aware of this use case, we can use the Hadoop client connection config (e.g. "here":http://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hadoop-basic/revisions/master/entry/com.zfabrik.samples.hadoop-basic.wordcount/nosql/core-site.xml) to also convey a Dev Repo scan root (so to say) which allows to run M/R jobs directly from the workspace - you might say. 

 h2. How to support other Hadoop versions 

 TBD