Sample-hadoop-basic » History » Version 9
Henning Blohm, 20.09.2012 16:45
1 | 1 | Henning Blohm | h1. A simple Hadoop with Z2 sample |
---|---|---|---|
2 | |||
3 | 2 | Henning Blohm | This sample is an adaptation of the classical Wordcount sample in the Z2 context. This sample is supposed to show you how Hadoop can be used from within Z2 and in particular how to write Map/Reduce jobs in that context. |
4 | |||
5 | 9 | Henning Blohm | *Note #1:* This sample is made to be run on Linux or Mac-OS. Supposedly it is possible to run Hadoop on Windows. Sorry, but we have not been able to adapt the sample yet. A machine with 8GB of RAM should be sufficient. |
6 | 4 | Henning Blohm | *Note #2:* For your convenience everything in this sample assumes you use Eclipse. That as such is of course no prerequisite to running the software, but it just makes everything much more integrated for now. Please have Eclipse ready and the Eclipsoid installed. See [[How to install Eclipsoid]]. |
7 | 1 | Henning Blohm | |
8 | 4 | Henning Blohm | This sample is provided by the repository "z2-samples-hadoop-basic":http://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hadoop-basic. |
9 | 3 | Henning Blohm | |
10 | 2 | Henning Blohm | h2. Prerequisites |
11 | |||
12 | 1 | Henning Blohm | This sample makes use of the [[Hadoop add-on]] that is based on Cloudera's CDH4 distribution of Hadoop. As client access is version dependent, so is the sample. In order to simplify this for you, there is a pre-configured CDH4 distribution available to you from this site. Apart from its development style configuration (i.e. no security), this is anyway the way we prefer to install Hadoop and friends: Just one root installation folder, one OS user, one log folder etc. |
13 | |||
14 | 9 | Henning Blohm | Please follow the procedure described here: [[Install prepacked CDH4]]. |
15 | 3 | Henning Blohm | |
16 | To use with this sample, it is most convenient, if you clone and configure the CDH4 install next to your Eclipse workspace and the sample repository clone. |
||
17 | |||
18 | 1 | Henning Blohm | h2. Setting up the sample |
19 | |||
20 | 4 | Henning Blohm | From here on, the sample is run like all samples, that is, following [[How to run a sample]]. |
21 | |||
22 | Assuming everything (including the CDH4 setup) is under *install* and your workspace in in *install/workspace* please clone "z2-samples-hadoop-basic":http://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hadoop-basic under *install* as well. Either from the command line as |
||
23 | |||
24 | <pre><code class="ruby"> |
||
25 | cd install |
||
26 | git clone -b master http://git.z2-environment.net/z2-samples.hadoop-basic |
||
27 | </code></pre> |
||
28 | |||
29 | or from within Eclipse using the Git repositories view (but make sure the folder is right next to your z2-base.core clone). |
||
30 | |||
31 | You should have an Eclipse workspace and next to it *z2-samples.hadoop-basic*, *z2-samples.cdh4-base*, and *z2-base.core*. Import all projects into your workspace. |
||
32 | |||
33 | We assume that you followed the steps in [[Install prepack CDH4]] and Hadoop is running (we do not need HBase in this case). |
||
34 | 5 | Henning Blohm | |
35 | h2. Running the sample |
||
36 | |||
37 | Start Z2. |
||
38 | |||
39 | This will take a short moment. When up, we want to first write a file into the Hadoop file system that we are going to split into words and count their occurences later on. |
||
40 | |||
41 | If you want to load some file you already have at hand, use the "copyFromLocal" operation to copy it into */hadoop-wordcount/input*. E.g. if the file is called *myfile.txt* go into the CDH4 install and run |
||
42 | |||
43 | <pre><code class="ruby"> |
||
44 | . ./env.sh |
||
45 | hadoop fs -copyFromLocal myfile.txt /hadoop-wordcount/input |
||
46 | </code></pre> |
||
47 | |||
48 | (the env.sh call is only required once per shell session). |
||
49 | |||
50 | Alternatively there is a z2Unit test (see [[How to z2Unit]]) that you can invoke to generate some input. As that is interesting on its own right, here is how that is done. |
||
51 | |||
52 | You should already have all the projects, in particular *com.zfabrik.samples.hadoop-basic.wordcount* already in your workspace. Otherwise import them from the repository you cloned previously. |
||
53 | |||
54 | You Eclipsoid to resolve all required compile dependency (Alt-R or click on the right Z in the toolbar), if you have not done so already. |
||
55 | |||
56 | Look for the type *WriteWordsFile* (Ctrl+Shift+T). |
||
57 | |||
58 | The method *writeWordsFile* will write a file of 100 million words in lines containing between 1 and 9 words each (but you can change that of course). Invoke it by right-clicking and "Run as / JUnit test". If you want to play around with the settings, simply change than, synchronize z2 (Alt-Y or click on the left Z in the toolbar) and rerun. |
||
59 | |||
60 | The interesting piece about this code is how it is connecting to HFDS: |
||
61 | |||
62 | <pre><code class="java"> |
||
63 | 6 | Henning Blohm | ... |
64 | @Test |
||
65 | public void writeWordsFile() throws Exception { |
||
66 | FileSystem fs = FileSystem.get(IComponentsLookup.INSTANCE.lookup(WordCountMRJob.CONFIG, Configuration.class)); |
||
67 | fs.delete(WordCountMRJob.INPUT_PATH, true); |
||
68 | fs.mkdirs(WordCountMRJob.INPUT_PATH.getParent()); |
||
69 | ... |
||
70 | 1 | Henning Blohm | </code></pre> |
71 | 6 | Henning Blohm | |
72 | 7 | Henning Blohm | Here, the actual connection configuration, one of Hadoop's XML configuration files, is looked up from a Z2 component called "com.zfabrik.samples.hadoop-basic.wordcount/nosql":http://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hadoop-basic/revisions/master/show/com.zfabrik.samples.hadoop-basic.wordcount/nosql. The component type for that is defined by the Hadoop integration module "com.zfabrik.hadoop" of the [[Hadoop add on]]. |
73 | 8 | Henning Blohm | |
74 | The purpose of this is to separate the client configuration information from the using implementation. We will see another application of that below. |
||
75 | |||
76 | So now we assume you have the input file uploaded or generated in HDFS. |