Sample-hadoop-basic » History » Revision 2
« Previous |
Revision 2/20
(diff)
| Next »
Henning Blohm, 20.09.2012 12:55
A simple Hadoop with Z2 sample¶
This sample is an adaptation of the classical Wordcount sample in the Z2 context. This sample is supposed to show you how Hadoop can be used from within Z2 and in particular how to write Map/Reduce jobs in that context.
Note #1: This sample is made to be run on Linux or Mac-OS. Supposedly it is possible to run Hadoop on Windows. Sorry, but we have not been able to adapt the sample yet.
Note #2: For your convenience everything in this sample assumes you use Eclipse. That as such is of course no prerequisite to run the software, but it just makes everything much more integrated for now. Please have Eclipse ready and the Eclipsoid installed. See How to install Eclipsoid.
Prerequisites¶
This sample makes use of the Hadoop add-on that is based on Cloudera's CDH4 distribution of Hadoop. As client access is version dependent, so is the sample. In order to simplify this for you, there is a pre-configured CDH4 distribution available to you from this site. Apart from its development style configuration (i.e. no security), this is anyway the way we prefer to install Hadoop and friends: Just one root installation folder, one OS user, one log folder etc.
Install CDH4 from a preconfigured repository¶
This site provides a pre-configured one-check out user space installation of Cloudera's CDH4 Hadoop and HBase distributions. This page explains how to install it on your machine - which is really, really simple compared to normally suggested Hadoop installation procedures.
Note #1: This will only work on Linux or Mac OS. A machine with 8GB of RAM should be sufficient.
Note #2: The repository also contains an Eclipse project file and has Eclipse launchers for most functions required.
Note #3: This setup is for educational purposes only. It has no security requirements and there is no one taking any liability on anything regardings its use.
In short there are the followings steps:
- Clone the repository
- Adapt your local environment
- Format HDFS
- Start and stop
Clone the repository¶
The pre-configured distribution is stored in the repository z2-samples-cdh4-base. We assume you install everything (including an Eclipse workspace - if you run the samples) in install.
cd install
git clone -b master http://git.z2-environment.net/z2-samples.cdh4-base
Adapt your environment¶
Before you can run anything two customizations are needed:
Set important environment variables¶
There is a shell script env.sh that you should open and change as described. At the time of this writing it is required that you define your JAVA_HOME (please do, even if set elsewhere already) and the NOSQL_HOME, which is the absolute path to the folder that has the env.sh file. This script is called from elsewhere and having absolute paths in here is a safe way to make sure things will be found.
If you are a Subversion user, note the following: In order to run embedded z2 M/R jobs, the env.sh identifies a z2 Home location next to the CDH4 checkout (see above) either in the folder core or in the folder z2-base.core. This is due to the fact that the Subclipse plugin of Eclipse uses the project name ("core") as check out folder while the command line client uses the folder name ("z2-base.core"). So please make sure, you have a z2 Home in exactly one of these locations (as said, depending on the Subversion client you use) or customize env.sh accordingly to set a good Z2_HOME variable (See also #959).
Enable password-less SSH¶
Currently this is still required to have the start / stop scripts running. This requirement may be dropped in the future.
If you have not created a unique key for SSH or have no idea what that is, run
ssh-keygen
(just keep hitting enter). Next copy that key over to the machine you want to log on to without password, i.e. localhost in this case (you can get ssh-copy-id from here if you don't have it):
ssh-copy-id <your user name>@localhost
If this fails because your SSH works differently, or ssh will refuse to log on without password please "ask the internet". Sorry.
All that matters is that in the end
ssh <your user name>@localhost
(substituting <your user name> with your actual user name of course) works without asking for a password.
In addition it may help to run ssh <your user name>@0.0.0.0
as well to make sure the host key for that (localhost) address has been verified.
Formatting HDFS¶
Finally, the last step before you can start up, is to prepare the local node to store data. This is done by running the format_dfs.sh script. Alternatively you can use the Eclipse launcher of the same name.
This should complete without any questions or errors. Otherwise please verify your settings above.
Start and Stop¶
Depending on your sample requirements, you can start Hadoop (HDFS, Yarn, the History Server) or HBase (including all the Hadoop services) using the start_hadoop.sh script (or launcher) or the start_hbase.sh script (or launcher) respectively. Similarly you can stop everything with the stop scripts.
When you have started, after a short while, using jps on the command line, you should see the following Java processes (and possibly others of course):
DataNode
NodeManager
NameNode
SecondaryNameNode
JobHistoryServer
ResourceManager
and additionally those, if you run HBase:
HRegionServer
HQuorumPeer
HMaster
There is lots of other scripts in the distribution that you can use to start or stop single components. If you do however, please run (in the shell):
. ./env.sh
(note the leading period)
If you ran the start script and it returned, here is some URLs you should check to verify everything is looking good:
- Try to reach the Namenode at http://localhost:50070
- Try to reach the Yarn Nodemanager at http://localhost:8088
and, if you are running HBase:
- Try to reach the HBase Master at http://localhost:60010
Note: If you notice that you cannot restart or that HBase is not stopping correctly, that is most likely exactly the case. Sometimes HBase processes do not stop. To make sure there is no process left, use jps from the command line and kill remaining processes.
Updated by Henning Blohm over 12 years ago · 2 revisions