Sample-hbase-mail-digester » History » Version 15
Henning Blohm, 03.08.2014 12:08
1 | 1 | Henning Blohm | h1. Sample that combines HBase with full-stack Spring and Hibernate usage |
---|---|---|---|
2 | 2 | Henning Blohm | |
3 | 14 | Henning Blohm | (HOLD ON FOR A SECOND - STILL WORKING ON IT) |
4 | |||
5 | 12 | Henning Blohm | This sample consists of an application that loads large Mbox archive files into HBase and extracts email addresses using a map reduce job. Extracted email addresses are then written to a relational database and offered for editing. |
6 | 3 | Henning Blohm | |
7 | 4 | Henning Blohm | Being a full stack sample it shows how to design a multi-module application with a service tier that can be seamlessly used from a Web app as well as from an application-level map-reduce job. |
8 | |||
9 | 1 | Henning Blohm | *Note*: This sample still uses v2.2 of z2 - so making sure the correct versions are specified below is crucial. |
10 | 5 | Henning Blohm | *Note*: Due to HBase, you will need to run this on Linux or Mac OS. |
11 | 4 | Henning Blohm | |
12 | 11 | Henning Blohm | Proceed to [[Sample-hbase-mail-digester#How-this-works|how this works]], if you do not want to run the sample but only learn about it. |
13 | 9 | Henning Blohm | |
14 | 4 | Henning Blohm | h2. Install |
15 | 1 | Henning Blohm | |
16 | 5 | Henning Blohm | Here is the quick guide to getting things up and running. This follows closely [[How_to_run_a_sample]] and [[Install_prepacked_CDH4]]. |
17 | |||
18 | h3. Checkout |
||
19 | |||
20 | Create some installation folder and check out the z2 core and the HBase distribution, as well as the sample application. |
||
21 | |||
22 | 3 | Henning Blohm | <pre><code class="bash"> |
23 | git clone -b v2.2 http://git.z2-environment.net/z2-base.core |
||
24 | git clone -b v2.2 http://git.z2-environment.net/z2-samples.cdh4-base |
||
25 | git clone -b master http://git.z2-environment.net/z2-samples.hbase-mail-digester |
||
26 | 1 | Henning Blohm | </code></pre> |
27 | 5 | Henning Blohm | |
28 | (Note: Do not use your shared git folder, if you have any, as the neighborhood of these projects may be inspected by z2 later on). |
||
29 | |||
30 | h3. Prepare |
||
31 | |||
32 | We need to apply some minimal configuration for HBase. At first, please follow [[Install_prepacked_CDH4]] on how to configure your HBase checkout. There are a few steps that need to be taken once only but still have to. |
||
33 | |||
34 | Assuming HBase has started and all processes show as described, there is one last thing to get running before starting the actual application: |
||
35 | |||
36 | {{include(How to run Java db)}} |
||
37 | |||
38 | 6 | Henning Blohm | h2. Start |
39 | 5 | Henning Blohm | |
40 | 7 | Henning Blohm | Now that all databases are up we can start the application simply by running (as always): |
41 | 5 | Henning Blohm | |
42 | <pre><code class="bash"> |
||
43 | # on Linux / Mac OS: |
||
44 | cd z2-base.core/run/bin |
||
45 | ./gui.sh |
||
46 | |||
47 | # on Windows: |
||
48 | cd z2-base.core\run\bin |
||
49 | gui.bat |
||
50 | </code></pre> |
||
51 | |||
52 | 7 | Henning Blohm | At first startup this will download some significant amount of dependencies (Spring, Vaadin, etc.). So go and get yourself some coffee.... |
53 | 5 | Henning Blohm | |
54 | When started, go to http://localhost:8080/digester-admin. You should see this: |
||
55 | 1 | Henning Blohm | |
56 | !start.png! |
||
57 | 7 | Henning Blohm | |
58 | h2. Using the application |
||
59 | |||
60 | 13 | Henning Blohm | To feed data into the application, please download some mail archive in mbox format, e.g. from http://tomcat.apache.org/mail/. Upload the file to the digester application as outlined on the first tab. Mails from mail archives are imported into the HBase database. |
61 | 7 | Henning Blohm | |
62 | 8 | Henning Blohm | When imported, run the analysis map-reduce job by clicking on "Start analysis job". The purpose of this job is to identify mail senders and provide some data like email address, full name, and number of mails into the relational database powered by Derby. |
63 | 7 | Henning Blohm | |
64 | 13 | Henning Blohm | When there is only little data in HBase, progress reporting is rather coarse-grained. After a while you should see: |
65 | 7 | Henning Blohm | |
66 | 8 | Henning Blohm | !job.png! |
67 | 7 | Henning Blohm | |
68 | 13 | Henning Blohm | And work should be completed shortly after. When done, check the extracted sender data: |
69 | 1 | Henning Blohm | |
70 | !counts.png! |
||
71 | 8 | Henning Blohm | |
72 | h2. How this works |
||
73 | |||
74 | 15 | Henning Blohm | Let's go module by module. |
75 | |||
76 | *com.zfabrik.samples.digester.mail.domain* |
||
77 | |||
78 | This modules defines data types and storage access for HBase stored e-mail messages. The interface "EmailRepository":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.api/com/zfabrik/samples/digester/mail/EMailRepository.java is the main storage access interface. It's implementation can be found at "EmailRepositoryImpl":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.impl/com/zfabrik/samples/impl/digester/mail/EmailRepositoryImpl.java. |
||
79 | |||
80 | The Spring Bean "HBaseConnector":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.impl/com/zfabrik/samples/impl/digester/mail/HBaseConnector.java, used by the e-mail repository manages HBase connections. |