Project

General

Profile

Sample-hbase-mail-digester » History » Version 24

Henning Blohm, 23.09.2015 22:11

1 1 Henning Blohm
h1. Sample that combines HBase with full-stack Spring and Hibernate usage
2 2 Henning Blohm
3 12 Henning Blohm
This sample consists of an application that loads large Mbox archive files into HBase and extracts email addresses using a map reduce job. Extracted email addresses are then written to a relational database and offered for editing.
4 3 Henning Blohm
5 4 Henning Blohm
Being a full stack sample it shows how to design a multi-module application with a service tier that can be seamlessly used from a Web app as well as from an application-level map-reduce job.
6
7 1 Henning Blohm
*Note*: This sample still uses v2.2 of z2 - so making sure the correct versions are specified below is crucial.
8 5 Henning Blohm
*Note*: Due to HBase, you will need to run this on Linux or Mac OS.
9 4 Henning Blohm
10 11 Henning Blohm
Proceed to [[Sample-hbase-mail-digester#How-this-works|how this works]], if you do not want to run the sample but only learn about it.
11 9 Henning Blohm
12 24 Henning Blohm
Note the 
13
14
{{include(Java Version Requirements)}}
15
16 4 Henning Blohm
h2. Install
17 1 Henning Blohm
18 5 Henning Blohm
Here is the quick guide to getting things up and running. This follows closely [[How_to_run_a_sample]] and [[Install_prepacked_CDH4]].
19
20
h3. Checkout
21
22
Create some installation folder and check out the z2 core and the HBase distribution, as well as the sample application. 
23
24 3 Henning Blohm
<pre><code class="bash">
25
git clone -b v2.2 http://git.z2-environment.net/z2-base.core
26
git clone -b v2.2 http://git.z2-environment.net/z2-samples.cdh4-base
27
git clone -b master http://git.z2-environment.net/z2-samples.hbase-mail-digester
28 1 Henning Blohm
</code></pre>
29 5 Henning Blohm
30
(Note: Do not use your shared git folder, if you have any, as the neighborhood of these projects may be inspected by z2 later on).
31
32
h3. Prepare
33
34
We need to apply some minimal configuration for HBase. At first, please follow [[Install_prepacked_CDH4]] on how to configure your HBase checkout. There are a few steps that need to be taken once only but still have to.
35
36
Assuming HBase has started and all processes show as described, there is one last thing to get running before starting the actual application:
37
38 23 Henning Blohm
{{include(How_to_run_Java_DB__Derby)}}
39 5 Henning Blohm
40 6 Henning Blohm
h2. Start
41 5 Henning Blohm
42 7 Henning Blohm
Now that all databases are up we can start the application simply by running (as always):
43 5 Henning Blohm
44
<pre><code class="bash">
45
# on Linux / Mac OS:
46
cd z2-base.core/run/bin
47
./gui.sh
48
49
# on Windows:
50
cd z2-base.core\run\bin
51
gui.bat
52
</code></pre>
53
54 7 Henning Blohm
At first startup this will download some significant amount of dependencies (Spring, Vaadin, etc.). So go and get yourself some coffee....
55 5 Henning Blohm
56
When started, go to http://localhost:8080/digester-admin. You should see this:
57 1 Henning Blohm
58
!start.png!
59 7 Henning Blohm
60
h2. Using the application
61
62 13 Henning Blohm
To feed data into the application, please download some mail archive in mbox format, e.g. from http://tomcat.apache.org/mail/. Upload the file to the digester application as outlined on the first tab. Mails from mail archives are imported into the HBase database.
63 7 Henning Blohm
64 8 Henning Blohm
When imported, run the analysis map-reduce job by clicking on "Start analysis job". The purpose of this job is to identify mail senders and provide some data like email address, full name, and number of mails into the relational database powered by Derby.
65 7 Henning Blohm
66 13 Henning Blohm
When there is only little data in HBase, progress reporting is rather coarse-grained. After a while you should see:
67 7 Henning Blohm
68 8 Henning Blohm
!job.png!
69 7 Henning Blohm
70 13 Henning Blohm
And work should be completed shortly after. When done, check the extracted sender data:
71 1 Henning Blohm
72
!counts.png!
73 8 Henning Blohm
74
h2. How this works
75
76 16 Henning Blohm
Let's go module by module in "z2-samples-hbase-mail-digester":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester
77 15 Henning Blohm
78 19 Henning Blohm
h4. com.zfabrik.samples.digester.mail.domain
79 15 Henning Blohm
80
This modules defines data types and storage access for HBase stored e-mail messages. The interface "EmailRepository":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.api/com/zfabrik/samples/digester/mail/EMailRepository.java is the main storage access interface. It's implementation can be found at "EmailRepositoryImpl":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.impl/com/zfabrik/samples/impl/digester/mail/EmailRepositoryImpl.java.
81
82 1 Henning Blohm
The Spring Bean "HBaseConnector":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/java/src.impl/com/zfabrik/samples/impl/digester/mail/HBaseConnector.java, used by the e-mail repository manages HBase connections.
83 16 Henning Blohm
84 19 Henning Blohm
h4. com.zfabrik.samples.digester.mail.admin
85 16 Henning Blohm
86 21 Henning Blohm
This module defines the _Mails_ tab that is loaded via an extension point (see [[Vaadin add-on]] for more details on what that is) from the main tab (defined in *com.zfabrik.samples.digester.admin* - below). Also it defines the _Mail file import_ activity (another extension point). The Mails tab is defined via the component "com.zfabrik.samples.digester.mail.admin/mailListView":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.admin/mailListView.properties and the Mail file import activity via the component "com.zfabrik.samples.digester.mail.admin/actionsView":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.admin/actionsView.properties.
87 16 Henning Blohm
88 17 Henning Blohm
As this module interacts mainly with the e-mail repository, it imports it from "com.zfabrik.samples.digester.mail.domain/emailRepository":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.domain/emailRepository.properties in its "application context":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.mail.admin/java/src.impl/META-INF/applicationContext.xml.
89
90 19 Henning Blohm
h4. com.zfabrik.samples.digester.business.domain
91 17 Henning Blohm
92
This module defines the domain types and repositories for all relational data of the application or, more specifically, it defines the JPA entities for contacts and a JPA/Hibernate implemented repository for thoses. It's structure closely resembles those of for example [[sample-spring-hibernate]]. 
93
94 19 Henning Blohm
h4. com.zfabrik.samples.digester.contacts.admin
95 17 Henning Blohm
96
Similarly to *com.zfabrik.samples.digester.mail.admin* this module defines a main level tab _Contacts_ and an activity _Mail analysis_ in components "com.zfabrik.samples.digester.contacts.admin/maintenanceView":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.admin/maintenanceView.properties and "com.zfabrik.samples.digester.contacts.admin/jobControlView":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.admin/jobControlView.properties respectively. The latter makes use of the "ContactsJob":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.api/com/zfabrik/samples/digester/contacts/jobs/ContactsJobs.java service that allows to kick off an analysis map-reduce job and to retrieve job status information. 
97
98 19 Henning Blohm
h4. com.zfabrik.samples.digester.contacts.jobs
99 17 Henning Blohm
100
This module defines the "ContactsJob":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.api/com/zfabrik/samples/digester/contacts/jobs/ContactsJobs.java service and the actual map-reduce job "AggregatorJob":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/show/com.zfabrik.samples.digester.contacts.jobs/java/src.impl/com/zfabrik/samples/impl/digester/contacts/jobs.
101
102
The ContactsJob "implementation":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.impl/com/zfabrik/samples/impl/digester/contacts/jobs/ContactsJobsImpl.java interacts directly with Hadoop and the supporting function of [[Hadoop Addon]] to submit the map-reduce job and similarly to check its status. Note the (unfortunately necessary) careful class loader hygiene implemented by using "ThreadUtil":http://www.z2-environment.net/javadoc/com.zfabrik.core.api!2Fjava/api/com/zfabrik/util/threading/ThreadUtil.html . See more in ("Protecting Java in a Modular World – Context Classloaders":http://www.z2-environment.net/blog/2012/07/for-techies-protecting-java-in-a-modular-world-context-classloaders/) to learn about this necessity when using overly cleverly implemented class loader usage such as in Hadoop and HBase. 
103
104
For the principles please check out [[how to hadoop]]. The mapper task "AggregatorMapper":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.impl/com/zfabrik/samples/impl/digester/contacts/jobs/AggregatorMapper.java is straight-forward: For every email message with sender @S@ encountered emit @(s.address, 1)@. 
105
106
The combiner "AggregatorCombiner":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.impl/com/zfabrik/samples/impl/digester/contacts/jobs/AggregatorCombiner.java simply sums counts up and implements
107
108
<pre>
109
(s.address, o=(o_1, o_2, ...)) --> (s.address, sum(o_i,i))
110
</pre>
111
112
At the heart of this sample's intention is the reducer "AggregatorReducer":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.contacts.jobs/java/src.impl/com/zfabrik/samples/impl/digester/contacts/jobs/AggregatorReducer.java. 
113
114
Receiving a tuple @(s.address, o=(o_1, o_2, ...))@  it computes @(s.address, sum(o_i,i))=:(a,t)@ and then updates a contact email count by looking for the contact entity identified by the email address a and then settings its mail count to t.
115
116
To do so it simply uses the ContactRepository that was Spring injected via the exact same module re-use as in the admin user interface modules – no difference here, although this is running on a hadoop data node.
117
118
Specifically in AggregatorReducer we use Spring AspectJ configuration:
119
120
<pre><code class="java">
121
@Configurable
122
public class AggregatorReducer extends Reducer<Text, IntWritable, Text, Writable> {
123
124
	private int count;
125
	
126
	@Autowired
127
	private ContactsRepository contacts;
128
	// ...
129
</code></pre>
130 18 Henning Blohm
131
to inject a service that is reference in the module's application context as 
132
133
<pre><code class="xml">
134
<!-- import contacts repository -->
135
<bean id="contactsRepository" class="com.zfabrik.springframework.ComponentFactoryBean">
136
	<property name="componentName" value="com.zfabrik.samples.digester.business.domain/contactsRepository" />
137
	<property name="className" value="com.zfabrik.samples.digester.business.domain.ContactsRepository" />
138
</bean>
139
</code></pre>
140
141
h4. com.zfabrik.samples.digester.admin
142
143
This final module is the visual glue of the whole application. It defines the skeleton Vaadin application that loads all extension point implementations. This happens in "AdminWindowWorkArea":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.admin/java/src.impl/com/zfabrik/samples/impl/digester/web/AdminWindowWorkArea.java
144
145
<pre><code class="java">
146
// ...
147
// add all extensions
148
Collection<Component> exts = ExtensionComponentsUtil.getComponentyByExtensionForApplication(
149
  AdminApplication.current(), 
150
  EXTENSION_POINT
151
);
152
		
153
for (Component c : exts) {
154
  addTab(c);
155
}
156
// ...
157
</code></pre>
158
159
for the top level tabs and very similarly in "DigesterActions":https://redmine.z2-environment.net/projects/z2-samples/repository/z2-samples-hbase-mail-digester/revisions/master/entry/com.zfabrik.samples.digester.admin/java/src.impl/com/zfabrik/samples/impl/digester/web/actions/DigesterActions.java for the activities presented on the front tab:
160
161
<pre><code class="java">
162
// ...		
163
// search for extensions and add them here
164
for (Component c : ExtensionComponentsUtil.getComponentyByExtensionForApplication(
165
  AdminApplication.current(), 
166
  EXTENSION_POINT)
167
) {
168
  addComponent(c);
169
}
170
// ...
171
</code></pre>
172 22 Henning Blohm
173
The overview below summarizes module type reuse dependencies and cross-module component use (typically via application context import as above):
174
175
!modules.png!