You are on page 1of 5

http://giladmanor.blogspot.

com/
giladmanor@yahoo.com

Data buckets

Written by Gilad Manor

Posted on Wednesday, March 3rd 2009 at JavaWorld’s Daily Brew

For me, software development is just a nice way of saying ‘bit moving’. A
good friend of mine used to describe himself as a bit reorganizer. We
rearrange invisible magnets, he would say, setting their tiny arrows of
residual currents to point this way or that. We are a bunch of “bitniks”
and we are all about data.

Application design and development has been my main source


of income for the last decade or so, it struck me as odd that there were so few
terms that describe so many kinds of data.

It occurred to me, that the Eskimos had their fourteen words for snow, and they say
that the Bedouins have nine words to describe sand. I felt so alone. I felt a need for
discovering my own flavors of data. It took me a while, but then in a single perfect
moment of clarity, I had realized what lay before me.

The orchestration of the moment was this; in the middle of a design meeting, yelling
and shouting all around, we were discussing optimization and performance and
spirits were high. My thoughts went back to when I have learnt about application
design. The fact was that when developing any business application, the first step
to take is to determine the set of business flows that describe the scope and
functionality of that application within the organizations it’s meant to serve.

Listing these business-flows by rank and cardinality is no bother at all. The simplest
evaluation I could think of is according to frequency of use and the sheer number of
users that would eventually use the flow.

I thought that the categorization of the data body by the same yardstick could
provide me with the flavors of data I was looking for.
http://giladmanor.blogspot.com/
giladmanor@yahoo.com
Java is a wonderful language, my favorite actually, versatile and strong. In the
context of this discussion (!), Java has one drawback; the Java Virtual Machine is
located far, so far away from the data acted upon.

Unlike the hieroglyphic COBOL, java needs special machinery to access its data. In
this case, the number of solutions testifies for the complexity of the problem.

It’s safe to say that there are absolutely no free meals. Every solution ever invented
to accommodate the data access issue, bears with it its own cost and complications.

Careful mapping of the data orientation by category and flavor might reduce the
friction in complex systems that depend on the availability of massive bulks of data.

And here I am getting to my point: The mapping of the data reduces the friction in
complex systems and mapping of the data needs more flavors.

The topology of data within an application


I have managed to put together five distinctions of data flavors, but first, I will
describe my study case application and define the yardstick I am using to
categorize data.

My example is an application that sells insurance policies. The simplified outline of


such an application would have a customer base and a product list. It would also
include a process for the selling of insurance policies, implemented by stapling the
products to customers.

To make it interesting, I will refer to the use of external services and fixed
configuration.

The yardstick for data categorization is determined by measuring the cardinality of


each of the application’s work flows.

It is easy to see that for the hierarchy of the business flows, the main workflow is for
the selling an insurance policy (stitching the customer to the product), followed by
the work flow for managing the customer base. Far behind would be the work flows
for creation, versioning and maintaining of the product list.

Applicative data bucket


The applicative data bucket is the body of data that is manipulated by the main
business flow and has the highest rate of change. I’m Strictly speaking of altered
(modified) data only.

In my example application, this would be the data that is handled in the policy
selling work flow. The data consists of the stitching tables between the products and
the customer. The stitching tables may also describe a single shopping cart or a
single contract with the customer.
http://giladmanor.blogspot.com/
giladmanor@yahoo.com
In many cases the data that is added or modified is within the boundaries of a single
session there may be no reason for cache optimization.

In cases where concurrency of change for the same data is permitted, the
synchronization between sessions should be handled with great care and
understanding of the business implications. It’s important to remember that
deadlock issues are ten times easier to handle from the business standpoint then
technologically.

There are several solutions for second level caching, just to name a couple, there is
the EH cache project which I’m using in the product I’m working on and I have also
heard about the terracotta project.

Choosing to implement the second level cache independently is also an option; it is


an easy implementation as long as the cache stays on the same virtual machine.
But scaling a cache solution to a clustered environment is a different ball game, and
in this particular case, my policy is to make use the effort of others, and not waist
my own resources on an existing product.

Reference data bucket, first degree


The first order of reference data is data that is used as “read only” by the main
business flow. Yet this referenced data is a moving target since close-by flows (flows
that are ranked closely to the main flow) change it constantly.

In the example application, the customer base management work flows rank in
second. The customer base is modified intensively, when additional customers are
introduced to the database or when existing customers change status and detail.

Having two possible concurrent sessions (the main business session and the
customer data base update session) both accessing the customer data requires
special attention and awareness to the business implications of the concurrent
modification.

In the calculation of insurance premium rates, the payment is determined in


accordance with personal parameters and the record of each individual customer,
thus changes need to be communicated instantly.

Thereby, synchronization of the concurrent sessions is a must. Second level cache


or any other innovative solution that allows live cache update between the
competing sessions is advised. However, since the messages are sent one way,
some application friction may be reduced.

Reference data bucket, second degree


The second order of reference data is data that is referenced by the main business
work flow as read only, yet changes by other business flows are in a very low
frequency.
http://giladmanor.blogspot.com/
giladmanor@yahoo.com
In my example application, these are the business flows that manage and maintain
the product lists.

The product list maintenance is usually handled by a few individuals in the


organization, and rate of change is very low since the products in the list go through
meticulous testing and examination before “going public”.

On top of that, the relevant product for the main business flow for selling insurance
policies is the subset of products that are complete and ready to be sold.

This implies that in relation to the selling work flow, the product list is static.

The caching implementation for the product list could then be very simple. A cache
pocket for the product list could be refreshed by messaging or time based, while
remaining static as far as the main business flow is concerned.

System configuration
Data that is retrieved from system configuration, property files, XML data structures
or tables belongs to the data bucket that reflects changes only when the system is
rebooted.

Relating to the example application, this bucket may contain anything between
configuration of connection pool sizes and i18n (internationalization bundles).

In my view, the special attention has to be for deciding what not to cache.

External data services


External services include a wide range of functions that have only one aspect in
common; the implementation of these data sources is out of scope for the
application being developed, sometimes out of company.

My way to relate to cache optimization for external data services is by using the
same ranking method as before but to devise which of the services caching is
irrelevant and for which it would be beneficial.

For the insurance example, I would never cache services that have a narrow scope
of relevance; a service that validates bank accounts is too volatile to cache. On the
other hand I might consider caching data for age group premium rates.

My view of best practice in optimization in the case of external data service would
be to exclude the task of updating the cached data from the thread that services
the business flow. Instead, I would consider maintaining an independent thread that
would checks for data modifications every once in a while, and updates the cached
data independently
http://giladmanor.blogspot.com/
giladmanor@yahoo.com

My Eskimo vocabulary
The basic motivation for refining the resolution in data terminology is for the
optimization of cache implementations.

There is an Eskimo saying that pre-mature optimization is the source of all evil (no
its not, I made that up). In most cases this is absolutely true, but I would like to
argue that cache optimization is elementary to the degree that has to be addressed
in early stages of application design.

But in any case, I have found my little Eskimo vocabulary for data in an application
and I am happy.

You might also like