Professional Documents
Culture Documents
By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.
CHAPTER
ike an y oth er tech n ology or process, th ere obviou sly are best
practices th at can be applied to th e problem s of Big Data. In
m ost cases, best practices u su ally arise from years of testin g an d
m easu rin g resu lts, givin g th em a solid fou n dation to bu ild on .
How ever, Big Data, as it is applied today, is relatively n ew, sh ort
circu itin g th e tried-an d-tru e m eth odology u sed in th e past to derive
best practices. Neverth eless, best practices are presen tin g th em selves
at a fairly accelerated rate, wh ich m ean s th at w e can still learn from
th e m istakes an d su ccesses of oth ers to de n e wh at works best an d
wh at doesn t.
Th e evolu tion ary aspect of Big Data ten ds to affect best practices,
so wh at m ay be best today m ay n ot n ecessarily be best tom orrow .
Th at said, th ere are still som e core proven tech n iqu es th at can be
applied to Big Data an alytics an d th at sh ou ld w ith stan d th e test of
tim e. With n ew term s, n ew skill sets, n ew produ cts, an d n ew provi-
ders, th e w orld of Big Data an alytics can seem u n fam iliar, bu t tried-
an d-tru e data m an agem en t best practices do h old u p well in th is still
em ergin g disciplin e.
As with an y bu sin ess in telligen ce (BI) an d/ or data wareh ou se
in itiative, it is critical to h ave a clear u n derstan din g of an organ ization s
data m an agem en t requ irem en ts an d a well-de n ed strategy before
ven tu rin g too far down th e Big Data an alytics path . Big Data an alytics
93
is widely h yped, an d com pan ies in all sectors are bein g ooded with
n ew data sou rces an d ever larger am ou n ts of in form ation . Yet m akin g
a big in vestm en t to attack th e Big Data problem with ou t rst gu rin g
ou t h ow doin g so can really add valu e to th e bu sin ess is on e of th e
m ost seriou s m issteps for wou ld-be u sers.
Th e trick is to start from a bu sin ess perspective an d n ot get too
h u n g u p on th e tech n ology, wh ich m ay en tail m ediatin g con versation s
am on g th e ch ief in form ation of cer (CIO), th e data scien tists, an d
oth er bu sin esspeople to iden tify wh at th e bu sin ess objectives are
an d wh at valu e can be derived. De n in g exactly wh at data are avail-
able an d m appin g ou t h ow an organ ization can best leverage th e
resou rces is a key part of th at exercise.
CIOs, IT m an agers, an d BI an d data wareh ou se profession als n eed
to exam in e wh at data are bein g retain ed, aggregated, an d u tilized an d
com pare th at with wh at data are bein g th rown away. It is also critical
to con sider extern al data sou rces th at are cu rren tly n ot bein g tapped
bu t th at cou ld be a com pellin g addition to th e m ix. Even if com pan ies
aren t su re h ow an d wh en th ey plan to ju m p in to Big Data an alytics,
th ere are ben e ts to goin g th rou gh th is kin d of an evalu ation soon er
rath er th an later.
Begin n in g th e process of accu m u latin g data also m akes you better
prepared for th e even tu al leap to Big Data, even if you don t kn ow
wh at you are goin g to u se it for at th e ou tset. Th e trick is to start
accu m u latin g th e in form ation as soon as possible. Oth erwise th ere
m ay be a m issed opportu n ity becau se in form ation m ay fall th rou gh
th e cracks, an d you m ay n ot h ave th at rich h istory of in form ation to
draw on wh en Big Data en ters th e pictu re.
stru ctu red an d u n stru ctu red data, th ey n eed to be vigilan t abou t
h om in g in on th e n din gs th at are m ost im portan t to th eir stated
bu sin ess objectives.
It is critical to avoid situ ation s in wh ich you en d u p with a process
th at iden ti es n ews pattern s an d data relation sh ips th at offer little
valu e to th e bu sin ess process. Th at creates a dead spot in an an alytics
m atrix wh ere pattern s, th ou gh n ew, m ay n ot be relevan t to th e
qu estion s bein g asked.
Su ccessfu l Big Data projects ten d to start with very targeted goals
an d focu s on sm aller data sets. On ly th en can th at su ccess be bu ilt
u pon to create a tru e Big Data an alytics m eth odology th at starts sm all
an d grows after th e practice h as served th e en terprise rath er well,
allowin g valu e to be created with little u pfron t in vestm en t wh ile
preparin g th e com pan y for th e poten tial win dfall of in form ation th at
can be derived from an alytics.
Th at can be accom plish ed by startin g with sm all bites (i.e., takin g
in dividu al data ows an d m igratin g th ose in to differen t system s for
con verged processin g). Over tim e, th ose sm all bites will tu rn in to big
bites, an d Big Data will be born . Th e ability to scale will prove
im portan t as data collection in creases, th e scale of th e system will
n eed to grow to accom m odate th e data.
Th ere are m an y poten tial reason s th at Big Data an alytics projects fall
sh ort of th eir goals an d expectation s, an d in som e cases it is better to
kn ow wh at not to do rath er th an kn owin g wh at to do. Th is leads u s
to th e idea of iden tifyin g worst practices, so th at you can avoid
m akin g th e sam e m istakes th at oth ers h ave m ade in th e past. It is
better to learn from th e errors of oth ers th an to m ake you r own . Som e
worst practices to look ou t for are th e followin g:
Many organi-
zations m ake th e m istake of assu min g that sim ply deploying
a data wareh ou sin g or BI system will solve critical busin ess
problem s and deliver value. However, IT as well as BI and ana-
lytics program m an agers get sold on th e techn ology h ype and
forget that busin ess value is th eir rst priority; data analysis
techn ology is ju st a tool u sed to generate that value. In stead of
blin dly adoptin g and deploying som ethin g, Big Data analytics
proponents rst n eed to determine the busin ess purposes that
wou ld be served by th e techn ology in order to establish a busi-
n ess case and on ly then choose and implement the right ana-
lytics tools for the job at h an d. Without a solid u nderstan ding of
busin ess requ irements, th e danger is that project team s will end
u p creating a Big Data disk farm that really isn t worth anyth in g
to the organization , earn in g th e teams an un wan ted spot in th e
data doghouse.
BA BY STEPS
By
their very n ature, Big Data analytics projects involve large data
sets. But th at doesn t m ean th at all of a compan y s data sou rces, or
all of the inform ation within a relevan t data sou rce, will n eed to be
analyzed. Organizations n eed to identify th e strategic data th at will
lead to valuable analytical insights. For instance, wh at com bin a-
tion of in form ation can pin poin t key custom er-reten tion factors?
Or wh at data are required to u ncover h idden pattern s in stock
m arket transaction s? Focusin g on a project s busin ess goals in the
plan ning stages can h elp an organization h om e in on the exact
analytics th at are required, after which it can and sh ou ld look
at the data n eeded to m eet those busin ess goals. In some cases, this
will indeed m ean inclu din g everyth ing. In oth er cases, though, it
m eans u sin g only a subset of th e Big Data on h an d.
For Twitter data, th ere are often big disparities am on g dim en sion s.
Hash tags are typically associated with tran sien t or irregu lar ph en om en a,
as opposed to, for in stan ce, th e m assive regu larity of tweets em an atin g
from a big cou n try. Becau se of th is greater degree of with in -dim en sion
sim ilarity, we sh ou ld treat th e dim en sion s separately. Th e dim en sion al
application of algorith m s can iden tify situ ation s in wh ich h ash tags an d
u ser n am es, rath er th an location s an d tim e zon es, dom in ate th e list of
an om alies, in dicatin g th at th ere is very little sim ilarity am on g th e item s
in each of th ese grou ps.
Given so m an y an om alies, m akin g sen se of th em becom es a dif-
cu lt task, creatin g th e followin g qu estion s: Wh at cou ld h ave cau sed
th e m assive u psu rges in th e oth erwise regu lar traf c? Wh at dom ain s
are in volved? Are URL sh orten ers an d Twitter live video stream in g
services in volved? Sortin g by th e m agn itu de of th e an om aly yields a
cu rsory an d excessively restricted view; correlation s of th e an om alies
often exist with in an d between dim en sion s. Th ere can be a great deal
of syn ergy am on g algorith m s, bu t it m ay take som e sort of clu sterin g
procedu re to u n cover th em .
In th e past, Big Data an alytics u su ally in volved a com prom ise between
perform an ce an d accu racy. Th is situ ation was cau sed by th e fact th at
tech n ology h ad to deal with large data sets th at often requ ired h ou rs
or days to an alyze an d ru n th e appropriate algorith m s on . Hadoop
solved som e of th ese problem s by u sin g clu stered processin g, an d oth er
tech n ologies h ave been developed th at h ave boosted perform an ce. Yet
real-tim e an alytics h as been m ostly a dream for th e typical organ ization ,
wh ich h as been con strain ed by bu dgetary lim its for storage an d pro-
cessin g power two elem en ts th at Big Data devou rs at prodigiou s rates.
Th ese con strain ts created a situ ation in wh ich if you n eeded
an swers fast, you wou ld be forced to look at sm aller data sets, wh ich
cou ld lead to less accu rate resu lts. Accu racy, in con trast, often requ ired
th e opposite approach : workin g with larger data sets an d takin g m ore
processin g tim e.
As tech n ology an d in n ovation evolve, so do th e available option s.
Th e in du stry is addressin g th e speed-versu s-accu racy problem with
an alyses of com plex bu sin ess data. Th e prim ary advan tages are as
follows: