You are on page 1of 10

Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.

CHAPTER

What Is Big Data?

h at exactly is Big Data? At rst glan ce, th e term seem s rath er


vagu e, referrin g to som eth in g th at is large an d fu ll of in for-
m ation . Th at description does in deed t th e bill, yet it provides
n o in form ation on wh at Big Data really is.
Big Data is often described as extrem ely large data sets th at h ave
grown beyon d th e ability to m an age an d an alyze th em with tradition al
data processin g tools. Search in g th e Web for clu es reveals an alm ost
u n iversal de n ition , sh ared by th e m ajority of th ose prom otin g th e
ideology of Big Data, th at can be con den sed in to som eth in g like th is:
Big Data de n es a situ ation in wh ich data sets h ave grown to su ch
en orm ou s sizes th at con ven tion al in form ation tech n ologies can n o
lon ger effectively h an dle eith er th e size of th e data set or th e scale an d
growth of th e data set. In oth er words, th e data set h as grown so large
th at it is dif cu lt to m an age an d even h arder to garn er valu e ou t of it.
Th e prim ary dif cu lties are th e acqu isition , storage, search in g, sh arin g,
an alytics, an d visu alization of data.
Th ere is m u ch m ore to be said abou t wh at Big Data actu ally is. Th e
con cept h as evolved to in clu de n ot on ly th e size of th e data set bu t also
th e processes in volved in leveragin g th e data. Big Data h as even
becom e syn on ym ou s with oth er bu sin ess con cepts, su ch as bu sin ess
in telligen ce, an alytics, an d data m in in g.
Paradoxically, Big Data is n ot th at n ew. Alth ou gh m assive data sets
h ave been created in ju st th e last two years, Big Data h as its roots in
th e scien ti c an d m edical com m u n ities, wh ere th e com plex an alysis of

c01 22 October 2012; 17:52:19


2 BI G DATA ANAL YTI CS

m assive am ou n ts of data h as been don e for dru g developm en t, ph ysics


m odelin g, an d oth er form s of research , all of wh ich in volve large data
sets. Yet it is th ese very roots of th e con cept th at h ave ch an ged wh at
Big Data h as com e to be.

THE ARRIVAL O F AN A LYTICS

As an alytics an d research were applied to large data sets, scien tists cam e
to th e con clu sion th at m ore is better in th is case, m ore data, m ore
an alysis, an d m ore resu lts. Research ers started to in corporate related
data sets, u n stru ctu red data, arch ival data, an d real-tim e data in to th e
process, wh ich in tu rn gave birth to wh at we n ow call Big Data.
In th e bu sin ess world, Big Data is all abou t opportu n ity. Accordin g
to IBM, every day we create 2.5 qu in tillion (2.5 10 18 ) bytes of data,
so m u ch th at 90 percen t of th e data in th e world today h as been
created in th e last two years. Th ese data com e from everywh ere:
sen sors u sed to gath er clim ate in form ation , posts to social m edia sites,
digital pictu res an d videos posted on lin e, tran saction records of on lin e
pu rch ases, an d cell ph on e GPS sign als, to n am e ju st a few. Th at is th e
catalyst for Big Data, alon g with th e m ore im portan t fact th at all of
th ese data h ave in trin sic valu e th at can be extrapolated u sin g an alytics,
algorith m s, an d oth er tech n iqu es.
Big Data h as already proved its im portan ce an d valu e in several
areas. Organ ization s su ch as th e Nation al Ocean ic an d Atm osph eric
Adm in istration (NOAA), th e Nation al Aeron au tics an d Space Adm in -
istration (NASA), several ph arm aceu tical com pan ies, an d n u m erou s
en ergy com pan ies h ave am assed h u ge am ou n ts of data an d n ow
leverage Big Data tech n ologies on a daily basis to extract valu e
from th em .
NOAA u ses Big Data approach es to aid in clim ate, ecosystem ,
weath er, an d com m ercial research , wh ile NASA u ses Big Data for
aeron au tical an d oth er research . Ph arm aceu tical com pan ies an d
en ergy com pan ies h ave leveraged Big Data for m ore tan gible resu lts,
su ch as dru g testin g an d geoph ysical an alysis. Th e New York Times h as
u sed Big Data tools for text an alysis an d Web m in in g, wh ile th e Walt
Disn ey Com pan y u ses th em to correlate an d u n derstan d cu stom er
beh avior in all of its stores, th em e parks, an d Web properties.

c01 22 October 2012; 17:52:19


WHAT I S BI G DATA? 3

Big Data plays an oth er role in today s bu sin esses: Large organ i-
zation s in creasin gly face th e n eed to m ain tain m assive am ou n ts of
stru ctu red an d u n stru ctu red data from tran saction in form ation in
data wareh ou ses to em ployee tweets, from su pplier records to regu -
latory lin gs to com ply with govern m en t regu lation s. Th at n eed h as
been driven even m ore by recen t cou rt cases th at h ave en cou raged
com pan ies to keep large qu an tities of docu m en ts, e-m ail m essages,
an d oth er electron ic com m u n ication s, su ch as in stan t m essagin g an d
In tern et provider teleph on y, th at m ay be requ ired for e-discovery if
th ey face litigation .

WHERE IS THE VALUE?

Extractin g valu e is m u ch m ore easily said th an don e. Big Data is fu ll of


ch allen ges, ran gin g from th e tech n ical to th e con ceptu al to th e oper-
ation al, an y of wh ich can derail th e ability to discover valu e an d
leverage wh at Big Data is all abou t.
Perh aps it is best to th in k of Big Data in m u ltidim en sion al term s, in
wh ich fou r dim en sion s relate to th e prim ary aspects of Big Data. Th ese
dim en sion s can be de n ed as follows:

1. Big Data com es in on e size: large. En terprises are


awash with data, easily am assin g terabytes an d even petabytes
of in form ation .
2. Big Data exten ds beyon d stru ctu red data to in clu de
u n stru ctu red data of all varieties: text, au dio, video, click
stream s, log les, an d m ore.
3. The m assive am ou nts of data collected for Big Data
purposes can lead to statistical errors and m isin terpretation of the
collected inform ation. Purity of the information is critical for value.
4. Often tim e sen sitive, Big Data m u st be u sed as it is
stream in g in to th e en terprise in order to m axim ize its valu e to
th e bu sin ess, bu t it m u st also still be available from th e arch ival
sou rces as well.

Th ese 4Vs of Big Data lay ou t th e path to an alytics, with each


h avin g in trin sic valu e in th e process of discoverin g valu e.

c01 22 October 2012; 17:52:19


4 BI G DATA ANAL YTI CS

Neverth eless, th e com plexity of Big Data does n ot en d with ju st fou r


dim en sion s. Th ere are oth er factors at work as well: th e processes th at
Big Data drives. Th ese processes are a con glom eration of tech n ologies
an d an alytics th at are u sed to de n e th e valu e of data sou rces, wh ich
tran slates to action able elem en ts th at m ove bu sin esses forward.
Man y of th ose tech n ologies or con cepts are n ot n ew bu t h ave
com e to fall u n der th e u m brella of Big Data. Best de n ed as an alysis
categories, th ese tech n ologies an d con cepts in clu de th e followin g:

Th is con sists of a
broad category of application s an d tech n ologies for gath erin g,
storin g, an alyzin g, an d providin g access to data. BI delivers
action able in form ation , wh ich h elps en terprise u sers m ake
better bu sin ess decision s u sin g fact-based su pport system s. BI
works by u sin g an in -depth an alysis of detailed bu sin ess data,
provided by databases, application data, an d oth er tan gible
data sou rces. In som e circles, BI can provide h istorical, cu rren t,
an d predictive views of bu sin ess operation s.
Th is is a process in wh ich data are an alyzed from
differen t perspectives an d th en tu rn ed in to su m m ary data th at
are deem ed u sefu l. Data m in in g is n orm ally u sed with data at
rest or with arch ival data. Data m in in g tech n iqu es focu s on
m odelin g an d kn owledge discovery for predictive, rath er th an
pu rely descriptive, pu rposes an ideal process for u n coverin g
n ew pattern s from large data sets.
Th ese look at data u sin g algorith m s
based on statistical prin ciples an d n orm ally con cen trate on data
sets related to polls, cen su s, an d oth er static data sets. Statistical
application s ideally deliver sam ple observation s th at can be u sed
to stu dy popu lated data sets for th e pu rpose of estim atin g,
testin g, an d predictive an alysis. Em pirical data, su ch as su rveys
an d experim en tal reportin g, are th e prim ary sou rces for an a-
lyzable in form ation .
Th is is a su bset of statistical application s in
wh ich data sets are exam in ed to com e u p with prediction s,
based on tren ds an d in form ation glean ed from databases. Pre-
dictive an alysis ten ds to be big in th e n an cial an d scien ti c

c01 22 October 2012; 17:52:19


WHAT I S BI G DATA? 5

worlds, wh ere tren din g ten ds to drive prediction s, on ce extern al


elem en ts are added to th e data set. On e of th e m ain goals of
predictive an alysis is to iden tify th e risks an d opportu n ities for
bu sin ess process, m arkets, an d m an u factu rin g.
Th is is a con ceptu al application of an alytics
in wh ich m u ltiple wh at-if scen arios can be applied via algo-
rith m s to m u ltiple data sets. Ideally, th e m odeled in form ation
ch an ges based on th e in form ation m ade available to th e algo-
rith m s, wh ich th en provide in sigh t to th e effects of th e ch an ge
on th e data sets. Data m odelin g works h an d in h an d with data
visu alization , in wh ich u n coverin g in form ation can h elp with a
particu lar bu sin ess en deavor.

Th e precedin g an alysis categories con stitu te on ly a portion of


wh ere Big Data is h eaded an d wh y it h as in trin sic valu e to bu sin ess.
Th at valu e is driven by th e n ever-en din g qu est for a com petitive
advan tage, en cou ragin g organ ization s to tu rn to large repositories of
corporate an d extern al data to u n cover tren ds, statistics, an d oth er
action able in form ation to h elp th em decide on th eir n ext m ove. Th is
h as h elped th e con cept of Big Data to gain popu larity with tech n olo-
gists an d execu tives alike, alon g with its associated tools, platform s,
an d an alytics.

MO RE TO BIG DA TA THA N MEETS THE EYE

Th e volu m e an d overall size of th e data set is on ly on e portion of th e


Big Data equ ation . Th ere is a grow in g con sen su s th at both sem i-
stru ctu red an d u n stru ctu red data sou rces con tain bu sin ess-critical
in form ation an d m u st th erefore be m ade accessible for both BI an d
operation al n eeds. It is also clear th at th e am ou n t of relevan t
u n stru ctu red bu sin ess data is n ot on ly growin g bu t will con tin u e to
grow for th e foreseeable fu tu re.
Data can be classi ed u n der several categories: stru ctu red data,
sem istru ctu red data, an d u n stru ctu red data. Stru ctu red data are n or-
m ally fou n d in tradition al databases (SQL or oth ers) wh ere data are
organ ized in to tables based on de n ed bu sin ess ru les. Stru ctu red data
u su ally prove to be th e easiest type of data to work with , sim ply

c01 22 October 2012; 17:52:19


6 BI G DATA ANAL YTI CS

becau se th e data are de n ed an d in dexed, m akin g access an d lterin g


easier.
Un stru ctu red data, in con trast, n orm ally h ave n o BI beh in d th em .
Un stru ctu red data are n ot organ ized in to tables an d can n ot be n atively
u sed by application s or in terpreted by a database. A good exam ple of
u n stru ctu red data wou ld be a collection of bin ary im age les.
Sem istru ctu red data fall between u nstructu red and stru ctu red data.
Semistru ctu red data do n ot h ave a formal structu re like a database with
tables and relation ships. However, u nlike un stru ctu red data, sem i-
stru ctu red data h ave tags or oth er m arkers to separate th e elements and
provide a h ierarchy of records and elds, wh ich de n e the data.

DEA LIN G WITH THE N UAN CES O F BIG DATA

Dealin g with differen t types of data is con vergin g, th an ks to u tilities


an d application s th at can process th e data sets u sin g stan dard XML
form ats an d in du stry-speci c XML data stan dards (e.g., ACORD in
in su ran ce, HL7 in h ealth care). Th ese XML tech n ologies are expan din g
th e types of data th at can be h an dled by Big Data an alytics an d in te-
gration tools, yet th e tran sform ation capabilities of th ese processes are
still bein g strain ed by th e com plexity an d volu m e of th e data, leadin g
to a m ism atch between th e existin g tran sform ation capabilities an d th e
em ergin g n eeds. Th is is open in g th e door for a n ew type of u n iversal
data tran sform ation produ ct th at will allow tran sform ation s to be
de n ed for all classes of data (stru ctu red, sem istru ctu red, an d
u n stru ctu red), with ou t writin g code, an d able to be deployed to an y
software application or platform arch itectu re.
Both th e de n ition of Big Data an d th e execu tion of th e related
an alytics are still in a state of u x; th e tools, tech n ologies, an d pro-
cedu res con tin u e to evolve. Yet th is situ ation does n ot m ean th at th ose
wh o seek valu e from large data sets sh ou ld wait. Big Data is far too
im portan t to bu sin ess processes to take a wait-an d-see approach .
Th e real trick with Big Data is to n d th e best way to deal with th e
varied data sou rces an d still m eet th e objectives of th e an alytical
process. Th is takes a savvy approach th at in tegrates h ardware, soft-
ware, an d procedu res in to a m an ageable process th at delivers resu lts
with in an acceptable tim e fram e an d it all starts with th e data.

c01 22 October 2012; 17:52:20


WHAT I S BI G DATA? 7

Storage is th e critical elem en t for Big Data. Th e data h ave to be


stored som ewh ere, readily accessible an d protected. Th is h as proved to
be an expen sive ch allen ge for m an y organ ization s, sin ce n etwork-
based storage, su ch as SANS an d NAS, can be very expen sive to pu r-
ch ase an d m an age.
Storage h as evolved to becom e on e of th e m ore pedestrian ele-
m en ts in th e typical data cen ter after all, storage tech n ologies h ave
m atu red an d h ave started to approach com m odity statu s. Neverth e-
less, today s en terprises are faced with evolvin g n eeds th at can pu t th e
strain on storage tech n ologies. A case in poin t is th e pu sh for Big Data
an alytics, a con cept th at brin gs BI capabilities to large data sets.
Th e Big Data an alytics process dem an ds capabilities th at are
u su ally beyon d th e typical storage paradigm s. Tradition al storage
tech n ologies, su ch as SANS, NAS, an d oth ers, can n ot n atively deal
with th e terabytes an d petabytes of u n stru ctu red in form ation pre-
sen ted by Big Data. Su ccess with Big Data an alytics dem an ds som e-
th in g m ore: a n ew w ay to deal with large volu m es of data, a n ew
storage platform ideology.

A N O PEN SO URCE BRIN GS FO RTH TO O LS

En ter Hadoop, an open sou rce project th at offers a platform to work


with Big Data. Alth ou gh Hadoop h as been arou n d for som e tim e, m ore
an d m ore bu sin esses are ju st n ow startin g to leverage its capabilities.
Th e Hadoop platform is design ed to solve problem s cau sed by m assive
am ou n ts of data, especially data th at con tain a m ixtu re of com plex
stru ctu red an d u n stru ctu red data, wh ich does n ot len d itself well to
bein g placed in tables. Hadoop works well in situ ation s th at requ ire th e
su pport of an alytics th at are deep an d com pu tation ally exten sive, like
clu sterin g an d targetin g.
For th e decision m aker seekin g to leverage Big Data, Hadoop
solves th e m ost com m on problem associated with Big Data: storin g an d
accessin g large am ou n ts of data in an ef cien t fash ion .
Th e in trin sic design of Hadoop allows it to ru n as a platform th at is
able to work on a large n u m ber of m ach in es th at don t sh are an y
m em ory or disks. With th at in m in d, it becom es easy to see h ow
Hadoop offers addition al valu e: Network m an agers can sim ply bu y a

c01 22 October 2012; 17:52:20


8 BI G DATA ANAL YTI CS

wh ole bu n ch of com m odity servers, slap th em in a rack, an d ru n th e


Hadoop software on each on e.
Hadoop also h elps to rem ove m u ch of th e m an agem en t overh ead
associated with large data sets. Operation ally, as an organ ization s data
are bein g loaded in to a Hadoop platform , th e software breaks down
th e data in to m an ageable pieces an d th en au tom atically spreads th em
to differen t servers. Th e distribu ted n atu re of th e data m ean s th ere is
n o on e place to go to access th e data; Hadoop keeps track of wh ere th e
data reside, an d it protects th e data by creatin g m u ltiple copy stores.
Resilien cy is en h an ced, becau se if a server goes of in e or fails, th e data
can be au tom atically replicated from a kn own good copy.
Th e Hadoop paradigm goes several steps fu rth er in w orkin g
w ith data. Take, for exam ple, th e lim itation s associated w ith a
tradition al cen tralized database system , w h ich m ay con sist of a large
disk drive con n ected to a server class system an d featu rin g m u ltiple
processors. In th at scen ario, an alytics is lim ited by th e perform an ce
of th e disk an d, u ltim ately, th e n u m ber of processors th at can be
bou gh t to bear.
With a Hadoop clu ster, every server in th e clu ster can participate in
th e processin g of th e data by u tilizin g Hadoop s ability to spread th e
work an d th e data across th e clu ster. In oth er words, an in dexin g job
works by sen din g code to each of th e servers in th e clu ster, an d each
server th en operates on its own little piece of th e data. Th e resu lts are
th en delivered back as a u n i ed wh ole. With Hadoop, th e process is
referred to as MapRedu ce, in wh ich th e code or processes are m apped
to all th e servers an d th e resu lts are redu ced to a sin gle set.
Th is process is wh at m akes Hadoop so good at dealin g with large
am ou n ts of data: Hadoop spreads ou t th e data an d can h an dle com plex
com pu tation al qu estion s by h arn essin g all of th e available clu ster
processors to work in parallel.

CA UTIO N : O BSTA CLES A HEA D

Neverth eless, ven turin g into the world of Hadoop is n ot a plu g-an d-play
experien ce; there are certain prerequisites, h ardware requirements, and
con guration chores that m ust be m et to ensu re su ccess. The rst step

c01 22 October 2012; 17:52:20


WHAT I S BI G DATA? 9

con sists of u nderstandin g and de n in g the analytics process. Most chief


inform ation of cers are familiar with busin ess analytics (BA) or BI
processes and can relate to the m ost com mon process layer u sed: the
extract, transform, and load (ETL) layer and the critical role it plays
wh en buildin g BA or BI solu tion s. Big Data analytics requires th at
organ ization s choose th e data to analyze, con solidate them, and then
apply aggregation m eth ods before th e data can be subjected to the ETL
process. This h as to occur with large volu mes of data, which can be
structu red, u nstructured, or from m ultiple sou rces, such as social n et-
works, data logs, web sites, m obile devices, and sen sors.
Hadoop accom plish es th at by in corporatin g pragm atic processes
an d con sideration s, su ch as a fau lt-toleran t clu stered arch itectu re, th e
ability to m ove com pu tin g power closer to th e data, parallel an d/ or
batch processin g of large data sets, an d an open ecosystem th at su p-
ports en terprise arch itectu re layers from data storage to an alytics
processes.
Not all en terprises requ ire wh at Big Data an alytics h as to offer;
th ose th at do m u st con sider Hadoop s ability to m eet th e ch allen ge.
However, Hadoop can n ot accom plish everyth in g on its own . En ter-
prises will n eed to con sider wh at addition al Hadoop com pon en ts are
n eeded to bu ild a Hadoop project.
For exam ple, a starter set of Hadoop com pon en ts m ay con sist of
th e followin g: HDFS an d HBase for data m an agem en t, MapRedu ce
an d OOZIE as a processin g fram ework, Pig an d Hive as developm en t
fram eworks for developer produ ctivity, an d open sou rce Pen tah o for
BI. A pilot project does n ot requ ire m assive am ou n ts of h ardware. Th e
h ardware requ irem en ts can be as sim ple as a pair of servers with
m u ltiple cores, 24 or m ore gigabytes of RAM, an d a dozen or so h ard
disk drives of 2 terabytes each . Th is sh ou ld prove su f cien t to get a
pilot project off th e grou n d.
Data m an agers sh ou ld be forewarn ed th at th e effective m an age-
m en t an d im plem en tation of Hadoop requ ires som e expertise an d
experien ce, an d if th at expertise is n ot readily available, in form ation
tech n ology m an agem en t sh ou ld con sider partn erin g with a service
provider th at can offer fu ll su pport for th e Hadoop project. Su ch
expertise proves especially im portan t for secu rity; Hadoop, HDFS, an d

c01 22 October 2012; 17:52:20


10 BI G DATA ANAL YTI CS

HBase offer very little in th e form of in tegrated secu rity. In oth er


words, th e data still n eed to be protected from com prom ise or th eft.
All th in gs con sidered, an in -h ou se Hadoop project m akes th e m ost
sen se for a pilot test of Big Data an alytics capabilities. After th e pilot, a
pleth ora of com m ercial an d/ or h osted solu tion s are available to th ose
wh o wan t to tread fu rth er in to th e realm of Big Data an alytics.

c01 22 October 2012; 17:52:20

You might also like