You are on page 1of 15

Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.

CHAPTER

The Nuts and


Bolts of Big Data

ssem blin g a Big Data solu tion is sort of like pu ttin g togeth er an
erector set. Th ere are variou s pieces an d elem en ts th at m u st be pu t
togeth er in th e proper fash ion to m ake su re everyth in g works
adequ ately, an d th ere are alm ost en dless com bin ation s of con gu ra-
tion s th at can be m ade with th e com pon en ts at h an d.
With Big Data, th e com pon en ts in clu de platform pieces, servers,
virtu alization solu tion s, storage arrays, application s, sen sors, an d rou t-
in g equ ipm en t. Th e righ t pieces m u st be picked an d in tegrated in a
fash ion th at offers th e best perform an ce, h igh ef cien cy, affordability,
ease of m an agem en t an d u se, an d scalability.

THE STO RA GE DILEMMA

Big Data con sists of data sets th at are too large to be acqu ired, h an dled,
an alyzed, or stored in an appropriate tim e fram e u sin g th e tradition al
in frastru ctu res. Big is a term relative to th e size of th e organ ization
an d, m ore im portan t, to th e scope of th e IT in frastru ctu re th at s in
place. Th e scale of Big Data directly affects th e storage platform th at
m u st be pu t in place, an d th ose deployin g storage solu tion s h ave to
u n derstan d th at Big Data u ses storage resou rces differen tly th an th e
typical en terprise application does.

47

c06 22 October 2012; 17:58:3


48 BI G DATA ANAL YTI CS

Th ese factors can m ake provision in g storage a com plex en deavor,


especially wh en on e con siders th at Big Data also in clu des an alysis; th is is
driven by th e expectation th at th ere will be valu e in all of th e in for-
m ation a bu sin ess is accu m u latin g an d a way to draw th at valu e ou t.
Origin ally driven by th e con cept th at storage capacity is in ex-
pen sive an d con stan tly droppin g in price, bu sin esses h ave been
com pelled to save m ore data, with th e h ope th at bu sin ess in telligen ce
(BI) can leverage th e m ou n tain s of n ew data created every day.
Organ ization s are also savin g data th at h ave already been an alyzed,
wh ich can poten tially be u sed for m arkin g tren ds in relation to fu tu re
data collection s.
Aside from th e ability to store m ore data than ever before, busi-
n esses also h ave access to m ore types of data. These data sou rces inclu de
In tern et transaction s, social n etworking activity, automated sen sors,
m obile devices, scien ti c in stru m en tation, voice over Intern et protocol,
and video elem en ts. In addition to creating static data poin ts, transac-
tions can create a certain velocity to th is data growth. For exam ple, th e
extraordin ary growth of social m edia is generating n ew transaction s
and records. But th e availability of ever-expandin g data sets doesn t
guarantee su ccess in the search for bu siness valu e.
As data sets con tin ue to grow with both structu red and u nstructured
data and data analysis becomes m ore diverse, tradition al enterprise
storage system designs are becomin g less able to m eet the n eeds of Big
Data. Th is situ ation h as driven storage ven dors to design n ew storage
platform s th at incorporate block- and le-based systems to m eet th e
n eeds of Big Data and associated analytics.
Meetin g th e ch allen ges posed by Big Data m ean s focu sin g on som e
key storage ideologies an d u n derstan din g h ow th ose storage design
elem en ts in teract with Big Data dem an ds, in clu din g th e followin g:

Big Data can m ean petabytes of data. Big Data stor-


age system s m u st th erefore be able to qu ickly an d easily ch an ge
scale to m eet th e growth of data collection s. Th ese storage
system s will n eed to add capacity in m odu les or arrays th at
are tran sparen t to u sers, with ou t takin g system s down . Most
Big Data en viron m en ts are tu rn in g to scale-ou t storage (th e
ability to in crease storage perform an ce as capacity in creases)

c06 22 October 2012; 17:58:3


THE NUTS AND BOL TS OF BI G DATA 49

tech n ologies to m eet th at criterion . Th e clu stered arch itectu re


of scale-ou t storage solu tion s featu res n odes of storage capacity
with em bedded processin g power an d con n ectivity th at can
grow seam lessly, avoidin g th e silos of storage th at tradition al
system s can create.
Big Data also m eans m an y large and small les. Managing
the accum ulation of m etadata for le systems with m ultiple large
and small les can redu ce scalability and impact performance,
a situation that can be a problem for traditional network-attach ed
storage systems. Object-based storage architectures, in con trast,
can allow Big Data storage systems to expand le cou nts into the
billion s withou t suffering the overhead problem s that traditional
le systems encou nter. Object-based storage systems can also
scale geographically, enablin g large infrastructures to be spread
across m ultiple locations.
Man y types of data carry secu rity stan dards th at are
driven by com plian ce laws an d regu lation s. Th e data m ay be
n an cial, m edical, or govern m en t in telligen ce an d m ay be part
of an an alytics set yet still be protected. Wh ile th ose data m ay
n ot be differen t from wh at cu rren t IT m an agers m u st accom -
m odate, Big Data an alytics m ay n eed to cross-referen ce data
th at h ave n ot been com m in gled in th e past, an d th is can create
som e n ew secu rity con sideration s. In tu rn , IT m an agers sh ou ld
con sider th e secu rity footin g of th e data stored in an array u sed
for Big Data an alytics an d th e people wh o will access th e data.
In m an y cases, Big Data em ploys a real-tim e com -
pon en t, especially in u se scen arios in volvin g Web tran saction s
or n an cial tran saction s. An exam ple is tailorin g Web adver-
tisin g to each u ser s browsin g h istory, wh ich dem an ds real-tim e
an alytics to fu n ction . Storage system s m u st be able to grow
rapidly an d still m ain tain perform an ce. Laten cy produ ces stale
data. Th at is an oth er case in wh ich scale-ou t arch itectu res solve
problem s. Th e tech n ology en ables th e clu ster of storage n odes to
in crease in processin g power an d con n ectivity as th ey grow in
capacity. Object-based storage system s can parallel data stream s,
fu rth er im provin g ou tpu t.

c06 22 October 2012; 17:58:3


50 BI G DATA ANAL YTI CS

Most Big Data environ m en ts n eed to provide h igh in pu t-


outpu t operations per secon d (IOPS) performance, especially
those u sed in h igh-performance com putin g environm en ts. Vir-
tualization of server resou rces, wh ich is a comm on m ethodology
u sed to expand com pute resou rces with ou t th e purch ase of
n ew h ardware, drives h igh IOPS requirements, just as it does in
tradition al IT environm en ts. Those h igh IOPS perform an ce
requirements can be m et with solid-state storage devices, wh ich
can be implemented in m an y different form ats, inclu ding sim ple
server-based cache to all- ash -based scalable storage system s.
As businesses get a better understanding of the potential
of Big Data analysis, the need to compare different data sets
increases, and with it, more people are bought into the data
sharing loop. The quest to create business value drives businesses
to look at more ways to cross-reference different data objects from
various platform s. Storage infrastructures that include global le
systems can address this issue, since they allow multiple users on
multiple hosts to access les from many different back-end stor-
age systems in m ultiple locations.
Big Data storage in frastru ctu res can grow very large,
an d th at sh ou ld be con sidered as part of th e design ch allen ge,
dictatin g th at care sh ou ld be taken in th e design an d allowin g th e
storage in frastru ctu re to grow an d evolve alon g with th e an alytics
com pon en t of th e m ission . Big Data storage in frastru ctu res also
n eed to accou n t for data m igration ch allen ges, at least du rin g th e
start-u p ph ase. Ideally, data m igration will becom e som eth in g
th at is n o lon ger n eeded in th e world of Big Data, sim ply becau se
th e data are distribu ted in m u ltiple location s.
Big Data applications often involve regulatory
com plian ce requirements, wh ich dictate th at data m ust be saved
for years or decades. Exam ples are m edical in form ation , which is
often saved for th e life of th e patien t, and n an cial in form ation ,
wh ich is typically saved for seven years. However, Big Data u sers
are often savin g data lon ger becau se th ey are part of a h istorical
record or are u sed for tim e-based analysis. The requ irement for

c06 22 October 2012; 17:58:3


THE NUTS AND BOL TS OF BI G DATA 51

lon gevity m eans that storage m an ufactu rers n eed to in clude


on goin g integrity checks and oth er lon g-term reliability featu res
as well as address the n eed for data-in -place u pgrades.
Big Data can be expen sive. Given th e scale at wh ich m an y
organ ization s are operatin g th eir Big Data en viron m en ts, cost
con tain m en t is im perative. Th at m ean s m ore ef cien cy as well
as less expen sive com pon en ts. Storage dedu plication h as
already en tered th e prim ary storage m arket an d, depen din g on
th e data types in volved, cou ld brin g som e valu e for Big Data
storage system s. Th e ability to redu ce capacity con su m ption
even by a few percen tage poin ts provides a sign i can t retu rn
on in vestm en t as data sets grow. Oth er Big Data storage tech -
n ologies th at can im prove ef cien cies are th in provision in g,
sn apsh ots, an d clon in g.
Th in provision in g operates by allocatin g disk storage space in
a exible m an n er am on g m u ltiple u sers based on th e m in i-
m u m space requ ired by each u ser at an y given tim e.
Snapshots streamline access to stored data and can speed u p the
process of data recovery. There are two m ain types of storage
snapsh ot: copy-on -write (or low-capacity) snapsh ot and split-
m irror sn apshot. Utilities are available th at can autom atically
gen erate eith er type.
Disk clon in g is copyin g th e con ten ts of a com pu ter s h ard
drive. Th e con ten ts are typically saved as a disk im age le an d
tran sferred to a storage m ediu m , wh ich cou ld be an oth er
com pu ter s h ard drive or rem ovable m edia su ch as a DVD or
a USB drive.
Data storage systems have evolved to inclu de an archive
com pon en t, which is importan t for organization s th at are dealing
with h istorical trends or lon g-term retention requirements. From
a capacity and dollar standpoin t, tape is still the m ost econom ical
storage m edium . Today, systems that support m ultiterabyte car-
tridges are becom in g the de facto standard in m any of th ese
environ ments.

c06 22 October 2012; 17:58:3


52 BI G DATA ANAL YTI CS

Th e biggest effect on cost con tain m en t can be traced to


th e u se of com m odity h ardware. Th is is a good th in g, sin ce th e
m ajority of Big Data in frastru ctu res won t be able to rely on
th e big iron en terprises of th e past. Most of th e rst an d
largest Big Data u sers h ave bu ilt th eir own wh ite-box system s
on -site, wh ich leverage a com m odity-orien ted, cost-savin g
strategy.
Th ese exam ples an d oth ers h ave driven th e tren d of cost
con tain m en t, an d m ore storage produ cts are arrivin g on th e
m arket th at are software based an d can be in stalled on existin g
system s or on com m on , off-th e-sh elf h ardware. In addition ,
m an y of th e sam e ven dors are sellin g th eir software tech n ologies
as com m odity applian ces or partn erin g with h ardware m an u -
factu rers to produ ce sim ilar offerin gs. Th at all adds u p to cost-
savin g strategies, wh ich brin gs Big Data in to th e reach of sm aller
an d sm aller bu sin esses.
In itially, Big Data im plem en ta-
tion s were design ed arou n d application -speci c in frastru ctu res,
su ch as cu stom system s developed for govern m en t projects or
th e wh ite-box system s en gin eered by large In tern et service
com pan ies. Application awaren ess is becom in g com m on in
m ain stream storage system s an d sh ou ld im prove ef cien cy or
perform an ce, wh ich ts righ t in to th e n eeds of a Big Data
en viron m en t.
Th e valu e of Big Data an d th e
associated an alytics is tricklin g down to sm aller organ ization s,
wh ich creates an oth er ch allen ge for th ose bu ildin g Big Data
storage in frastru ctu res: creatin g sm aller in itial im plem en tation s
th at can scale yet t in to th e bu dgets of sm aller organ ization s.

BUILDIN G A PLA TFO RM

Like an y application platform , a Big Data application platform m u st


su pport all of th e fu n ction ality requ ired for an y application plat-
form , in clu din g elem en ts su ch as scalability, secu rity, availability, an d
con tin u ity.

c06 22 October 2012; 17:58:3


THE NUTS AND BOL TS OF BI G DATA 53

Yet Big Data Application platform s are u n iqu e; th ey n eed to be


able to h an dle m assive am ou n ts of data across m u ltiple data stores an d
in itiate con cu rren t processin g to save tim e. Th is m ean s th at a Big Data
platform sh ou ld in clu de bu ilt-in su pport for tech n ologies su ch as
MapRedu ce, in tegration with extern al Not on ly SQL (NoSQL) data-
bases, parallel processin g capabilities, an d distribu ted data services. It
sh ou ld also m ake u se of th e n ew in tegration targets, at least from a
developm en t perspective.
Con sequ en tly, th ere are speci c ch aracteristics an d featu res th at a
Big Data platform sh ou ld offer to work effectively with Big Data
an alytics processes:

Most of th e
existin g platform s for processin g data were design ed for h an -
dlin g tran saction al Web application s an d h ave little su pport
for bu sin ess an alytics application s. Th at situ ation h as driven
Hadoop to becom e th e de facto stan dard for h an dlin g batch
processin g. However, real-tim e an alytics is altogeth er differen t,
requ irin g som eth in g m ore th an Hadoop can offer. An even t-
processin g fram ework n eeds to be in place as well. Fortu n ately,
several tech n ologies an d processin g altern atives exist on th e
m arket th at can brin g real-tim e an alytics in to Big Data plat-
form s, an d m an y m ajor ven dors, su ch as Oracle, HP, an d IBM,
are offerin g th e h ardware an d software to brin g real-tim e pro-
cessin g to th e forefron t. However, for th e sm aller bu sin ess th at
m ay n ot be a viable option becau se of th e cost. For n ow, real-
tim e processin g rem ain s a fu n ction th at is provided as a service
via th e clou d for sm aller bu sin esses.
Transform ing Big Data application
development into something more m ainstream may be the best
way to leverage what is offered by Big Data. This means creating
a built-in stack that integrates with Big Data databases from
the NoSQL world and creating MapReduce frameworks such as
Hadoop and distributed processing. Developm ent should account
for the existing transaction-processing and event-processing
semantics that come with the handling of the real-time analytics
that t into the Big Data world.

c06 22 October 2012; 17:58:3


54 BI G DATA ANAL YTI CS

Creatin g Big Data application s is very differen t from writin g


a typical CRUD application (create, retrieve, u pdate, delete)
for a cen tralized relation al database. Th e prim ary differen ce is
with th e design of th e data dom ain m odel, as well as th e API
an d Qu ery sem an tics th at will be u sed to access an d process th at
data. Mappin g is an effective approach in Big Data, h en ce th e
su ccess of MapRedu ce, in wh ich th ere is an im pedan ce m is-
m atch between differen t data m odels an d sou rces. An appro-
priate exam ple is th e u se of object an d relation al m appin g tools
like Hibern ate for bu ildin g a bridge between th e im pedan ce
m ism atch es.
Batch -processin g projects
are bein g serviced with fram eworks su ch as Hive, wh ich provide
an SQL-like facade for h an dlin g com plex batch processin g with
Hadoop. However, oth er tools are startin g to sh ow prom ise.
An exam ple is JPA, wh ich provides a m ore stan dardized JEE
abstraction th at ts in to real-tim e Big Data application s. Th e
Google app En gin e u ses Data Nu cleu s alon g with Bigtable to
ach ieve th e sam e goal, wh ile GigaSpaces u ses Open JPA s JPA
abstraction com bin ed with an in -m em ory data grid. Red Hat
takes a differen t approach an d leverages Hibern ate object-grid
m appin g to m ap Big Data.
Th ere are several ch oices available
to abstract data, ran gin g from open sou rce tools to com m ercial
distribu tion s of specialized produ cts. On e to pay atten tion to is
Sprin g Data from Sprin gSou rce, wh ich is a h igh -level abstrac-
tion tool th at offers th e ability to m ap differen t data stores of all
kin ds in to on e com m on abstraction th rou gh an n otation an d a
plu g-in approach .
Of course, one of the primary capabilities offered by abstrac-
tion tools is the ability to normalize and interpret the data into a
uniform structure, which can be further worked with. The key
here is to make sure that whatever abstraction technology is
em ployed deals with current and future data sets ef ciently.
A critical com pon en t of th e Big Data an alytics
process is logic, especially bu sin ess logic, wh ich is respon sible for

c06 22 October 2012; 17:58:3


THE NUTS AND BOL TS OF BI G DATA 55

processin g th e data. Cu rren tly, MapRedu ce reign s su prem e in


th e realm of Big Data bu sin ess logic. MapRedu ce was design ed
to h an dle th e processin g of m assive am ou n ts of data th rou gh
m ovin g th e processin g logic to th e data an d distribu tin g th e logic
in parallel to all n odes. An oth er factor th at adds to th e appeal
of MapRedu ce is th at developin g parallel processin g code is
very com plex.
Wh en designing a custom Big Data application platform, it is
critical to m ake MapReduce and parallel execu tion sim ple. Th at
can be accom plished by m apping the sem an tics into existin g
programm in g m odels. An example is to exten d an existin g m odel,
such as Session Bean , to support the needed sem an tics. This m akes
parallel processing look like a standard invocation of sin gle-job
execu tion .
SQL is a great qu ery lan gu age.
However, it is lim ited, at least in th e realm of Big Data. Th e
problem lies in th e fact th at SQL relies on a sch em a to work
properly, an d Big Data, especially wh en it is u n stru ctu red, does
n ot work well with sch em a-based qu eries. It is th e dyn am ic data
stru ctu re of Big Data th at con fou n ds th e SQL sch em a-based
processes. Here Big Data platform s m u st be able to su pport
sch em a-less sem an tics, wh ich in tu rn m ean s th at th e data
m appin g layer wou ld n eed to be exten ded to su pport docu m en t
sem an tics. Exam ples are Mon goDB, Cou ch Base, Cassan dra, an d
th e GigaSpaces docu m en t API. Th e key h ere is to m ake su re th at
Big Data application platform s su pport m ore relaxed version s
of th ose sem an tics, with a focu s on providin g exibility in
con sisten cy, scalability, an d perform an ce.
If th e goal is to deliver th e best per-
form an ce an d redu ce laten cy, th en on e m u st con sider u sin g
RAM-based devices an d perform processin g in -m em ory. How-
ever, for th at to work effectively, Big Data platform s n eed to
provide a seam less in tegration between RAM an d disk-based
devices in wh ich data th at are written in RAM wou ld be
syn ch ed in to th e disk asyn ch ron ou sly. Also, th e platform s n eed
to provide com m on abstraction s th at allow u sers th e sam e data

c06 22 October 2012; 17:58:3


56 BI G DATA ANAL YTI CS

access API for both devices an d th u s m ake it easier to ch oose th e


righ t tool for th e job with ou t ch an gin g th e application code.
Big
Data application s (an d platform s) m u st also be able to work
with even t-driven processes. With Big Data, th is m ean s th ere
m u st be data awaren ess in corporated, wh ich m akes it easy to
rou te m essages based on data af n ity an d th e con ten t of th e
m essage. Th ere also h ave to be con trols th at allow th e creation
of n e-grain ed sem an tics for triggerin g even ts based on data
operation s (su ch as add, delete, an d u pdate) an d con ten t, as
with com plex even t processin g.
Big Data
application s con su m e large am ou n ts of com pu ter an d storage
resou rces. Th is h as led to th e u se of th e clou d an d its elastic
capabilities for ru n n in g Big Data application s, wh ich in tu rn can
offer a m ore econ om ical approach to processin g Big Data jobs.
To take advan tage of th ose econ om ics, Big Data application
platform s m u st in clu de bu ilt-in su pport for pu blic, private, an d
h ybrid clou ds th at will in clu de seam less tran sition s between
th e variou s clou d platform s th rou gh in tegration with th e
available fram eworks. Exam ples abou n d, su ch as JClou ds an d
Clou d Bu rstin g, wh ich provides a h ybrid m odel for u sin g clou d
resou rces as spare capacity to h an dle load.
Th e typical Big Data application
stack in corporates several layers, in clu din g th e database itself,
th e Web tier, th e processin g tier, cach in g layer, th e data syn -
ch ron ization an d distribu tion layer, an d reportin g tools. A m ajor
disadvan tage for th ose m an agin g Big Data application s is th at
each of th ose layers com es w ith differen t m an agem en t, provi-
sion in g, m on itorin g, an d trou blesh ootin g tools. Add to th at
th e in h eren t com plexity of Big Data application s, an d effec-
tive m an agem en t, alon g with th e associated m ain ten an ce,
becom es dif cu lt.
With th at in m in d, it becom es critical to ch oose a Big Data
application platform th at in tegrates th e m an agem en t stack with
th e application stack. An in tegrated m an agem en t capability is

c06 22 October 2012; 17:58:4


THE NUTS AND BOL TS OF BI G DATA 57

on e of th e best produ ctivity elem en ts th at can be in corporated


in to a Big Data platform .

Bu ildin g a Big Data platform is n o easy ch ore, especially wh en on e


con siders th at th ere m ay be a m u ltitu de of righ t ways an d wron g ways
to do it. Th is is fu rth er com plicated by th e pleth ora of tools, tech n ol-
ogies, an d m eth odologies available. However, th ere is a brigh t side th at
stresses exibility, an d sin ce Big Data is con stan tly evolvin g, exibility
will ru le in bu ildin g a cu stom platform or ch oosin g on e off th e sh elf.

BRIN GIN G STRUCTURE TO UN STRUCTURED DA TA

In its n ative form at, a large pile of u n stru ctu red data h as little valu e. It
is bu rden som e in th e typical en terprise, especially on e th at h as n ot
adopted Big Data practices to extract th e valu e.
However, extractin g valu e can be akin to n din g a n eedle in a
h aystack, an d if th at h aystack is spread across several farm s an d th e
n eedle is in pieces, it becom es even m ore dif cu lt. On e of th e prim ary
jobs of Big Data an alytics is to piece th at n eedle back togeth er an d
organ ize th e h aystack in to a sin gle en tity to speed u p th e search . Th at
can be a tall order with u n stru ctu red data, a type of data th at is growin g
in volu m e an d size as well as com plexity.
Un stru ctu red (or u n catalogu ed) data can take m an y form s, su ch
as h istorical ph otograph collection s, au dio clips, research n otes,
gen ealogy m aterials, an d oth er rich es h idden in variou s data libraries.
Th e Big Data m ovem en t h as driven m eth odologies to create dyn am ic
an d m ean in gfu l lin ks am on g th ese cu rren tly u n stru ctu red in form a-
tion sou rces.
For th e m ost part, th at h as resu lted in th e creation of m etadata an d
m eth ods to brin g stru ctu re to u n stru ctu red data. Cu rren tly, two dom i-
n an t tech n ical an d stru ctu ral approach es h ave em erged: (1) a relian ce
on search tech n ologies, an d (2) a tren d toward au tom ated data cate-
gorization . Man y data categorization tech n iqu es are bein g applied
across th e lan dscape, in clu din g taxon om ies, sem an tics, n atu ral lan -
gu age recogn ition , au to-categorization , wh at s related fu n ction ality,
data visu alization , an d person alization . Th e idea is to provide th e
in form ation th at is n eeded to process an an alytics fu n ction .

c06 22 October 2012; 17:58:4


58 BI G DATA ANAL YTI CS

The importance of in tegrating structu red and loosely u nstructu red


data can not be overstated in the world of Big Data analytics. There are a
few enablin g techn ical strategies that m ake it possible to sort the wh eat
from th e chaff. For instance, there is SQL-NoSQL Integration. Th ose
u sin g MapReduce and oth er sch em aless frameworks h ave been strug-
gling with stru ctu ral data and analytics comin g from th e relation al
database m an agem en t system (RDBMS) side. However, th e integration
of th e relational and n on relational paradigm s provides the m ost pow-
erful analytics by bringin g togeth er th e best of both worlds.
Th ere are several tech n ologies th at en able th is in tegration ; som e of
th em take advan tage of th e processin g power of MapRedu ce fram e-
works like Hadoop to perform data tran sform ation in place, rath er th an
doin g it in a separate m iddle tier. Som e tools com bin e th is capability with
in -place tran sform ation at th e target database as well, takin g advan t-
age of th e com pu tin g capabilities of en gin eered m ach in es an d u sin g
ch an ge data captu re to syn ch ron ize, sou rce, an d target, again with ou t
th e overh ead of a m iddle tier. In both cases, th e overarch in g prin ciple is
real-tim e data in tegration , in wh ich re ectin g data ch an ge in stan tly in a
data wareh ou se wh eth er origin atin g from a MapRedu ce job or from
a tran saction al system an d create down stream an alytics th at h ave an
accu rate, tim ely view of reality. Oth ers are tu rn in g to lin ked data an d
sem an tics, wh ere data sets are created u sin g lin kin g m eth odologies th at
focu s on th e sem an tics of th e data.
This ts well into th e broader n otion of pointin g at external sou rces
from within a data set, wh ich h as been arou nd for quite a lon g tim e.
That ability to point to u nstructu red data (wh ether residin g in the le
system or som e external sou rce) m erely becomes an exten sion of the
given capabilities, in which the ability to store and process XML and
XQuery n atively within an RDBMS enables the com bination of different
degrees of structu re while searching and analyzing the u nderlyin g data.
Newer sem an tics tech n ologies can take th is fu rth er by providin g a
set of form alized XML-based stan dards for storage, qu eryin g, an d
m an ipu lation of data. Sin ce th ese tech n ologies h ave been focu sed on
th e Web, m an y bu sin esses h ave n ot associated th e process with Big
Data solu tion s.
Most NoSQL tech n ologies fall in to th e categories of key valu e
stores, graph , or docu m en t databases; th e sem an tic resou rce description

c06 22 October 2012; 17:58:4


THE NUTS AND BOL TS OF BI G DATA 59

fram ework (RDF) triple store creates an altern ative. It is n ot relation al


in th e tradition al sen se, bu t it still m ain tain s relation sh ips between
data elem en ts, in clu din g extern al on es, an d does so in a exible,
exten sible fash ion .
A record in an RDF store is com posed of a triple, con sisting of
subject, predicate, and object. That does n ot impose a relation al sch em a
on th e data, which su pports th e addition of n ew elements with ou t
structu ral m odi cation s to the store. In addition, th e u nderlyin g system
can resolve referen ces by in ferrin g n ew triples from the existing records
u sin g a rules set. This is a powerfu l altern ative to join in g relation al
tables to resolve referen ces in a typical RDBMS, wh ile also offering a
m ore expressive way to m odel data than a key value store.
On e of th e m ost powerfu l aspects of sem an tic tech n ology com es
from th e world of lin gu istics an d n atu ral lan gu age processin g, also
kn own as en tity extraction . Th is is a powerfu l m ech an ism to extract
in form ation from u n stru ctu red data an d com bin e it with tran saction al
data, en ablin g deep an alytics by brin gin g th ese worlds closer togeth er.
Anoth er m ethod that brin gs stru ctu re to the u nstru ctu red is the text
analytics tool, wh ich is improving daily as scien tists come u p with n ew
ways of m akin g algorithm s u nderstan d written text m ore accurately.
Today s algorith ms can detect n am es of people, organization s, and
locations within secon ds sim ply by analyzin g the context in wh ich
words are u sed. The trend for this tool is to m ove toward recognition of
furth er u sefu l entities, su ch as produ ct nam es, brands, events, and skills.
Entity relation extraction is another important tool, in wh ich a
relation th at consistently con nects two entities in m an y docum ents is
important inform ation in scien ce and enterprise alike. Entity relation
extraction detects n ew knowledge in Big Data. Oth er u nstructu red data
tools are detectin g sentim en t in social data, integrating m u ltiple lan-
guages, and applying text analytics to audio and video transcripts. The
n um ber of videos is growing at a con stant rate, and transcripts are even
m ore u nstru ctu red th an written text because there is n o pun ctuation .

PRO CESSIN G PO WER

An alyzin g Big Data can take m assive am ou n ts of processin g power.


Th ere is a sim ple relation sh ip between data an alytics an d processin g

c06 22 October 2012; 17:58:4


60 BI G DATA ANAL YTI CS

power: th e larger th e data set an d th e faster th at resu lts are n eeded, th e


m ore processin g power it takes. However, processin g Big Data an a-
lytics is n ot a sim ple m atter of th rowin g th e latest an d fastest processor
at th e problem ; it is m ore abou t th e ideology beh in d grid-com pu tin g
tech n ologies.
Big Data in volves m ore th an ju st distribu ted processin g tech n olo-
gies, like Hadoop. It is also abou t faster processors, wider ban dwidth
com m u n ication s, an d larger an d ch eaper storage to ach ieve th e goal of
m akin g th e data con su m able. Th at in tu rn drives th e idea of data
visu alization an d in terface tech n ologies, wh ich m ake th e resu lts of
an alysis con su m able by h u m an s, an d th at is wh ere th e raw processin g
power com es to bear for an alytics.
In tu itiven ess com es from proper an alysis, an d proper an alysis
requ ires th e appropriate h orsepower an d in frastru ctu re to m in e an
appropriate data set from h u ge piles of data. To th at en d, distribu ted
processin g platform s su ch as Hadoop an d MapRedu ce are gain in g
favor over big iron in th e realm of Big Data an alytics.
Perh aps th e sim plest argu m en t for pu rsu in g a distribu ted in fra-
stru ctu re is th e exibility of scale, in wh ich m ore com m odity h ardware
can ju st be th rown at a particu lar an alysis project to in crease perfor-
m an ce an d speed resu lts. Th at distribu ted ideology plays well in to grid
processin g an d clou d-based services, wh ich can be em ployed as n eeded
to process data sets.
Th e prim ary th in g to rem em ber abou t bu ildin g a processin g plat-
form for Big Data is h ow processin g can scale. For exam ple, m an y
bu sin esses start off sm all, with a few com m odity PCs ru n n in g a Hadoop-
based platform , bu t as th e am ou n t of data an d th e available sou rces grow
expon en tially, th e ability to process th e data falls expon en tially,
m ean in g th at design s m u st in corporate a look-ah ead m eth odology. Th at
is wh ere IT profession als will n eed to con sider available, fu tu re tech -
n ologies to scale processin g to th eir n eeds.
Clou d-based solu tion s th at offer elastic-type services are a decen t
w ay to fu tu re-proof a Big Data an alytics platform , sim ply becau se of
th e ability of a clou d service to in stan tly scale to th e loads placed
u pon it.
Th ere is n o sim ple an swer to h ow to process Big Data with th e
tech n ology ch oices available today. Neverth eless, m ajor ven dors are

c06 22 October 2012; 17:58:4


THE NUTS AND BOL TS OF BI G DATA 61

lookin g to m ake th e ch oices easier by providin g can n ed solu tion s th at


are based on applian ce m odels, wh ile oth ers are bu ildin g com plete
clou d-based Big Data solu tion s to m eet th e elastic n eeds of sm all an d
m ediu m bu sin esses lookin g to leverage Big Data.

CHO O SIN G AMO N G IN -HO USE, O UTSO URCED,


O R HYBRID APPRO A CHES

Th e world of Big Data is lled with ch oices so m an y th at m ost IT


profession als can becom e overwh elm ed with option s, tech n ologies,
an d platform s. It is alm ost at th e poin t at wh ich Big Data an alytics is
requ ired to ch oose am on g th e variou s Big Data ideologies, platform s,
an d tools.
However, th e qu estion rem ain s of wh ere to start with Big Data.
Th e an swer can be fou n d in h ow Big Data system s evolve or grow. In
th e past, workin g with Big Data always m ean t workin g on th e scale of
a dedicated data cen ter. However, com m odity h ardware u sin g plat-
form s like Hadoop h as ch an ged th e dyn am ic, decreasin g storage prices,
an d open sou rce application s h ave fu rth er lowered th e in itial cost of
en try. Th ese n ew dyn am ics allow sm aller bu sin esses to experim en t
with Big Data an d th en expan d th e platform s as n eeded as su ccesses
are bu ilt.
On ce a pilot project h as been con stru cted u sin g open sou rce soft-
ware with com m odity h ardware an d storage devices, IT m an agers can
th en m easu re h ow well th e pilot platform m eets th eir n eeds. On ly
after th e processin g n eeds an d volu m e of data in crease can an IT
m an ager m ake a decision on wh ere to h ead with a Big Data an alytics
platform , developin g on e in -h ou se or tu rn in g to th e clou d.

c06 22 October 2012; 17:58:4

You might also like