You are on page 1of 13

Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.

CHAPTER

Bringing It
All Together

h e prom ises offered by data-driven decision m akin g h ave been


widely recogn ized. Bu sin esses h ave been u sin g bu sin ess in telli-
gen ce (BI) an d bu sin ess an alytics for years n ow, realizin g th e
valu e offered by sm aller data sets an d of in e advan ced processin g.
However, bu sin esses are ju st startin g to realize th e valu e of Big Data
an alytics, especially wh en paired with real-tim e processin g.
Th at h as led to a growin g en th u siasm for th e n otion of Big Data,
with bu sin esses of all sizes startin g to th row resou rces beh in d th e qu est
to leverage th e valu e ou t of large data stores com posed of stru ctu red,
sem istru ctu red, an d u n stru ctu red data. Alth ou gh th e prom ises wrap-
ped arou n d Big Data are very real, th ere is still a wide gap between its
poten tial an d its realization .
Th at wide gap is h igh ligh ted by th ose wh o h ave su ccessfu lly u sed
th e con cepts of Big Data at th e ou tset. For exam ple, it is estim ated th at
Google alon e con tribu ted $54 billion to th e U.S. econ om y in 2009,
a sign i can t econ om ic effect, m ostly attribu ted to th e ability to h an dle
large data sets in an ef cien t m an n er.
Th at alon e is probably reason en ou gh for th e m ajority of bu si-
n esses to start evalu atin g h ow Big Data an alytics can affect th e bottom
lin e, an d th ose bu sin esses sh ou ld probably start evalu atin g Big Data
prom ises soon er rath er th an later.

111

c10 22 October 2012; 18:1:22


112 BI G DATA ANAL YTI CS

Delving into th e value of Big Data analytics reveals th at elements


su ch as h eterogeneity, scale, tim elin ess, complexity, and privacy prob-
lem s can impede progress at all phases of the process th at create value
from data. Th e prim ary problem begin s at the poin t of data acquisition ,
wh en the data tsu nami requires u s to m ake decision s, currently in an ad
h oc m ann er, about wh at data to keep, what to discard, and h ow to
reliably store wh at we keep with the right m etadata.
Addin g to th e con fu sion is th at m ost data today are n ot n atively
stored in a stru ctu red form at; for exam ple, tweets an d blogs are weakly
stru ctu red pieces of text, wh ile im ages an d video are stru ctu red for
storage an d display bu t n ot for sem an tic con ten t an d search . Tran s-
form in g su ch con ten t in to a stru ctu red form at for later an alysis is a
m ajor ch allen ge.
Neverth eless, th e valu e of data explodes w h en th ey can be lin ked
w ith oth er data; th u s data in tegration is a m ajor creator of valu e.
Sin ce m ost data are directly gen erated in digital form at today, bu si-
n esses h ave th e opportu n ity an d th e ch allen ge to in u en ce th e
creation of facilitatin g later lin kage an d to au tom atically lin k previ-
ou sly created data.
Data analysis, organization , retrieval, and m odeling are oth er foun -
dational challen ges. Data analysis is a clear bottleneck in m an y applica-
tions becau se of the lack of scalability of the un derlying algorithm s as well
as the com plexity of the data that n eed to be analyzed. Finally, presen -
tation of the results and their interpretation by n on tech nical dom ain
experts is crucial for extracting action able kn owledge.

THE PA TH TO BIG DA TA

Du rin g th e last th ree to fou r decades, prim ary data m an agem en t


prin ciples, in clu din g ph ysical an d logical in depen den ce, declarative
qu eryin g, an d cost-based optim ization , h ave created a m u ltibillion -
dollar in du stry th at h as delivered added valu e to collected data. Th e
evolu tion of th ese tech n ical advan tages h as led to th e creation of BI
platform s, wh ich h ave becom e on e of th e prim ary ten ets of valu e
extraction an d corporate decision m akin g.
Th e fou n dation laid by BI application s an d platform s h as created
th e ideal en viron m en t for m ovin g in to Big Data an alytics. After all,

c10 22 October 2012; 18:1:22


BRI NGI NG I T AL L TOGETHER 113

m an y of th e con cepts rem ain th e sam e; it is ju st th e data sou rces an d


th e qu an tity th at prim arily ch an ge, as well as th e algorith m s u sed to
expose th e valu e.
Th at creates an opportu n ity in wh ich in vestm en t in Big Data an d its
associated tech n ical elem en ts becom es a m u st for m an y bu sin esses. Th at
in vestm en t will spu r fu rth er evolu tion of th e an alytical platform s in u se
an d will strive to create collaborative an alytical solu tion s th at look
beyon d th e con n es of tradition al an alytics. In oth er words, appropriate
in vestm en t in Big Data will lead to a n ew wave of fu n dam en tal tech -
n ological advan ces th at will be em bodied in th e n ext gen eration s of Big
Data m an agem en t an d an alysis platform s, produ cts, an d system s.
Th e tim e is n ow. Usin g Big Data to solve bu sin ess problem s an d
prom ote research in itiatives will m ost likely create h u ge econ om ic
valu e in th e U.S. econ om y for years to com e, m akin g Big Data an a-
lytics th e n orm for larger organ ization s. However, th e path to su ccess is
n ot easy an d m ay requ ire th at data scien tists reth in k data an alysis
system s in fu n dam en tal ways.
A m ajor in vestm en t in Big Data, properly directed, n ot on ly can
resu lt in m ajor scien ti c advan ces bu t also can lay th e fou n dation for th e
n ext gen eration of advan ces in scien ce, m edicin e, an d bu sin ess. So
bu sin ess leaders m u st ask th em selves th e followin g: Do th ey wan t to be
part of th e n ext big th in g in IT?

THE REA LITIES O F THIN KIN G BIG DA TA

Today, organ ization s an d in dividu als are awash in a ood of data.


Application s an d com pu ter-based tools are collectin g in form ation on
an u n preceden ted scale. Th e down side is th at th e data h ave to be
m an aged, wh ich is an expen sive, cu m bersom e process. Yet th e cost of
th at m an agem en t can be offset by th e in trin sic valu e offered by th e
data, at least wh en looked at properly.
Th e valu e is derived from th e data th em selves. Decision s th at were
previou sly based on gu esswork or on pain stakin gly con stru cted m odels
of reality can n ow be m ade based on th e data th em selves. Su ch Big
Data an alysis n ow drives n early every aspect of ou r m odern society,
in clu din g m obile services, retail, m an u factu rin g, n an cial services, life
scien ces, an d ph ysical scien ces.

c10 22 October 2012; 18:1:22


114 BI G DATA ANAL YTI CS

Certain m arket segm en ts h ave h ad early su ccess with Big Data


an alytics. For exam ple, scien ti c research h as been revolu tion ized by
Big Data, a prim e case bein g th e Sloan Digital Sky Su rvey, wh ich h as
becom e a cen tral resou rce for astron om ers th e world over.
Big Data h as tran sform ed astron om y from a eld in wh ich takin g
pictu res of th e sky was a large part of th e job to on e in wh ich th e
pictu res are all in a database already an d th e astron om er s task is to
n d in terestin g objects an d ph en om en a in th e database.
Transformation is taking place in the biological arena as well. There
is n ow a well-established tradition of depositin g scien ti c data into a
public repository and of creatin g public databases for use by oth er
scien tists. In fact, there is an entire discipline of bioinformatics that
is largely devoted to the m ainten an ce and analysis of such data. As
technology advances, particularly with the advent of n ext-generation
sequ en cing, the size and nu mber of available experim en tal data sets
are increasin g expon en tially.
Big Data h as th e poten tial to revolu tion ize m ore th an ju st research ;
th e an alytics process h as started to tran sform edu cation as well. A recen t
detailed qu an titative com parison of differen t approach es taken by 35
ch arter sch ools in New York City h as fou n d th at on e of th e top ve
policies correlated with m easu rable academ ic effectiven ess was th e u se
of data to gu ide in stru ction .
Th is exam ple is on ly th e tip of th e iceberg; as access to data an d
an alytics im proves an d evolves, m u ch m ore valu e can be derived. Th e
poten tial h ere leads to a w orld w h ere au th orized in dividu als h ave
access to a h u ge database in wh ich every detailed m easu re of every
stu den t s academ ic perform an ce is stored. Th at data cou ld be u sed to
design th e m ost effective approach es for edu cation , ran gin g from
th e basics, su ch as readin g, writin g, an d m ath , to advan ced college-
level cou rses.
A n al exam ple is th e h ealth care in du stry, in wh ich everyth in g
from in su ran ce costs to treatm en t m eth ods to dru g testin g can be
im proved with Big Data an alytics. Ultim ately, Big Data in th e h ealth
care in du stry will lead to redu ced costs an d im proved qu ality of care,
wh ich m ay be attribu ted to m akin g care m ore preven tive an d per-
son alized an d basin g it on m ore exten sive (h om e-based) con tin u ou s
m on itorin g.

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 115

More exam ples are readily available to prove th at data can deliver
valu e well beyon d on e s expectation s. Th e key issu es are th e an alysis
perform ed an d th e goal sou gh t. Th e previou s exam ples on ly scratch
th e su rface of wh at Big Data m ean s to th e m asses. Th e essen tial poin t
h ere is to u n derstan d th e in trin sic valu e of Big Data an alytics an d
extrapolate th e valu e as it can be applied to oth er circu m stan ces.

HA N DS-O N BIG DA TA

The analysis of Big Data involves m ultiple distin ct phases, each of wh ich
introdu ces challen ges. These phases inclu de acquisition , extraction,
aggregation, m odeling, and interpretation . However, m ost people focu s
just on the m odeling (an alysis) phase.
Alth ou gh th at ph ase is cru cial, it is of little u se w ith ou t th e oth er
ph ases of th e data an alysis process, wh ich can create problem s like
false ou tcom es an d u n in terru ptable resu lts. Th e an alysis is on ly as
good as th e data provided. Th e problem stem s from th e fact th at
th ere are poorly u n derstood com plexities in th e con text of m u lti-
ten an ted data clu sters, especially wh en several an alyses are bein g
ru n con cu rren tly.
Man y sign i can t ch allen ges exten d beyon d an d u n dern eath th e
m odelin g ph ase. For exam ple, Big Data h as to be m an aged for con text,
wh ich m ay in clu de spu riou s in form ation an d can be h eterogen eou s in
n atu re; th is is fu rth er com plicated by th e lack of an u pfron t m odel.
It m ean s th at data proven an ce m u st be accou n ted for, as well as m eth ods
created to h an dle u n certain ty an d error.
Perh aps th e problem s can be attribu ted to ign oran ce or, at th e very
least, a lack of con sideration for prim ary topics th at de n e th e Big Data
process yet are often afterth ou gh ts. Th is m ean s th at qu estion s an d
an alytical processes m u st be plan n ed an d th ou gh t ou t in th e con text of
th e data provided. On e h as to determ in e wh at is wan ted from th e data
an d th en ask th e appropriate qu estion s to get th at in form ation .
Accom plish in g th at will requ ire sm arter system s as well as better
su pport for th ose m akin g th e qu eries, perh aps by em powerin g th ose
u sers with n atu ral lan gu age tools (rath er th an com plex m ath em atical
algorith m s) to qu ery th e data. Th e key issu e is th e level of ach ievable
arti cial in telligen ce an d h ow m u ch th at can be relied on . Cu rren tly,

c10 22 October 2012; 18:1:23


116 BI G DATA ANAL YTI CS

IBM s Watson is a m ajor step toward in tegratin g arti cial in telligen ce


with th e Big Data an alytics space, yet th e sh eer size an d com plexity of
th e system preclu des its u se for m ost an alysts.
Th is m ean s th at oth er m eth odologies to em power u sers an d an a-
lysts will h ave to be created, an d th ey m u st rem ain affordable an d be
sim ple to u se. After all, th e cu rren t bottlen eck with processin g Big Data
really h as becom e th e n u m ber of u sers wh o are em powered to ask
qu estion s of th e data an d an alyze th em .

THE BIG DA TA PIPELIN E IN DEPTH

Big Data does n ot arise from a vacuum (except, of cou rse, wh en studying
deep space). Basically, data are recorded from a data-generating sou rce.
Gathering data is akin to sen sing and observing the world aroun d u s, from
the h eart rate of a h ospital patient to the con tents of an air sam ple to the
n um ber of Web page queries to scien ti c experim ents that can easily
produ ce petabytes of data.
However, m u ch of th e data collected is of little in terest an d can be
ltered an d com pressed by m an y orders of m agn itu de, wh ich creates a
bigger ch allen ge: th e de n ition of lters th at do n ot discard u sefu l
in form ation . For exam ple, su ppose on e data sen sor readin g differs
su bstan tially from th e rest. Can th at be attribu ted to a fau lty sen sor, or
are th e data real an d worth in clu sion ?
Fu rth er com plicatin g th e lterin g process is h ow th e sen sors gath er
data. Are th ey based on tim e, tran saction s, or oth er variables? Are th e
sen sors affected by en viron m en t or oth er activities? Are th e sen sors tied
to spatial an d tem poral even ts su ch as traf c m ovem en t or rain fall?
Before th e data are ltered, th ese con sideration s an d oth ers m u st
be addressed. Th at m ay requ ire n ew tech n iqu es an d m eth odologies to
process th e raw data in telligen tly an d deliver a data set in m an ageable
ch u n ks with ou t th rowin g away th e n eedle in th e h aystack. Fu rth er
lterin g com plication s com e with real-tim e processin g, in wh ich th e
data are in m otion an d stream in g on th e y, an d on e does n ot h ave
th e lu xu ry of bein g able to store th e data rst an d process th em later
for redu ction .
An oth er ch allen ge com es in th e form of au tom atically gen eratin g
th e righ t m etadata to describe wh at data are recorded an d h ow th ey

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 117

are recorded an d m easu red. For exam ple, in scien ti c experim en ts,
con siderable detail on speci c experim en tal con dition s an d procedu res
m ay be requ ired to be able to in terpret th e resu lts correctly, an d it is
im portan t th at su ch m etadata be recorded with observation al data.
Wh en im plem en ted properly, au tom ated m etadata acqu isi-
tion system s can m in im ize th e n eed for m an u al processin g, greatly
redu cin g th e h u m an bu rden of recordin g m etadata. Th ose wh o
are gath erin g data also h ave to be con cern ed w ith th e data prove-
n an ce. Recordin g in form ation abou t th e data at th eir tim e of creation
becom es im portan t as th e data m ove th rou gh th e data an alysis
process. Accu rate proven an ce can preven t processin g errors from
ren derin g th e su bsequ en t an alysis u seless. With su itable proven an ce,
th e su bsequ en t processin g steps can be qu ickly iden ti ed. Provin g th e
accu racy of th e data is accom plish ed by gen eratin g su itable m etadata
th at also carry th e proven an ce of th e data th rou gh th e data an alysis
process.
An oth er step in th e process con sists of extractin g an d clean in g th e
data. Th e in form ation collected will frequ en tly n ot be in a form at
ready for an alysis. For exam ple, con sider electron ic h ealth records in a
m edical facility th at con sist of tran scribed dictation s from several
ph ysician s, stru ctu red data from sen sors an d m easu rem en ts (possibly
with som e associated an om alou s data), an d im age data su ch as scan s.
Data in th is form can n ot be effectively an alyzed. Wh at is n eeded is an
in form ation extraction process th at draws ou t th e requ ired in form a-
tion from th e u n derlyin g sou rces an d expresses it in a stru ctu red form
su itable for an alysis.
Accom plish in g th at correctly is an on goin g tech n ical ch allen ge,
especially wh en th e data in clu de im ages (an d, in th e fu tu re, video).
Su ch extraction is h igh ly application depen den t; th e in form ation in an
MRI, for in stan ce, is very differen t from wh at you wou ld draw ou t of a
su rveillan ce ph oto. Th e u biqu ity of su rveillan ce cam eras an d th e
popu larity of GPS-en abled m obile ph on es, cam eras, an d oth er portable
devices m ean s th at rich an d h igh - delity location an d trajectory (i.e.,
m ovem en t in space) data can also be extracted.
An oth er issu e is th e h on esty of th e data. For th e m ost part, data are
expected to be accu rate, if n ot tru th fu l. However, in som e cases, th ose
wh o are reportin g th e data m ay ch oose to h ide or falsify in form ation .

c10 22 October 2012; 18:1:23


118 BI G DATA ANAL YTI CS

For exam ple, patien ts m ay ch oose to h ide risky beh avior, or poten tial
borrowers llin g ou t loan application s m ay in ate in com e or h ide
expen ses. Th e list is en dless of ways in wh ich data cou ld be m is-
in terpreted or m isreported. Th e act of clean in g data before an alysis
sh ou ld in clu de well-recogn ized con strain ts on valid data or well-
u n derstood error m odels, wh ich m ay be lackin g in Big Data platform s.
Movin g data th rou gh th e process requ ires con cen tration on in te-
gration , aggregation , an d represen tation of th e data all of wh ich are
process-orien ted steps th at address th e h eterogen eity of th e ood of
data. Here th e ch allen ge is to record th e data an d th en place th em in to
som e type of repository.
Data analysis is con siderably m ore challen ging than sim ply locating,
identifyin g, u nderstan ding, and citin g data. For effective large-scale
analysis, all of this h as to h appen in a com pletely automated m an ner.
This requires differences in data stru ctu re and sem an tics to be expressed
in forms th at are m achine readable and then com puter resolvable.
It m ay take a signi cant amoun t of work to ach ieve autom ated error-
free difference resolution.
Th e data preparation ch allen ge even exten ds to an alysis th at u ses
on ly a sin gle data set. Here th ere is still th e issu e of su itable database
design , fu rth er com plicated by th e m an y altern ative ways in wh ich to
store th e in form ation . Particu lar database design s m ay h ave certain
advan tages over oth ers for an alytical pu rposes. A case in poin t is th e
variety in th e stru ctu re of bioin form atics databases, in wh ich in for-
m ation on su bstan tially sim ilar en tities, su ch as gen es, is in h eren tly
differen t bu t is represen ted with th e sam e data elem en ts.
Exam ples like th ese clearly in dicate th at database design is an
artistic en deavor th at h as to be carefu lly execu ted in th e en terprise
con text by profession als. Wh en creatin g effective database design s,
profession als su ch as data scien tists m u st h ave th e tools to assist th em
in th e design process, an d m ore im portan t, th ey m u st develop tech -
n iqu es so th at databases can be u sed effectively in th e absen ce of
in telligen t database design .
As th e data m ove th rou gh th e process, th e n ext step is qu eryin g
th e data an d th en m odelin g it for an alysis. Meth ods for qu eryin g an d
m in in g Big Data are fu n dam en tally differen t from tradition al statistical
an alysis. Big Data is often n oisy, dyn am ic, h eterogen eou s, in terrelated,

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 119

an d u n tru stworth y a very differen t in form ation al sou rce from sm all
data sets u sed for tradition al statistical an alysis.
Even so, n oisy Big Data can be m ore valu able th an tin y sam ples
becau se gen eral statistics obtain ed from frequ en t pattern s an d cor-
relation an alysis u su ally overpower in dividu al u ctu ation s an d often
disclose m ore reliable h idden pattern s an d kn ow ledge. In addition ,
in tercon n ected Big Data creates large h eterogen eou s in form a-
tion n etworks w ith w h ich in form ation redu n dan cy can be explored
to com pen sate for m issin g data, cross-ch eck con ictin g cases, an d
validate tru stw orth y relation sh ips. In tercon n ected Big Data resou rces
can disclose in h eren t clu sters an d u n cover h idden relation sh ips
an d m odels.
Min in g th e data th erefore requ ires in tegrated, clean ed, tru stwor-
th y, an d ef cien tly accessible data, backed by declarative qu ery an d
m in in g in terfaces th at featu re scalable m in in g algorith m s. All of th is
relies on Big Data com pu tin g en viron m en ts th at are able to h an dle th e
load. Fu rth erm ore, data m in in g can be u sed con cu rren tly to im prove
th e qu ality an d tru stworth in ess of th e data, expose th e sem an tics
beh in d th e data, an d provide in telligen t qu eryin g fu n ction s.
Viru len t exam ples of in trodu ced data errors can be readily fou n d
in th e h ealth care in du stry. As n oted previou sly, it is n ot u n com m on
for real-w orld m edical records to h ave errors. Fu rth er com plicatin g th e
situ ation is th e fact th at m edical records are h eterogen eou s an d are
u su ally distribu ted in m u ltiple system s. Th e resu lt is a com plex an a-
lytics en viron m en t th at lacks an y type of stan dard n om en clatu re to
de n e its respective elem en ts.
Th e valu e of Big Data an alysis can be realized on ly if it can be
applied robu stly u n der th ose ch allen gin g con dition s. However, th e
kn owledge developed from th at data can be u sed to correct errors an d
rem ove am bigu ity. An exam ple of th e u se of th at corrective an alysis is
wh en a ph ysician writes DVT as th e diagn osis for a patien t. Th is
abbreviation is com m on ly u sed for both deep vein th rom bosis an d
diverticu litis, two very differen t m edical con dition s. A kn owledge base
con stru cted from related data can u se associated sym ptom s or m edi-
cation s to determ in e wh ich of th e two th e ph ysician m ean t.
It is easy to see h ow Big Data can en able th e n ext gen eration
of in teractive data an alysis, wh ich by u sin g au tom ation can deliver

c10 22 October 2012; 18:1:23


120 BI G DATA ANAL YTI CS

real-tim e an swers. Th is m ean s th at m ach in e in telligen ce can be u sed


in th e fu tu re to direct au tom atically gen erated qu eries toward Big
Data a key capability th at will exten d th e valu e of data for au tom atic
con ten t creation for web sites, popu late h ot lists or recom m en dation s,
an d to provide an ad h oc an alysis of th e valu e of a data set to decide
wh eth er to store or discard it.
Ach ievin g th at goal will requ ire scalin g com plex qu ery-processin g
tech n iqu es to terabytes wh ile en ablin g in teractive respon se tim es, an d
cu rren tly th is is a m ajor ch allen ge an d an open research problem .
Neverth eless, advan ces are m ade on a regu lar basis, an d wh at is a
problem today will u n dou btedly be solved in th e n ear fu tu re as pro-
cessin g power in creases an d data becom e m ore coh eren t.
Solvin g th at problem will requ ire a tech n iqu e th at elim in ates
th e lack of coordin ation am on g database system s th at h ost th e data
an d provide SQL qu eryin g, w ith an alytics packages th at perform
variou s form s of n on -SQL processin g su ch as data m in in g an d sta-
tistical an alyses. Today s an alysts are im peded by a tediou s process of
exportin g data from a database, perform in g a n on -SQL process, an d
brin gin g th e data back. Th is is a m ajor obstacle to providin g th e
in teractive au tom ation th at w as provided by th e rst gen eration of
SQL-based OLAP system s. Wh at is n eeded is a tigh t cou plin g between
declarative qu ery lan gu ages an d th e fu n ction s of Big Data an alytics
packages th at w ill ben e t both th e expressiven ess an d th e perfor-
m an ce of th e an alysis.
On e of th e m ost im portan t steps in processin g Big Data is th e
in terpretation of th e data an alyzed. Th at is wh ere bu sin ess decision s can
be form ed based on th e con ten ts of th e data as th ey relate to a bu sin ess
process. Th e ability to an alyze Big Data is of lim ited valu e if th e u sers
can n ot u n derstan d th e an alysis. Ultim ately, a decision m aker, provided
with th e resu lt of an an alysis, h as to in terpret th ese resu lts. Data
in terpretation can n ot h appen in a vacu u m . For m ost scen arios, in ter-
pretation requ ires exam in in g all of th e assu m ption s an d retracin g th e
an alysis process.
An im portan t elem en t of in terpretation com es from th e u n der-
stan din g th at th ere are m an y possible sou rces of error, ran gin g from
processin g bu gs to im proper an alysis assu m ption s to resu lts based on
erron eou s data a situ ation th at logically preven ts u sers from fu lly

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 121

cedin g au th ority to a fu lly au tom ated process ru n solely by th e com -


pu ter system . Proper in terpretation requ ires th at th e u ser u n derstan ds
an d veri es th e resu lts produ ced by th e com pu ter. Neverth eless, th e
an alytics platform sh ou ld m ake th at easy to do, wh ich cu rren tly
rem ain s a ch allen ge with Big Data becau se of its in h eren t com plexity.
In m ost cases, cru cial assu m ption s beh in d th e data are recorded
th at can tain t th e overall an alysis. Th ose an alyzin g th e data n eed to
be aware of th ese situ ation s. Sin ce th e an alytical process in volves
m u ltiple steps, assu m ption s can creep in at an y poin t, m akin g doc-
u m en tation an d explan ation of th e process especially im portan t to
th ose in terpretin g th e data. Ultim ately th at w ill lead to im proved
resu lts an d will in trodu ce self-correction in to th e data process as
th ose in terpretin g th e data in form th ose writin g th e algorith m s of
th eir n eeds.
It is rarely en ou gh to provide ju st th e resu lts. Rath er, on e m u st
provide su pplem en tary in form ation th at explain s h ow each resu lt
was derived an d w h at in pu ts it w as based on . Su ch su pplem en tary
in form ation is called th e provenance of th e data. By stu dyin g h ow best
to acqu ire, store, an d qu ery proven an ce, in con ju n ction with u sin g
tech n iqu es to accu m u late adequ ate m etadata, w e can create an
in frastru ctu re th at provides u sers w ith th e ability to in terpret th e
an alytical resu lts an d to repeat th e an alysis with differen t assu m p-
tion s, param eters, or data sets.

BIG DA TA VISUA LIZA TIO N

System s th at offer a rich palette of visu alization s are im portan t in con -


veyin g to th e u sers th e resu lts of th e qu eries, u sin g a represen tation th at
best illu strates h ow data are in terpreted in a particu lar situ ation . In th e
past, BI system s u sers were n orm ally offered tabu lar con ten t con sistin g
of n u m bers an d h ad to visu alize th e data relation sh ips th em selves.
However, th e com plexity of Big Data m akes th at dif cu lt, an d graph ical
represen tation s of an alyzed data sets are m ore in form ative an d easier
to u n derstan d.
It is u su ally easier for a m u ltitu de of u sers to collaborate on th e
an alytical resu lts wh en it is presen ted in a graph ical form , sim ply
becau se in terpretation is rem oved from th e form u la an d th e u sers are

c10 22 October 2012; 18:1:23


122 BI G DATA ANAL YTI CS

sh own th e resu lts. Today s an alysts n eed to presen t resu lts in powerfu l
visu alization s th at assist in terpretation an d su pport u ser collaboration .
Th ese visu alization s sh ou ld be based on in teractive sou rces th at
allow th e u sers to click an d rede n e th e presen ted elem en ts, creatin g a
con stru ctive en viron m en t wh ere th eories can be played ou t an d oth er
h idden elem en ts can be brou gh t forward. Ideally, th e in terface will
allow visu alization s to be affected by wh at-if scen arios or ltered by
oth er related in form ation , su ch as date ran ges, geograph ical location s,
or statistical qu eries.
Furth ermore, with a few clicks the u ser should be able to go deeper
into each piece of data and u nderstand its provenance, wh ich is a key
featu re to u nderstanding the data. Users n eed to be able to n ot only see
the results but also un derstand why they are seein g those results.
Raw proven an ce, particu larly regardin g th e ph ases in th e an alytics
process, is likely to be too tech n ical for m an y u sers to grasp com pletely.
On e altern ative is to en able th e u sers to play with th e steps in th e
an alysis m ake sm all ch an ges to th e process, for exam ple, or m odify
valu es for som e param eters. Th e u sers can th en view th e resu lts of
th ese in crem en tal ch an ges. By th ese m ean s, th e u sers can develop an
in tu itive feelin g for th e an alysis an d also verify th at it perform s as
expected in corn er cases, th ose th at occu r ou tside n orm al circu m -
stan ces. Accom plish in g th is requ ires th e system to provide con ven ien t
facilities for th e u ser to specify an alyses.

BIG DA TA PRIVA CY

Data privacy is an oth er h u ge con cern , wh ich in creases as on e equ ates


su ch privacy with th e power of Big Data. For electron ic h ealth records,
th ere are strict laws govern in g wh at can an d can n ot be don e. For oth er
data, regu lation s, particu larly in th e Un ited States, are less forcefu l.
However, th ere is great pu blic fear abou t th e in appropriate u se of
person al data, particu larly th rou gh th e lin kin g of data from m u ltiple
sou rces. Man agin g privacy is effectively both a tech n ical an d a socio-
logical problem , an d it m u st be addressed join tly from both perspec-
tives to realize th e prom ise of Big Data.
Take, for exam ple, th e data glean ed from location -based services.
A situ ation in wh ich n ew arch itectu res requ ire a u ser to sh are h is

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 123

or h er location with th e service provider resu lts in obviou s privacy


con cern s. Hidin g th e u ser s iden tity alon e with ou t h idin g th e location
wou ld n ot properly address th ese privacy con cern s.
An attacker or a (poten tially m aliciou s) location -based server can
in fer th e iden tity of th e qu ery sou rce from its location in form ation . For
exam ple, a u ser s location in form ation can be tracked th rou gh several
station ary con n ection poin ts (e.g., cell towers). After a wh ile, th e u ser
leaves a m etaph orical trail of bread cru m bs th at lead to a certain res-
iden ce or of ce location an d can th ereby be u sed to determ in e th e
u ser s iden tity.
Several oth er types of private in form ation , su ch as h ealth issu es (e.g.,
presen ce in a can cer treatm en t cen ter) or religiou s preferen ces
(e.g., presen ce in a ch u rch ), can also be revealed by ju st observin g
an on ym ou s u sers m ovem en t an d u sage pattern over tim e.
Fu rth erm ore, with th e cu rren t platform s in u se, it is m ore dif cu lt
to h ide a u ser location th an to h ide h is or h er iden tity. Th is is a resu lt
of h ow location -based services in teract with th e u ser. Th e location of
th e u ser is n eeded for su ccessfu l data access or data collection , bu t th e
iden tity of th e u ser is n ot n ecessary.
Th ere are m an y addition al ch allen gin g research problem s, su ch as
de n in g th e ability to sh are private data wh ile lim itin g disclosu re
an d en su rin g su f cien t data u tility in th e sh ared data. Th e existin g
m eth odology of differen tial privacy is an im portan t step in th e righ t
direction , bu t it u n fortu n ately cripples th e data payload too severely to
be u sefu l in m ost practical cases.
Real-world data are n ot static in n atu re, bu t th ey get larger an d
ch an ge over tim e, ren derin g th e prevailin g tech n iqu es alm ost u seless,
sin ce u sefu l con ten t is n ot revealed in an y m easu rable am ou n t for
fu tu re an alytics. Th is requ ires a reth in kin g of h ow secu rity for in for-
m ation sh arin g is de n ed for Big Data u se cases. Man y on lin e services
today requ ire u s to sh are private in form ation (th in k of Facebook
application s), bu t beyon d record-level access con trol we do n ot
u n derstan d wh at it m ean s to sh are data, h ow th e sh ared data can be
lin ked, an d h ow to give u sers n e-grain ed con trol over th is sh arin g.
Th ose issu es will h ave to be worked ou t to preserve u ser secu rity
wh ile still providin g th e m ost robu st data set for Big Data an alytics.

c10 22 October 2012; 18:1:23

You might also like