You are on page 1of 10

Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.

CHAPTER

Big Data Sources

n e of th e biggest ch allen ges for m ost organ ization s is n din g data


sou rces to u se as part of th eir an alytics processes. As th e n am e
im plies, Big Data is large, bu t size is n ot th e on ly con cern . Th ere
are several oth er con sideration s wh en decidin g h ow to locate an d
parse Big Data sets.
Th e rst step is to iden tify u sable data. Wh ile th at m ay be obviou s,
it is an yth in g bu t sim ple. Locatin g th e appropriate data to pu sh th rou gh
an an alytics platform can be com plex an d fru stratin g. Th e sou rce m u st
be con sidered to determ in e wh eth er th e data set is appropriate for u se.
Th at tran slates in to detective work or in vestigative reportin g.
Con sideration s sh ou ld in clu de th e followin g:

Stru ctu re of th e data (stru ctu red, u n stru ctu red, sem istru ctu red,
table based, proprietary)
Sou rce of th e data (in tern al, extern al, private, pu blic)
Valu e of th e data (gen eric, u n iqu e, specialized)
Qu ality of th e data (veri ed, static, stream in g)
Storage of th e data (rem otely accessed, sh ared, dedicated plat-
form s, portable)
Relation sh ip of th e data (su perset, su bset, correlated)

All of th ose elem en ts an d m an y oth ers can affect th e selection


process an d can h ave a dram atic effect on h ow th e raw data are pre-
pared ( scru bbed ) before th e an alytics process takes place.

37

c05 22 October 2012; 17:57:16


38 BI G DATA ANAL YTI CS

In th e IT realm , on ce a data sou rce is located, th e n ext step is to


im port th e data in to an appropriate platform . Th at process m ay be as
sim ple as copyin g data on to a Hadoop clu ster or as com plicated as
scru bbin g, in dexin g, an d im portin g th e data in to a large SQL-type
table. Th at im portation , or gath erin g of th e data, is on ly on e step in a
m u ltistep, som etim es com plex process.
On ce th e im portation (or real-tim e u pdatin g) h as been perform ed,
tem plates an d scripts can be design ed to ease fu rth er data-gath erin g
ch ores. On ce th e process h as been design ed, it becom es easier to
execu te in th e fu tu re.
Bu ildin g a Big Data set u ltim ately serves on e strategic pu rpose: to
m in e th e data, or dig for som eth in g of valu e. Min in g data in volves a lot
m ore th an ju st ru n n in g algorith m s again st a particu lar data sou rce.
Usu ally, th e data h ave to be rst im ported in to a platform th at can deal
with th e data in an appropriate fash ion . Th is m ean s th e data h ave to be
tran sform ed in to som eth in g accessible, qu eryable, an d relatable.
Min in g starts with a m in e or, in Big Data parlan ce, a platform . Ulti-
m ately, to h ave an y valu e, th at platform m u st be popu lated with
u sable in form ation .

HUN TIN G FO R DATA

Fin din g data for Big Data an alytics is part scien ce, part in vestigative
w ork, an d part assu m ption . Som e of th e m ost obviou s sou rces for
data are electron ic tran saction s, w eb site logs, an d sen sor in for-
m ation . An y data th e organ ization gath ers w h ile doin g bu sin ess are
in clu ded. Th e idea is to locate as m an y data sou rces as possible an d
brin g th e data in to an an alytics platform . Addition al data can be
gath ered u sin g n etw ork taps an d data replication clien ts. Ideally,
th e m ore data th at can be captu red, th e m ore data th ere w ill be to
w ork w ith .
Fin din g th e in tern al data is th e easy part of Big Data. It gets m ore
com plicated on ce data con sidered u n related, extern al, or u n stru ctu red
are bou gh t in to th e equ ation . With th at in m in d, th e big qu estion with
Big Data n ow is, Wh ere do I get th e data from ? Th is is n ot easily
an swered; it takes som e research to separate th e wh eat from th e ch aff,
kn owin g th at th e ch aff m ay h ave som e valu e as well.

c05 22 October 2012; 17:57:16


BI G DATA SOURCES 39

Settin g ou t to bu ild a Big Data wareh ou se takes a con cen trated


effort to n d th e appropriate data. Th e rst step is to determ in e wh at
Big Data an alytics is goin g to be u sed for. For exam ple, is th e bu sin ess
lookin g to an alyze m arketin g tren ds, predict web traf c, gau ge cu s-
tom er satisfaction , or ach ieve som e oth er lofty goal th at can be
accom plish ed with th e cu rren t tech n ologies?
It is th is kn owledge th at will determ in e wh ere an d h ow to gath er
Big Data. Perh aps th e best way to bu ild su ch kn owledge is to better
u n derstan d th e bu sin ess an alytics (BA) an d bu sin ess in telligen ce (BI)
processes to determ in e h ow large-scale data sets can be u sed to in teract
with in tern al data to garn er action able resu lts.

SETTIN G THE GO A L

Every project u su ally starts ou t with a goal an d with objectives to reach


th at goal. Big Data an alytics sh ou ld be n o differen t. However, de n in g
th e goal can be a dif cu lt process, especially wh en th e goal is vagu e
an d am ou n ts to little m ore th an som eth in g like u sin g th e data better.
It is im perative to de n e th e goal before h u n tin g for data sou rces, an d
in m an y cases, proven exam ples of su ccess can be th e fou n dation for
de n in g a goal.
Take, for exam ple, a retail organ ization . Th e goal for Big Data
an alytics m ay be to in crease sales, a ch ore th at span s several bu sin ess
ideologies an d departm en ts, in clu din g m arketin g, pricin g, in ven tory,
advertisin g, an d cu stom er relation s. On ce th ere is a goal in m in d, th e
n ext step is to de n e th e objectives, th e exact m ean s by wh ich to reach
th e goal.
For a project su ch as th e retail exam ple, it will be n ecessary
to gath er in form ation from a m u ltitu de of sou rces, som e in tern al
an d oth ers extern al. Som e of th e data m ay h ave to be pu rch ased, an d
som e m ay be available u n der th e pu blic dom ain . Th e key is to start
with th e in tern al, stru ctu red data rst, su ch as sales logs, in ven tory
m ovem en t, registered tran saction s, cu stom er in form ation , pricin g, an d
su pplier in teraction s.
Next com e th e u n stru ctu red data, su ch as call cen ter an d su pport
logs, cu stom er feedback (perh aps e-m ails an d oth er com m u n ication s),
su rveys, an d data gath ered by sen sors (store traf c, parkin g lot u sage).

c05 22 October 2012; 17:57:17


40 BI G DATA ANAL YTI CS

Th e list can in clu de m an y oth er in tern ally tracked elem en ts; h owever,
it is critical to be aware of dim in ish in g retu rn s on in vestm en t with th e
data sou rced. In oth er words, som e log in form ation m ay n ot be worth
th e effort to gath er, becau se it will n ot affect th e an alytics ou tcom e.
Fin ally, extern al data m u st be taken in to accou n t. Th ere is a vast
wealth of extern al in form ation th at can be u sed to calcu late everyth in g
from cu stom er sen tim en ts to geopolitical issu es. Th e data th at m ake u p
th e pu blic portion of th e an alytics process can com e from govern m en t
en tities, research com pan ies, social n etworkin g sites, an d a m u ltitu de
of oth er sou rces.
For exam ple, a bu sin ess m ay decide to m in e Twitter, Facebook, th e
U.S. cen su s, weath er in form ation , traf c pattern in form ation , an d
n ews arch ives to bu ild a com plex sou rce of rich data. Som e con trols
n eed to be in place, an d th at m ay even in clu de scru bbin g th e data
before processin g (i.e., rem ovin g spu riou s in form ation or in valid
elem en ts).
Th e rich n ess of th e data is th e basis for predictive an alytics. A
com pan y lookin g to in crease sales m ay com pare popu lation tren ds,
alon g with social sen tim en t, to cu stom er feedback an d satisfaction to
iden tify wh ere th e sales process cou ld be im proved. Th e data ware-
h ou se can be u sed for m u ch m ore after th e in itial processin g, an d real-
tim e data cou ld also be in tegrated to iden tify tren ds as th ey arise.
Th e retail situ ation is on ly on e exam ple; th ere are dozen s of oth ers,
each of wh ich m ay h ave a speci c applicability to th e task at h an d.

BIG DA TA SO URCES GRO WIN G

Mu ltiple sou rces are respon sible for a growth in data th at is appl-
cable to Big Data tech n ology. Som e of th ese sou rces represen t en tirely
n ew data sou rces, wh ile oth ers are a ch an ge in th e resolu tion of th e
existin g data gen erated. Mu ch of th at growth can be attribu ted to
in du stry digitization of con ten t.
With com pan ies n ow tu rn in g to creatin g digital represen tation s of
existin g data an d acqu irin g everyth in g th at is n ew, data growth rates
over th e last few years h ave been n early in n ite, sim ply becau se m ost
of th e bu sin esses in volved started from zero.

c05 22 October 2012; 17:57:17


BI G DATA SOURCES 41

Man y in du stries fall u n der th e u m brella of n ew data creation an d


digitization of existin g data, an d m ost are becom in g appropriate
sou rces for Big Data resou rces. Th ose in du stries in clu de th e followin g:

Sen sor data are bein g gen erated at an


acceleratin g rate from eet GPS tran sceivers, RFID (radio-
frequ en cy iden ti cation ) tag readers, sm art m eters, an d cell
ph on es (call data records); th ese data are u sed to optim ize
operation s an d drive operation al BI to realize im m ediate bu si-
n ess opportu n ities.
Th e h ealth care in du stry is qu ickly m ovin g to
electron ic m edical records an d im ages, wh ich it wan ts to u se for
sh ort-term pu blic h ealth m on itorin g an d lon g-term epidem io-
logical research program s.
Man y govern m en t agen cies are digitizin g pu blic
records, su ch as cen su s in form ation , en ergy u sage, bu dgets,
Freedom of In form ation Act docu m en ts, electoral data, an d law
en forcem en t reportin g.
Th e en tertain m en t in du stry h as
m oved to digital recordin g, produ ction , an d delivery in th e past
ve years an d is n ow collectin g large am ou n ts of rich con ten t
an d u ser view in g beh aviors.
Low-cost gen e sequ en cin g (less th an $1,000) can
gen erate ten s of terabytes of in form ation th at m u st be an alyzed
to look for gen etic variation s an d poten tial treatm en t
effectiven ess.
Video surveillance is still transitioning
from closed-caption television to Intern et protocol television
cam eras and recordin g systems that organ ization s wan t to ana-
lyze for beh avioral pattern s (secu rity and service enh an cement).

For m an y bu sin esses, th e addition al data can com e from self-service


m arketplaces, wh ich record th e u se of af n ity cards an d track th e sites
visited, an d can be combined with social n etworks an d location -based

c05 22 October 2012; 17:57:17


42 BI G DATA ANAL YTI CS

m etadata. Th is creates a goldm ine of action able consu m er data for


retailers, distribu tors, an d m anu factu rers of consu mer packaged goods.
Th e legal profession is addin g to th e m u ltitu de of data sou rces,
th an ks to th e discovery process, wh ich is dealin g m ore frequ en tly with
electron ic records an d requ irin g th e digitization of paper docu m en ts
for faster in dexin g an d im proved access. Today, leadin g e-discovery
com pan ies are h an dlin g terabytes or even petabytes of in form ation
th at n eed to be retain ed an d rean alyzed for th e fu ll cou rse of a legal
proceedin g.
Addition al in form ation an d large data sets can be fou n d on social
m edia sites su ch as Facebook, Fou rsqu are, an d Twitter. A n u m ber of
n ew bu sin esses are n ow bu ildin g Big Data en viron m en ts, based on
scale-ou t clu sters u sin g power-ef cien t m u lticore processors th at
leverage con su m ers (con sciou s or u n con sciou s) n early con tin u ou s
stream s of data abou t th em selves (e.g., likes, location s, an d opin ion s).
Th an ks to th e n etwork effect of su ccessfu l sites, th e total data
gen erated can expan d at an expon en tial rate. Som e com pan ies h ave
collected an d an alyzed over 4 billion data poin ts (e.g., web site cu t-
an d-paste operation s) sin ce in form ation collection started, an d with in
a year th e process h as expan ded to 20 billion data poin ts gath ered.

DIVIN G DEEPER IN TO BIG DA TA SO URCES

A ch an ge in resolu tion is fu rth er drivin g th e expan sion of Big Data.


Here addition al data poin ts are gath ered from existin g system s or
with th e in stallation of n ew sen sors th at deliver m ore pieces of
in form ation . Som e exam ples of in creased resolu tion can be fou n d
in th e followin g areas:

Th an ks to th e con solidation of global


tradin g en viron m en ts an d th e in creased u se of program m ed
tradin g, th e volu m e of tran saction s bein g collected an d an aly-
zed is dou blin g or triplin g. Tran saction volu m es also u ctu ate
m u ch faster, m u ch wider, an d m u ch m ore u n predictably.
Com petition am on g rm s is creatin g m ore data, sim ply becau se
sam plin g for tradin g decision s is occu rrin g m ore frequ en tly an d
at faster in tervals.

c05 22 October 2012; 17:57:17


BI G DATA SOURCES 43

Th e u se of sm art m eters in en ergy


grid system s, wh ich sh ifts m eter readin gs from m on th ly to every
15 m in u tes, can tran slate in to a m u ltith ou san dfold in crease in
data gen erated. Sm art m eter tech n ology exten ds beyon d ju st
power u sage an d can m easu re h eatin g, coolin g, an d oth er loads,
wh ich can be u sed as an in dicator of h ou seh old size at an y
given m om en t.
With th e advan ces in sm artph on es an d
con n ected PDAs, th e prim ary data gen erated from th ese devices
h ave grown beyon d caller, receiver, an d call len gth . Addition al
data are n ow bein g h arvested at expon en tial rates, in clu din g
elem en ts su ch as geograph ic location , text m essages, browsin g
h istory, an d (th an ks to th e addition of accelerom eters) even
m otion s, as well as social n etwork posts an d application u se.

A WEA LTH O F PUBLIC IN FO RMA TIO N

For th ose lookin g to sam ple wh at is available for Big Data an alytics,
a vast am ou n t of data exists on th e Web; som e of it is free, an d som e
of it is available for a fee. Mu ch of it is sim ply th ere for th e takin g.
If you r goal is to start gath erin g data, it is pretty h ard to beat m an y of
th e tools th at are readily available on th e m arket. For th ose lookin g for
poin t-an d-click sim plicity, Extractiv (h ttp:/ / www.extractiv.com ) an d
Mozen da (h ttp:/ / www .m ozen da.com ) offer th e ability to acqu ire data
from m u ltiple sou rces an d to search th e Web for in form ation .
An oth er can didate for processin g data on th e Web is Google Re n e
(h ttp:/ / code.google.com / p/ google-re n e), a tool set th at can work
with m essy data, clean in g th em u p an d th en tran sform in g th em in to
differen t form ats for an alytics. 80Legs (h ttp:/ / www.80legs.com ) spe-
cializes in gath erin g data from social n etw orkin g sites as well as retail
an d bu sin ess directories.
Th e tools ju st m en tion ed are excellen t exam ples for m in in g data
from th e Web to tran sform th em in to a Big Data an alytics platform .
However, gath erin g data is on ly th e rst of m an y steps. To garn er
valu e from th e data, th ey m u st be an alyzed an d, better yet, visu alized.
Tools su ch as Grep (h ttp:/ / www .lin fo.org/ grep.h tm l), Tu rk (h ttp:/ /
www.m tu rk.com ), an d BigSh eets (h ttp:/ / www-01.ibm .com / software/

c05 22 October 2012; 17:57:17


44 BI G DATA ANAL YTI CS

ebu sin ess/ jstart/ bigsh eets) offer th e ability to an alyze data. For visu -
alization , an alysts can tu rn to tools su ch as Tableau Pu blic (h ttp://
www.tableausoftware.com), OpenHeatMap (http:// www.openheatmap
.com ), and Gephi (http:// www.gephi.org).
Beyon d th e u se of discovery tools, Big Data can be fou n d th rou gh
services an d sites su ch as Cru n ch Base, th e U.S. cen su s, In foCh im ps,
Kaggle, Freebase, an d Tim etric. Man y oth er services offer data sets
directly for in tegration in to Big Data processin g.
The prices of some of th ese services are rather reasonable. For
exam ple, you can down load a m illion Web pages through 80Legs for less
th an th ree dollars. Som e of th e top data sets can be fou nd on comm ercial
sites, yet for free. An example is the Com mon Crawl Corpu s, wh ich h as
crawl data from abou t ve billion Web pages and is available in the ARC
le form at from Amazon S3. The Google Books Ngram s is anoth er data
set th at Amazon S3 m akes available for free. The le is in a Hadoop-
frien dly form at. For th ose wh o m ay be won dering, n -grams are xed-
size sets of items. In this case, the items are words extracted from th e
Google Books corpu s. The n speci es the n um ber of elements in th e set,
so a ve-gram contains ve words or characters.
Man y m ore data sets are available from Am azon S3, an d it is
de n itely worth visitin g h ttp:// aws.amazon .com / pu blicdatasets/ to
track th ese down . An oth er site to visit for a listin g of pu blic data sets is
h ttp:// www.qu ora.com / Data/ Wh ere-can -I-get-large-datasets-open -to-
th e-pu blic, a treasure trove of lin ks to data sets an d in formation related
to th ose data sets.

GETTIN G STA RTED WITH BIG DA TA ACQ UISITIO N

Barriers to Big Data adoption are gen erally cu ltu ral rath er th an tech -
n ological. In particu lar, m an y organ ization s fail to im plem en t Big Data
program s becau se th ey are u n able to appreciate h ow data an alytics can
im prove th eir core bu sin ess. On e th e m ost com m on triggers for Big
Data developm en t is a data explosion th at m akes existin g data sets
very large an d in creasin gly dif cu lt to m an age with con ven tion al
database m an agem en t tools.
As th ese data sets grow in size typically ran gin g from several
terabytes to m u ltiple petabytes bu sin esses face th e ch allen ge of

c05 22 October 2012; 17:57:17


BI G DATA SOURCES 45

captu rin g, m an agin g, an d an alyzin g th e data in an acceptable tim e


fram e. Gettin g started in volves several steps, startin g with train in g.
Train in g is a prerequ isite for u n derstan din g th e paradigm sh ift th at Big
Data offers. With ou t th at in sider kn owledge, it becom es dif cu lt to
explain an d com m u n icate th e valu e of data, especially wh en th e data
are pu blic in n atu re. Next on th e list is th e in tegration of developm en t
an d operation s team s (kn own as DevOps), th e people m ost likely to
deal with th e bu rden s of storin g an d tran sform in g th e data in to
som eth in g u sable.
Mu ch of th e process of m ovin g forward will lie with th e bu sin ess
execu tives an d decision m akers, wh o will also n eed to be brou gh t u p
to speed on th e valu e of Big Data. Th e advan tages m u st be explain ed in
a fash ion th at m akes sen se to th e bu sin ess operation s, wh ich in tu rn
m ean s th at IT pros are goin g to h ave to do som e legwork. To get
started, it proves h elpfu l to pu rsu e a few ideologies:

Iden tify a problem th at bu sin ess leaders can u n derstan d an d


relate to an d th at com m an ds th eir atten tion .
Do n ot focu s exclu sively on th e tech n ical data m an agem en t
ch allen ge. Be su re to allocate resou rces to u n derstan d th e u ses
for th e data with in th e bu sin ess.
De n e th e qu estion s th at m u st be an swered to m eet th e bu si-
n ess objective, an d on ly th en focu s on discoverin g th e
n ecessary data.
Un derstan d th e tools available to m erge th e data an d th e bu si-
n ess process so th at th e resu lt of th e data an alysis is m ore
action able.
Bu ild a scalable in frastru ctu re th at can h an dle growth of th e
data. Good an alysis requ ires en ou gh com pu tin g power to pu ll in
an d an alyze data. Man y people get discou raged becau se wh en
th ey start th e an alytic process, it is slow an d laboriou s.
Iden tify tech n ologies th at you can tru st. A dizzyin g variety of
open sou rce Big Data software tech n ologies are available, an d
m an y are likely to disappear with in a few years. Fin d on e th at
h as profession al ven dor su pport, or be prepared to take on
perm an en t m ain ten an ce of th e tech n ology as well as th e

c05 22 October 2012; 17:57:17


46 BI G DATA ANAL YTI CS

solu tion in th e lon g ru n . Hadoop seem s to be attractin g a lot of


m ain stream ven dor su pport.
Ch oose a tech n ology th at ts th e problem . Hadoop is best for
large bu t relatively sim ple data set lterin g, con vertin g, sortin g,
an d an alysis. It is also good for siftin g th rou gh large volu m es of
text. It is n ot really u sefu l for on goin g persisten t data m an age-
m en t, especially if stru ctu ral con sisten cy an d tran saction al
in tegrity are requ ired.
Be aware of ch an gin g data form ats an d ch an gin g data n eeds.
For in stan ce, a com m on problem faced by organ ization s seekin g
to u se BI solu tion s to m an age m arketin g cam paign s is th at th ose
cam paign s can be very speci cally focu sed, requ irin g an an alysis
of data stru ctu res th at m ay be in play for on ly a m on th or two.
Usin g con ven tion al relation al database m an agem en t system
tech n iqu es, it can take several weeks for database adm in is-
trators to get a data wareh ou se ready to accept th e ch an ged
data, by wh ich tim e th e cam paign is n early over. A MapRedu ce
solu tion , su ch as on e bu ilt on a Hadoop fram ework, can redu ce
th ose weeks to a day or two. Th u s it is n ot ju st volu m e bu t also
variety th at can drive Big Data adoption .

O N GO IN G GRO WTH, N O EN D IN SIGHT

Data creation is occu rrin g at a record rate. In fact, th e research rm


IDC s Digital Un iverse Stu dy predicts th at between 2009 an d 2020,
digital data will grow 44-fold, to 35 zettabytes per year. It is also
im portan t to recogn ize th at m u ch of th is data explosion is th e resu lt of
an explosion in devices located at th e periph ery of th e n etwork,
in clu din g em bedded sen sors, sm artph on es, an d tablet com pu ters. All
of th ese data create n ew opportu n ities for data an alytics in h u m an
gen om ics, h ealth care, oil an d gas, search , su rveillan ce, n an ce, an d
m an y oth er areas.

c05 22 October 2012; 17:57:17

You might also like