Professional Documents
Culture Documents
By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.
CHAPTER
Stru ctu re of th e data (stru ctu red, u n stru ctu red, sem istru ctu red,
table based, proprietary)
Sou rce of th e data (in tern al, extern al, private, pu blic)
Valu e of th e data (gen eric, u n iqu e, specialized)
Qu ality of th e data (veri ed, static, stream in g)
Storage of th e data (rem otely accessed, sh ared, dedicated plat-
form s, portable)
Relation sh ip of th e data (su perset, su bset, correlated)
37
Fin din g data for Big Data an alytics is part scien ce, part in vestigative
w ork, an d part assu m ption . Som e of th e m ost obviou s sou rces for
data are electron ic tran saction s, w eb site logs, an d sen sor in for-
m ation . An y data th e organ ization gath ers w h ile doin g bu sin ess are
in clu ded. Th e idea is to locate as m an y data sou rces as possible an d
brin g th e data in to an an alytics platform . Addition al data can be
gath ered u sin g n etw ork taps an d data replication clien ts. Ideally,
th e m ore data th at can be captu red, th e m ore data th ere w ill be to
w ork w ith .
Fin din g th e in tern al data is th e easy part of Big Data. It gets m ore
com plicated on ce data con sidered u n related, extern al, or u n stru ctu red
are bou gh t in to th e equ ation . With th at in m in d, th e big qu estion with
Big Data n ow is, Wh ere do I get th e data from ? Th is is n ot easily
an swered; it takes som e research to separate th e wh eat from th e ch aff,
kn owin g th at th e ch aff m ay h ave som e valu e as well.
SETTIN G THE GO A L
Th e list can in clu de m an y oth er in tern ally tracked elem en ts; h owever,
it is critical to be aware of dim in ish in g retu rn s on in vestm en t with th e
data sou rced. In oth er words, som e log in form ation m ay n ot be worth
th e effort to gath er, becau se it will n ot affect th e an alytics ou tcom e.
Fin ally, extern al data m u st be taken in to accou n t. Th ere is a vast
wealth of extern al in form ation th at can be u sed to calcu late everyth in g
from cu stom er sen tim en ts to geopolitical issu es. Th e data th at m ake u p
th e pu blic portion of th e an alytics process can com e from govern m en t
en tities, research com pan ies, social n etworkin g sites, an d a m u ltitu de
of oth er sou rces.
For exam ple, a bu sin ess m ay decide to m in e Twitter, Facebook, th e
U.S. cen su s, weath er in form ation , traf c pattern in form ation , an d
n ews arch ives to bu ild a com plex sou rce of rich data. Som e con trols
n eed to be in place, an d th at m ay even in clu de scru bbin g th e data
before processin g (i.e., rem ovin g spu riou s in form ation or in valid
elem en ts).
Th e rich n ess of th e data is th e basis for predictive an alytics. A
com pan y lookin g to in crease sales m ay com pare popu lation tren ds,
alon g with social sen tim en t, to cu stom er feedback an d satisfaction to
iden tify wh ere th e sales process cou ld be im proved. Th e data ware-
h ou se can be u sed for m u ch m ore after th e in itial processin g, an d real-
tim e data cou ld also be in tegrated to iden tify tren ds as th ey arise.
Th e retail situ ation is on ly on e exam ple; th ere are dozen s of oth ers,
each of wh ich m ay h ave a speci c applicability to th e task at h an d.
Mu ltiple sou rces are respon sible for a growth in data th at is appl-
cable to Big Data tech n ology. Som e of th ese sou rces represen t en tirely
n ew data sou rces, wh ile oth ers are a ch an ge in th e resolu tion of th e
existin g data gen erated. Mu ch of th at growth can be attribu ted to
in du stry digitization of con ten t.
With com pan ies n ow tu rn in g to creatin g digital represen tation s of
existin g data an d acqu irin g everyth in g th at is n ew, data growth rates
over th e last few years h ave been n early in n ite, sim ply becau se m ost
of th e bu sin esses in volved started from zero.
For th ose lookin g to sam ple wh at is available for Big Data an alytics,
a vast am ou n t of data exists on th e Web; som e of it is free, an d som e
of it is available for a fee. Mu ch of it is sim ply th ere for th e takin g.
If you r goal is to start gath erin g data, it is pretty h ard to beat m an y of
th e tools th at are readily available on th e m arket. For th ose lookin g for
poin t-an d-click sim plicity, Extractiv (h ttp:/ / www.extractiv.com ) an d
Mozen da (h ttp:/ / www .m ozen da.com ) offer th e ability to acqu ire data
from m u ltiple sou rces an d to search th e Web for in form ation .
An oth er can didate for processin g data on th e Web is Google Re n e
(h ttp:/ / code.google.com / p/ google-re n e), a tool set th at can work
with m essy data, clean in g th em u p an d th en tran sform in g th em in to
differen t form ats for an alytics. 80Legs (h ttp:/ / www.80legs.com ) spe-
cializes in gath erin g data from social n etw orkin g sites as well as retail
an d bu sin ess directories.
Th e tools ju st m en tion ed are excellen t exam ples for m in in g data
from th e Web to tran sform th em in to a Big Data an alytics platform .
However, gath erin g data is on ly th e rst of m an y steps. To garn er
valu e from th e data, th ey m u st be an alyzed an d, better yet, visu alized.
Tools su ch as Grep (h ttp:/ / www .lin fo.org/ grep.h tm l), Tu rk (h ttp:/ /
www.m tu rk.com ), an d BigSh eets (h ttp:/ / www-01.ibm .com / software/
ebu sin ess/ jstart/ bigsh eets) offer th e ability to an alyze data. For visu -
alization , an alysts can tu rn to tools su ch as Tableau Pu blic (h ttp://
www.tableausoftware.com), OpenHeatMap (http:// www.openheatmap
.com ), and Gephi (http:// www.gephi.org).
Beyon d th e u se of discovery tools, Big Data can be fou n d th rou gh
services an d sites su ch as Cru n ch Base, th e U.S. cen su s, In foCh im ps,
Kaggle, Freebase, an d Tim etric. Man y oth er services offer data sets
directly for in tegration in to Big Data processin g.
The prices of some of th ese services are rather reasonable. For
exam ple, you can down load a m illion Web pages through 80Legs for less
th an th ree dollars. Som e of th e top data sets can be fou nd on comm ercial
sites, yet for free. An example is the Com mon Crawl Corpu s, wh ich h as
crawl data from abou t ve billion Web pages and is available in the ARC
le form at from Amazon S3. The Google Books Ngram s is anoth er data
set th at Amazon S3 m akes available for free. The le is in a Hadoop-
frien dly form at. For th ose wh o m ay be won dering, n -grams are xed-
size sets of items. In this case, the items are words extracted from th e
Google Books corpu s. The n speci es the n um ber of elements in th e set,
so a ve-gram contains ve words or characters.
Man y m ore data sets are available from Am azon S3, an d it is
de n itely worth visitin g h ttp:// aws.amazon .com / pu blicdatasets/ to
track th ese down . An oth er site to visit for a listin g of pu blic data sets is
h ttp:// www.qu ora.com / Data/ Wh ere-can -I-get-large-datasets-open -to-
th e-pu blic, a treasure trove of lin ks to data sets an d in formation related
to th ose data sets.
Barriers to Big Data adoption are gen erally cu ltu ral rath er th an tech -
n ological. In particu lar, m an y organ ization s fail to im plem en t Big Data
program s becau se th ey are u n able to appreciate h ow data an alytics can
im prove th eir core bu sin ess. On e th e m ost com m on triggers for Big
Data developm en t is a data explosion th at m akes existin g data sets
very large an d in creasin gly dif cu lt to m an age with con ven tion al
database m an agem en t tools.
As th ese data sets grow in size typically ran gin g from several
terabytes to m u ltiple petabytes bu sin esses face th e ch allen ge of