Professional Documents
Culture Documents
Georg Krempl
Indre Zliobaite Dariusz Brzezinski
University Magdeburg, Germany Aalto University and HIIT, Finland Poznan U. of Technology, Poland
georg.krempl@iti.cs.uni-magdeburg.de indre.zliobaite@aalto.fi dariusz.brzezinski@cs.put.poznan.pl
Eyke Hullermeier
Mark Last Vincent Lemaire
University of Paderborn, Germany Ben-Gurion U. of the Negev, Israel Orange Labs, France
eyke@upb.de mlast@bgu.ac.il vincent.lemaire@orange.com
Tino Noack Ammar Shaker Sonja Sievi
TU Cottbus, Germany University of Paderborn, Germany Astrium Space Transportation, Germany
noacktin@tu-cottbus.de ammar.shaker@upb.de sonja.sievi@astrium.eads.net
Myra Spiliopoulou Jerzy Stefanowski
University Magdeburg, Germany Poznan U. of Technology, Poland
myra@iti.cs.uni-magdeburg.de jerzy.stefanowski@cs.put.poznan.pl
1
http://sites.google.com/site/realstream2013
tion. Examples are subpopulations that can be identified and
show distinct, trackable evolutionary patterns. In the latter
case, no such patterns exist and drift occurs seemingly at ran-
dom. An example for the latter is fickle concept drift.
"
Stream mining approaches in general address the challenges posed
$
by volume, velocity and volatility of data. However, in real-world
applications these three challenges often coincide with other, to
#
date insufficiently considered ones.
The next sections discuss eight identified challenges for data stream
mining, providing illustrations with real world application exam-
Figure 1: CRISP cycle with data stream research challenges. ples, and formulating suggestions for forthcoming research.
[20] S. Carter,
[19] C. W. Lee,
Byun, H. Weerkamp,
Y. Kim, andand
M. K.
Tsagkias.
K. Kim.Microblog
Twitter [35] G. Fr
[34] S. Golovchinsky
enot and S.and M. Efron.An
Grumbach. Making sense of
in-browser Twitter
microblog
language identification:
data collecting tool with Overcoming the limitations
rule-based filtering of
and analy- search.
rankingInengine.
CHI, 2010.
In International conference on Ad-
short, unedited
sis module. and idiomatic
International text. of
Journal Language Resources
Web Information vances in Conceptual Modeling, volume 7518, pages 78
and Evaluation, 47(1):195215, [36] M. Graham, S. A. . Hale, and D. Gaffney. Where in the
Systems, 9(3):184203, 2013. June 2012. 88, 2012.
world are you ? Geolocation and language identification
[21] S.
[20] M.Carter,
Cha, H.W. Haddadi,
Weerkamp,F. Benevenuto, and K.Microblog
and M. Tsagkias. P. Gum- in Twitter.
[35] G. In ICWSM,
Golovchinsky and M. pages
Efron.518521, 2012.of Twitter
Making sense
madi. Measuring
language user influence
identification: in twitter:
Overcoming The million
the limitations of search. In CHI, 2010.
[37] B. Hecht, L. Hong, B. Suh, and E. Chi. Tweets from
follower
short, fallacy. and
unedited In ICWSM,
idiomaticpages
text. 1017,
Language2010.
Resources Justin
[36] M. Biebers
Graham, S. heart: the and
A. . Hale, dynamics of the Where
D. Gaffney. locationinfield
the
and Evaluation, 47(1):195215, June 2012. in userareprofiles. In Conference
[22] S. Chandra, L. Khan, and F. B. Muhaya. Estimating world you ? Geolocation and on Human
language Factors in
identification
[21] twitter
M. Cha,userH. location
Haddadi,using social interactionsa
F. Benevenuto, and K. P.content
Gum- Computing
in Twitter. Systems,
In ICWSM, pages 237246,
pages 2011.
518521, 2012.
based approach. In
madi. Measuring IEEE
user Conference
influence on Privacy,
in twitter: Se-
The million [38]
[37] B. J. Jansen,
Hecht, M. Zhang,
L. Hong, K. Sobel,
B. Suh, and E. and
Chi.A.Tweets
Chowdury.
from
curity,
followerRisk and In
fallacy. Trust, pagespages
ICWSM, 838843, Oct.
1017, 2011.
2010. Twitter power: heart:
Tweets
Justin Biebers theas electronic
dynamics word
of the of mouth.
location field
Journal of the American
in user profiles. SocietyonforHuman
In Conference Information
FactorsSci-in
[23] C. Chandra,
[22] S. Chen, F. L.Li, Khan,
C. Ooi,andandF. S.B.Wu. TI : An
Muhaya. efficient
Estimating
ence and Technology,
Computing 60(11):21692188,
Systems, pages 237246, 2011. Nov. 2009.
indexing mechanism
twitter user for real-time
location using search. In SIGMOD,
social interactionsa content
pages
based 649660,
approach.2011.
In IEEE Conference on Privacy, Se- [39]
[38] J.
B. Jiang, L. Hidayah,
J. Jansen, T.K.
M. Zhang, Elsayed,
Sobel, and A.H. Chowdury.
Ramadan.
curity, Risk and Trust, pages 838843, Oct. 2011. BEST
Twitterofpower:
KAUST at TREC-2011
Tweets : Building
as electronic word of effective
mouth.
[24] Z. Cheng, J. Caverlee, K. Lee, and C. Science. A search
JournalinofTwitter. TREC, Society
the American 2011. for Information Sci-
[23] content-driven
C. Chen, F. Li,framework
C. Ooi, and forS.geo-locating
Wu. TI : An microblog
efficient
ence and Technology, 60(11):21692188, Nov. 2009.
users. ACM
indexing Transactions
mechanism on Intelligent
for real-time Systems
search. In SIGMOD,and [40] P. J
urgens, A. Jungherr, and H. Schoen. Small worlds
Technology, 2012.
pages 649660, 2011. [39] with a difference:
J. Jiang, new gatekeepers
L. Hidayah, T. Elsayed, and
and the
H. filtering
Ramadan. of
political
BEST ofinformation
KAUST at on Twitter. In: International
TREC-2011 Web
Building effective
[25] M. Cheng,
[24] Z. Cheong J.
andCaverlee,
S. Ray. A
K.literature
Lee, andreview of recent
C. Science. A Science
search inConference-WebSci, pages 15, June 2011.
Twitter. TREC, 2011.
microblogging
content-driven developments.
framework forTechnical report,
geo-locating Clayton
microblog
School of Information
users. ACM Technology,
Transactions Monash
on Intelligent University,
Systems and [41]
[40] U. Kang,
P. Jurgens, D.A.
H. Jungherr,
Chau, andand C. Faloutsos.
H. Schoen.Managing and
Small worlds
2011.
Technology, 2012. mining large graphsnew
with a difference: : Systems and implementations.
gatekeepers and the filtering In
of
SIGMOD, volume 1, on
political information pages 589592,
Twitter. 2012.
In International Web
[26] Chew,
[25] M. Cynthia,
Cheong and and G. Eysenbach.
S. Ray. A literaturePandemics in the
review of recent Science Conference-WebSci, pages 15, June 2011.
age of twitter:developments.
microblogging content analysis of tweets
Technical during
report, the
Clayton [42] U. Kang and C. Faloutsos. Big graph mining :
2009
SchoolH1N1 outbreak. PloS
of Information one, 5(11),
Technology, 2010.University,
Monash [41] Algorithms
U. Kang, D. H. andChau,
discoveries. SIGKDD Managing
and C. Faloutsos. Explorations,
and
2011. 14(2):2936, 2013. : Systems and implementations. In
mining large graphs
[27] B. O. Connor, N. A. Smith, and E. P. Xing. A latent SIGMOD, volume 1, pages 589592, 2012.
[26] variable model forand
Chew, Cynthia, geographic lexical variation.
G. Eysenbach. PandemicsIninCon-
the [43] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos.
ference on Empirical
age of twitter: Methods
content in Natural
analysis Language
of tweets duringPro-
the [42] Gbase:
U. Kang An and
efficient
C. analysis
Faloutsos. platform for large
Big graph graphs.:
mining
cessing,
2009 H1N1 pages 12771287,
outbreak. PloS2010.
one, 5(11), 2010. VLDB Journal,
Algorithms and21(5):637650, June 2012.Explorations,
discoveries. SIGKDD
14(2):2936, 2013.
[28] Conover,
[27] B. Michael,
O. Connor, J. Ratkiewicz,
N. A. Smith, and E. P. M.Xing.Francisco,
A latent [44] S. Kumar, G. Barbier, M. Abbasi, and H. Liu. Tweet-
B. Goncalves,
variable model F.forMenczer, and
geographic A. Flammini.
lexical variation. Political
In Con- [43] Tracker:
U. Kang, An H. analysis
Tong, J. tool
Sun,for humanitarian
C.-Y. Lin, and C.and disaster
Faloutsos.
polarization on Twitter.
ference on Empirical In ICWSM,
Methods 2011.
in Natural Language Pro- relief.
Gbase:InAn ICWSM,
efficientpages 661662,
analysis 2011.
platform for large graphs.
cessing, pages 12771287, 2010. VLDB Journal, 21(5):637650, June 2012.
[45] H. Kwak, C. Lee, H. Park, and S. Moon. What is twit-
[29] J. David. Thats what friends are for inferring location
[28] in online social
Conover, mediaJ.platforms
Michael, based M.
Ratkiewicz, on social rela-
Francisco, [44] ter, a socialG.
S. Kumar, network or aM.
Barbier, news media?
Abbasi, andInH.WWW, pages
Liu. Tweet-
591600,
Tracker: An 2010.
analysis tool for humanitarian and disaster
tionships. In ICWSM,
B. Goncalves, 2013.and A. Flammini. Political
F. Menczer,
polarization on Twitter. In ICWSM, 2011. relief. In ICWSM, pages 661662, 2011.
[46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and
[45] A
H.novel
Kwak, approach
C. Lee, H.for Park,
event anddetection by mining
S. Moon. What is spatio-
twit-
[29] V. Guana.Thats
J. David. SociQL:
whatAfriends
query are
language for the
for inferring social
location temporal information
ter, a social network or on microblogs.
a news International
media? In WWW, pages
Web. In social
in online E. Kranakis, editor, Advances
media platforms based on in Network
social rela- Conference on Advances in Social Networks Analysis
591600, 2010.
Analysis
tionships.and its Applications,
In ICWSM, 2013. chapter 17, pages 381 and Mining, pages 254259, July 2011.
406. 2013. [46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and [47] A
C. novel
Li, J. approach
Weng, Q.for He,event
Y. Yao, and A.by
detection Datta.
miningTwiNER:
spatio-
[31] V.
Y. Doytsher and B. Galon.
Guana. SociQL: Querying
A query geo-social
language for thedata by
social named entity
temporal recognition
information on in targeted twitter
microblogs. stream. In
In International
bridging
Web. In spatial networkseditor,
E. Kranakis, and social networks.
Advances In 2nd
in Network SIGIR, pages
Conference on721730,
Advances 2012.
in Social Networks Analysis
ACM SIGSPATIAL
Analysis International
and its Applications, Workshop
chapter on Loca-
17, pages 381 and Mining, pages 254259, July 2011.
tion
406. Based
2013. Social Networks, pages 3946, 2010. [48] A. Marcus, M. Bernstein, and O. Badar. Tweets as
data:
[47] C. Li, demonstration
J. Weng, Q. He,ofY. TweeQL
Yao, and andA.Twitinfo. In SIG-
Datta. TwiNER:
[32] A. Dries,
[31] Y. S. Nijssen,
Doytsher and L. De
and B. Galon. Raedt.geo-social
Querying A query language
data by MOD, pages
named entity 12591261,
recognition2011.
in targeted twitter stream. In
for analyzing
bridging networks.
spatial In CIKM,
networks pages
and social 485494,In2009.
networks. 2nd SIGIR, pages 721730, 2012.
ACM SIGSPATIAL International Workshop on Loca- [49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] tion
M. Efron.
BasedHashtag retrieval inpages
Social Networks, a microblogging
3946, 2010.environ- [48] visualizing
A. Marcus,the M.data in tweets.
Bernstein, andSIGMOD
O. Badar.Record, 40(4),
Tweets as
ment. pages 787788, 2010. 2012.
data: demonstration of TweeQL and Twitinfo. In SIG-
[32] A. Dries, S. Nijssen, and L. De Raedt. A query language MOD, pages 12591261, 2011.
for analyzing networks. In CIKM, pages 485494, 2009.
[49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] M. Efron. Hashtag retrieval in a microblogging environ- visualizing the data in tweets. SIGMOD Record, 40(4),
ment. pages 787788, 2010. 2012.
ABSTRACT trary, popular social networking sites like Facebook6 , have richer
social interactions, but lower quality content comparing with blo-
Tumblr, as one of the most popular microblogging platforms, has
gosphere. Since most social interactions are either unpublished or
gained momentum recently. It is reported to have 166.4 millions of
less meaningful for the majority of public audience, it is natural for
users and 73.4 billions of posts by January 2014. While many arti-
Facebook users to form different communities or social circles. Mi-
cles about Tumblr have been published in major press, there is not
croblogging services, in between of traditional blogging and online
much scholar work so far. In this paper, we provide some pioneer
social networking services, have intermediate quality content and
analysis on Tumblr from a variety of aspects. We study the social
intermediate social interactions. Twitter7 , which is the largest mi-
network structure among Tumblr users, analyze its user generated
croblogging site, has the limitation of 140 characters in each post,
content, and describe reblogging patterns to analyze its user be-
and the Twitter following relationship is not reciprocal: a Twitter
havior. We aim to provide a comprehensive statistical overview of
user does not need to follow back if the user is followed by another.
Tumblr and compare it with other popular social services, including
As a result, Twitter is considered as a new social media [11], and
blogosphere, Twitter and Facebook, in answering a couple of key
short messages can be broadcasted to a Twitter users followers in
questions: What is Tumblr? How is Tumblr different from other
real time.
social media networks? In short, we nd Tumblr has more rich
content than other microblogging platforms, and it contains hybrid Tumblr is also posed as a microblogging platform. Tumblr users
characteristics of social networking, traditional blogosphere, and can follow another user without following back, which forms a non-
social media. This work serves as an early snapshot of Tumblr that reciprocal social network; a Tumblr post can be re-broadcasted by
later work can leverage. a user to its own followers via reblogging. But unlike Twitter, Tum-
blr has no length limitation for each post, and Tumblr also supports
multimedia post, such as images, audios or videos. With these dif-
1. INTRODUCTION ferences in mind, are the social network, user generated content, or
user behavior on Tumblr dramatically different from other social
Tumblr, as one of the most prevalent microblogging sites, has be-
media sites?
come phenomenal in recent years, and it is acquired by Yahoo! in
2013. By mid-January 2014, Tumblr has 166.4 millions of users In this paper, we provide a statistical overview over Tumblr from
and 73.4 billions of posts1 . It is reported to be the most popular assorted aspects. We study the social network structure among
social site among young generation, as half of Tumblrs visitor are Tumblr users and compare its network properties with other com-
under 25 years old2 . Tumblr is ranked as the 16th most popular monly used ones. Meanwhile, we study content generated in Tum-
sites in United States, which is the 2nd most dominant blogging blr and examine the content generation patterns. One step further,
site, the 2nd largest microblogging service, and the 5th most preva- we also analyze how a blog post is being reblogged and propagated
lent social site3 . In contrast to the momentum Tumblr gained in through a network, both topologically and temporally. Our study
recent press, little academic research has been conducted over this shows that Tumblr provides hybrid microblogging services: it con-
burgeoning social service. Naturally questions arise: What is Tum- tains dual characteristics of both social media and traditional blog-
blr? What is the difference between Tumblr and other blogging or ging. Meanwhile, surprising patterns surface. We describe these
social media sites? intriguing ndings and provide insights, which hopefully can be
leveraged by other researchers to understand more about this new
Traditional blogging sites, such as Blogspot4 and Live Journal5 ,
form of social media.
have high quality content but little social interactions. Nardi et
al. [17] investigated blogging as a form of personal communica-
tion and expression, and showed that the vast majority of blog posts 2. TUMBLR AT FIRST SIGHT
are written by ordinary people with a small audience. On the con-
Tumblr is ranked the second largest microblogging service, right
after Twitter, with over 166.4 million users and 73.4 billion posts
1
http://www.tumblr.com/about by January 2014. Tumblr is easy to register, and one can sign up
2
http://www.webcitation.org/64UXrbl8H for Tumblr service with a valid email address within 30 seconds.
3
http://www.alexa.com/topsites/countries/US Once sign in Tumblr, a user can follow other users. Different from
4
http://blogspot.com Facebook, the connections in Tumblr do not require mutual conr-
5
http://livejournal.com mation. Hence the social network in Tumblr is unidirectional.
6
http://facebook.com
7
http://twitter.com
edges. Within this social graph, 41.40% of nodes have 0 in-degree, point due to the Tumblr limit of 5000 followees for ordinary users.
and the maximum in-degree of a node is 4.06 million. By con- The reciprocity relationship on Tumblr does not follow the power
trast, 12.74% of nodes have 0 out-degree, the maximum out-degree law distribution, since the curve mostly is convex, similar to the
of a node is 155.5k. Top popular Tumblr users include equipo11 , pattern reported over Facebook[22].
instagram12 , and woodendreams13 . This indicates the media char- Meanwhile, it has been observed that ones degree is correlated
acteristic of Tumblr: the most popular user has more than 4 million with the degree of his friends. This is also called degree correlation
audience, while more than 40% of users are purely audience since or degree assortativity [18; 19]. Over the derived r-graph, we obtain
they dont have any followers. a correlation of 0.106 between terminal nodes of reciprocate con-
Figure 3(a) demonstrates the distribution of in-degrees in the blue nections, reconrming the positive degree assortativity as reported
curve and that of out-degrees in the red curve, where y-axis refers in Twitter [11]. Nevertheless, compared with the strong social net-
to the cumulated density distribution function (CCDF): the proba- work Facebook, Tumblrs degree assortativity is weaker (0.106 vs.
bility that accounts have at least k in-degrees or out-degrees, i.e., 0.226).
P (K >= k). It is observed that Tumblr users in-degree follows Degree of Separation. Small world phenomenon is almost uni-
a power-law distribution with exponent 2.19, which is quite sim- versal among social networks. With this huge Tumblr network,
ilar from the power law exponent of Twitter at 2.28 [11] or that we are able to validate the well-known six degrees of separation
of traditional blogs at 2.38 [21]. This also conrms with earlier as well. Figure 4 displays the distribution of the shortest paths in
empirical observation that most social network have a power-law the network. To approximate the distribution, we randomly sample
exponent between 2 and 3 [6]. 60,000 nodes as seed and calculate for each node the shortest paths
In regard to out-degree distribution, we notice the red curve has a to other nodes. It is observed that the distribution of paths length
big drop when out-degree is around 5000, since there was a limit reaches its mode with the highest probability at 4 hops, and has a
that ordinary Tumblr users can follow at most 5000 other users. median of 5 hops. On average, the distance between two connected
Tumblr users out-degree does not follow a power-law distribution, nodes is 4.7. Even though the longest shortest path in the approxi-
which is similar to blogosphere of traditional blogging [21]. mation has 29 hops, 90% of shortest paths are within 5.4 hops. All
If we explore users in-degree and out-degree together, we could these numbers are close to those reported on Facebook and Twitter,
generate normalized 3-D histogram in Figure 3(b). As both in- yet signicantly smaller than that obtained over blogosphere and
degree and out-degree follow the heavy-tail distribution, we only instant messenger network [13].
zoom in those user who have less than 210 in-degrees and out- Component Size. The previous result shows that those users who
degrees. Apparently, there is a positive correlation between in- are connected have a small average distance. It relies on the as-
degree and out-degree because of the dominance of diagonal bars. sumption that most users are connected to each other, which we
In aggregation, a user with low in-degree tends to have low out- shall conrm immediately. Because the Tumblr graph is directed,
degree as well, even though some nodes, especially those top pop- we compute out all weakly-connected components by ignoring the
ular ones, have very imbalanced in-degree and out-degree. direction of edges. It turns out the giant connected component
Reciprocity. Since Tumblr is a directed network, we would like to (GCC) encompasses 99.61% of nodes in the graph. Over the de-
examine the reciprocity of the graph. We derive the backbone of the rived r-graph, 97.55% are residing in the corresponding GCC. This
Tumblr network by keeping those reciprocal connections only, i.e., nding suggests the whole graph is almost just one connected com-
user a follows b and vice versa. Let r-graph denote the correspond- ponent, and almost all users can reach others through just few hops.
ing reciprocal graph. We found 29.03% of Tumblr user pairs have To give a palpable understanding, we summarize commonly used
reciprocity relationship, which is higher than 22.1% of reciprocity network statistics in Table 1. Those numbers from other popular
on Twitter [11] and 3% of reciprocity on Blogosphere [21], indicat- social networks (blogosphere, Twitter, Facebook, and MSN) are
ing a stronger interaction between users in the network. Figure 3(c) also included for comparison. From this compact view, it is obvi-
shows the distribution of degrees in the r-graph. There is a turning ous traditional blogs yield a signicantly different network struc-
ture. Tumblr, even though originally proposed for blogging, yields
11
http://equipo.tumblr.com a network structure that is more similar to Twitter and Facebook.
12
http://instagram.tumblr.com
13
http://woodendreams.tumblr.com
Percentage of Users
OutDegree
0.15
2 2
10 10
0.1
0.05
CCDF
CCDF
4 4
10 10
0
0
1
6 2 6
10 3 10 10
4 9
5 8
6 7
7 6
5
8
9 3 4
10 2
0 1
8 8
10 X 10
InDegree = 2
0
10
2
10 10
4 6
10 10
8
OutDegree = 2Y 0
10
1
10 10
2
10
3 4
10
5
10
InDegree or OutDegree InDegree (same to OutDegree)
(a) in/out degree distribution (b) in/out degree correlation (c) degree distribution in r-graph
10
0 Text Post Photo Caption
Dataset Dataset
10
2
# Posts 21.5 M 26.3 M
4
Mean Post Length 426.7 Bytes 64.3 Bytes
10
Median Post Length 87 Bytes 29 Bytes
PDF
10
6 Max Post Length 446.0 K Bytes 485.5 K Bytes
10
8 Table 2: Statistics of User Generated Contents
10
10
0 5 10 15 20 25 30
Shortest Path Length key difference: Tumblr has no length limit while Twitter enforces
1 the strict limitation of 140 bytes for each tweet. How does this key
difference affect user post behavior?
0.8
It has been reported that the average length of posts on Twitter is
67.9 bytes and the median is 60 bytes14 . Corresponding statistics
0.6
of Tumblr are shown in Table 2. For the text post dataset, the aver-
CDF
0.4
age length is 426.7 bytes and the median is 87 bytes, which both,
as expected, are longer than that of Twitter. Keep in mind Tum-
0.2 blrs numbers are obtained after removing all quotes, photos and
URLs, which further discounts the discrepancy between Tumblr
0
0 5 10 15 20 25 30 and Twitter. The big gap between mean and median is due to a
Shortest Path Length
small percentage of extremely long posts. For instance, the longest
text post is 446K bytes in our sampled dataset. As for photo cap-
Figure 4: Shortest Path Distribution tions, naturally we expect it to be much shorter than text posts.
The average length is around 64.3 bytes, but the median is only 29
bytes. Although photo posts are dominant in Tumblr, the number
of text posts and photo captions in Table 2 are comparable, because
4. TUMBLR AS BLOGOSPHERE FOR majority of photo posts dont contain any raw photo captions.
CONTENT GENERATION A further related question: is the 140-byte limit sensible? We plot
As Tumblr is initially proposed for the purpose of blogging, here post length distribution of the text post dataset, and zoom into less
we analyze its user generated contents. As described earlier, photo than 280 bytes in Figure 5. About 24.48% of posts are beyond
and text posts account for more than 92% of total posts. Hence, we 140 bytes, which indicates that at least around one quarter of posts
concentrate only on these two types of posts. One text post may will have to be rewritten in a more compact version if the limit was
contain URL, quote or raw message. In this study, we are mainly enforced in Tumblr.
interested in the authentic contents generated by users. Hence, we Blending all numbers above together, we can see at least two types
extract raw messages as the content information of each text post, of posts: one is more like posting a reference (URL or photo) with
by removing quotes and URLs. Similarly, photo posts contains 3 added information or short comments, the other is authentic user
categories of information: photo URL, quote photo caption, raw generated content like in traditional blogging. In other words, Tum-
photo caption. While the photo URL might contain lots of addi- blr is a mix of both types of posts, and its no-length-limit policy
tional meta information, it would require tremendous effort to ana- encourages its users to post longer high-quality content directly.
lyze all images in Tumblr. Hence, we focus on raw photo captions What are people talking about? Because there is no length limit
as the content of each photo post. We end up with two datasets of on Tumblr, the blog post tends to be more meaningful, which al-
content: one is text post, and the other is photo caption.
Whats the effect of no length limit for post? Both Tumblr and 14
http://www.quora.com/Twitter-1/What-is-the-average-length-of-
Twitter are considered microblogging platforms, yet there is one a-tweet
Topic Topical Keywords Table 4: Topical Keywords from Photo Caption Dataset
Pop music song listen iframe band album lyrics
Music video guitar
Sports game play team win video cookie For brevity, we just show the result for text post dataset as similar
ball football top sims fun beat league patterns were observed over photo captions.
Internet internet computer laptop google search online The patterns are strong in both gures. Those users who have
site facebook drop website app mobile iphone higher in-degree tend to post more, in terms of both mean and me-
Pets big dog cat animal pet animals bear tiny dian. One caveat is that what we observe and report here is merely
small deal puppy correlation, and it does not derive causality. Here we draw a con-
Medical anxiety pain hospital mental panic cancer servative conclusion that the social popularity is highly positively
depression brain stress medical correlated with user blog frequency. A similar positive correlation
Finance money pay store loan online interest buying is also observed in Twitter[11].
bank apply card credit In contrast, the pattern in terms of user registration time is beyond
our imagination until we draw the gure. Surprisingly, those users
Table 3: Topical Keywords from Text Post Dataset who either register earliest or register latest tend to post less fre-
quently. Those who are in between are inclined to post more fre-
quently. Obviously, our initial hypothesis about the incentive for
lows us to run topic analysis over the two datasets to have an overview new users to blog more is invalid. There could be different expla-
of the content. We run LDA [4] with 100 topics on both datasets, nations in hindsight. Rather than guessing the underlying explana-
and showcase several topics and their corresponding keywords on tion, we decide to leave this phenomenon as an open question to
Tables 3 and 4, which also show the high quality of textual content future researchers.
on Tumblr clearly. Medical, Pets, Pop Music, Sports are shared in- As for reference, we also look at average post-length of users, be-
terests across 2 different datasets, although representative topical cause it has been adopted as a simple metric to approximate quality
keywords might be different even for the same topic. Finance, In- of blog posts [1]. The corresponding correlations are plot in Fig-
ternet only attracts enough attentions from text posts, while only ure 7. In terms of post length, the tail users in social networks are
signicant amount of photo posts show interest to Photography, the winner. Meanwhile, long-term or recently-joined users tend to
Scenery topics. We want to emphasize that most of these keywords post longer blogs. Apparently, this pattern is exactly opposite to
are semantically meaningful and representative of the topics. post frequency. That is, the more frequent one blogs, the shorter
Who are the major contributors of contents? There are two po- the blog post is. And less frequent bloggers tend to have longer
tential hypotheses. 1) One supposes those socially popular users posts. That is totally valid considering each individual has limited
post more. This is derived from the result that those popular users time and resources. We even changed the post length to the max-
are followed by many users, therefore blogging is one way to at- imum for each individual user rather than average, but the pattern
tract more audience as followers. Meanwhile, it might be true that remains still.
blogging is an incentive for celebrities to interact or reward their In summary, without the post length limitation, Tumblr users are
followers. 2) The other assumes that long-term users (in terms of inclined to write longer blogs, and thus leading to higher-quality
registration time) post more, since they are accustomed to this ser- user generated content, which can be leveraged for topic analysis.
vice, and they are more likely to have their own focused commu- The social celebrities (those with large number of followers) are
nities or social circles. These peer interactions encourage them to the main contributors of contents, which is similar to Twitter [24].
generate more authentic content to share with others. Surprisingly, long-term users and recently-registered users tend to
Do socially popular users or long-term users generate more con- blog less frequently. The post-length in general has a negative cor-
tents? In order to answer this question, we choose a xed time relation with post frequency. The more frequently one posts, the
window of two weeks in August 2013 and examine how frequent shorter those posts tend to be.
each user blogs on Tumblr. We sort all users based on their in-
degree (or duration time since registration) and then partition them
into 10 equi-width bins. For each bin, we calculate the average
5. TUMBLR FOR INFORMATION PROPA-
blogging frequency. For easy comparison, we consider the maxi- GATION
mal value of all bins as 1, and normalize the relative ratio for other Tumblr offers one feature which is missing in traditional blog ser-
bins. The results are displayed in Figure 6, where x-axis from left to vices: reblog. Once a user posts a blog, other users in Tumblr can
right indicates increasing in-degree (or decreasing duration time). reblog to comment or broadcast to their own followers. This en-
0.6 0.6
0.4 0.4
0.2 0.2
0 0
InDegree from Low to High along xAxis InDegree from Low to High along xAxis
1.2 1.2
Mean of Post Frequency Mean of Post Length
Median of Post Frequency Median of Post Length
1 1
Normalized Post Frequency
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Registration Time from Early to Late along xAxis Registration Time from Early to Late along xAxis
Figure 6: Correlation of Post Frequency with User In-degree or Figure 7: Correlation of Post Length with User In-degree or Dura-
Duration Time since Registration tion Time since Registration
ables information to be propagated through the network. In this of reblog cascade involving few reblog events. Yet, within a time
section, we examine the reblogging patterns in Tumblr. We exam- window of two weeks, the maximum cascade could reach 116.6K.
ine all blog posts uploaded within the rst 2 weeks, and count re- In order to have a detailed understanding of reblog cascades, we
blog events in the subsequent 2 weeks right after the blog is posted, zoom into the short head and plot the CCDF up to reblog cascade
so that there would be no bias because of the time window selection size equivalent to 20 in Figure 9. It is observed that only about
in our blog data. 19.32% of reblog cascades have size greater than 10. By contrast,
Who are reblogging? Firstly, we would like to understand which only 1% of retweet cascades have size larger than 10 [11]. The re-
users tend to reblog more? Those people who reblog frequently blog cascades in Tumblr tend to be larger than retweet cascades in
serves as the information transmitter. Similar to the previous sec- Twitter.
tion, we examine the correlation of reblogging behavior with users Reblog depth distribution. As shown in previous sections, almost
in-degree. As shown in the Figure 8, social celebrities, who are the any pair of users are connected through few hops. How many hops
major source of contents, reblog a lot more compared with other does one blog to propagate to another user in reality? Hence, we
users. This reblogging is propagated further through their huge look at the reblog cascade depth, the maximum number of nodes to
number of followers. Hence, they serve as both content contrib- pass in order to reach one leaf node from the root node in the reblog
utor and information transmitter. On the other hand, users who cascade structure. Note that reblog depth and size are different. A
registered earlier reblog more as well. The socially popular and cascade of depth 2 can involve hundreds of nodes if every other
long-term users are the backbone of Tumblr network to make it a node in the cascade reblogs the same root node.
vibrant community for information propagation and sharing. Figure 10 plots the distribution of number of hops: again, the reblog
Reblog size distribution. Once a blog is posted, it can be re- cascade depth distribution follows a power law as well according
blogged by others. Those reblogs can be reblogged even further, to the PDF; when zooming into the CCDF, we observe that only
which leads to a tree structure, which is called reblog cascade, with 9.21% of reblog cascades have depth larger than 6. That is, major-
the rst author being the root node. The reblog cascade size indi- ity of cascades can reach just few hops, which is consistent with the
cates the number of reblog actions that have been involved in the ndings reported over Twitter [3]. Actually, 53.31% of cascades in
cascade. Figure 9 plots the distribution of reblog cascade sizes. Tumblr have depth 2. Nevertheless, the maximum depth among all
Not surprisingly, it follows a power-law distribution, with majority cascades can reach 241 based on two week data. This looks un-
1
Normalized Reblog Frequency
PDF
4
10
0.8
6
10
0.6
8
10 0 2 4 6
0.4 10 10 10 10
Reblog Cascade Size
0.2 1
0.8
0
InDegree from Low to High along xAxis
0.6
CCDF
1.2 0.4
Mean of Reblog Frequency
Median of Reblog Frequency 0.2
1
Normalized Reblog Frequency
0
0 5 10 15 20 25
0.8 Reblog Cascade Size
0.4
is posted, the less likely it would be reblogged. 75.03% of rst re-
blog arrive within the rst hour since a blog is posted, and 95.84%
0.2
of rst reblog appears within one day. Comparatively, It has been
reported that half of retweeting occurs within an hour and 75%
0
Registration Time from Early to Late along xAxis under a day [11] on Twitter. In short, Tumblr reblog has a strong
bias toward recency, and information propagation on Tumblr is fast.
Figure 11: Cascade Structure Distribution up to Size 5. The percentage at the top is the coverage of cascade structure.
0
10 1
2 0.8
10
0.6
PDF
CDF
4
10
0.4
6
10
0.2
8
10 0 1 2 3 0
10 10 10 10 1m 10m 1h 1d 1w
Reblog Cascade Depth Lag Time of First Reblog
1
Figure 12: Distribution of Time Lag between a Blog and its rst
0.8
Reblog
Majority of reblog cascades are tiny in terms of both size [10] A. Java, X. Song, T. Finin, and B. Tseng. Why we twit-
and depth, though extreme ones are not uncommon. It is rel- ter: understanding microblogging usage and communities. In
atively easier to propagate a message wide but shallow rather WebKDD/SNA-KDD 07, pages 5665, New York, NY, USA,
than deep, suggesting the priority for inuence maximization 2007. ACM.
or information propagation. [11] H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a
social network or a news media. In Proceedings of 19th Inter-
Compared with Twitter, Tumblr is more vibrant and faster in
national World Wide Web Conference (WWW), 2010.
terms of reblog and interactions. Tumblr reblog has a strong
bias toward recency. Approximately 3/4 of the rst reblogs [12] C. Lampe, N. Ellison, and C. Steineld. A familiar
occur within the rst hour and 95.84% appear within one face(book): Prole elements as signals in an online social net-
day. work. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 2007.
This snapshot research is by no means to be complete. There are
several directions to extend this work. First, some patterns de- [13] J. Leskovec and E. Horvitz. Planetary-scale views on a large
scribed here are correlations. They do not illustrate the underlying instant-messaging network. In WWW 08: Proceeding of the
mechanism. It is imperative to differentiate correlation and causal- 17th international conference on World Wide Web, pages
ity [2] so that we can better understand the user behavior. Secondly, 915924, New York, NY, USA, 2008. ACM.
it is observed that Tumblr is very popular among young users, as [14] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. S.
half of Tumblrs visitor base being under 25 years old. Why is it Glance. Finding patterns in blog shapes and blog evolution.
so? We need to combine content analysis, social network analysis, In Proceedings of ICWSM, 2007.
together with user proles to gure out. In addition, since more
than 70% of Tumblr posts are images, it is necessary to go beyond [15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and
photo captions, and analyze image content together with other meta B. Bhattacharjee. Measurement and analysis of online social
information. networks. In IMC 07: Proceedings of the 7th ACM SIG-
COMM conference on Internet measurement, pages 2942,
New York, NY, USA, 2007. ACM.
8. REFERENCES
[16] S. Mittal, N. Gupta, P. Dewan, and P. Kumaraguru. The pin-
[1] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the in- bang theory: Discovering the pinterest world. arXiv preprint
uential bloggers in a community. In Proceedings of WSDM, arXiv:1307.4952, 2013.
2008. [17] B. Nardi, D. J. Schiano, S. Gumbrecht, and L. Swartz. Why
[2] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Inuence we blog. Commun. ACM, 47(12):4146, 2004.
and correlation in social networks. In Proceedings of KDD, [18] M. E. J. Newman. Assortative mixing in networks. Physical
2008. review letters, 89(20): 208701, 2002.
[3] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Ev- [19] M. E. J. Newman. Mixing patterns in networks. Physical Re-
eryones an inuencer: quantifying inuence on twitter. In view E, 67(2): 026126, 2003.
Proceedings of WSDM, 2011.
[20] R. Ottoni, J. P. Pesce, D. Las Casas, G. Franciscani, P. Ku-
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan:. Latent dirichlet allo- maruguru, and V. Almeida. Ladies rst: Analyzing gen-
cation. Journal of Machine Learning Research, 3:9931022, der roles and behaviors in pinterest. Proceedings of ICWSM,
2003. 2013.
[5] S. Chang, V. Kumar, E. Gilbert, and L. Terveen. Specializa- [21] X. Shi, B. Tseng, , and L. A. Adamic. Looking at the blogo-
tion, homophily, and gender in a social curation site: Findings sphere topology through different lenses. In Proceedings of
from pinterest. In Proceedings of The 17th ACM Conference ICWSM, 2007.
on Computer Supported Cooperative Work and Social Com- [22] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow.
puting, CSCW14, 2014. The anatomy of the facebook social graph. arXiv preprint
arXiv:1111.4503, 2011.
[6] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law
distributions in empirical data. arXiv, 706, 2007. [23] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: nd-
ing topic-sensitive inuential twitterers. In Proceedings of
[7] E. Gilbert, S. Bakhshi, S. Chang, and L. Terveen. i need to WSDM, pages 11371146, 2010.
try this!: A statistical overview of pinterest. In Proceedings
of the SIGCHI Conference on Human Factors in Computing [24] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who
Systems (CHI), 2013. says what to whom on twitter. In Proceedings of WWW 2011.
SIGKDD Explorations
SIGKDD Explorations Volume 16,
Volume 16, Issue
Issue 1
1 Page 30
Page 30
Streaming computational model is considered one of the widely- data. Change analysis both detects and explains the change. Hido
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
ing data processing helps the decision-making process in real- vised learning.
time. A data
Streaming stream is defined
computational model as isfollows.
considered one of the widely- data. Change analysis both detects
D EFINITION 4. Change point and explains
detection is the change. Hido
identifying time
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
points at which properties of time series data change[32]
ingDdata
EFINITION
processing 1. Ahelps
data the stream is an infinite sequence
decision-making process inofreal-
ele- vised learning.
ments
time. A data stream is defined as follows. Depending on specific application, change detection can be
D EFINITION 4. Change point detection is identifying time
S = (X1 , T1 ) , .., X j , T j , ... (1) called in different terms such as burst detection, outlier detection,
D EFINITION 1. A data stream is an infinite sequence of ele- points at which properties of time series data change[32]
or anomaly detection. Burst detection a special kind of change
ments
Each element is a pair X j , T j where X j is a d-dimensional vec- DependingBurst
detection. on specific
is a period application,
on streamchange detection sum
with aggregated can ex-be
tor X j = (x1 , x2 , ...,
S= xd) (X
arriving at the time stamp
1 , T1 ) , .., X j , T j , ...
T j . Time-stamp (1) called
ceeding in adifferent
threshold terms [31]. such as burst
Outlier detection,
detection is aoutlier
special detection,
kind of
is defined over discretedomain with a total order. There are two or anomaly
change detection.
detection. Anomaly Burst detection
detection can a special
be seen kindasofa change
special
Each
types element
of time-stamps: X j , T jtime-stamp
is a pairexplicit where X j is agenerated
d-dimensional vec-
when data detection.
type of changeBurstdetection
is a period on streamdata.
in streaming with aggregated sum ex-
tor
arrive; (x1 , x2 ,time-stamp
X j =implicit ..., xd ) arriving at the time
is assigned stampdata
by some T j . Time-stamp
stream pro- ceeding
To find aasolution
threshold to the[31].problem
Outlierofdetection is a special
change detection, wekindshouldof
is defined
cessing over discrete domain with a total order. There are two
system. change detection.
consider the aspectsAnomaly of changedetection can be
of the system in seen
whichaswea want special to
types of time-stamps: explicit time-stamp is generated when data type
detect.of As
change
shown detection
in [52],in thestreaming
followingdata. aspects of change, which
Streaming data time-stamp
arrive; implicit includes theis fundamental
assigned by some characteristics
data stream as pro-
fol- To
mustfindbeaconsidered,
solution to the problem
include of change
subject of change, detection,
type of wechange,
should
lows.
cessingFirst, data arrives continuously. Second, streaming data
system. consider
cause of the aspects
change, effectof change
of change,of the system of
response in which
change,wetemporal
want to
evolves overtime. Third, streaming data is noisy, corrupted. detect.
issues, As andshown
spatialinissues.
[52], the In following
particular,aspects
to design of change, which
an algorithm
Streaming
Forth, datainterfering
timely includes the fundamental
is important. Fromcharacteristics as fol-
the characteristics mustdetecting
be considered,
lows. First, data for changesinclude in sensor subject of change,
streaming data, thetypemajor
of change,
ques-
of streaming data arrives
and data continuously.
stream model, Second, streaming
data stream data
process- cause
tions we of change,
need to effect
answer of include:
change, Whatresponse of change,
is the system in temporal
which
evolves overtime. Third, streaming data is
ing and mining pose the following challenges. First, as streaming noisy, corrupted. issues, and spatial
Forth, timely interfering is important. From thedata characteristics the changes need toissues.
be detected?In particular,
What are tothe
design an algorithm
principles used to
data arrives rapidly, the techniques of streaming process and for
modeldetecting changesWhat
the problem? in sensor
is datastreaming
type? What data,arethethemajor ques-
constraints
of streaming
analysis must data
keepand data the
up with stream
data model, data stream
rate to prevent fromprocess-
the loss tions we need to answer include: What is the system in which
ing and mining pose the following of the problem? What is the physical subject of change? What is
of important information as well aschallenges.
avoid dataFirst, as streaming
redundancy. Sec- the changes
the meaningneed to be detected?
of change to the user? WhatHow aretothe principles
respond used to
and react to
data
ond, arrives rapidly,
as the speed of the techniques
streaming dataof is streaming
very high, data process
the data and
volume model the problem?
analysis must keep up with the data rate to prevent from the loss this change? How to What visualize is data
thistype?
change? What are the constraints
overcomes the processing capacity of the existing systems. Third, of the problem? What is the physical subject of change? What is
of A change detection method can fall into one of two types: batch
theimportant information
value of data decreasesasover welltime,
as avoid data redundancy.
the recent streaming data Sec-
is the meaning of change to the user? How to respond and react to
change detection and sequential change detection. Given a se-
ond, as the speed of streaming data is very high,
sufficient for many applications. Therefore, one can only capture the data volume this change? How to visualize this change?
overcomes the processing capacity of the existing systems. Third, quence of N observations x1 , .., xN , where N is invariant, the task
and process the data as soon as it is generated. A change detection method can fall into one of two types: batch
the value of data decreases over time, the recent streaming data is of a batch change detection method is deciding whether a change
change detection and sequential change detection. Given a se-
2.1
sufficientChange
for many Detection: Definitions
applications. Therefore, one canand onlyNota-
capture occurs at some point in the sequence by using all N available ob-
quence of N observations x1 , .., xN , where N is invariant, the task
tionthe data as soon as it is generated.
and process servations. When the arriving speed of data is too high, batch
of a batch change detection method is deciding whether a change
change detection is suitable. In other words, change detection
2.1 section
This Change Detection:
presents Definitions
concepts and classificationand Nota-
of changes occurs at some point in the sequence by using all N available ob-
method using two adjacent windows model will be used. How-
tiondetection methods. To develop a change detection
and change servations. When the arriving speed of data is too high, batch
ever, the drawback of batch change detection method is that its
method, we should understand what a change is. change detection is suitable. In other words, change detection
This section presents concepts and classification of changes running time is very large when detecting changes in a large
method using two adjacent windows model will be used. How-
andDchange detection
EFINITION methods.
2. Change To develop
is defined as thea change detection
difference in the amount of data. In contrast,
ever, the drawback of batchthe sequential
change change
detection detection
method prob-
is that its
method, we should understand what a change is.
state of an object or phenomenon over time and/or space [52; lem is based on the observations so far.
running time is very large when detecting changes in a large If no change is detected,
1].D EFINITION 2. Change is defined as the difference in the the next of
amount observation is processed.
data. In contrast, Whenever
the sequential a change
change is detected,
detection prob-
the change detector is reset.
lem is based on the observations so far. If no change is detected,
state
In theofview
an object or phenomenon
change is theover timeofand/or space [52;a Change
1].
of system, process transition from the next detection
observation methods can be classified
is processed. Wheneverinto the following
a change is detected,ap-
state of a system to another. In other words, a change can be de- proaches:
the changethreshold-based
detector is reset.change detection method; state-based
fined
In theas the of
view difference betweenisan
system, change theearlier
process state and a laterfrom
of transition state.a change
Change detection
detection method;
methods trend-based change
can be classified intodetection
the following method.ap-
An
stateimportant
of a system distinction
to another.between
In otherchange
words,and difference
a change can is
be that
de- A change threshold-based
proaches: detection algorithm changeshould meet method;
detection three main require-
state-based
afined
change
as therefers to a transition
difference betweeninantheearlier
state state
of an and
object or astate.
a later phe- ments
change[37]:detectionaccuracy,
method; promptness,
trend-basedand changeonline. The algorithm
detection method.
nomenon
An important overtime while the
distinction difference
between changemeans
andthe dissimilarity
difference in
is that should
A change detect as many
detection as possible
algorithm actual
should meet changethreepoints
main and gen-
require-
the characteristics
a change refers to of two objects.
a transition A change
in the state ofcan
an reflect
object theor ashort-
phe- erate
mentsas[37]:few asaccuracy,
possible false alarms. The
promptness, and algorithm
online. The should detect
algorithm
term trendovertime
nomenon or long-term
whiletrend. For example,
the difference meansa the
stock analyst may
dissimilarity in change point as
should detect as early
manyas as possible. The algorithm
possible actual change points shouldand be gen-
effi-
be
theinterested in theofshort-term
characteristics change
two objects. of thecan
A change stock price.the short-
reflect cient
erate assufficient for a realfalse
few as possible timealarms.
environment.
The algorithm should detect
Change
term trend detection is defined
or long-term asFor
trend. the example,
process ofa stock
identifying
analyst differ-
may Change
change point detection in data
as early stream allows
as possible. us to identify
The algorithm shouldthebetime-
effi-
ences in the state
be interested in theofshort-term
an object or phenomenon
change by observing
of the stock price. it at evolving trends,for
cient sufficient and time-evolving
a real patterns. Research issues on
time environment.
different times [54].
Change detection is In the above
defined as thedefinition,
process of a change
identifyingis detected
differ- mining
Changechangesdetection in in
data datastreams
stream include
allowsmodeling and representa-
us to identify the time-
on thein
ences basis
the of differences
state of an
of an object orobject at different
phenomenon times without
by observing it at tion of changes,
evolving trends, change-adaptive
and time-evolving mining method,
patterns. and interactive
Research issues on
considering
different times the[54].
differences of an definition,
In the above object in locations
a changeinisspace.
detectedIn exploration
mining changes of changes
in data [19].streams Change
includedetection
modeling playsandanrepresenta-
important
many
on thereal
basisworld applications,
of differences changes
of an objectcan occur both
at different timesin terms
withoutof role in changes,
tion of the field change-adaptive
of data stream analysis. Since change
mining method, in model
and interactive
both time and
considering thespace. For example,
differences multiple
of an object spatial-temporal
in locations in space. data
In may conveyofinteresting
exploration changes [19]. time-dependent
Change detection information
plays anand knowl-
important
streams
many real representing triple (latitude,
world applications, changeslongitude,
can occurtime)
both are created
in terms of role inthe
edge, thechange
field ofof data
the data stream
stream can beSince
analysis. used for understanding
change in model
both
in timeinformation
traffic and space. For example,
systems usingmultiple
GPS [23]. spatial-temporal
Hence, changedata de- maynature
the convey of interesting time-dependent
several applications. information
Basically, interesting andresearch
knowl-
streamscan
tection representing
be definedtriple (latitude, longitude, time) are created
as follows. edge, the change
problems on mining of thechanges
data stream can streams
in data be used for canunderstanding
be classified
in traffic information systems using GPS [23]. Hence, change de- the
intonature
three of several applications.
categories: modeling and Basically, interesting
representation research
of changes,
D EFINITION
tection 3. Change
can be defined detection is the process of identify-
as follows. problems
mining on mining
methods, changes in exploration
and interactive data streams of can be classified
changes. Change
ing differences in the state of an object or phenomenon by ob- into threealgorithm
detection categories: canmodeling
be used asand representationinof
a sub-procedure manychanges,
other
D EFINITION
serving 3. Change
it at different detection
times and/or is thelocations
different process in of space.
identify- mining
data methods,
stream mining and interactiveinexploration
algorithms order to deal ofwith
changes. Change
the changing
ing differences in the state of an object or phenomenon by ob- detection
data in dataalgorithm
streamscan [28;be4].used as a sub-procedure
A definition of changeindetection
many other for
A distinction
serving betweentimes
it at different concept driftdifferent
and/or detection and change
locations detec-
in space. data streamdata mining algorithms in order to deal with the changing
streaming is given as follows
tion is that concept drift detection focuses on the labeled data data in data streams [28; 4]. A definition of change detection for
A distinction
change between
detectionconcept
can dealdrift
withdetection and and
change detec-
while both labeled unlabeled D EFINITION
streaming data is5. givenChange detection is the process of segment-
as follows
tion is that concept drift detection focuses on the labeled data
while change detection can deal with both labeled and unlabeled D EFINITION 5. Change detection is the process of segment-
u1 u2
Local decision Local decision
u1 u2
Local decision Local decision
Global decision
u1 u2
Global decision
(a) Distributed detection without decision (b) Distributed detection with decision fusion
u1
fusion u2
(a) Distributed detection without decision (b) Distributed detection with decision fusion
fusion Figure 2: One-shot distributed change detection models
Phenomenon
u1 u2
Local decision Local decision
u1 u1 u2
u2
Local decision Local decision
u1 Global decision
u2
Local decision Local decision
Global decision
(a) Distributed continuous Local
detection without deci- (b) Distributed continuous detection with decision
decision
Local decision
sion fusion fusion
(a) Distributed continuous detection without deci- (b) Distributed continuous detection with decision
sion fusion fusion
Figure 3: Continuous distributed change detection models
Beng Chin Ooi , Kian-Lee Tan , Quoc Trung Tran , James W. L. Yip ,
Gang Chen# , Zheng Jye Ling , Thi Nguyen , Anthony K. H. Tung , Meihui Zhang
National University of Singapore National University Health System # Zhejiang University
{ooibc, tankl, tqtrung, thi, atung, zmeihui}@comp.nus.edu.sg
{james yip, zheng jye ling}@nuhs.edu.sg # cg@zju.edu.cn
have any predefined class labels and might need to ask Hadoop Distributed File System (HDFS) and a key-value storage
security experts to provide the class labels for some sample system, ES2 [8]) for both unstructured and structured data. The
cases. next layer (which is the security layer) enables users to protect
data privacy by encryption. The third layer (which is the
Lastly, data in different domains (e.g., healthcare, distributed processing layer) provides a distributed processing
telecommunication, home security) is expected to grow infrastructure called E3 [9] that supports different parallel
dramatically in the years ahead [26]. For instance, patients processing logics such as MapReduce [11], Directed Acyclic
in intensive care units are constantly being monitored, and Graph (DAG) and SQL. The top layer (which is the analytics
their historical records have to be retained. This can easily layer) exploits the contextual crowd intelligence for big data
result in hundreds of millions of (historical) records of analytics. The details of this layer are shown in Figure 1. In
patients. As another example, during a mass casualty Figure 2, KB is the knowledge base and iCrowd is the component
disaster (e.g., SARS, H5N1), there is an overwhelming that interacts with the domain experts. Different components of
number of patients who have to be monitored and tracked, the analytics layer (e.g., scalable machine learning algorithms) can
and information about each patient is huge by itself. process their data with the most appropriate data processing model
Furthermore, streaming data arrive continuously, e.g., new and their computations will be automatically executed in parallel
data from the real-time data feed are constantly being by the lower layers.
inserted. Hence, the system in healthcare setting must In the remaining of this paper, we focus only on the analytics layer.
provide the real-time predictions, e.g., predicting the For more details of the other layers of the epiC system, please refer
survival of patients in the next 6 hours. to [1; 10; 19].
The three above mentioned aspects call for a new generation of
intelligent DBMSs that can provide effective solutions for big data 4. RESEARCH PROBLEMS
analytics. Our proposition of exploiting contextual crowd In this section, we elaborate on the research problems that we need
intelligence is, we believe, a big step towards this goal. to address in order to build an intelligent system for big data
analytics.
3.2 Contextual Data Management
The central theme of crowd intelligence is to get domain experts 4.1 Asking Experts The Right Questions
engaged as both the participants to fine tune the system and the Given a large volume of data and a limited amount of time that
end-users of the system. Figure 1 presents an intelligent system domain experts can participate in building the systems, we need
that exploits contextual crowd intelligence for big data analytics. to ask the experts the right questions. In the context of healthcare
The system first builds a knowledge base that will be subsequently analytics, we plan to ask the following domain knowledge from
used for the analytics tasks based on historical data, domain doctors.
knowledge from SMEs (e.g., doctors), and other sources such as
general clinical guidelines. Each source contributes to build some Labelings. The system asks doctors to label tuples that the
weak classifiers. The system needs to combine these classifiers system has low confidence in performing the prediction
to derive a final classifier that achieves a high level of accuracy for task. There are two important issues here. First, doctors
prediction purposes. The system also needs to go through several have different levels of confidence when answering different
iterations of interaction with the experts to refine, for example, the questions, i.e., doctors are reluctant to assess patient profiles
final classifier. As such, the experts participate in the entire that they do not have specialties. Second, since there is so
process in fine tuning the system and decide on the best much information about patients, selecting the relevant
practices. When real-time data or feed arrives, the system feature of each patient to present to the doctors in order not
performs the prediction on-the-fly and alerts the experts to overwhelm them is also a major issue.
immediately. Hence, the experts become the end-users of the In essence, what we need is a diverse set of labeled patients
system. that covers the whole data space as much as possible. One
We have developed the epiC system [1; 10; 19] to support large possible solution is to group similar patient profiles together
scale data processing, and are extending it to support healthcare and show these groups to doctors. The purpose is to let the
analytics. Figure 2 shows the software stack of epiC. At the doctors select the groups of patients that they are
bottom, the storage layer supports different storage systems (e.g., comfortable in providing the labels. In addition, for each
Historical data
Classifier from Classifier from Questions
SMEs data MR DAG SQL
Distributed
Classifier from Classifier from Processing Layer
general guidelines other sources
E3
Derive
Update SMEs
Real-time data/feed
Trusted Data Service Security Layer
Predictor Knowledge
Base
Figure 1: Contextual crowd intelligence for big data analytics. Figure 2: The software stack of epiC for big data analyt-
ics.
group, we present only the features which the patients in the lab tests) from various medical dictionaries (a.k.a. knowledge
group have similar values. In this way, we can avoid base), such as, the Unified Medical Language System (UMLS) [6].
overwhelming the doctors with information. Note that, in We now discuss several problems raised due to the nature of the
some cases, we need to perform hierarchical clustering to unstructured data and the incompleteness of the knowledge base,
reduce the number of patients shown to the doctors each and subsequently discuss a hybrid human-machine approach to
time. Selecting the right clustering algorithms and solve these problems. The discussion uses the following running
developing effective visualization tools to present patients example. We run cTAKES on the doctors note of patient 1 (in
profiles are important here. Table 1), and obtain the following clinical entities: (1) diseases:
IHD (Ischemic Heart Disease) and DM; (2) medications: GTN
Rules/Hypotheses. The system collects expert
and Metformin; and (3) laboratory test: HbA1c.
rules/hypotheses that the doctors used to do the labeling.
For example, to predict the risk of unplanned patient Ambiguous mentions. In many cases, a mention in the free text
readmissions, the doctors suggested a hypothesis that social may refer to different domain entities. For instance, in the running
factors and the status of the diseases are important risk example, DM refers to two different diseases Dystrophy
indicators for readmission. The system would then verify or Myotonic and Diabetes Mellitus. We note that this problem is
adjust these hypotheses and revert back to the doctors with not uncommon as doctors tend to use abbreviations in their notes.
evidence to support or reject their hypotheses. Such For example, CCF refers to either Congestive heart failure or
interactions are beneficial to both the system and the Carotid-Cavernous Fistula diseases; PID refers to either
doctors. Prolapsed Intervertebral Disc or Pelvic Inflammatory Disease.
There are also cases where only human but not the machine can
Inferred implicit knowledge. The system can also infer understand the meaning of some mentions in the text. For
implicit and valuable knowledge based on the example, assuming that we are extracting the social factor of
answers/reactions of the domain experts. For instance, if the patients in Table 1. It is rather easy to extract the social factor for
doctors label two patients who belong to a given cluster patient 1, since the text contains the phraze stays with son.
differently, then the system can adjust the distance function However, it is challenging, if not possible, for the machine to
used to compute the similarity between two patients, and extract the social factor for patient 2. The reason is that the
thus infer which features are more important. Such paragraph contains several different keywords relating to the
knowledge is implicit as the doctors themselves may not be social factor such as single, no child, live with friend,
aware of. sheltered home, next-of-kin.
We can also ask the same kind of questions for the analytics tasks Incomplete knowledge base. The knowledge base is incomplete
in other domains. For instance, to predict malicious SMSs, we for the following reasons. First, the terms used in the doctors
need to select a small set of messages (by utilizing some clustering notes could be specific within a country or a particular hospital,
algorithms) and ask the experts to provide labels for these whereas the existing knowledge bases may only cover the
samples. We also collect rules and heuristics that the experts universal ones. Thus, these terms do not exist in the dictionary.
utilize to label the SMSs. One example is the term HL in our running example, which
refers to the Hyperlipidemia disease but is not captured in
4.2 Extracting Domain Entities From UMLS. Second, the relationships between entities covered in
Unstructured Data existing medical knowledge bases (like ULMS) are far from
Feature selection is very important for any machine learning task complete. In the running example, the fact that the medication
and can greatly affect the algorithms quality. Processing doctors Metformin is used to treat Diabetes Mellitus (DM) is also missing
notes for extracting important features is an inevitably important in UMLS. The relationships that exist between domain entities can
step for healthcare analytics problems. There are several be used to derive implicit and useful information. For instance,
state-of-the-art Natural Language Processing (NLP) engines for from the laboratory result of the lab test HbA1c, we can infer
processing clinical documents, such as, MedLEE [14] and whether the DM condition is well-controlled (i.e., the relationship
cTAKES [27]. These engines process clinical notes, identifying between a disease and a lab test).
types of clinical entities (e.g., medications, diseases, procedures,
[5] National university health system. http://www.nuhs.edu.sg/. [22] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang.
Cdas: a crowdsourcing data analytics system. PVLDB,
[6] Unified medical language system. 5(10):10401051, 2012.
http://www.nlm.nih.gov/research/umls/.
[23] A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowd-
[7] N. Allaudeen, J. L. Schnipper, E. J. Orav, R. M. Wachter, and sourced databases: Query processing with people. In CIDR,
A. R. Vidyarthi. Inability of providers to predict unplanned pages 211214, 2011.
readmissions. J Gen Intern Med, 26(7):771776.
[24] A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzo-
[8] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. tis, and J. Widom. Deco: declarative crowdsourcing. In CIK-
Vo, S. Wu, and Q. Xu. Es2: A cloud data storage system for M, pages 12031212, 2012.
supporting both oltp and olap. In ICDE, pages 291302, 2011.
[25] S. Perera, A. Sheth, K. Thirunarayan, S. Nair, and N. Shah.
[9] G. Chen, K. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo, Challenges in understanding clinical notes: Why nlp engines
and S. Wu. E3: an elastic execution engine for scalable data fall short and where background knowledge can help. In CIK-
processing. JIP, 20(1):6576, 2012. M Workshop, 2013.
[10] G. Chen, H. Jagadish, D. Jiang, D. Maier, B. Ooi, K. Tan, and [26] W. Raghupathi and V. Raghupathi. Big data analytics in
W. Tan. Federation in cloud data management: Challenges healthcare: promise and potential. Health Information Sci-
and opportunities. TDKE, 2014. ence and Systems, 2014.
[27] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn,
[11] J. Dean and S. Ghemawat. Mapreduce: Simplified data pro-
K. K. Schuler, and C. G. Chute. Mayo clinical text analysis
cessing on large clusters. Commun. ACM, 51(1), Jan. 2008.
and knowledge extraction system (ctakes): architecture, com-
[12] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid ponent evaluation and applications. JAMIA, 17(5):507513,
machine-crowdsourcing system for matching web tables. In 2010.
ICDE, pages 976987, 2014. [28] T. K. Sean Goldberg, Daisy Zhe Wang. Castle: Crowd-
[13] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and assisted system for textual labeling & extraction. HCOM,
R. Xin. Crowddb: answering queries with crowdsourcing. In 2013.
SIGMOD Conference, pages 6172, 2011. [29] B. Settles. Active learning literature survey. Technical report,
University of WisconsinMadison, 2010.
[14] C. Friedman, P. O. Alderson, J. H. Austin, J. J. Cimino, and
S. B. Johnson. A general natural-language text processor for [30] M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cher-
clinical radiology. JAMIA, 1(2):161174, 1994. niack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale:
The data tamer system. In CIDR, 2013.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. H. Witten. The weka data mining software: An update. [31] J. Sun, J. Hu, D. Luo, M. Markatou, F. Wang, S. Edabol-
SIGKDD Explorations, 11(1), 2009. lahi, S. E. Steinhubl, Z. Daar, and W. F. Stewart. Combining
knowledge and data driven insights for identifying risk factors
[16] J. Han. Data Mining: Concepts and Techniques. Morgan using electronic health records. In AMIA, 2012.
Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[32] J. Wiens and J. Guttag. Active learning applied to patient-
[17] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active adaptive heartbeat classification. In NIPS, pages 24422450,
learning and its application to medical image classification. In 2010.
ICML, pages 417424, 2006.
Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier and Massih-Reza
On Power Law Distributions in Large-scale Taxonomies
Amini
Universite Grenoble Alpes, CNRS
F-38000 Grenoble, France
rohit.babbar@imag.fr,
Rohit cornelia.metzig@imag.fr,
Babbar, Cornelia Metzig, ioannis.partalas@imag.fr,
Ioannis Partalas, Eric Gaussier and Massih-Reza
eric.gaussier@imag.fr,Amini
massih-reza.amini@imag.fr
Universite Grenoble Alpes, CNRS
F-38000 Grenoble, France
ABSTRACT erarchy tree in large-scale taxonomies with the goal of mod-
rohit.babbar@imag.fr, cornelia.metzig@imag.fr, ioannis.partalas@imag.fr,
elling the process of their evolution. This is undertaken
In many of the large-scale physical and social complex sys-
eric.gaussier@imag.fr,
tems phenomena fat-tailed distributions occur, for which dif-
massih-reza.amini@imag.fr
by a quantitative study of the evolution of large-scale tax-
onomy using models of preferential attachment, based on
ferent generating mechanisms have been proposed. In this
the famous model proposed by Yule [33] and showing that
ABSTRACT
paper, we study models of generating power law distribu- erarchy treethe
throughout in large-scale
growth process, taxonomies with theexhibits
the taxonomy goal of amod-
fat-
tions in the evolution of large-scale taxonomies such as Open elling
In many ofProject,
the large-scale physical and social complex tailed the process ofWe
distribution. their evolution.
apply this reasoningThis isto undertaken
both cate-
Directory which consist of websites assigned to sys-
one by a quantitative study of the evolution of large-scale tax-
tems phenomena fat-tailed distributions gory sizes and tree connectivity in a simple joint model.
of tens of thousands of categories. Theoccur, for which
categories dif-
in such onomy using modelsvariable
of preferential attachment,
ferent generating mechanisms haveorbeen Formally, a random X is defined to followbased
a power on
taxonomies are arranged in tree DAGproposed.
structured In con-
this
the
paper, we study law famous model
distribution proposed
if for by Yuleconstant
some positive [33] anda,showing that
the comple-
figurations havingmodels of generating
parent-child relationspower
among lawthem.
distribu-
We throughout the growth process, the
tions quantitatively
in the evolutionanalyse
of large-scale taxonomies such as mentary cumulative distribution is taxonomy exhibits a fat-
given as follows:
first the formation process of Open
such tailed distribution. We apply this reasoning to both cate-
Directory Project,
taxonomies, whichwhichleads consist
to power of websites assigned as
law distribution to one
the P (X > x) in
gory sizes and tree connectivity xaa simple joint model.
of tens of thousands
stationary distributions. of categories.
In the context The ofcategories
designinginclassi-
such
Formally, a random variable X is defined to follow a power
taxonomies
fiers are arranged
for large-scale taxonomies,in treewhich
or DAG structuredassign
automatically con- Power law distributions,
law distribution if for someorpositive
more generally
constant a, fat-tailed
the comple-dis-
figurations
unseen having to
documents parent-child relations among
leaf-level categories, them. how
we highlight We tributions that decaydistribution
slower thanisGaussians, are found in a
mentary cumulative given as follows:
firstfat-tailed
the quantitatively
natureanalyse
of thesethe formation can
distributions process of such
be leveraged wide variety of physical and social complex systems, ranging
taxonomies,
to analytically which
study leads
the to power
space law distribution
complexity of such as the
classi- (X > x) xof
from city population,Pdistribution
a
wealth to citations of
stationary distributions.
fiers. Empirical evaluationInofthe thecontext of designingonclassi-
space complexity pub- scientific articles [23]. It is also found in network connectiv-
fiers available
licly for large-scale
datasetstaxonomies,
demonstrates which theautomatically
applicability assign
of our Power
ity, lawthe
where distributions,
internet andorWikipediamore generally fat-tailed
are prominent dis-
exam-
unseen documents to leaf-level categories, we highlight how
approach. tributions
ples [27; 7].that
Ourdecay
analysisslower in than Gaussians,
the context are foundweb-
of large-scale in a
the fat-tailed nature of these distributions can be leveraged wide variety leads
taxonomies of physical and social
to a better complex systems,
understanding of suchranging
large-
to analytically study the space complexity of such classi-
1. INTRODUCTION
fiers. Empirical evaluation of the space complexity on pub-
from data,
scale city population,
and also leveraged distribution
in orderof wealth
to presentto citations
a concrete of
With the tremendous scientific
analysis of articles
space[23]. It is alsofor
complexity found in network
hierarchical connectiv-
classification
licly available datasets growth of datathe
demonstrates on applicability
the web fromof var- our ity, where Due
schemes. the internet
to the ever and increasing
Wikipediascale are prominent
of trainingexam-
data
ious sources
approach. such as social networks, online business ser-
vices and news networks, structuring the data into concep- ples [27;terms
size in 7]. Our analysis
of the number in the context of large-scale
of documents, feature setweb- size
tual taxonomies leads to better scalability, interpretability taxonomies
and numberleads to a classes,
of target better understanding
the space complexity of such of large-
the
1. INTRODUCTION
and visualization. Yahoo! directory, the open directory scale data,
trained and also
classifiers leveraged
plays in order
a crucial role intothepresent a concrete
applicability of
With the(ODP)
project tremendous growth ofare
and Wikipedia data on the web
prominent from var-
examples of analysis of space
classification systemscomplexity
in manyfor hierarchical
applications classification
of practical im-
ious sources
such web-scale such as social networks,
taxonomies. The Medical online business
Subject ser-
Heading schemes. Due to the ever increasing scale of training data
portance.
vices and news
hierarchy of thenetworks,
Nationalstructuring
Library of the data into
Medicine concep-
is another size in
The terms
space of the number
complexity analysis of presented
documents, in feature
this paper set pro-
size
tual taxonomies
instance leads to better
of a large-scale taxonomy scalability,
in the interpretability
domain of life and
videsnumber of target
an analytical classes, the
comparison of thespace complexity
trained of the
model for hi-
and visualization.
sciences. Yahoo! directory,
These taxonomies the open
consist of classes directory
arranged in trained classifiers
erarchical and flat plays a crucialwhich
classification, role incan thebeapplicability
used to select of
project
a (ODP)structure
hierarchical and Wikipedia are prominent
with parent-child examples
relations among of classification
the appropriate systems
modelina-priori
many applications of practical
for the classification im-
prob-
such web-scale
them and can be taxonomies.
in the formThe of a Medical
rooted treeSubject
or a Heading
directed portance.
lem at hand, without actually having to train any mod-
hierarchy of the
acyclic graph. ODP National Librarywhich
for instance, of Medicine
is in theisform another
of a The Exploiting
els. space complexity
the power analysis presented
law nature in this paper
of taxonomies pro-
to study
instancetree,
rooted of alists
large-scale taxonomy
over 5 million websitesin the domain among
distributed of life vides an analytical
the training comparisonforofhierarchical
time complexity the trained Supportmodel forVec- hi-
sciences.
close to 1 These
milliontaxonomies
categories and consist of classes arranged
is maintained by close to in erarchical
tor Machinesandhasflat been
classification,
performed which
in [32;can19].
be used
The to select
authors
a hierarchical
100,000 humanstructure with parent-child
editors. Wikipedia, relations
on the other hand,among
rep- the appropriate
therein justify the model
power a-priori for the classification
law assumption prob-
only empirically,
them and
resents can be
a more in the form
complicated of a rooted
directed graphtree or a directed
taxonomy struc- lem at our
unlike hand, without
analysis in actually
Section 3having wherein to wetrain any mod-
describe the
acyclic graph. ODP
ture consisting of over fora instance, which is in
million categories. Inthethisform of a
context, els. Exploiting
generative the of
process power law nature
large-scale web of taxonomiesmore
taxonomies to study
con-
rooted tree,hierarchical
large-scale lists over 5 classification
million websites distributed
deals with the task among of the training
cretely, in thetime complexity
context of similarfor processes
hierarchical Support
studied Vec-
in other
close to 1 million
automatically categories
assigning labelsand is maintained
to unseen documents by close
fromto a tor Machines
models. Despitehas the
beenimportant
performedinsights
in [32; of 19].[32;
The19],authors
space
100,000 human
set of target editors.
classes whichWikipedia, on the other
are represented by thehand, rep-
leaf level therein justify
complexity has the
not power law assumption
been treated formally so only
far.empirically,
resentsina the
nodes more complicated directed graph taxonomy struc-
hierarchy. unlike our analysis
The remainder of this in paper
Section is as3 wherein we describe
follows. Related workthe on
ture
In consisting
this work, weofstudyover athe million categories.
distribution In this
of data andcontext,
the hi- generative process of large-scale web taxonomies
reporting power law distributions and on large scale hierar- more con-
large-scale hierarchical classification deals with the task of cretely,classification
chical in the context of similar in
is presented processes
Section studied in other
2. In Section 3,
automatically assigning labels to unseen documents from a models.
we recall Despite
important thegrowth
importantmodels insights of [32; 19], space
and quantitatively jus-
set of target classes which are represented by the leaf level complexity
tify has not of
the formation been powertreated
lawsformally
as they so arefar.
found in hi-
nodes in the hierarchy. The remainder of this paper is as follows. Related work on
In this work, we study the distribution of data and the hi- reporting power law distributions and on large scale hierar-
chical classification is presented in Section 2. In Section 3,
we recall important growth models and quantitatively jus-
tify the formation of power laws as they are found in hi-
# categories # categories
the counter-cumulative in-degree distribution, shown in Fig-
density p(Ni ), then also follows a power law with exponent 100000
ures 1 and 2, for LSHTC2-DMOZ
(+1) dataset which is a subset
( + 1), i.e. i ) 4N
1000
of ODP. Thep(Ndataset contains
i . 394, 000 websites and 27, 785
Two of our The
categories. empirical
numberfindings are a power
of categories law
at each for of
level both
the the
hi-
complementary cumulative category size distribution and 10000
erarchy is shown in Figure 3. 100
the counter-cumulative in-degree distribution, shown in Fig-
ures 1 and 2, for LSHTC2-DMOZ dataset which is a subset
1000
of ODP.100000
The dataset4 contains 394, 000 websites and 27, 785
= 1.1 10
Ni>N with Ni>N
1000
100000 Figure 3: Number of categories at each level in the hierarchy
categories
= 1.1 10
of the LSHTC2-DMOZ database.
1 2 3 4 5
100
10000 Level
# of with
5
of
1 isthe LSHTC2-DMOZ
created . database.
1
100 The described system is constantly growing in terms of el-
1 10 100 1000 10000
ements and classes, so strictly speaking, a stationary state
category size N It further
does assumes
not exist [20].that for every
However, m elements
a stationary that are added
distribution, the
10
to the pre-existing
so-called Yule classes inhas
distribution, thebeen
system, a newusing
derived classthe
of size
ap-
5
Figure 1: Category size vs rank distribution for the proach of the. master equation with similar approximations
1 is created
LSHTC2-DMOZ
1 dataset. The[26;
by described
23; 17].system Here,iswe constantly growing [23],
follow Newman in terms
who ofcon-el-
1 10 100 1000 10000
ements
siders asandoneclasses,
time-step so the
strictly speaking,
duration betweena stationary
creation ofstate
two
category size N does not exist [20]. From
However,
consecutive classes. this afollows
stationary distribution,
that the average num- the
so-called
ber Yule distribution,
of elements per class is has beenmderived
always + 1, and using
the the ap-
system
Figure 10000
1: Category size vs rank
=distribution for the
dgi>dgwith dgi>dg
1.9 proach
containsof(mthe + master equation
1) elements atwith similar where
a moment approximations
the num-
LSHTC2-DMOZ dataset. by [26; 23; 17].
ber of classes is .Here,
Let we pN,follow
denote Newman [23], who
the fraction con-
of classes
1000 siders
havingasNone time-step
elements whenthe theduration
total between
number of creation
classesofistwo.
consecutive
Between twoclasses.successiveFrom timethisinstances,
follows that thethe average num-
probability for a
10000 ber
givenofpre-existing
elements per class
class i ofissize
always
Ni to mgain
+ 1,aandnewthe system
element is
= 1.9
categories
100 contains
mN (m + 1) elements at a moment where the num-
i /((m + 1)). Since there are pN, classes of size N ,
ber of classesnumber
the expected is . Let such pN, denote
classes whichthegain
fraction
a newofelement
classes
1000 having
(and grow N elements
to size (Nwhen + 1))the total by
is given number
: of classes is .
# of with
Yules model describes a system that grows in two quantities, The 4.first
ure Theterm
lastintermthe corresponds
right hand side to theof Equation 4 corre-
decrease resulting
in elements and in classes N
p(i) in
= which
ithe elements are assigned. (2) sponds to classes with N documents when the number of
5
i =1 N
It assumes that for a system having iclasses, the probability The initial
classes is . size
Themay secondbe generalized
term corresponds to othertosmall sizes; for
the contribu-
that
4
a new element will be assigned to a certain class is tion fromTessone
instance classes ofetsize al. (N consider
1) which entrant
have classes
grown to with
sizesize
N,
http://lshtc.iit.demokritos.gr/LSHTC2
proportional to its current size, datasets drawn from aby
this is shown truncated power (pointing
the left arrow law [29] . rightwards) in Fig-
ure 4. The last term corresponds to the decrease resulting
Ni
p(i) = (2) 5
i =1 N i The initial size may be generalized to other small sizes; for
instance Tessone et al. consider entrant classes with size
4
http://lshtc.iit.demokritos.gr/LSHTC2 datasets drawn from a truncated power law [29] .
number of classes
dgi Number of subclasses of class i 200
di Number of features of class i
Total number of classes 150
300
DG
Variables Total number of in-degrees (=subcategories)
100
pN, Fraction of classes having N elements 250
Ni Number
when theoftotal
elements
number in class i
of classes is 50
number of classes
dgi Number of subclasses of class i 200
Constants
di Number of features of class i 0
0 N-1 N N+1
Total number of classes 150 class size
m Number of elements added to the system af-
DG Total
ter number
which a newof class
in-degrees (=subcategories)
is added 100
pN,
w Fraction
of classes having
[0, 1] Probability N elementsof sub-
that attachment Figure 4: Illustration of Equation 4. Individual classes grow
when the total
categories number of classes is
is preferential 50 move to the right over time, as indicated by
constantly i.e.,
Constants
Indices arrows. A stationary distribution means that the height of
0
each bar remains 0 constant.
N-1 N N+1
class size
im Number
Index forofthe
elements
class added to the system af-
ter which a new class is added
w [0, 1] Probability
Table 1: Summary of notationthat
usedattachment
in Section 3of sub- Figurei to
node 4: connect
Illustrationto a of Equation
certain 4. Individual
existing classes grow
node j is proportional
categories is preferential constantly
to its number i.e.,ofmove to the
existing right
edges of over
nodetime,
j. as indicated by
Indices arrows.
A node in A stationary
the Barab adistribution
si-Albert (BA) means thatcorresponds
model the height of a
from classes which have gained an element and have become each bar
class remains
in Yules constant.
model, and a new edge to two newly assigned
ofi size (N + 1),Index
this isforshown
the class
by the right arrow (pointing element. Every added edge counts both to the degree of an
rightwards) in Figure 4. The equation for the class of size 1 existing node j, as well as to the newly added node i. For
Table nodereason
i to connect to a certain
nodes existing node j is added
proportional
is given by: 1: Summary of notation used in Section 3 this the existing j and the newly node i
to its number of existing edges of node
grow always by the same number of edges, implying m = 1 j.
m A
( + 1)p1,(+1) = p1, + 1 p1, (5) andnode in the Barab
consequently =asi-Albert (BA) model
2 in the BA-model, corresponds of
independently a
from classes which have gained an element + m and1 have become class in Yules
the number of model,
edges that and each
a newnewedge to two
node newly assigned
creates.
of
As size
the (N + 1),
number this
of isclasses
shown(andby the right arrow
therefore (pointing
the number of element.
The seminal Every added edge
BA-model hascounts both to the
been extended in degree of an
many ways.
rightwards) in Figure 4. The equation for the
elements (m + 1)) in the system increases, the probabilityclass of size 1 existing node j, as well as to the newly
For hierarchical taxonomies, we use a preferential attach-added node i. For
is given
that by:element is classified into a class of size N , given by
a new this
mentreason
modelthe for existing
trees by nodes j and
[17]. The the newly
authors addedgrowth
considered node i
Equation 3, is assumed to remain constantmand independent grow always edges,
via directed by the and sameexplain
number of edges,
power implying m
law formation in = 1
the
( + 1,(+1) = p1, + 1
1)phypothesis, p1, (5) and consequently
in-degree, i.e. the = 2 directed
edges in the BA-model,
from childrenindependently
to parent inof
of . Under this the stationary
m + distribution
1 for
class sizes can be determined by solving Equation 4 and the
a treenumber of edges
structure. that each
In contrast to new node creates.
the BA-model, newly added
As theEquation
using number 5asofthe classes
initial(and therefore
condition. Thistheis number
given byof The seminal BA-model has been
nodes and existing nodes do not increase their extended in in-degree
many ways. by
elements (m + 1)) in the system increases, the probability For same
the hierarchical
amount, taxonomies,
since new nodes we usestart
a preferential attach-
with an in-degree
that a new elementpN =is(1 + 1/m)B(N,
classified into a2class
+ 1/m)of size N , given(6)by ment
of 0. model for trees
Leaf nodes thus bycannot
[17]. The authors
attract considered
attachment of growth
nodes,
Equation 3, is assumed to remain constant and independent via directed
and preferentialedges, and explain
attachment alonepower law lead
cannot formation in the
to a power-
where B(., .) this
of . Under is the beta distribution.
hypothesis, Equation
the stationary 6 has been
distribution for in-degree, i.e. random
the edges directed from children to parent in
termed law. A small term ensures that some nodes attach
class sizes can be determined by solving Equation 4 vari-
Yule distribution [26]. Written for a continuous and a tree structure. In contrast to the BA-model,
to existing ones independently of their degree, which is the newly added
able
usingNEquation
, it has a5power
as thelaw tail:condition. This is given by nodes and existing nodesofdo
initial analogous to the start a not
newincrease
class intheir
the in-degree
Yule model. by
2 1 the
The same amount,
probability sincea new
v that new node
nodesattaches
start withas aan in-degree
child to the
pN = (1p(N )N
+ 1/m)B(N, m
2 + 1/m) (6) of 0. Leaf
existing nodenodes
i of thus
with cannot
indegree attract attachment of nodes,
dgi becomes
From the
where B(.,above equation
.) is the the exponentEquation
beta distribution. of the density
6 has func-
been and preferential attachment alone cannot lead to a power-
law. A small random di 1 1
tion
termedis between 2 and 3.[26].
Yule distribution ItsWritten
cumulative
for a size distribution
continuous vari- v(i) = wterm ensures + (1 that
w) some , nodes attach (8)
P (NkN>
able N has
, it ), asagiven
powerbylaw
Equation
tail: 1, has an exponent given to existing ones independently DG of their degree, DG which is the
by analogous
where DG to the size
is the startofofthe a new
system class in the Yule
measured in themodel.
total
1
p(N ) N 2 m The probability v that a new node attaches
number of in-degrees. w [0, 1] denotes the probability as a child tothat
the
= (1 + (1/m)) (7) existing node i of becomes
the attachment is with indegree(1dgi w)
preferential, the probability that
From the above equation the exponent of the density func- it is random to any node,
which is between 1 and 2. The higher the frequency 1/m
tion is between 2 and di independently
1 1 of their numbers
at which new classes are3.introduced,
Its cumulative size distribution
the bigger becomes, of indegrees. As v(i)it=has
DG
+ (1
w been done w)
for the
DG
, process [26;
Yule (8)
P
and k > N ), as given by Equation 1, has an exponent given
(Nthe lower the average class size. This exponent is stable 23; 14; 29], the stationary distribution is again derived via
by
over time although the taxonomy is constantly growing. where DG isEquation
the master the size 4.of the
Thesystem measured
exponent of the in the total
asymptotic
number
power law of in-degrees. w [0,
in the in-degree 1] denotes is
distribution the probability that
= 1 + 1/w.This
= (1 + (1/m)) (7)
3.2 Preferential attachment models for net- the
modelattachment
is suitableistopreferential, (1 properties
explain scaling w) the probability that
of the tree or
which works and
is between trees
1 and 2. The higher the frequency 1/m it is random to any node, independently of their
network structure of large-scale web taxonomies, which have numbers
at similar
A which new classes
model are introduced,
has been the network
formulated for bigger growth
becomes,
by of indegrees.
also As itempirically,
been analysed has been done for the Yule
for instance process [26;
for subcategories
and the
Barabasilower
and the average
Albert class size.
[2], which Thisthe
explains exponent is stable
formation of a 23; 14; 29], the
of Wikipedia [7].stationary
It has alsodistribution
been applied is again derivedtrees
to directory via
over time
power lawalthough the taxonomy
distribution is constantly
in connectivity degree ofgrowing.
nodes. It the master
in [14]. Equation 4. The exponent of the asymptotic
assumes that the networks grow in terms of nodes and edges, power law in the in-degree distribution is = 1 + 1/w.This
3.2 Preferential attachment models for net-
and that every newly added node to the system connects 3.3 Model
model for
is suitable to hierarchical web taxonomies
explain scaling properties of the tree or
works and trees
with a fixed number of edges to existing nodes. Attachment network structure of large-scale web taxonomies,
We now apply these models to large-scale web taxonomies which have
A again
is similarpreferential,
model has been formulated
i.e. the for network
probability growth
for a newly by
added also been analysed
like DMOZ. empirically,
Empirically, for instance
we uncovered two for subcategories
scaling laws: (a)
Barabasi and Albert [2], which explains the formation of a of Wikipedia [7]. It has also been applied to directory trees
power law distribution in connectivity degree of nodes. It in [14].
assumes that the networks grow in terms of nodes and edges,
and that every newly added node to the system connects 3.3 Model for hierarchical web taxonomies
with a fixed number of edges to existing nodes. Attachment We now apply these models to large-scale web taxonomies
is again preferential, i.e. the probability for a newly added like DMOZ. Empirically, we uncovered two scaling laws: (a)
erees.
of the At the same
DMOZ how time, the categories
to classify websites, are
we not a mere
propose set,
a sim-
but combined
ple form a treedescription
structure, which
of the grows itselfAltogether,
process. in two quanti- the Equation 8,0 where i = 1, would suffice to explain
ties: in the
database number
grows nodes
in three (categories) and in the number of
quantities: power law in-degrees dgi and in category sizes Ni .
in-degrees of nodes (child-to-parent links, i.e. subcategory- Figure 7: (iii): Growth in children categories.
To link the two processes more plausibly, it can be
to-category
(i) Growthlinks). Based onNew
in websites. the rules for voluntary
websites are assigned referees
into
assumed that the second term in Equation 9 denoting
of thecategories
DMOZ how to classify
i, with websites,
probability p(i) we
propose
Ni (Figure a sim-
5).
assignment8,ofwhere
Equation new first
i =children
1, woulddepends
suffice on
to the size
explain
ple combined description
This assignment of theindependently
happens process. Altogether,
of the hier-the
N i of parent
power categories,
law in-degrees dg and in category sizes N .
database grows
archy levelinofthree quantities:
category. However, only leaf categories i i
may receive documents. To link the two processes Ni
(i) Growth in websites. New websites are assigned into i = more , plausibly, it can(10) be
assumed that the second term N in Equation 9 denoting
categories i, with probability p(i) Ni (Figure 5).
assignment
since this isofcloser new first
to thechildren
rules bydepends on the
which the size
referees
This assignment happens independently of the hier-
N of parent categories,
create new categories, but is not essential for the ex-
i
archy level of category. However, only leaf categories
may receive documents. planation of the power laws.NIt i
reflects that the bigger
a leaf category, the higher i =the probability
, (10)
that referees
N
create a child category when assigning a new website
Figure 5: (i): A website is assigned to existing categories since
to it. this is closer to the rules by which the referees
with p(i) Ni . create new categories, but is not essential for the ex-
planation the
To summarize, of the power
central laws.
idea It reflects
of this that the
joint model bigger
is to con-
a leaf
sider two category,forthe
measures higher
the size ofthe probabilitythe
a category: that referees
number of
create a
its websites Nichild
(whichcategory
governswhen
the assigning
preferential a new website
attachment
(ii) Growth in categories. With probability 1/m, the ref-
Figureerees
5: (i): A website is assigned to existing of new to websites),
it. and its in-degree, i.e. the number of its
assign a website into a newly createdcategories
category,
with p(i) N . children dgi , which governs the preferential attachment of
at any level of the hierarchy (Figure 6).
i
To summarize,
new categories.the Tocentral
explainideatheofpower
this jointlaw model
in the iscategory
to con-
This assumption would suffice to create a power law in sider two
sizes, measures(i)
assumptions forand
the (ii)
sizeare
of athecategory: the number
requirements. of
For the
its websites
law inNthe i (which governs the preferential attachment
(ii) the category
Growth size distribution,
in categories. but since a 1/m,
With probability tree-structure
the ref- power number of indegrees, assumptions (ii) and
among of neware websites), and its in-degree, i.e. the number of its
erees assign a website into a newly created the
categories exists, we also assume that event
category, (iii) the requirements. The empirically found exponents
of children
= 1.1 dg i , which governs
yield the preferential attachment of
at category
any levelcreation is also attaching
of the hierarchy (Figureat6).particular places and = 1.9 a frequency of new categories
to the tree structure. The probability v(i) that a cate- new categories.
1/m=0.1 To explain
and a frequency of the
newpower
indegrees law (1
in
the
w)category
= 0.9.
This assumption
gory is created aswould suffice
the child of ato createparent
certain a power law in
category sizes, assumptions (i) and (ii) are the requirements. For the
the category
i can dependsize
in distribution,
addition on but the since a tree-structure
in-degree di of that 3.4 law
power Other in the interpretations
number of indegrees, assumptions (ii) and
among
categorycategories exists,9).
(see Equation we also assume that the event (iii) are of
Instead theassuming
requirements. The empirically
in Equations 9 and 10 found exponents
that referees de-
of category creation is also attaching at particular places = to
cide 1.1open
and a single
= 1.9child
yieldcategory,
a frequency it isofmore
new realistic
categoriesto
to the tree structure. The probability 2 v(i) that a cate- 1/m=0.1
assume that andan a frequency of new indegrees
existing category (1 w)
is restructured, i.e.=one
0.9.or
gory is created as the child of a certain
2
parent
3 category several child categories are created, and websites are moved
i can depend in addition on the in-degree di of that 3.4 Other
into these interpretations
new categories such that the parent category con-
category (see Equation 9). 0 0 0 0 0 0
Instead
tains less websites orin even
of assuming Equations
none at9 and
all. 10If that
one referees
of the newde-
cide to open
children a single
categories child category,
inherits all websitesit isofmore realisticcat-
the parent to
2
Figure 6: (ii): Growth in categories is equivalent to growth assume
egory (seethatFigure
an existing
8), thecategory is restructured,
Yule model i.e. one or
applies directly. If
2 3 several child categories are created, and websites arecontains
moved
of the tree structure in terms of in-degrees. the websites are partitioned differently, the model
0
into theseshrinking
effective new categories such thatThis
of categories. the parent
is not category
describedcon-by
0 0 0 0 0
tains less model,
the Yule websites orthe
and even none Equation
master at all. If 4one of the only
considers new
children categories.
growing categories inherits
However,all it
websites
has been of the parent
shown [29; cat-
21]
Figure 6: (ii): Growth in categories is equivalent to growth egory (see Figure
that models 8), the
including Yule model
shrinking applies
categories also directly.
lead to the If
of(iii)
theGrowth
tree structure in terms
in children of in-degrees.
categories. Finally, the hierarchy the websites
formation of are partitioned
power differently,
laws. Further the model compati-
generalizations contains
may also grow in terms of levels, since with a certain effective
ble shrinking
with power of categories.
law formation Thisnew
are that is not described
categories by
do not
probability (1 w), new children categories are as- the Yule model,
necessarily and the
start with one master
document,Equation
and that 4 considers only
the frequency
signed independently of the number of children, i.e. growing
of categories.
new categories doesHowever,
not needittohas been shown [29; 21]
be constant.
that models including shrinking categories also lead to the
(iii) Growth in children categories. Finally, the hierarchy formation of power laws. Further generalizations compati-
may also grow in terms of levels, since with a certain ble with power law formation are that new categories do not
probability (1 w), new children categories are as- necessarily start with one document, and that the frequency
signed independently of the number of children, i.e. of new categories does not need to be constant.
categories
10000
100 = 1.9
Figure 8: Model without and with shrinking categories. In
the left figure, a child category inherits all the elements of 1000
# ofwith
its parent and takes its place in the size distribution. 10
# of categories
100
100000 1
Figure 8: Model without and with shrinking
Level 2 categories. In 100 1000 10000 100000
Level
the left figure, a child category inherits 3 elements of
all the category size in features d
with Ni>N
10000
its parent Level
and takes its place in the size 4
distribution. 10
Level 5
Figure 11: Number of features vs rank distribution.
1000
100000 1
of categories
100000 100000
= 0.59 = 0.53
nb of features
nb of features
1e+06 1e+06
10000 10000
100000 100000
= 0.59 = 0.53
nb of features
nb of features
1000 1000
10000 10000
100 100
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000 100000
1000
nb of words 1000
nb of docs in collection
Figure 10: Heaps law: number of distinct words vs. number of words, and vs number of documents.
100 100
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000 100000
5. SPACE COMPLEXITY OF
nb of LARGE-SCALE
words memory. We, therefore, nb of docs compare the space complexity of
in collection
HIERARCHICAL CLASSIFICATION hierarchical and flat methods which governs the size of the
Figure 10: Heaps law: number of distinct words trained model in large scale classification. The goal of this
Fat-tailed distributions in large-scale web taxonomies high- vs. number of words, and vs number of documents.
analysis is to determine the conditions under which the size
light the underlying structure and semantics which are use-
of the hierarchically trained linear model is lower than that
ful to visualize important properties of the data especially in
5. SPACE COMPLEXITY OF LARGE-SCALE
big data scenarios. In this section we focus on the applica-
of flat model.
memory. We, therefore, compare the space complexity of
As a prototypical classifier, wewhich
use agoverns
linear classifier
the size of the
tions HIERARCHICAL
in the context of large-scale CLASSIFICATION
hierarchical classification, hierarchical and flat methods of the
form
trained wTmodel
x which in canlarge bescale
obtained using standard
classification. algorithms
The goal of this
wherein the fit of power law distribution to
Fat-tailed distributions in large-scale web taxonomies high- such taxonomies
such
analysis as Support Vector the
is to determine Machine or Logistic
conditions under Regression.
which the size In
can
lightbe theleveraged
underlying to structure
concretelyand analyse the space
semantics whichcomplex-
are use- this
of the work, we apply trained
hierarchically one-vs-all linearL2-regularized
model is lowerL2-loss
than sup-
that
ity of visualize
ful to large-scale hierarchical
important classifiers
properties of the in data
the context
especially of in
a
port
of flatvector
model. classification as it has been shown to yield state-
generic
big datalinear classifier
scenarios. deployed
In this section in wetop-down
focus on the hierarchical
applica- of-the-art
As a performance
prototypical in thewe
classifier, context
use a of largeclassifier
linear scale textofclas-
the
cascade.
tions in the context of large-scale hierarchical classification, sification
form w T [12]. For flat classification one stores weight vec-
x which can be obtained using standard algorithms
In the following
wherein the fit ofsections
power law we first present formally
distribution the task of
to such taxonomies tors
suchw y and hence
asy ,Support Vector in Machine
a K classorproblemLogisticinRegression.
d dimensional In
hierarchical
can be leveraged classification and then
to concretely we proceed
analyse the space to the space
complex- feature
this work,space,we the
apply space complexity
one-vs-all for flat classification
L2-regularized L2-loss sup- is:
complexity analysis
ity of large-scale for large-scale
hierarchical systems.
classifiers in theFinally,
context we of
em- a
pirically validate the derived bounds. port vector classification as it has been shown to yield state-
generic linear classifier deployed in top-down hierarchical Size F lat = d K (12)
of-the-art performance in the context of large scale text clas-
cascade.
5.1 Hierarchical
In the following sectionsClassification
we first present formally the task of
sification
which [12]. For
represents theflatsizeclassification
of the matrixone stores weight
consisting vec-
of K weight
tors w y , y and hence in a K class problem in d dimensional
hierarchical
In single-label classification and then weclassification,
multi-class hierarchical proceed to the thespace
train- vectors, one for each class, spanning the entire input space.
feature space, the space complexity for flat classification is:
complexity
ing set can analysis for large-scale
be represented {(x(i) , y (i)
by S =systems. )}N
Finally,
i=1 . weIn em-
the We need a more sophisticated analysis for computing the
pirically of
context validate the derived bounds.
text classification, x(i) X denotes the vector space complexity for Size hierarchical classification. In this case,
F lat = d K (12)
representation of document i in an input space X Rd . even though the total number of weight vectors is much more
5.1 Hierarchical
The hierarchy in the formClassification
of rooted tree is given by G = since
whichthese are computed
represents the size of forthe
allmatrix
the nodes in the tree
consisting of Kand not
weight
E) where Vmulti-class
(V,single-label
In Y denotes the setclassification,
hierarchical of nodes ofthe G, train-
and only
vectors,for the
one leaves
for eachas in flat
class, classification.
spanning the Inspite
entire of
inputthis, the
space.
E denotes
ing set canthebe set of edges with
represented = {(x(i) , y (i) )}orientation.
by Sparent-to-child N
. In the size
We of hierarchical
need a more model can beanalysis
sophisticated much smallerfor as compared
computing the
i=1
The leaves of the tree which usually to flatcomplexity
model in the large scale classification.
classification. InIntuitively,
context of text classification, x(i) form the setthe
X denotes of vector
target space for hierarchical this case,
classes is given by Y = {u V : v V, (u, v) E}. Assum-
d when
even the feature
though the set size
total numberis high of (top
weight levels in the
vectors hierarchy),
is much more
representation of document i in an input space X R .
ing that there are K classes, the label y (i)
Y represents the
since number
these of computed
are classes is less, for and
all the onnodes
the contrary,
in the treewhen
and the
not
The hierarchy in the form of rooted tree is given by G =
the class associated with the instance (i) number of classes is high (at the bottom),
only for the leaves as in flat classification. Inspite of this, the the feature set
(V, E) where V Y denotes the setx of . nodes
The hierarchical
of G, and
relationship among size
size is
of low.
hierarchical model can be much smaller as compared
E denotes the set ofcategories
edges with implies a transition
parent-to-child from gen-
orientation.
eralization In
to order
flat modelto analytically
in the large compare
scale the relative sizesIntuitively,
classification. of hierar-
The leaves to of specialization
the tree whichasusually one traverses
form the any
setpath from
of target
root towards chical and flat models in the context
when the feature set size is high (top levels in the hierarchy),of large scale classifi-
classes is giventhe by leaves.
Y = {u This V : implies
v V, that
(u, v)the documents
E}. Assum-
which are assigned to a particular leaf also
(i) belong to the cation,
the number we assume
of classes poweris law and
less, behaviour
on the with respect
contrary, to the
when the
ing that there are K classes, the label y Y represents
inner nodes on the path from the root to
(i) that leaf node. number
number of
of features,
classes across
is high levels
(at thein bottom),
the hierarchy. the More pre-
feature set
the class associated with the instance x . The hierarchical
cisely,
size is if low.the categories at a level in the hierarchy are ordered
relationship among categories implies a transition from gen-
5.2 Space
eralization Complexityas one traverses any path from
to specialization
with
In order respect to the number
to analytically compareof features, we observe
the relative sizes ofa hierar-
power
law
chical behaviour.
and flat This hasinalso
models the been
context verified
of empirically
large scale as il-
classifi-
root prediction
The towards the leaves.
speed This implies
for large-scale that the documents
classification is crucial
lustrated
cation, we inassume
Figure power12 for law various levels in
behaviour withtherespect
hierarchy,
to for
the
which
for are assigned
its application in to a particular
many scenarios of leaf also belong
practical to the
importance.
one
number of the of datasets
features, used
across in levels
our experiments.
in the MoreMore
hierarchy. formally,
pre-
inner
It hasnodes on the in
been shown path
[32;from the hierarchical
3] that root to that classifiers
leaf node.are
the
cisely,feature
if the size dl,r of at
categories the r-th in
a level ranked category,are
the hierarchy according
ordered
usually faster to train and test time as compared to flat to therespect
number ofthe
features, for level l, 1 wel L 1, aispower
given
5.2 Space
classifiers. Complexity
However, given the large physical memory of with
by:
to number of features, observe
modern systems, whatforalso matters classification
in practice isisthe size law behaviour. This has also been verified empirically as il-
The prediction speed large-scale crucial
of the trained model with respect to the available physical lustrated in Figure 12 for d various
d r levels
l in the hierarchy, for
(13)
for its application in many scenarios of practical importance. l,r l,1
one of the datasets used in our experiments. More formally,
It has been shown in [32; 3] that hierarchical classifiers are
the feature size dl,r of the r-th ranked category, according
usually faster to train and test time as compared to flat to the number of features, for level l, 1 l L 1, is given
classifiers. However, given the large physical memory of by:
modern systems, what also matters in practice is the size
of the trained model with respect to the available physical dl,r dl,1 rl (13)
< 0.
1000 Using our notation, the size of the corresponding flat clas-
Therefore, Inequality
sifier is: Size 18 can be re-written as:
f lat = Kd1 , where K denotes the number of
10000 leaves. Thus:
Level 2 Sizehier < bd1 (L 1)
of categories
100 Level 3 K ( 1)
Level 4 If > (> 1), then Sizehier < Sizef lat
di>d
K b(L the
Using our notation, 1) size of the corresponding flat clas-
1000
sifier is:
which SizeCondition
proves f lat = Kd115. , where K denotes the number of
# with
10 leaves. Thus:
The proof for Condition 16 is similar: assuming 0 < < 1, it
# of categories
100 is this time the second K term in Equation 18 ((L 1) (1) )
If > (> 1), then Sizehier < Sizef lat
which is negative,K b(L 1) so that one obtains:
1
100 1000 10000 100000
which proves Condition 15. b(L1)(1) 1
10 # of features d Sizehier < bd1
The proof for Condition 16 is(bsimilar: (1) assuming
1)(1 ) 0 < < 1, it
is this time the second term in Equation 18 ((L 1) (1) )
and then:
Figure 12: Power-law variation for features in different levels which is negative, so that one obtains:
1
for LSHTC2-a
100 dataset, Y-axis
1000 represents the feature set
10000 size
100000 b(L1)(1) 1 1 (L1)(1)
If < K,
b then Sizehier 1 < Sizef lat
plotted against rank of the categories on X-axis (b(1) 1)
Sizehier < bd1 b
# of features d (b(1) 1)(1 )
We now state a proposition that shows that, under some con- which concludes the proof of the proposition.
and then:
Figure
ditions 12: Power-law
on the depth ofvariation for features
the hierarchy, in different
its number levels
of leaves,
for It canb(L1)(1)
be shown, but1 this 1 is beyond the scope of this paper,
its branching factors and power law parameters, the sizesize
LSHTC2-a dataset, Y-axis represents the feature set of If Condition <satisfiedK,
plotted againstclassifier
rank of is
the categories that 16 is forthen Sizehier
a range < Sizeof
of values
f lat
a hierarchical below that ofonitsX-axis
flat version. (b (1) 1) b
]0, 1[. However, as is shown in the experimental part, it is
WeProposition 1. For a hierarchy
now state a proposition that shows of that,
categories
underof depth
some L
con- which
Condition concludes the proof of1the
15 of Proposition thatproposition.
holds in practice.
and K leaves,
ditions on the let
depth =ofminthe1lL l andits
hierarchy, b= maxl,rof
number bl,rleaves,
. De- The previous proposition complements the analysis presented
noting the space complexity of alawhierarchical classification It can
in [32] be shown,it but
in which this isthat
is shown beyond the scope
the training andof test
this time
paper,of
its branching factors and power parameters, the size of
model by Sizehier and the one ofthat
its corresponding flat ver- that Condition
hierarchical 16 is satisfied
classifiers for a range
is importantly of values
decreased of
with respect
a hierarchical classifier is below of its flat version.
sion by Sizef lat , one has: ]0,
to 1[.
the However,
ones of their as isflatshown in the experimental
counterpart. In this workpart, it is
we show
Proposition 1. For a hierarchy of categories of depth L Condition
that the space 15 ofcomplexity
Propositionof1 hierarchical
that holds inclassifiers
practice. is also
K The previous
and KForleaves,
> 1, > min1lL l and
letif = (> b1),
= then
maxl,r bl,r . De- better, underproposition
a conditioncomplements
that holds in thepractice,
analysis presented
than the
noting the space complexity K b(Lof a hierarchical
1) (15)
classification in [32]
one in which
of their flat it is shown thatTherefore,
counterparts. the training forand
largetest time
scale of
tax-
model by Sizehier and the one Size of its corresponding
hier < Sizef lat flat ver- hierarchical
onomies whose classifiers
featureis size
importantly
distribution decreased
exhibitwith respect
power law
sion by Sizef lat , one has: to the ones
decay, of their classifiers
hierarchical flat counterpart.
should be In this work
better we show
in terms of
b(L1)(1) 1 1 that the
speed than space
flat complexity
ones, due toofthe hierarchical classifiers is also
following reasons:
0 < ><1,1,ifif > (1) K
ForFor <(> 1), K, then better, under a condition that holds in practice, than the
(bK b(L 1) 1) b then (16)
(15) one1.ofAs theirshown above, the space
flat counterparts. complexity
Therefore, of hierarchical
for large scale tax-
Size
Size <
hier
hier < Size
Size f
f lat lat onomies classifier
whoseisfeaturelower thansize flat classifiers.
distribution exhibit power law
decay, hierarchical classifiers should be better in terms of
Proof. As dl,1 b(L1)(1)
d1 and Bl 1b(l1)
1 for 1 l L, one 2. For
speed than K flat
classes,
ones,only dueO(log
to theK) classifiers
following need to be eval-
reasons:
For
has, 0 <Equation
from < 1, if 14 and
(1)
<
the definitions of K, thenb:
and uated per test document as against O(K) classifiers in
(b 1) b (16) flat shown
classification.
(l1)
1. As above, the space complexity of hierarchical
Size
L1 b
hier < Size f lat classifier is lower than flat classifiers.
Sizehier bd1 r In order to empirically validate the claim of Proposition 1,
Proof. As dl,1 d1 and Bll=1 (l1)
br=1 for 1 l L, one we2. measured
For K classes,the trained modelK)
only O(log sizes of a standard
classifiers need to top-down
be eval-
has, from Equation 14and b the
(l1) definitions
of and b: hierarchical
uated per scheme (TD), which
test document uses a O(K)
as against linear classifiers
classifier at in
One can then bound r=1 r using ([32]): each parent of the hierarchy, and the flat one.
flat classification.
(l1)
b(l1) (l1)(1) b
L1
We use the publicly available DMOZ data of the LSHTC
b bd1 r
Sizehier In order towhich empirically validate
r < for = 0, 1 (17) challenge is a subset of the claim ofMozilla.
Directory Proposition More 1,
1 l=1
r=1 we measuredwe
specifically, theusedtrained
the model sizes of aofstandard
large dataset top-down
the LSHTC-2010
r=1
b(l1) hierarchical scheme (TD), which uses a linear classifier at
One can then bound r=1 r using ([32]): each parent of the hierarchy, and the flat one.
b(l1)
We use the publicly available DMOZ data of the LSHTC
b(l1)(1) challenge which is a subset of Directory Mozilla. More
r < for = 0, 1 (17)
r=1
1 specifically, we used the large dataset of the LSHTC-2010
gregory@kdnuggets.com
ABSTRACT 3. INTERVIEW
We discuss the most important database research advances,
industry developments, role of relational and NoSQL databases, Gregory Piatetsky: You have started as a researcher in
Computing Reality, Data Curation, Cloud Computing, Tamr and Databases (PhD from Toronto) and had a very distinguished
Jisto startups, what he learned as a chief Scientist of Verizon, and varied career spanning academia, industry, and
Knowledge Discovery, Privacy Issues, and more. government, in US, Europe, Australia, and Latin America
over the last 25+ years. From your unique vantage point, what
were 3 most important database research advances?
Keywords
Data Curation, NoSQL, Data Curation, Cloud Computing, Michael Brodie: Three most important database research
Verizon, Privacy, Computing Reality. advances:
1. INTRODUCTION
I had a pleasure of working with Michael Brodie when we were 1. Ted Codds Relational model of data (1970) is the most
both at GTE Laboratories in 1990s, where he was already a world- important database research advance as it launched what
famous researcher and a department manager. I recently met him is now a $28 BN/year market still growing at 11%
at another conference, and our discussion led to this interview. CAGR with over 215 RDBMSs on the market. More
Michael is still very sharp, very active, and busy - he answered important to me it launched four decades of amazing
these questions while flying from Boston to Doha, Qatar where he research advances starting with query optimization
is advising Qatar Computing Research Institute. (Selinger) and transactions (Gray) and innovation that
Parts of this interview were published in KDnuggets [1-3]. has probably grown at 20% CAGR.
2. The next most important research advance or stage was
2. BACKGROUND a change in perspective that specific domains require
Dr. Michael L. Brodie [4] has served as their own DBMS such as graph databases, array stores,
Chief Scientist of a Fortune 20 company, an document stores, key-value stores, NoSQL, NewSQL,
Advisory Board member of leading and many more to come. DB-Engines.com lists twelve
national and international research DBMS categories thus bumping the database world
organizations, and an invited speaker and from managing 8% of the worlds data to about 12% but
lecturer. In his role as Chief Scientist Dr. due to the growth of non-database data back to 10%.
Brodie has researched and analyzed Soon, due to the role of data in our digitized world there
challenges and opportunities in advanced will be data management systems for many more
technology, architecture, and domains. While this is amazingly cool, how do we solve
methodologies for Information Technology multi-disciplinary (multi-data domain) problems in a
strategies. He has guided advanced consistent rather that disjoint way?
deployments of emergent technologies at 3. The next most important research advance is just
industrial scale, most recently Cloud emerging and is mind blowing. I call it Computing
Computing and Big Data. Reality, acknowledging that every datum (every real
world observation) is not definitive but probabilistic.
Unlike conventional databases and more like reality,
Throughout his career Dr. Brodie has been active in both
Computing Reality has no single version of truth. How
advanced, academic research and large-scale industrial practice
do we model such worlds, more realistic worlds and
attempting to obtain mutual benefits from the industrial
compute over them? The simple answer is that it is
deployment of innovative technologies while helping research to
already in Big Data sources. There are many related
understand industrial requirements and constraints. He has
attempts to address Computing Reality including social
contributed to multi-disciplinary problem solving at scale in
computing, probabilistic computing, probabilistic
contexts such as Terrorism and Individual Privacy, and
databases, Open Worlds in AI, Web Science,
Information Technology Challenges in Healthcare Reform.
Approximate Computing, Crowd Computing, and more.
Perhaps this will be the next generation of computing.
MB: Alas the database industry, like all industries, has a legacy
problem that stifles innovation. It has taken over 30 years to
emerge from the relational era. The most important recent
database industry development came from outside the database
industry, it is Big Data and its marketing arm called MapReduce
and its data sidekicks, Hadoop and NoSQL. Frankly, the
database industry has been insular and protected its relational
turf for FAR too long. Smart folks at Yahoo!, Google and other
places saw value in data, non-database data, and thus emerged
MapReduce, Hadoop, and NoSQL- generally crappy database
ideas but it woke up the database industry1. Hadoop and NoSQL
are growing in demand. In time it will be seen that they are
amazing for a very specific problem domain, embarrassingly
parallel problems, but it is a money pit for everything else. The
importance of MapReduce is that it forced the database industry to Popularity changes per category, April 2014, over 1 year
get out of their hammocks.
x Graph DBMS growing dramatically 3.5X
GP: What is the role of Relational Databases, NoSQL x Wide column stores 2X
databases, Graph databases, and other databases today? x Document stores 2X
Relational Databases have two extremely well established roles. x Native XML DBMS 1.5X
Conventional row stores serve the OLTP community as the x Key-value stores 1.5X
backbone of enterprise operations. These blindingly fast
transaction processors are moving in-memory. OLTP stores are x Search engines 1.5X
modest in number and size (< 1 TB) growing and declining in x RDF stores 1.5X
lock step with business growth and decline. Column stores,
OLAP, are the backbone of data warehouses and until recently x Object oriented DBMS - flat
business intelligence. In general there are huge numbers of these,
x Multivalue DBMS - flat
often of very large size in the Petabyte and Exabyte range. This is
where Big Data battle lines are being drawn. What fun!! x Relational DBMS - flat
This is also where we turn from polishing the relational round ball
[5] and focus on the other dozen or so other DBMS categories. GP: You have held an amazing variety of positions in
Taking over is relative; none of the 12 other categories has more academy, industry, government organization, VC firms, and
than 3% of the database market. Graph databases serve graph start-ups, in US, Brazil, Canada, Australia, and Europe.
applications like networking in communications, telecom, social Which 3 positions were most satisfying to you and why?
networks, and of course NSA applications! But what is wonderful
about these emerging classes of data-domain specific DBMSs is MB: What a great question. Thank you for asking because it
that we are only now discovering the rich use cases that they caused me to think about what I have really enjoyed over 40
serve. years. Somehow CSAIL at MIT and the Faculty of Computing
and Communications at EPFL jump to mind.
The use cases define the DBMSs and the DBMSs help formulate
the use cases. SciDB is a superb example of managing scientific There are scary smart people at those places. Like climbing
data and computation at scale. It is awkward for both communities mountains it both scares and exhilarates me. To be frank my jobs
database folks who dont speak linear algebra or matrices, and at big enterprises in hindsight are confusing. I guess I was window
scientists who only speak R. Exciting times. For a little fun look at dressing because my role did not feel like it had impact. So
the database-engines list [6]. getting motivated and scared at MIT and EPFL are probably top,
so theres number one. Why? Just look down 5,000 feet and ask
Database Engines why am I here?
A pernicious aspect of What are the biases that we bring to it. On MB: As an undergraduate at the University of Toronto, I was
a personal note, my biased recall of 1989 was how marvelous extremely fortunate to have had Kelly Gotlieb, the Father of
your ideas were and the amazing potential of data mining. I accept Computing in Canada, as a mentor. I was a student in his 1971
your view that I was skeptical rather than enthusiastic as I recall. course, Computers and Society, later to become the first book on
You see I modified reality to fit my desire to be on the winning the topic. Kelly and the issues, including privacy, have resonated
side, which I was not then. Hence, what we think that we thought with me throughout my career. Kelly observed that privacy, like
may bear little resemblance to reality or, more precisely other many other cultural norms, varies over time. So yes, privacy will
peoples reality. As Richard Feynman said, fluctuate from Alan Westins notion of determining how your
personal information is communicated to the Facebook-esk "Get
The first principle is that you must not fool yourself - and you over it".
are the easiest person to fool.
While personal privacy is undergoing significant change,
That said, I see the main successes of this trend as a nascent disclosure of information assets that are part of the digital
trajectory along the lines of Big Data, Data Analytics, Business economy or of government or corporate strategy may have very
Intelligence, Data Science, and whatever the current trendy term significant impacts on our economy and democracy. Hence, this
is. The World of What is phenomenal machines proposing raises issues of security, protection, and cultural and social issues
potential correlations that are beyond our ability to identify. too complex to be treated here.
Humans consider seven plus or minus 2 variables at a time, a
rather simple model, while models, such as Machine Learning, However, there are a number of very smart people looking at
can consider millions or billions of variables at a time. Yet 95% various aspects. The quote you cite is from Craig Mundy [13] who
(or even 99.99999%) of the resulting correlations may be explores changes that Big Data brings debating the balancing of
meaningless. For example, ~99% of credit card transactions are economic versus privacy issues.
legitimate with less than 1% that are fraudulent, yet the 1% can
kill the profits of a bank. So precision and outlier cases, called
anomalies in science can matter. So it pays to search for Very smart folks, like Butler Lampson and Mike Stonebraker, are
apparently anomalous behavior as it is happening! commenting on practical solutions to this age-old problem. Their
arguments are along the following lines. Due to the massive scale
We have already seen massive benefits of Big Data in the stock of Big Data, and what I call Computing Reality, previously top-
market, electoral predictions, marketing success, and many more down solutions for security, such as anticipating and preventing
that underlie the Big Data explosion. Yet there is a potential Big security breaches, will simply not scale to Big Data. They must be
Data Winter ahead if people blindly apply Big Data and more augmented with new approaches including bottom-up solutions
specifically Machine Learning. The failures concern limited such as Stonebrakers logging to detect and stem previously
models of phenomena and the human tendency of bias. People can unanticipated security breaches and Weitzners accountable
and do use What (Big Data, etc.) to support their biases and systems.
limited models, e.g., used to support the claim of the absence of
climate change or lack of human impact on climate change, rather To beat the Heartbleed bug and others like it, Organizations need
GP: What interesting technical developments you expect in (Michael Brodie and his son on a peak in New Hampshire)
Database and Cloud Technology in the next 5 years?
My activities include the gym (4 times a week); hiking/climbing
MB: I call the Big Picture Computing Reality in which we model
~75 mountains USA, Nepal, Greece, Italy, France, Switzerland,
the world from whatever reasonable perspectives emerge from the
and even Australia; 42 of the 48 4,000 footers in NH (most with
data and are appropriate, e.g., have veracity, and make decisions
Mike Stonebraker); cooking (daily and special occasions with my
symbiotically with machines and people collaborating to optimize
son Justin, an amazing chef and brewer, when hes not doing his
resources while achieving measures of veracity for each result.
PhD), travel, and my garden; all of these except the gym and
garden - with family and close friends.
One subspace of this world is what we currently know with high
levels of confidence, the type of information that we store in
relational databases. Another encompassing space is what we
know but forgot or dont want to remember (unknown knowns)
and a third is what we speculate but do not know (known Very cool Big Data Books:
unknowns), these are all the hypotheses that we make but do not Big Data: A Revolution That Will Transform How We Live, Work,
know in science, business, and life. and Think by Viktor Mayer-Schonberger, Kenneth Cukier,
Houghton Mifflin Harcourt confused and inspired me, then
The rest of the data space unknown unknowns - is infinite; The Signal and the Noise: Why So Many Predictions Fail-but
otherwise learning would be at an end. That is the space of Some Don't, by Nate Silver, Penguin Press, inspired me.
discovery.