07 DSM PDF

Open Challenges for Data Stream Mining Research
Georg Krempl
Indre Zliobaite Dariusz Brzezinski

University Magdeburg, Germany Aalto University and HIIT, Finland Poznan U. of Technology, Poland
georg.krempl@iti.cs.uni-magdeburg.de indre.zliobaite@aalto.fi dariusz.brzezinski@cs.put.poznan.pl
Eyke Hullermeier
Mark Last Vincent Lemaire
University of Paderborn, Germany Ben-Gurion U. of the Negev, Israel Orange Labs, France
eyke@upb.de mlast@bgu.ac.il vincent.lemaire@orange.com
Tino Noack Ammar Shaker Sonja Sievi
TU Cottbus, Germany University of Paderborn, Germany Astrium Space Transportation, Germany
noacktin@tu-cottbus.de ammar.shaker@upb.de sonja.sievi@astrium.eads.net
Myra Spiliopoulou Jerzy Stefanowski
University Magdeburg, Germany Poznan U. of Technology, Poland
myra@iti.cs.uni-magdeburg.de jerzy.stefanowski@cs.put.poznan.pl
ABSTRACT the last decade, truly autonomous, self-maintaining, adaptive data

mining systems are rarely reported. This paper identifies real-world
Every day, huge volumes of sensory, transactional, and web data
challenges for data stream research that are important but yet un-
are continuously generated as streams, which need to be analyzed
solved. Our objective is to present to the community a position
online as they arrive. Streaming data can be considered as one
paper that could inspire and guide future research in data streams.
of the main sources of what is called big data. While predictive
This article builds upon discussions at the International Workshop
modeling for data streams and big data have received a lot of at-
on Real-World Challenges for Data Stream Mining (RealStream)1
tention over the last decade, many research approaches are typi-
in September 2013, in Prague, Czech Republic.
cally designed for well-behaved controlled problem settings, over-
looking important challenges imposed by real-world applications. Several related position papers are available. Dietterich [10] presents
This article presents a discussion on eight open challenges for data a discussion focused on predictive modeling techniques, that are
stream mining. Our goal is to identify gaps between current re- applicable to streaming and non-streaming data. Fan and Bifet [12]
search and meaningful applications, highlight open problems, and concentrate on challenges presented by large volumes of data. Zlio-
define new application-relevant research directions for data stream baite et al. [48] focus on concept drift and adaptation of systems
mining. The identified challenges cover the full cycle of knowledge during online operation. Gaber et al. [13] discuss ubiquitous data
discovery and involve such problems as: protecting data privacy, mining with attention to collaborative data stream mining. In this
dealing with legacy systems, handling incomplete and delayed in- paper, we focus on research challenges for streaming data inspired
formation, analysis of complex data, and evaluation of stream min- and required by real-world applications. In contrast to existing po-
ing algorithms. The resulting analysis is illustrated by practical sition papers, we raise issues connected not only with large vol-
applications and provides general suggestions concerning lines of umes of data and concept drift, but also such practical problems
future research in data stream mining. as privacy constraints, availability of information, and dealing with
legacy systems.
The scope of this paper is not restricted to algorithmic challenges,
1. INTRODUCTION it aims at covering the full cycle of knowledge discovery from data
The volumes of automatically generated data are constantly in- (CRISP [40]), from understanding the context of the task, to data
creasing. According to the Digital Universe Study [18], over 2.8ZB preparation, modeling, evaluation, and deployment. We discuss
of data were created and processed in 2012, with a projected in- eight challenges: making models simpler, protecting privacy and
crease of 15 times by 2020. This growth in the production of dig- confidentiality, dealing with legacy systems, stream preprocessing,
ital data results from our surrounding environment being equipped timing and availability of information, relational stream mining,
with more and more sensors. People carrying smart phones produce analyzing event data, and evaluation of stream mining algorithms.
data, database transactions are being counted and stored, streams of Figure 1 illustrates the positioning of these challenges in the CRISP
data are extracted from virtual environments in the form of logs or cycle. Some of these apply to traditional (non-streaming) data min-
user generated content. A significant part of such data is volatile, ing as well, but they are critical in streaming environments. Along
which means it needs to be analyzed in real time as it arrives. Data with further discussion of these challenges, we present our position
stream mining is a research field that studies methods and algo- where the forthcoming focus of research and development efforts
rithms for extracting knowledge from volatile streaming data [14; should be directed to address these challenges.
5; 1]. Although data streams, online learning, big data, and adapta- In the remainder of the article, section 2 gives a brief introduction to
tion to concept drift have become important research topics during data stream mining, sections 37 discuss each identified challenge,
and section 8 highlights action points for future research.
1
http://sites.google.com/site/realstream2013
SIGKDD Explorations Volume 16, Issue 1 Page 1

singular or recurring contexts: In the former case, a model
becomes obsolete once and for all when its context is re-

placed by a novel context. In the latter case, a models con-

text might reoccur at a later moment in time, for example due

to a business cycle or seasonality, therefore, obsolete models

might still regain value.

systematic or unsystematic: In the former case, there are
patterns in the way the distributions change that can be ex-

ploited to predict change and perform faster model adapta-

tion. Examples are subpopulations that can be identified and
show distinct, trackable evolutionary patterns. In the latter

case, no such patterns exist and drift occurs seemingly at ran-

dom. An example for the latter is fickle concept drift.

! real or virtual: While the former requires model adaptation,

the latter corresponds to observing outliers or noise, which

should not be incorporated into a model.
"
Stream mining approaches in general address the challenges posed

$

by volume, velocity and volatility of data. However, in real-world

applications these three challenges often coincide with other, to

#
date insufficiently considered ones.

The next sections discuss eight identified challenges for data stream
mining, providing illustrations with real world application exam-
Figure 1: CRISP cycle with data stream research challenges. ples, and formulating suggestions for forthcoming research.
3. PROTECTING PRIVACY AND CONFI-

2. DATA STREAM MINING
Mining big data streams faces three principal challenges: volume,
DENTIALITY
velocity, and volatility. Volume and velocity require a high volume Data streams present new challenges and opportunities with respect
of data to be processed in limited time. Starting from the first arriv- to protecting privacy and confidentiality in data mining. Privacy
ing instance, the amount of available data constantly increases from preserving data mining has been studied for over a decade (see.
zero to potentially infinity. This requires incremental approaches e.g. [3]). The main objective is to develop such data mining tech-
that incorporate information as it becomes available, and online niques that would not uncover information or patterns which com-
processing if not all data can be kept [15]. Volatility, on the other promise confidentiality and privacy obligations. Modeling can be
hand, corresponds to a dynamic environment with ever-changing done on original or anonymized data, but when the model is re-
patterns. Here, old data is of limited use, even if it could be saved leased, it should not contain information that may violate privacy
and processed again later. This is due to change, that can affect the or confidentiality. This is typically achieved by controlled distor-
induced data mining models in multiple ways: change of the target tion of sensitive data by modifying the values or adding noise.
variable, change in the available feature information, and drift. Ensuring privacy and confidentiality is important for gaining trust
Changes of the target variable occur for example in credit scor- of the users and the society in autonomous, stream data mining
ing, when the definition of the classification target default versus systems. While in offline data mining a human analyst working
non-default changes due to business or regulatory requirements. with the data can do a sanity check before releasing the model, in
Changes in the available feature information arise when new fea- data stream mining privacy preservation needs to be done online.
tures become available, e.g. due to a new sensor or instrument. Several existing works relate to privacy preservation in publishing
Similarly, existing features might need to be excluded due to regu- streaming data (e.g. [46]), but no systematic research in relation to
latory requirements, or a feature might change in its scale, if data broader data stream challenges exists.
from a more precise instrument becomes available. Finally, drift is We identify two main challenges for privacy preservation in mining
a phenomenon that occurs when the distributions of features x and data streams. The first challenge is incompleteness of information.
target variables y change in time. The challenge posed by drift has Data arrives in portions and the model is updated online. There-
been subject to extensive research, thus we provide here solely a fore, the model is never final and it is difficult to judge privacy
brief categorization and refer to recent surveys like [17]. preservation before seeing all the data. For example, suppose GPS
In supervised learning, drift can affect the posterior P (y|x), the traces of individuals are being collected for modeling traffic situa-
conditional feature P (x|y), the feature P (x) and the class prior tion. Suppose person A at current time travels from the campus to
P (y) distribution. The distinction based on which distribution is the airport. The privacy of a person will be compromised, if there
assumed to be affected, and which is assumed to be static, serves to are no similar trips by other persons in the very near future. How-
assess the suitability of an approach for a particular task. It is worth ever, near future trips are unknown at the current time, when the
noting, that the problem of changing distributions is also present in model needs to be updated.
unsupervised learning from data streams. On the other hand, data stream mining algorithms may have some
A further categorization of drift can be made by: inherent privacy preservation properties due to the fact that they do
not need to see all the modeling data at once, and can be incremen-
smoothness of concept transition: Transitions between con- tally updated with portions of data. Investigating privacy preser-
cepts can be sudden or gradual. The former is sometimes also vation properties of existing data stream algorithms makes another
denoted in literature as shift or abrupt drift. interesting direction for future research.

The second important challenge for privacy preservation is concept actions before feeding the newest data to the predictive models.
drift. As data may evolve over time, fixed privacy preservation The problem of preprocessing for data streams is challenging due to
rules may no longer hold. For example, suppose winter comes, the challenging nature of the data (continuously arriving and evolv-
snow falls, and much less people commute by bike. By knowing ing). An analyst cannot know for sure, what kind of data to expect
that a person comes to work by bike and having a set of GPS traces, in the future, and cannot deterministically enumerate possible ac-
it may not be possible to identify this person uniquely in summer, tions. Therefore, not only models, but also the procedure itself
when there are many cyclists, but possible in winter. Hence, an im- needs to be fully automated.
portant direction for future research is to develop adaptive privacy This research problem can be approached from several angles. One
preservation mechanisms, that would diagnose such a situation and way is to look at existing predictive models for data streams, and
adapt themselves to preserve privacy in the new circumstances. try to integrate them with selected data preprocessing methods (e.g.
feature selection, outlier definition and removal).
Another way is to systematically characterize the existing offline
4. STREAMED DATA MANAGEMENT data preprocessing approaches, try to find a mapping between those
Most of the data stream research concentrates on developing pre- approaches and problem settings in data streams, and extend pre-
dictive models that address a simplified scenario, in which data is processing approaches for data streams in such a way as traditional
already pre-processed, completely and immediately available for predictive models have been extended for data stream settings.
free. However, successful business implementations depend strongly In either case, developing individual methods and methodology for
on the alignment of the used machine learning algorithms with preprocessing of data streams would bridge an important gap in the
both, the business objectives, and the available data. This section practical applications of data stream mining.
discusses often omitted challenges connected with streaming data.
4.1 Streamed Preprocessing 4.2 Timing and Availability of Information

Data preprocessing is an important step in all real world data anal- Most algorithms developed for evolving data streams make simpli-
ysis applications, since data comes from complex environments, fying assumptions on the timing and availability of information. In
may be noisy, redundant, contain outliers and missing values. Many particular, they assume that information is complete, immediately
standard procedures for preprocessing offline data are available and available, and received passively and for free. These assumptions
well established, see e.g. [33]; however, the data stream setting in- often do not hold in real-world applications, e.g., patient monitor-
troduces new challenges that have not received sufficient research ing, robot vision, or marketing [43]. This section is dedicated to the
attention yet. discussion of these assumptions and the challenges resulting from
their absence. For some of these challenges, corresponding situ-
While in traditional offline analysis data preprocessing is a once-off
ations in offline, static data mining have already been addressed
procedure, usually done by a human expert prior to modeling, in the
in literature. We will briefly point out where a mapping of such
streaming scenario manual processing is not feasible, as new data
known solutions to the online, evolving stream setting is easily fea-
continuously arrives. Streaming data needs fully automated pre-
sible, for example by applying windowing techniques. However,
processing methods, that can optimize the parameters and operate
we will focus on problems for which no such simple mapping ex-
autonomously. Moreover, preprocessing models need to be able to
ists and which are therefore open challenges in stream mining.
update themselves automatically along with evolving data, in a sim-
ilar way as predictive models for streaming data do. Furthermore,
all updates of preprocessing procedures need to be synchronized 4.2.1 Handling Incomplete Information
with the subsequent predictive models, otherwise after an update in Completeness of information assumes that the true values of all
preprocessing the data representation may change and, as a result, variables, that is of features and of the target, are revealed eventu-
the previously used predictive model may become useless. ally to the mining algorithm.
Except for some studies, mainly focusing on feature construction The problem of missing values, which corresponds to incomplete-
over data streams, e.g. [49; 4], no systematic methodology for data ness of features, has been discussed extensively for the offline,
stream preprocessing is currently available. static settings. A recent survey is given in [45]. However, only few
As an illustrative example for challenges related to data preprocess- works address data streams, and in particular evolving data streams.
ing, consider predicting traffic jams based on mobile sensing data. Thus several open challenges remain, some are pointed out in the
People using navigation services on mobile devices can opt to send review by [29]: how to address the problem that the frequency in
anonymized data to the service provider. Service providers, such as which missing values occur is unpredictable, but largely affects the
Google, Yandex or Nokia, provide estimations and predictions of quality of imputations? How to (automatically) select the best im-
traffic jams based on this data. First, the data of each user is mapped putation technique? How to proceed in the trade-off between speed
to the road network, the speed of each user on each road segment and statistical accuracy?
of the trip is computed, data from multiple users is aggregated, and Another problem is that of missing values of the target variable. It
finally the current speed of the traffic is estimated. has been studied extensively in the static setting as semi-supervised
There are a lot of data preprocessing challenges associated with learning (SSL, see [11]). A requirement for applying SSL tech-
this task. First, noisiness of GPS data might vary depending on niques to streams is the availability of at least some labeled data
location and load of the telecommunication network. There may from the most recent distribution. While first attempts to this prob-
be outliers, for instance, if somebody stopped in the middle of a lem have been made, e.g. the online manifold regularization ap-
segment to wait for a passenger, or a car broke. The number of proach in [19] and the ensembles-based approach suggested by
pedestrians using mobile navigation may vary, and require adaptive [11], improvements in speed and the provision of performance guar-
instance selection. Moreover, road networks may change over time, antees remain open challenges. A special case of incomplete infor-
leading to changes in average speeds, in the number of cars and mation is censored data in Event History Analysis (EHA), which
even car types (e.g. heavy trucks might be banned, new optimal is described in section 5.2. A related problem discussed below is
routes emerge). All these issues require automated preprocessing active learning (AL, see [38]).

4.2.2 Dealing with Skewed Distributions uncertainty regarding convergence: in contrast to learning
Class imbalance, where the class prior probability of the minor- in static contexts, due to drift there is no guarantee that with
ity class is small compared to that of the majority class, is a fre- additional labels the difference between model and reality
quent problem in real-world applications like fraud detection or narrows down. This leaves the formulation of suitable stop
credit scoring. This problem has been well studied in the offline criteria a challenging open issue.
setting (see e.g. [22] for a recent book on that subject), and has
necessity of perpetual validation: even if there has been
also been studied to some extent in the online, stream-based setting
convergence due to some temporary stability, the learned hy-
(see [23] for a recent survey). However, among the few existing
potheses can get invalidated at any time by subsequent drift.
stream-based approaches, most do not pay attention to drift of the
This can affect any part of the feature space and is not nec-
minority class, and as [23] pointed out, a more rigorous evaluation
essarily detectable from unlabeled data. Thus, without per-
of these algorithms on real-world data needs yet to be done.
petual validation the mining algorithm might lock itself to a
wrong hypothesis without ever noticing.
4.2.3 Handling Delayed Information
temporal budget allocation: the necessity of perpetual vali-
Latency means information becomes available with significant de-
dation raises the question of optimally allocating the labeling
lay. For example, in the case of so-called verification latency, the
budget over time.
value of the preceding instances target variable is not available be-
fore the subsequent instance has to be predicted. On evolving data performance bounds: in the case of drifting posteriors, no
streams, this is more than a mere problem of streaming data inte- theoretical work exists that provides bounds for errors and
gration between feature and target streams, as due to concept drift label requests. However, deriving such bounds will also re-
patterns show temporal locality [2]. It means that feedback on the quire assuming some type of systematic drift.
current prediction is not available to improve the subsequent pre-
dictions, but only eventually will become available for much later The task of active feature acquisition, where one has to actively
predictions. Thus, there is no recent sample of labeled data at all select among costly features, constitutes another open challenge on
that would correspond to the most-recent unlabeled data, and semi- evolving data streams: in contrast to the static, offline setting, the
supervised learning approaches are not directly applicable. value of a feature is likely to change with its drifting distribution.
A related problem in static, offline data mining is that addressed
by unsupervised transductive transfer learning (or unsupervised do- 5. MINING ENTITIES AND EVENTS
main adaptation): given labeled data from a source domain, a pre-
Conventional stream mining algorithms learn over a single stream
dictive model is sought for a related target domain in which no
of arriving entities. In subsection 5.1, we introduce the paradigm
labeled data is available. In principle, ideas from transfer learning
of entity stream mining, where the entities constituting the stream
could be used to address latency in evolving data streams, for ex-
are linked to instances (structured pieces of information) from fur-
ample by employing them in a chunk-based approach, as suggested
ther streams. Model learning in this paradigm involves the incor-
in [43]. However, adapting them for use in evolving data streams
poration of the streaming information into the stream of entities;
has not been tried yet and constitutes a non-trivial, open task, as
learning tasks include cluster evolution, migration of entities from
adaptation in streams must be fast and fully automated and thus
one state to another, classifier adaptation as entities re-appear with
cannot rely on iterated careful tuning by human experts.
another label than before.
Furthermore, consecutive chunks constitute several domains, thus
Then, in subsection 5.2, we investigate the special case where en-
the transitions between several subsequent chunks might provide
tities are associated with the occurrence of events. Model learning
exploitable patterns of systematic drift. This idea has been in-
then implies identifying the moment of occurrence of an event on
troduced in [27], and a few so-called drift-mining algorithms that
an entity. This scenario might be seen as a special case of entity
identify and exploit such patterns have been proposed since then.
stream mining, since an event can be seen as a degenerate instance
However, the existing approaches cover only a very limited set of
consisting of a single value (the events occurrence).
possible drift patterns and scenarios.
5.1 Entity Stream Mining
4.2.4 Active Selection from Costly Information Let T be a stream of entities, e.g. customers of a company or pa-
The challenge of intelligently selecting among costly pieces of in- tients of a hospital. We observe entities over time, e.g. on a com-
formation is the subject of active learning research. Active stream- panys website or at a hospital admission vicinity: an entity appears
based selective sampling [38] describes a scenario, in which in- and re-appears at discrete time points, new entities show up. At a
stances arrive one-by-one. While the instances feature vectors are time point t, an entity e T is linked with different pieces of in-
provided for free, obtaining their true target values is costly, and the formation - the purchases and ratings performed by a customer, the
definitive decision whether or not to request this target value must anamnesis, the medical tests and the diagnosis recorded for the pa-
be taken before proceeding to the next instance. This corresponds tient. Each of these information pieces ij (t) is a structured record
to a data stream, but not necessarily to an evolving one. As a result, or an unstructured text from a stream Tj , linked to e via the foreign
only a small subset of stream-based selective sampling algorithms key relation. Thus, the entities in T are in 1-to-1 or 1-to-n relation
is suited for non-stationary environments. To make things worse, with entities from further streams T1 , . . . , Tm (stream of purchases,
many contributions do not state explicitly whether they were de- stream of ratings, stream of complaints etc). The schema describ-
signed for drift, neither do they provide experimental evaluations ing the streams T, T1 , . . . , Tm can be perceived as a conventional
on such evolving data streams, thus leaving the reader the ardu- relational schema, except that it describes streams instead of static
ous task to assess their suitability for evolving streams. A first, re- sets.
cent attempt to provide an overview on the existing active learning In this relational setting, the entity stream mining task corresponds
strategies for evolving data streams is given in [43]. The challenges to learning a model T over T , thereby incorporating information
for active learning posed by evolving data streams are: from the adjoint streams T1 , . . . , Tm that feed the entities in T .

Albeit the members of each stream are entities, we use the term entities, thus the challenges pertinent to stream mining also apply
entity only for stream T the target of learning, while we denote here. One of these challenges, and one much discussed in the con-
the entities in the other streams as instances. In the unsupervised text of big data, is volatility. In relational stream mining, volatility
setting, entity stream clustering encompasses learning and adapting refers to the entity itself, not only to the stream of instances that
clusters over T , taking account the other streams that arrive at dif- reference the entities. Finally, an entity is ultimately big data by
ferent speeds. In the supervised setting, entity stream classification itself, since it is described by multiple streams. Hence, next to the
involves learning and adapting a classifier, notwithstanding the fact problem of dealing with new forms of learning and new aspects of
that an entitys label may change from one time point to the next, drift, the subject of efficient learning and adaption in the Big Data
as new instances referencing it arrive. context becomes paramount.
5.1.1 Challenges of Aggregation 5.2 Analyzing Event Data

The first challenge of entity stream mining task concerns informa- Events are an example for data that occurs often yet is rarely ana-
tion summarization: how to aggregate into each entity e at each lyzed in the stream setting. In static environments, events are usu-
time point t the information available on it from the other streams? ally studied through event history analysis (EHA), a statistical me-
What information should be stored for each entity? How to deal thod for modeling and analyzing the temporal distribution of events
with differences in the speeds of the individual streams? How to related to specific objects in the course of their lifetime [9]. More
learn over the streams efficiently? Answering these questions in a specifically, EHA is interested in the duration before the occurrence
seamless way would allow us to deploy conventional stream mining of an event or, in the recurrent case (where the same event can oc-
methods for entity stream mining after aggregation. cur repeatedly), the duration between two events. The notion of
The information referencing a relational entity cannot be held per- an event is completely generic and may indicate, for example, the
petually for learning, hence aggregation of the arriving streams is failure of an electrical device. The method is perhaps even better
necessary. Information aggregation over time-stamped data is tra- known as survival analysis, a term that originates from applications
ditionally practiced in document stream mining, where the objec- in medicine, in which an event is the death of a patient and survival
tive is to derive and adapt content summaries on learned topics. time is the time period between the beginning of the study and the
Content summarization on entities, which are referenced in the doc- occurrence of this event. EHA can also be considered as a special
ument stream, is studied by Kotov et al., who maintain for each case of entity stream mining described in section 5.1, because the
entity the number of times it is mentioned in the news [26]. basic statistical entities in EHA are monitored objects (or subjects),
In such studies, summarization is a task by itself. Aggregation of typically described in terms of feature vectors x Rn , together
information for subsequent learning is a bit more challenging, be- with their survival time s. Then, the goal is to model the depen-
cause summarization implies information loss - notably informa- dence of s on x. A corresponding model provides hints at possible
tion about the evolution of an entity. Hassani and Seidl monitor cause-effect relationships (e.g., what properties tend to increase a
health parameters of patients, modeling the stream of recordings patients survival time) and, moreover, can be used for predictive
on a patient as a sequence of events [21]: the learning task is then purposes (e.g., what is the expected survival time of a patient).
to predict forthcoming values. Aggregation with selective forget- Although one might be tempted to approach this modeling task as
ting of past information is proposed in [25; 42] in the classification a standard regression problem with input (regressor) x and out-
context: the former method [25] slides a window over the stream, put (response) s, it is important to notice that the survival time s
while the latter [42] forgets entities that have not appeared for a is normally not observed for all objects. Indeed, the problem of
while, and summarizes the information in frequent itemsets, which censoring plays an important role in EHA and occurs in different
are then used as new features for learning. facets. In particular, it may happen that some of the objects sur-
vived till the end of the study at time tend (also called the cut-off
5.1.2 Challenges of Learning point). They are censored or, more specifically, right censored,
Even if information aggregation over the streams T1 , . . . , Tm is since tevent has not been observed for them; instead, it is only
performed intelligently, entity stream mining still calls for more known that tevent > tend . In snapshot monitoring [28], the data
than conventional stream mining methods. The reason is that enti- stream may be sampled multiple times, resulting in a new cut-off
ties of stream T re-appear in the stream and evolve. In particular, point for each snapshot. Unlike standard regression analysis, EHA
in the unsupervised setting, an entity may be linked to conceptu- is specifically tailored for analyzing event data of that kind. It is
ally different instances at each time point, e.g. reflecting a cus- built upon the hazard function as a basic mathematical tool.
tomers change in preferences. In the supervised setting, an entity
may change its label; for example, a customers affinity to risk may 5.2.1 Survival function and hazard rate
change in response to market changes or to changes in family sta- Suppose the time of occurrence of the next event (since the start or
tus. This corresponds to entity drift, i.e. a new type of drift beyond the last event) for an object x is modeled as a real-valued random
the conventional concept drift pertaining to model T . Hence, how variable T with probability density function f ( | x). The hazard
should entity drift be traced, and how should the interplay between function or hazard rate h( | x) models the propensity of the occur-
entity drift and model drift be captured? rence of an event, that is, the marginal probability of an event to
In the unsupervised setting, Oliveira and Gama learn and monitor occur at time t, given that no event has occurred so far:
clusters as states of evolution [32], while [41] extend that work to
f (t | x) f (t | x)
learn Markov chains that mark the entities evolution. As pointed h(t | x) = = ,
out in [32], these states are not necessarily predefined they must S(t | x) 1 F (t | x)
be subject of learning. In [43], we report on further solutions to where S( | x) is the survival function and F ( | x) the cumulative
the entity evolution problem and to the problem of learning with distribution of f ( | x). Thus,
forgetting over multiple streams and over the entities referenced by
t
them.
Conventional concept drift also occurs when learning a model over F (t | x) = P(T t) = f (u | x) du
0

is the probability of an event to occur before time t. Correspond- Dealing with model changes of that kind is clearly an important
ingly, S(t | x) = 1 F (t | x) is the probability that the event did challenge for event analysis on data streams. Although the problem
not occur until time t (the survival probability). It can hence be is to some extent addressed by the works mentioned above, there
used to model the probability of the right-censoring of the time for is certainly scope for further improvement, and for using these ap-
an event to occur. proaches to derive predictive models from censored data. Besides,
A simple example is the Cox proportional hazard model [9], in there are many other directions for future work. For example, since
which the hazard rate is constant over time; thus, it does depend the detection of events is a main prerequisite for analyzing them,
on the feature vector x = (x1 , . . . , xn ) but not on time t. More the combination of EHA with methods for event detection [36] is
specifically, the hazard rate is modeled as a log-linear function of an important challenge. Indeed, this problem is often far from triv-
the features xi : ial, and in many cases, events (such as frauds, for example) can only
be detected with a certain time delay; dealing with delayed events
h(t | x) = (x) = exp x is therefore another important topic, which was also discussed in
The model is proportional in the sense that increasing xi by one section 4.2.
unit increases the hazard rate (x) by a factor of i = exp(i ).
For this model, one easily derives the survival function S(t | x) =
1 exp((x) t) and an expected survival time of 1/(x). 6. EVALUATION OF DATA STREAM AL-
GORITHMS
5.2.2 EHA on data streams All of the aforementioned challenges are milestones on the road to
Although the temporal nature of event data naturally fits the data better algorithms for real-world data stream mining systems. To
stream model and, moreover, event data is naturally produced by verify if these challenges are met, practitioners need tools capa-
many data sources, EHA has been considered in the data stream ble of evaluating newly proposed solutions. Although in the field
scenario only very recently. In [39], the authors propose a method of static classification such tools exist, they are insufficient in data
for analyzing earthquake and Twitter data, namely an extension of stream environments due to such problems as: concept drift, lim-
the above Cox model based on a sliding window approach. The ited processing time, verification latency, multiple stream struc-
authors of [28] modify standard classification algorithms, such as tures, evolving class skew, censored data, and changing misclassi-
decision trees, so that they can be trained on a snapshot stream of fication costs. In fact, the myriad of additional complexities posed
both censored and non-censored data. by data streams makes algorithm evaluation a highly multi-criterial
Like in the case of clustering [35], where one distinguishes between task, in which optimal trade-offs may change over time.
clustering observations and clustering data sources, two different Recent developments in applied machine learning [6] emphasize
settings can be envisioned for EHA on data streams: the importance of understanding the data one is working with and
using evaluation metrics which reflect its difficulties. As men-
1. In the first setting, events are generated by multiple data sources tioned before, data streams set new requirements compared to tra-
(representing monitored objects), and the features pertain to ditional data mining and researchers are beginning to acknowl-
these sources; thus, each data source is characterized by a edge the shortcomings of existing evaluation metrics. For exam-
feature vector x and produces a stream of (recurrent) events. ple, Gama et al. [16] proposed a way of calculating classification
For example, data sources could be users in a computer net- accuracy using only the most recent stream examples, therefore al-
work, and an event occurs whenever a user sends an email. lowing for time-oriented evaluation and aiding concept drift detec-
2. In the second setting, events are produced by a single data tion. Methods which test the classifiers robustness to drifts and
source, but now the events themselves are characterized by noise on a practical, experimental level are also starting to arise
features. For example, events might be emails sent by an [34; 47]. However, all these evaluation techniques focus on sin-
email server, and each email is represented by a certain set gle criteria such as prediction accuracy or robustness to drifts, even
of properties. though data streams make evaluation a constant trade-off between
several criteria [7]. Moreover, in data stream environments there is
Statistical event models on data streams can be used in much the a need for more advanced tools for visualizing changes in algorithm
same way as in the case of static data. For example, they can serve predictions with time.
predictive purposes, i.e., to answer questions such as How much The problem of creating complex evaluation methods for stream
time will elapse before the next email arrives? or What is the mining algorithms lies mainly in the size and evolving nature of
probability to receive more than 100 emails within the next hour?. data streams. It is much more difficult to estimate and visualize,
What is specifically interesting, however, and indeed distinguishes for example, prediction accuracy if evaluation must be done on-
the data stream setting from the static case, is the fact that the model line, using limited resources, and the classification task changes
may change over time. This is a subtle aspect, because the hazard with time. In fact, the algorithms ability to adapt is another as-
model h(t | x) itself may already be time-dependent; here, how- pect which needs to be evaluated, although information needed to
ever, t is not the absolute time but the duration time, i.e., the time perform such evaluation is not always available. Concept drifts are
elapsed since the last event. A change of the model is compara- known in advance mainly when using synthetic or benchmark data,
ble to concept drift in classification, and means that the way in while in more practical scenarios occurrences and types of concepts
which the hazard rate depends on time t and on the features xi are not directly known and only the label of each arriving instance
changes over time. For example, consider the event increase of is known. Moreover, in many cases the task is more complicated, as
a stock rate and suppose that i = log(2) for the binary feature labeling information is not instantly available. Other difficulties in
xi = energy sector in the above Cox model (which, as already evaluation include processing complex relational streams and cop-
mentioned, does not depend on t). Thus, this feature doubles the ing with class imbalance when class distributions evolve with time.
hazard rate and hence halves the expected duration between two Finally, not only do we need measures for evaluating single aspects
events. Needless to say, however, this influence may change over of stream mining algorithms, but also ways of combining several of
time, depending on how well the energy sector is doing. these aspects into global evaluation models, which would take into

account expert knowledge and user preferences. 7.1 Making models simpler, more reactive, and
Clearly, evaluation of data stream algorithms is a fertile ground more specialized
for novel theoretical and algorithmic solutions. In terms of pre- In this subsection, we discuss aspects like the simplicity of a model,
diction measures, data stream mining still requires evaluation tools its proper combination of offline and online components, and its
that would be immune to class imbalance and robust to noise. In customization to the requirements of the application domain. As
our opinion, solutions to this problem should involve not only met- an application example, consider the French Orange Portal2 , which
rics based on relative performance to baseline (chance) classifiers, registers millions of visits daily. Most of these visitors are only
but also graphical measures similar to PR-curves or cost curves. known through anonymous cookie IDs. For all of these visitors,
Furthermore, there is a need for integrating information about con- the portal has the ambition to provide specific and relevant contents
cept drifts in the evaluation process. As mentioned earlier, possible as well as printing ads for targeted audiences. Using information
ways of considering concept drifts will depend on the information about visits on the portal the questions are: what part of the portal
that is available. If true concepts are known, algorithms could be does each cookie visit, and when and which contents did it consult,
evaluated based on: how often they detect drift, how early they de- what advertisement was sent, when (if) was it clicked. All this in-
tect it, how they react to it, and how quickly they recover from it. formation generates hundreds of gigabytes of data each week. A
Moreover, in this scenario, evaluation of an algorithm should be user profiling system needs to have a back end part to preprocess
dependent on whether it takes place during drift or during times of the information required at the input of a front end part, which will
concept stability. A possible way of tackling this problem would be compute appetency to advertising (for example) using stream min-
the proposal of graphical methods, similar to ROC analysis, which ing techniques (in this case a supervised classifier). Since the ads
would work online and visualize concept drift measures alongside to print change regularly, based on marketing campaigns, the ex-
prediction measures. Additionally, these graphical measures could tensive parameter tuning is infeasible as one has to react quickly to
take into account the state of the stream, for example, its speed, change. Currently, these tasks are either solved using bandit meth-
number of missing values, or class distribution. Similar methods ods from game theory [8], which impairs adaptation to drift, or
could be proposed for scenarios where concepts are not known in done offline in big data systems, resulting in slow reactivity.
advance, however, in these cases measures should be based on drift
detectors or label-independent stream statistics. Above all, due to 7.1.1 Minimizing parameter dependence
the number of aspects which need to be measured, we believe that Adaptive predictive systems are intrinsically parametrized. In most
the evaluation of data stream algorithms requires a multi-criterial of the cases, setting these parameters, or tuning them is a difficult
view. This could be done by using inspirations from multiple crite- task, which in turn negatively affects the usability of these systems.
ria decision analysis, where trade-offs between criteria are achieved Therefore, it is strongly desired for the system to have as few user
using user-feedback. In particular, a user could showcase his/her adjustable parameters as possible. Unfortunately, the state of the
criteria preferences (for example, between memory consumption, art does not produce methods with trustworthy or easily adjustable
accuracy, reactivity, self-tuning, and adaptability) by deciding be- parameters. Moreover, many predictive modeling methods use a
tween alternative algorithms for a given data stream. It is worth lot of parameters, rendering them particularly impractical for data
noticing that such a multi-criterial view on evaluation is difficult to stream applications, where models are allowed to evolve over time,
encapsulate in a single number, as it is usually done in traditional and input parameters often need to evolve as well.
offline learning. This might suggest that researchers in this area
The process of predictive modeling encompasses fitting of parame-
should turn towards semi-qualitative and semi-quantitative evalua-
ters on a training dataset and subsequently selecting the best model,
tion, for which systematic methodologies should be developed.
either by heuristics or principled methods. Recently, model selec-
Finally, a separate research direction involves rethinking the way tion methods have been proposed that do not require internal cross-
we test data stream mining algorithms. The traditional train, cross- validation, but rather use the Bayesian machinery to design regu-
validate, test workflow in classification is not applicable for sequen- larizers with data dependent priors [20]. However, they are not yet
tial data, which makes, for instance, parameter tuning much more applicable in data streams, as their computational time complexity
difficult. Similarly, ground truth verification in unsupervised learn- is too high and they require all examples to be kept in memory.
ing is practically impossible in data stream environments. With
these problems in mind, it is worth stating that there is still a short- 7.1.2 Combining offline and online models
age of real and synthetic benchmark datasets. Such a situation
Online and offline learning are mostly considered as mutually ex-
might be a result of non-uniform standards for testing algorithms on
clusive, but it is their combination that might enhance the value
streaming data. As community, we should decide on such matters
of data the most. Online learning, which processes instances one-
as: What characteristics should benchmark datasets have? Should
by-one and builds models incrementally, has the virtue of being
they have prediction tasks attached? Should we move towards on-
fast, both in the processing of data and in the adaptation of mod-
line evaluation tools rather than datasets? These questions should
els. Offline (or batch) learning has the advantage of allowing the
be answered in order to solve evaluation issues in controlled envi-
use of more sophisticated mining techniques, which might be more
ronments before we create measures for real-world scenarios.
time-consuming or require a human expert. While the first allows
the processing of fast data that requires real-time processing and
adaptivity, the second allows processing of big data that requires
7. FROM ALGORITHMS TO DECISION longer processing time and larger abstraction.
Their combination can take place in many steps of the mining pro-
SUPPORT SYSTEMS cess, such as the data preparation and the preprocessing steps. For
While a lot of algorithmic methods for data streams are already example, offline learning on big data could extract fundamental and
available, their deployment in real applications with real streaming sustainable trends from data using batch processing and massive
data presents a new dimension of challenges. This section points parallelism. Online learning could then take real-time decisions
out two such challenges: making models simpler and dealing with
2
legacy systems. www.orange.fr

from online events to optimize an immediate pay-off. In the online mining can make a decisive contribution to enhance and facilitate
advertisement application mentioned above, the user-click predic- the required monitoring tasks. Recently, we are planning to use the
tion is done within a context, defined for example by the currently ISS Columbus module as a technology demonstrator for integrat-
viewed page and the profile of the cookie. The decision which ing data stream processing and mining into the existing monitoring
banner to display is done online, but the context can be prepro- processes [31]. Figure 2 exemplifies the failure management sys-
cessed offline. By deriving meta-information such as the profile is tem (FMS) of the ISS Columbus module. While it is impossible to
a young male, the page is from the sport cluster, the offline com- simply redesign the FMS from scratch, we can outline the follow-
ponent can ease the online decision task. ing challenges.
7.1.3 Solving the right problem

Domain knowledge may help to solve many issues raised in this
paper, by systematically exploiting particularities of application do-
mains. However, this is seldom considered, as typical data stream
methods are created to deal with a large variety of domains. For in-
stance, in some domains the learning algorithm receives only par-
tial feedback upon its prediction, i.e. a single bit of right-or-wrong,
rather than the true label. In the user-click prediction example, if a 1. ISS Columbus module
user does not click on a banner, we do not know which one would
have been correct, but solely that the displayed one was wrong.
This is related to the issues on timing and availability of informa-
tion discussed in section 4.2.
5. Assembly, integration, 2. Ground control
However, building predictive models that systematically incorpo- centre
and test facility
rate domain knowledge or domain specific information requires
to choose the right optimization criteria. As mentioned in sec-
tion 6, the data stream setting requires optimizing multiple criteria
simultaneously, as optimizing only predictive performance is not
sufficient. We need to develop learning algorithms, which mini-
mize an objective function including intrinsically and simultane- 4. Mission archiv 3. Engineering support
ously: memory consumption, predictive performance, reactivity, centre
self monitoring and tuning, and (explainable) auto-adaptivity. Data
streams research is lacking methodologies for forming and opti- Figure 2: ISS Columbus FMS
mizing such criteria.
Therefore, models should be simple so that they do not depend on
a set of carefully tuned parameters. Additionally, they should com-
bine offline and online techniques to address challenges of big and 7.2.2 Complexity
fast data, and they should solve the right problem, which might Even though spacecraft monitoring is very challenging by itself,
consist in solving a multi-criteria optimization task. Finally, they it becomes increasingly difficult and complex due to the integra-
have to be able to learn from a small amount of data and with low tion of data stream mining into such legacy systems. However,
variance [37], to react quickly to drift. it was assumed to enhance and facilitate current monitoring pro-
cesses. Thus, appropriate mechanism are required to integrate data
7.2 Dealing with Legacy Systems stream mining into the current processes to decrease complexity.
In many application environments, such as financial services or
health care systems, business critical applications are in operation 7.2.3 Interlocking
for decades. Since these applications produce massive amounts of As depicted in Figure 2, the ISS Columbus module is connected
data, it becomes very promising to process these amounts of data to ground instances. Real-time monitoring must be applied aboard
by real-time stream mining approaches. However, it is often impos- where computational resources are restricted (e.g. processor speed
sible to change existing infrastructures in order to introduce fully and memory or power consumption). Near real-time monitoring or
fledged stream mining systems. Rather than changing existing in- long-term analysis must be applied on-ground where the downlink
frastructures, approaches are required that integrate stream mining suffers from latencies because of a long transmission distance, is
techniques into legacy systems. In general, problems concerning subject to bandwidth limitations, and continuously interrupted due
legacy systems are domain-specific and encompass both technical to loss of signal. Consequently, new data stream mining mecha-
and procedural issues. In this section, we analyze challenges posed nisms are necessary which ensure a smooth interlocking function-
by a specific real-world application with legacy issues the ISS ality of aboard and ground instances.
Columbus spacecraft module.
7.2.4 Reliability and Balance
7.2.1 ISS Columbus The reliability of spacecrafts is indispensable for astronauts health
Spacecrafts are very complex systems, exposed to very different and mission success. Accordingly, spacecrafts pass very long and
physical environments (e.g. space), and associated to ground sta- expensive planning and testing phases. Hence, potential data stream
tions. These systems are under constant and remote monitoring mining algorithms must ensure reliability and the integration of
by means of telemetry and commands. The ISS Columbus mod- such algorithms into legacy systems must not cause critical side
ule has been in operation for more than 5 years. For some time, effects. Furthermore, data stream mining is an automatic process
it is pointed out that the monitoring process is not as efficient as which neglects interactions with human experts, while spacecraft
previously expected [30]. However, we assume that data stream monitoring is a semi-automatic process and human experts (e.g.

the flight control team) are responsible for decisions and conse- funded by the German Research Foundation, projects SP 572/11-1
quent actions. This problem poses the following question: How to (IMPRINT) and HU 1284/5-1, the Academy of Finland grant 118653
integrate data stream mining into legacy systems when automation (ALGODAN), and the Polish National Science Center grants
needs to be increased but the human expert needs to be maintained DEC-2011/03/N/ST6/00360 and DEC-2013/11/B/ST6/00963.
in the loop? Abstract discussions on this topic are provided by ex-
pert systems [44] and the MAPE-K reference model [24]. Expert 9. REFERENCES
systems aim to combine human expertise with artificial expertise
and the MAPE-K reference model aims to provide an autonomic [1] C. Aggarwal, editor. Data Streams: Models and Algorithms.
control loop. A balance must be struck which considers both afore- Springer, 2007.
mentioned aspects appropriately.
Overall, the Columbus study has shown that extending legacy sys- [2] C. Aggarwal and D. Turaga. Mining data streams: Systems
tems with real time data stream mining technologies is feasible and and algorithms. In Machine Learning and Knowledge Dis-
it is an important area for further stream-mining research. covery for Engineering Systems Health Management, pages
432. Chapman and Hall, 2012.
8. CONCLUDING REMARKS [3] R. Agrawal and R. Srikant. Privacy-preserving data mining.
In this paper, we discussed research challenges for data streams, SIGMOD Rec., 29(2):439450, 2000.
originating from real-world applications. We analyzed issues con-
[4] C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what
cerning privacy, availability of information, relational and event
to observe next: Adaptive variable selection for regression in
streams, preprocessing, model complexity, evaluation, and legacy
multivariate data streams. In Proc. of the 2008 ACM Symp. on
systems. The discussed issues were illustrated by practical applica-
Applied Computing, SAC, pages 961965, 2008.
tions including GPS systems, Twitter analysis, earthquake predic-
tions, customer profiling, and spacecraft monitoring. The study of [5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom.
real-world problems highlighted shortcomings of existing method- Models and issues in data stream systems. In Proc. of the 21st
ologies and showcased previously unaddressed research issues. ACM SIGACT-SIGMOD-SIGART Symposium on Principles
Consequently, we call the data stream mining community to con- of Database Systems, PODS, pages 116, 2002.
sider the following action points for data stream research:
[6] C. Brodley, U. Rebbapragada, K. Small, and B. Wallace.
developing methods for ensuring privacy with incomplete Challenges and opportunities in applied machine learning. AI
information as data arrives, while taking into account the Magazine, 33(1):1124, 2012.
evolving nature of data;
[7] D. Brzezinski and J. Stefanowski. Reacting to different types
considering the availability of information by developing mod- of concept drift: The accuracy updated ensemble algorithm.
els that handle incomplete, delayed and/or costly feedback; IEEE Trans. on Neural Networks and Learning Systems.,
25:8194, 2014.
taking advantage of relations between streaming entities;
[8] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal
developing event detection methods and predictive models multi-armed bandits. In Proc. of the 22nd Conf. on Neural
for censored data; Information Processing Systems, NIPS, pages 273280, 2008.
developing a systematic methodology for streamed prepro- [9] D. Cox and D. Oakes. Analysis of Survival Data. Chapman &
cessing; Hall, London, 1984.
creating simpler models through multi-objective optimiza- [10] T. Dietterich. Machine-learning research. AI Magazine,
tion criteria, which consider not only accuracy, but also com- 18(4):97136, 1997.
putational resources, diagnostics, reactivity, interpretability;
[11] G. Ditzler and R. Polikar. Semi-supervised learning in non-
establishing a multi-criteria view towards evaluation, dealing stationary environments. In Proc. of the 2011 Int. Joint Conf.
with absence of the ground truth about how data changes; on Neural Networks, IJCNN, pages 2741 2748, 2011.
developing online monitoring systems, ensuring reliability of [12] W. Fan and A. Bifet. Mining big data: current status, and fore-
any updates, and balancing the distribution of resources. cast to the future. SIGKDD Explorations, 14(2):15, 2012.
As our study shows, there are challenges in every step of the CRISP [13] M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl.
data mining process. To date, modeling over data streams has Data stream mining in ubiquitous environments: state-of-the-
been viewed and approached as an extension of traditional meth- art and current directions. Wiley Interdisciplinary Reviews:
ods. However, our discussion and application examples show that Data Mining and Knowledge Discovery, 4(2):116 138,
in many cases it would be beneficial to step aside from building 2014.
upon existing offline approaches, and start blank considering what
[14] M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data
is required in the stream setting.
streams: A review. SIGMOD Rec., 34(2):1826, 2005.
Acknowledgments [15] J. Gama. Knowledge Discovery from Data Streams. Chapman

& Hall/CRC, 2010.
We would like to thank the participants of the RealStream2013
workshop at ECMLPKDD2013 in Prague, and in particular Bern- [16] J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating
hard Pfahringer and George Forman, for suggestions and discus- stream learning algorithms. Machine Learning, 90(3):317
sions on the challenges in stream mining. Part of this work was 346, 2013.

[17] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and [33] D. Pyle. Data Preparation for Data Mining. Morgan Kauf-
A. Bouchachia. A survey on concept-drift adaptation. ACM mann Publishers Inc., 1999.
Computing Surveys, 46(4), 2014.
[34] T. Raeder and N. Chawla. Model monitor (m2 ): Evaluat-
[18] J. Gantz and D. Reinsel. The digital universe in 2020: Big ing, comparing, and monitoring models. Journal of Machine
data, bigger digital shadows, and biggest growth in the far Learning Research, 10:13871390, 2009.
east, December 2012.
[35] P. Rodrigues and J. Gama. Distributed clustering of ubiqui-
[19] A. Goldberg, M. Li, and X. Zhu. Online manifold regular- tous data streams. WIREs Data Mining and Knowledge Dis-
ization: A new learning setting and empirical study. In Proc. covery, pages 3854, 2013.
of the European Conf. on Machine Learning and Principles [36] T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for
of Knowledge Discovery in Databases, ECMLPKDD, pages real-time event detection and earthquake reporting system de-
393407, 2008. velopment. IEEE Trans. on Knowledge and Data Engineer-
[20] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selec- ing, 25(4):919931, 2013.
tion: Beyond the bayesian/frequentist divide. Journal of Ma- [37] C. Salperwyck and V. Lemaire. Learning with few examples:
chine Learning Research, 11:6187, 2010. An empirical study on leading classifiers. In Proc. of the 2011
[21] M. Hassani and T. Seidl. Towards a mobile health context Int. Joint Conf. on Neural Networks, IJCNN, pages 1010
prediction: Sequential pattern mining in multiple streams. 1019, 2011.
In Proc. of , IEEE Int. Conf. on Mobile Data Management, [38] B. Settles. Active Learning. Synthesis Lectures on Artificial
MDM, pages 5557, 2011. Intelligence and Machine Learning. Morgan and Claypool
Publishers, 2012.
[22] H. He and Y. Ma, editors. Imbalanced Learning: Founda-
tions, Algorithms, and Applications. IEEE, 2013. [39] A. Shaker and E. Hullermeier. Survival analysis on data
streams: Analyzing temporal events in dynamically chang-
[23] T. Hoens, R. Polikar, and N. Chawla. Learning from streaming environments. Int. Journal of Applied Mathematics and
ing data with concept drift and imbalance: an overview. Computer Science, 24(1):199212, 2014.
Progress in Artificial Intelligence, 1(1):89101, 2012.
[40] C. Shearer. The CRISP-DM model: the new blueprint for data
[24] IBM. An architectural blueprint for autonomic computing. mining. J Data Warehousing, 2000.
Technical report, IBM, 2003.
[41] Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou.
[25] E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Where are we going? predicting the evolution of individu-
Adaptive windowing for online learning from multiple inter- als. In Proc. of the 11th Int. Conf. on Advances in Intelligent
related data streams. In Proc. of the 11th IEEE Int. Conf. on Data Analysis, IDA, pages 357368, 2012.
Data Mining Workshops, ICDMW, pages 697704, 2011.
[42] Z. Siddiqui and M. Spiliopoulou. Classification rule mining
[26] A. Kotov, C. Zhai, and R. Sproat. Mining named entities for a stream of perennial objects. In Proc. of the 5th Int. Conf.
with temporally correlated bursts from multilingual web news on Rule-based Reasoning, Programming, and Applications,
streams. In Proc. of the 4th ACM Int. Conf. on Web Search and RuleML, pages 281296, 2011.
Data Mining, WSDM, pages 237246, 2011.
[43] M. Spiliopoulou and G. Krempl. Tutorial mining multiple
[27] G. Krempl. The algorithm APT to classify in concurrence of threads of streaming data. In Proc. of the Pacific-Asia Conf.
latency and drift. In Proc. of the 10th Int. Conf. on Advances on Knowledge Discovery and Data Mining, PAKDD, 2013.
in Intelligent Data Analysis, IDA, pages 222233, 2011.
[44] D. Waterman. A Guide to Expert Systems. Addison-Wesley,
[28] M. Last and H. Halpert. Survival analysis meets data stream 1986.
mining. In Proc. of the 1st Worksh. on Real-World Challenges [45] W. Young, G. Weckman, and W. Holland. A survey of
for Data Stream Mining, RealStream, pages 2629, 2013. methodologies for the treatment of missing values within
[29] F. Nelwamondo and T. Marwala. Key issues on computational datasets: limitations and benefits. Theoretical Issues in Er-
intelligence techniques for missing data imputation - a review. gonomics Science, 12, January 2011.
In Proc. of World Multi Conf. on Systemics, Cybernetics and [46] B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continu-
Informatics, volume 4, pages 3540, 2008. ous privacy preserving publishing of data streams. In Proc.
of the 12th Int. Conf. on Extending Database Technology,
[30] E. Noack, W. Belau, R. Wohlgemuth, R. Muller, S. Palumberi,
EDBT, pages 648659, 2009.
P. Parodi, and F. Burzagli. Efficiency of the columbus failure
management system. In Proc. of the AIAA 40th Int. Conf. on [47] I. Zliobaite. Controlled permutations for testing adaptive
Environmental Systems, 2010. learning models. Knowledge and Information Systems, In
Press, 2014.
[31] E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumloffel,
E. Hauke, J. Stamminger, and E. Frisk. The columbus module [48] I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama,
as a technology demonstrator for innovative failure manage- L. Minku, and K. Musial. Next challenges for adaptive learn-
ment. In German Air and Space Travel Congress, 2012. ing systems. SIGKDD Explorations, 14(1):4855, 2012.
[32] M. Oliveira and J. Gama. A framework to monitor clusters [49] I. Zliobaite and B. Gabrys. Adaptive preprocessing for
evolution applied to economy and finance problems. Intelli- streaming data. IEEE Trans. on Knowledge and Data Engi-
gent Data Analysis, 16(1):93111, 2012. neering, 26(2):309321, 2014.

Twitter Analytics: A Big Data Management Perspective
Oshini Goonetilleke , Timos Sellis , Xiuzhen Zhang , Saket Sathe

Twitter Analytics:
School of ComputerA
Bigand
Science Data Management
IT, RMIT

Perspective
University, Melbourne, Australia
IBM Melbourne Research Laboratory, Australia
{oshini.goonetilleke, timos.sellis, xiuzhen.zhang}@rmit.edu.au, ssathe@au.ibm.com
Oshini Goonetilleke , Timos Sellis , Xiuzhen Zhang , Saket Sathe

School of Computer Science and IT, RMIT University, Melbourne, Australia

IBM Melbourne Research Laboratory, Australia
ABSTRACT
{oshini.goonetilleke, timos.sellis, xiuzhen.zhang @rmit.edu.au,
The }systems ssathe@au.ibm.com
that perform analysis in the context of these
interactions typically involve the following major compo-
With the inception of the Twitter microblogging platform
nents: focused crawling, data management and data ana-
in 2006, a myriad of research efforts have emerged studying
lytics. Here, data management comprises of information ex-
different aspects of the Twittersphere. Each study exploits
traction, pre-processing, data modeling and query process-
ABSTRACT
its own tools and mechanisms to capture, store, query and The components.
ing systems that Figure
perform1 analysis
shows a in the diagram
block context of of these
such
analyze Twitter data. Inevitably, platforms have been de- interactions
With the a system andtypically involve the following
depicts interactions major compo-
between various
veloped to inception
replace this of ad-hoc
the Twitter microblogging
exploration with a more platform
struc- nents: Until
focused crawling, data
in 2006, a myriad of research efforts have emerged studying nents. now, there has beenmanagement and data
significant amount ana-
of prior
tured and methodological form of analysis. Another body lytics. Here, dataimproving
management
different aspects of the research around eachcomprises of information
of the components shownex- in
of literature focuses on Twittersphere.
developing languagesEach study exploits
for querying traction,
its own tools and mechanisms to capture, Figure 1, pre-processing,
but to the best data
of ourmodeling
knowledge,andthere
queryhave
process-
been
Tweets. This paper addresses issues aroundstore,
the bigquery
data andna- ing frameworks
components.that Figure 1 shows a block diagram of such
analyze Twitterand data. Inevitably, no propose a unified approach to Twitter
ture of Twitter emphasizes theplatforms
need for new havedata
been de-
man- a system and depicts
veloped data management thatinteractions
seamlessly between
integratevarious
all thesecompo-
com-
agementtoand replace
querythis ad-hoc frameworks
language exploration withthat aaddress
more struc-
limi- nents. Until now, there has been significant amount
tured and methodological ponents. Following these observations, in this paperofweprior
ex-
tations of existing systems.form We of analysis.
review Another
existing body
approaches research
of literature focuses on developing languages for querying tensivelyaround
survey improving each that
the techniques of the components
have shownfor
been proposed in
that were developed to facilitate twitter analytics followed Figure 1,each
but of
to the
thecomponents
best of our knowledge, there1,have
Tweets. This paper addressesissuesissuesandaround the big data na- realising shown in Figure and been
then
by a discussion on research technical challenges no frameworks that propose a unified approach to Twitter
ture of Twitterintegrated
and emphasizes the need for new data man- motivate the need and challenges of a unified framework for
in developing solutions. data management that seamlessly integrate all these com-
agement and query language frameworks that address limi- managing Twitter data.
ponents. Following these observations, in this paper we ex-
tations of existing systems. We review existing approaches
1. INTRODUCTION
that were developed to facilitate twitter analytics followed
tensively survey the techniques that have been proposed for
The realising each of the components shown in Figure 1, and then
by a massive
discussion growth of dataissues
on research generated from social
and technical media
challenges
sources have resulted in solutions.
a growing interest on efficient and motivate the need and challenges of a unified framework for
in developing integrated
effective means of collecting, analyzing and querying large managing Twitter data.
volumes of social data. The existing platforms exploit sev-
1. INTRODUCTION
eral characteristics of big data, including large volumes of
The massive
data, velocity growth
due to theof data generated
streaming naturefrom socialand
of data, media
va-
sources
riety duehave resulted
to the in a growing
integration of data interest
from theon webefficient
and otherand
effective means
sources. Hence,ofsocial
collecting,
network analyzing and querying
data presents large
an excellent
volumes for
testbed of social
research data. Thedata.
on big existing platforms exploit sev-
eral characteristics
In particular, onlineof social
big data, including
networking andlarge volumes of
microblogging
data, velocity
platform Twitterdue has
to the
seen streaming
exponentialnature of data,
growth anduser
in its va-
riety due toitsthe
base since integration
inception of data
in 2006 withfrom
nowthe web200
over andmillion
other
sources.
monthly activeHence,userssocial network500
producing data presents
million tweetsan (Twitter-
excellent
testbed the
sphere, for research
postings on madebig data.
to Twitter) daily1 . A wide re-
In particular,
search community online
hassocial networking since
been established and microblogging
then with the
platform
hope Twitter has seen
of understanding exponential
interactions growth in
on Twitter. Foritsexam-
user
basestudies
ple, since itshaveinception in 2006 in
been conducted with
manynowdomains
over 200 million
exploring Figure 1: An abstraction of a Twitter data management
monthly active
different users producing
perspectives 500 million
of understanding human tweets (Twitter-
behavior.
1 platform
sphere,research
Prior the postings
focusesmade to Twitter)
on a variety . A wide
dailyincluding
of topics re-
opin-
search
ion miningcommunity has event
[12, 14, 38], been established
detection [46,since
65, then with the
76], spread of In our survey of existing literature we observed ways in
hope of understanding
pandemics interactions
[26,58,68], celebrity on Twitter.
engagement [74] andForanalysis
exam-
which researchers have tried to develop general platforms
ple,political
of studies discourse
have been[28, conducted
40, 70]. in many
These domains
types exploring
of efforts have Figure 1: aAn abstraction of a Twitter data management
to provide repeatable foundation for Twitter data analyt-
different perspectivestoofunderstand
enabled researchers understanding human behavior.
interactions on Twitter platform
ics. We review the tweet analytics space primarily focusing
Prior
relatedresearch focuses
to the fields on a varietyeducation,
of journalism, of topics including
marketing,opin- dis- on the following key elements:
ion mining
aster [12, 14, 38], event detection [46, 65, 76], spread of
relief etc. In our survey of existing literature we observed ways in
pandemics [26,58,68], celebrity engagement [74] and analysis Data Collection. Researchers have several options for
1 which researchers have tried to develop general platforms
ofhttp://tnw.to/s0n9u
political discourse [28, 40, 70]. These types of efforts have collecting a suitable data set from Twitter. In Section 2
to provide a repeatable foundation for Twitter data analyt-
enabled researchers to understand interactions on Twitter we briefly describe mechanisms and tools that focus pri-
ics. We review the tweet analytics space primarily focusing
related to the fields of journalism, education, marketing, dis- marily on facilitating the initial data acquisition phase.
on the following key elements:
aster relief etc. These tools systematically capture the data using any of
Data Collection. Researchers have several options for
1
http://tnw.to/s0n9u collecting a suitable data set from Twitter. In Section 2
we briefly describe mechanisms and tools that focus pri-
marily on facilitating the initial data acquisition phase.
These tools systematically capture the data using any of

the Twitters publicly accessible APIs. stream) of the public Twitter stream in real-time. Applica-
Data management frameworks. In addition to pro- tions where this rate limitation is too restrictive rely on third
viding a module for crawling tweets, these frameworks party resellers like GNIP, DataSift or Topsy 3 , who provides
provide support for pre-processing, information extrac- access to the entire collection of the tweets known as the
tion and/or visualization capabilities. Prepossessing deals Twitter FireHose. At a cost, resellers can provide unlim-
the Twitters publicly accessible APIs. stream)
ited access of the public Twitter
to archives streamdata,
of historical in real-time.
real-timeApplica-
stream-
with preparing tweets for data analysis. Information ex-
traction
Data management frameworks. tions where this rate limitation is too
ing data or both. It is mostly the corporate 3businessesrestrictive rely on third
who
aims to derive more insightInfrom addition
the to pro-
tweets,
viding party
opt forresellers like GNIP,toDataSift
such alternatives gain insightsor Topsy , whoconsumer
into their provides
which isa not module for crawling
directly reported tweets, these frameworks
by the Twitter API, e.g.
provide support access to the entire
and competitor collection of the tweets known as the
patterns.
a sentiment for afor givenpre-processing,
tweet. Several information
frameworks extrac-
pro-
tion and/or visualization capabilities. Prepossessing deals Twitter FireHose.
In order to obtain a dataset At a cost,sufficient
resellersfor cananprovide
analysisunlim-
task,
vide functionality to present the results in many output
with ited access to archives of historical data, real-time APIstream-
formspreparing tweets for
of visualizations data
while analysis.
others Information
are built exclusivelyex- it is necessary to efficiently query the respective meth-
traction aims to derive more insight from the tweets, ing
ods,data
within or the
both. It is mostly
bounds of imposedthe corporate
rate limits. businesses
The requestswho
to search over a large tweet collection. In Section 3 we
which opt for such alternatives to gain insights into their consumer
review isexisting
not directly reported by frameworks.
data management the Twitter API, e.g. to the API may have to run continuously spanning across
a sentiment for a given tweet. Several frameworks pro- and competitor
several days or weeks. patterns.Creating the users social graph for
Languages for querying tweets. A growing body In order to obtain a dataset sufficient for an analysis
vide functionality to present the results in many output a community of interest requires additional modules task,that
of literature proposes declarative query languages as a it is necessary to efficiently query thecrawls
respective
forms of visualizations while others are built exclusively crawl user accounts iteratively. Large with API
moremeth-
com-
mechanism of extracting structured information from tweets. ods,
to search over a large tweet collection. In Section 3 we pletewithin
coverage thewasbounds
madeofpossible
imposedwith ratethelimits.
use of The requests
whitelisted
Languages present end users with a set of primitives ben- to
review existing data management frameworks. accounts [21, 45] and using the computation power of across
the API may have to run continuously spanning cloud
eficial in exploring the Twittersphere along different di- several days[55]. or weeks.
Languages computing Due to Creating the users
Twitters current socialwhitelisted
policy, graph for
mensions. Infor querying
Section tweets. Adeclarative
4 we investigate growing body lan- a community of interest requires
of literature proposes declarative query accounts are discontinued and areadditional
no longer modulesan option that
as
guages and similar systems developed for languages
querying aasva- a
crawl user accounts
mechanism of extracting means of large dataiteratively.
collection.Large crawls with
Distributed more com-
systems have
riety of tweet properties.structured information from tweets. plete coverage was
Languages present end users with a set of primitives ben- been developed [16,made
45] topossible with the userunning,
make continuously of whitelisted
large
We eficial
have in identified thetheessential ingredients accounts
scale crawls feasible. There are other solutions that of
[21, 45] and using the computation power cloud
exploring Twittersphere alongfor a unified
different di- provide
Twitter computingfeatures[55]. Due to to Twittersthe current policy, whitelisted
mensions. In Section 4 we investigate declarative that
data management solution, with the intention lan- extended process incoming Twitter data as
an analyst will similar
easily be able to extend its accounts
discussed are discontinued and are no longer an option as
guages and systems developed for capabilities
querying a va- for next.
specific means of large data collection. Distributed systems have
riety types
of tweet of properties.
research. Such a solution will allow the
data analyst to focus on the use cases of the analytics task been developed [16, 45] to make continuously running, large
We conveniently
by have identified using thethe essential ingredients
functionality providedfor bya unified
an in- 3.
scale DATA MANAGEMENT
crawls feasible. There are otherFRAMEWORKS
solutions that provide
Twitter
tegrated data management
framework. solution,
In Section 5 wewith the intention
present that
our position extended features to process the incoming Twitter data as
an analyst will easily be able to extend
and emphasize the need for integrated solutions that address its capabilities for 3.1
discussedFocused
next. Crawlers
specific types
limitations of research.
of existing systems.SuchSection
a solution will allow
6 outlines the
research The focus in studies like TwitterEcho [16] and Byun et
data
issuesanalyst to focus associated
on the usewith casestheof development
the analyticsoftask al. [19] is data collection, where the primary contributions
by
and challenges
tegrated platforms. Finally we conclude in Sectionby7. an in-
conveniently using the functionality provided
in-
3. DATA MANAGEMENT FRAMEWORKS
are driven by crawling strategies for effective retrieval and
tegrated framework. In Section 5 we present our position better coverage. TwitterEcho describes an open source dis-
and emphasize the need for integrated solutions that address 3.1 Focused Crawlers
tributed crawler for Twitter. Data can be collected from
2. OPTIONS
limitations of existing FOR DATA
systems. COLLECTION
Section 6 outlines research The focus in studies like
a focused community TwitterEcho
of interest and it [16]adapts anda Byun
central-et
Researchers
issues have several
and challenges optionswith
associated when thechoosing an API
development of for
in- al. [19] is data collection, where the primary
ized distributed architecture in which multiple thin clients contributions
data collection,
tegrated platforms. i.e. Finally
the Search, Streaming
we conclude and the7.REST
in Section are deployed
are driven byto crawling
create astrategies
scalable for effective
system. retrieval and
TwitterEcho de-
API. Each API has varying capabilities with respect to the better
vises a coverage.
user expansion TwitterEcho
strategydescribes
by whichanthe open
userssource dis-
follower
type and the amount of information that can be retrieved. tributed crawler iteratively
for Twitter. Data thecan be collected
API. Thefrom
2. OPTIONS FOR DATA COLLECTION
The Search API is dedicated for running searches against an
lists are crawled
a focused community
using
of interest andfor
REST
it controlling
adapts a central-
sys-
tem also includes modules responsible the ex-
Researchers
index of recent have several
tweets. It options when choosing
takes keywords as queries an with
API the
for ized distributed
pansion strategy.architecture in whichfeature
The user selection multiple thin clients
identifies user
data collection,
possibility i.e. the
of multiple Search,
queries Streaming
combined as aand the REST
comma sepa- are deployed
accounts to be to monitored
create a scalable
by the system.
system with TwitterEcho
modules de- for
API.
rated Each
list. A API has varying
request capabilities
to the search with respect
API returns to the
a collection vises a user expansion
user profile analysis and strategy
language by which the usersThe
identification. follower
user
type
of and the
relevant amount
tweets of information
matching that can be retrieved.
a user query. lists are crawled
selection feature iteratively
is customized usingtothe REST
crawl the API. Thecom-
focused sys-
The
The Search
Streaming APIAPI is dedicated
providesfor runningtosearches
a stream against
continuously an
cap- tem alsoofincludes
munity Portuguese modules responsible
tweets but can for be controlling
adapted to the ex-
target
index of recent tweets. It takes keywords
ture the public tweets where parameters are provided to as queries with the pansion strategy. The
other communities. Byun useret selection
al. [19] infeature
their workidentifies
propose usera
possibility of multiple
filter the results of the queries
stream combined
by hashtags, as keywords,
a comma sepa- twit- accounts
rule-basedtodata be monitored
collection tool by the system with
for Twitter with modules
the focusfor of
rated
ter user ids, usernames or geographic regions. aThe
list. A request to the search API returns collection
REST user profilesentiment
analysing analysis of and language
Twitter identification.
messages. The user
It is a java-based
of relevant
API can betweets used to matching
retrievea auser query.of the most recent
fraction selection feature is customized to crawl
open source tool developed using the Drools rule engine. the 4focused com-
The
tweets Streaming
publishedAPI by aprovides
Twitter auser.streamAll to continuously
three APIs limitcap- the munity
They stressof Portuguese
the importance tweetsofbut an can be adapted
automated data to target
collector
ture the public tweets where parameters
number of requests within a time window and rate-limits are provided are to other communities.
that also Byun et al. data
filters out unnecessary [19] insuchtheir work messages.
as spam propose a
filter the
posed at results
the userof and the stream by hashtags,
the application level.keywords,
Responsetwit- ob- rule-based data collection tool for Twitter with the focus of
ter userfrom
tained ids, Twitter
usernames APIorisgeographic
generally in regions.
the JSON Theformat.
REST 3.2
analysingPre-processing
sentiment of Twitter andmessages.
Information Extrac-
It is a java-based
API can be used
Third party libraries to 2retrieve a fraction of the most recent
are available in many programming tion tool developed using the Drools4 rule engine.
open source
tweets published by a Twitter user. API.
All three
TheseAPIs limitpro-
the They stress
languages for accessing the Twitter libraries Apart from the dataimportance
collection, of an automated
several frameworks data collector
implement
number of requestsand within
providea methods
time window and rate-limitsand are that also filters out unnecessary data such as spam
vide wrappers for authentication methods to perform extensive pre-processing andmessages.
informa-
posed at the user
other functions and the application
to conveniently access thelevel.
API. Response ob-
tion extraction of the tweets. Pre-processing tasks of Trend-
tained
Publicly from Twitter
available APIsAPI doisnot
generally
guarantee in the JSONcoverage
complete format. 3.2 Pre-processing and Information Extrac-
Miner [61] take into account the challenges posed by the
Third party libraries 2
are available in
of the data for a given query as the feeds are not designed many programming tion
noisy genre of tweets. Tokenization, stemming and POS
languages
for enterprise for accessing
access. For theexample,
Twitter API. These libraries
the streaming pro-
API only Apart from data collection, several frameworks implement
vide wrappers and provide 3
provides a random sample methods for authentication
of 1% (known as the Spritzer and http://gnip.com/,
methods http://datasift.com/,
to perform extensive http:
pre-processing and informa-
other functions to conveniently access the API. //about.topsy.com/
tion extraction of the tweets. Pre-processing tasks of Trend-
2 4
https://dev.twitter.com/docs/twitter-libraries
Publicly available APIs do not guarantee complete coverage http://drools.jboss.org/
Miner [61] take into account the challenges posed by the
of the data for a given query as the feeds are not designed noisy genre of tweets. Tokenization, stemming and POS
for enterprise access. For example, the streaming API only
3
provides a random sample of 1% (known as the Spritzer http://gnip.com/, http://datasift.com/, http:
//about.topsy.com/
2 4
https://dev.twitter.com/docs/twitter-libraries http://drools.jboss.org/

tagging are some of the text processing tasks that better phasis on selection of a proper data set for the definition of
prepare tweets for the analysis task. The platform pro- a campaign. The platform consists of three layers; campaign
vides separate built-in modules to extract information such crawling layer, integrated modeling layer and the data anal-
as location, language, sentiment and named entities that ysis layer. In the campaign crawling layer, a configuration
are deemed very useful in data analytics. The creation of a module follows an iterative approach to ensure the cam-
tagging
pipeline are sometools
of these of theallowstextthe processing
data analyst tasks to that
extend better
and phasis on selection
paign converges to of a proper
a proper setdata set for(keywords).
of filters the definition of
Col-
prepare tweets for the analysis
reuse each component with relative ease. task. The platform pro- a campaign. The platform consists of
lected tweets, metadata and the community data (relation- three layers; campaign
vides separate
TwitIE [17] is built-in
another modules
open-source to extract information
information extractionsuch crawling
ships among layer, integrated
Twitter users) modeling
are stored layer in and
a graph the data anal-
database.
as
NLP location,
pipelinelanguage,
customized sentiment
for microblogand namedtext. For entities that
the pur- ysis
Thislayer.
study In the campaign
should be highlighted crawling
for itslayer, a configuration
distinction to allow
are
pose of information extraction (IE), the general purposeofIE
deemed very useful in data analytics. The creation a module follows
for a flexible an iterative
querying mechanism approachon top to ensure
of a data themodel
cam-
pipeline of
pipeline these tools
ANNIE is used allows
and the data analyst
it consists to extendsuch
of components and paignon
built converges
raw data. to The
a proper
modelset of filters (keywords).
is generated in the integrated Col-
reuse each component
as sentence splitter, POS with relative
tagger and ease.
gazetteer lists (for loca- lected
modeling tweets,
layermetadata
and comprisesand thea community
representation dataof(relation-
associa-
TwitIE [17] is another open-source
tion prediction). Each step of the pipeline information extraction
addresses draw- ships between
tions among Twitter terms (e.g.users) are stored
hashtags) used in in
a graph
tweets database.
and their
NLP pipeline
backs in traditionalcustomized
NLP systemsfor microblog text. For
by addressing the the pur-
inherent This studyinshould
evolution time. be highlighted
Their approach forisits distinctionastoitallow
interesting cap-
pose of information
challenges in microblog extraction
text. As (IE), the general
a result, purpose
individual com- IE for
turesa the
flexible
oftenquerying
overlooked mechanism
temporalon top of a In
dimension. datathemodel
third
pipeline of
ponents ANNIE
ANNIE is are
usedcustomized.
and it consists of components
Language identification,such built analysis
data on raw data. layer,The model
a query is generated
language is usedintothe integrated
design a tar-
as sentence splitter,
tokenisation, POS tagger
normalization, POS and gazetteer
tagging and lists
named (forentity
loca- modeling
get view oflayer and comprises
the campaign data athat representation
corresponds of to associa-
a set of
tion prediction). Each step of the pipeline
recognition is performed with each module reporting accu- addresses draw- tions between
tweets that containterms (e.g. hashtags)
for example, the used in tweets
answer to anand their
opinion
backson
racy in tweets.
traditional NLP systems by addressing the inherent evolution in time. Their approach is interesting as it cap-
mining question.
challenges
Baldwin [11] in presents
microblog text. As
a system a result,
designed for individual
event detectioncom- tures
Whilethe often overlooked
including components temporal
for capture dimension.
and storage In the third
of tweets,
ponents of ANNIE are customized. Language
on Twitter with functionality for pre-processing. JSON re- identification, data
additional tools have been developed to search through tar-
analysis layer, a query language is used to design a the
tokenisation,
sults returnednormalization,
by the Streaming POSAPI tagging
are andparsed named
and entity
piped get view of
collected the campaign
tweets. data that
The architecture of corresponds
CoalMine [72] to presents
a set of
recognition is performed with each module
through language filtering and lexical normalisation compo- reporting accu- tweets
a socialthatnetwork containdatafor example,
mining system thedemonstrated
answer to anon opinion
Twit-
racy on tweets.
nents. Messages that do not have location information are mining
ter, question.
designed to process large amounts of streaming social
Baldwin [11] using
geo-located, presents a system designed
probabilistic models since for eventits detection
a critical While
data. Theincluding
ad-hoc components
query toolfor capturean
provides and end storage of tweets,
user with the
on
issueTwitter with functionality
in identifying where an event for pre-processing.
occurs. Information JSON ex- re- additional tools have been developed
ability to access one or more data files through a Google- to search through the
sults returned
traction modules by the Streaming
require knowledge API from
are parsed
external andsources
piped collected
like searchtweets.
interface.The Appropriate
architecture of CoalMine
support [72] presents
is provided for a
through language filtering
and are generally and lexical
more expensive tasksnormalisation
than language compo-
pro- a
setsocial network
of boolean anddata mining
logical system demonstrated
operators for ease of querying on Twit-on
nents.
cessing. Platforms that support real-time analysis [11, are
Messages that do not have location information 76] ter, of
top designed
a standard to process
Apache largeLucene amounts
index. of The streaming social
data collection
geo-located,
require using probabilistic
processing tasks to be conducted models since its a critical
on-the-fly where data. The ad-hoc
and storage component query is tool provides an
responsible for end user withcon-
establishing the
issue in identifying
the speed where analgorithms
of the underlying event occurs. Information
is a crucial consider- ex- ability to access one or more data
nections to the REST API and to store the JSON objects files through a Google-
traction modules require knowledge from external sources
ation. like search
returned ininterface.
compressed Appropriate
formats. support is provided for a
and are generally more expensive tasks than language pro- set of boolean
In building and logical
support platforms operators for ease to
it is necessary of querying
make provi- on
3.3 Generic
cessing. Platforms Platforms
that support real-time analysis [11, 76] top of a standard Apache Lucene index.
sions for practical considerations such as processing big data. The data collection
require processing tasks to be conducted on-the-fly where and storage component
TrendMiner [61] facilitates is responsible
real-time analysisfor establishing
of tweets con-and
There are several proposals in which researchers have tried
the speed of the underlying algorithms is a crucial consider- nections
takes intotoconsideration
the REST API and to and
scalability store the JSON
efficiency objects
of process-
to develop generic platforms to provide a repeatable foun-
ation. returned
ing large in compressed
volumes formats.
of data. TrendMiner makes an effort to
dation for Twitter data analytics. Twitter Zombie [15] is a
In building
unify some support platforms
of the existing textit processing
is necessarytools to make provi-
for Online
platform to unify the data gathering and analysis methods
3.3 Generic Platforms
by presenting a candidate architecture and methodological
sions
Socialfor practical considerations
Networking (OSN) data, with such as processing
emphasis big data.
on adapting
There are several proposals in which TrendMiner [61] facilitates real-time analysisbatches of tweets and
approach for examining specific parts researchers have tried
of the Twittersphere. to real-life scenarios that include processing of mil-
to develop generic platforms to provide a repeatable foun- takes
lions ofintodata.
consideration scalability
They envision the and
system efficiency
to be of process-
developed
It outlines architecture for standard capture, transforma-
dation foranalysis
Twitterofdata analytics. Twitter Zombie [15] is a ing large batch-mode
for both volumes of data. and online TrendMiner
processing. makes an effort
TwitIE to
[17] as
tion and Twitter interactions using the Twitters
platform to unify the data gathering and analysis methods unify someinofthe
discussed the existingsection
previous text processing
is anothertools for Online
open-source in-
Search API. This tool is designed to gather data from Twit-
by presenting a candidate architecture and methodological Social
formation Networking
extraction (OSN) data, with
NLP pipeline emphasisfor
customized onmicroblog
adapting
ter by executing a series of independent search jobs on a
approach to real-life scenarios that include processing batches of mil-
continual for basisexamining specific parts
and the collected tweets of and
the their
Twittersphere.
metadata text.
It outlines architecture for standard capture, transforma- lions of data. They envision the system to be developed
is kept in a relational DBMS. One of the interesting features
tion
of and analysis ofisTwitter
TwitterZombie its abilityinteractions
to capture using the Twitters
hierarchical rela- 3.4
for both Application-specific
batch-mode and online Platforms processing. TwitIE [17] as
Search API. Thisdata
toolreturned
is designed to gatherAdata from trans-
Twit- discussed in the previous section is another open-source in-
tionships in the by Twitter. network Apart from the above mentioned general purpose platforms,
ter formation extraction NLP pipeline customized for microblog
latorbymodule
executing a series
performs of independentonsearch
post-processing jobs on
the tweets anda there are many frameworks targeted at conducting specific
continual basis and the collected tweets separately
and their metadata text.
stores hashtags, mentions and retweets, from the types of analysis with Twitter data. Emergency Situation
is kepttext.
tweet in a relational
Raw tweets DBMS. One of the interesting
are transformed features
into a representa- Awareness (ESA) [76] is a platform developed to detect, as-
of TwitterZombie
tion of interactionsistoitscreate ability to capture
networks hierarchical
of retweets, rela-
mentions 3.4 Application-specific Platforms
sess, summarise and report messages of interest published
tionships
and usersinmentioning
the data returned
hashtags. by Twitter.
This feature A network
captured trans-by Apart
on from the
Twitter for above mentioned general
crisis coordination tasks.purpose platforms,
The objective of
lator module performs
TwitterZombie, which other post-processing
studies pay on theattention
little tweets and to, there are many
their work is to frameworks
convert large targeted
streamsatofconducting
social media specific
data
stores
is hashtags,
helpful mentions
in answering and retweets,
different types ofseparately from the
research questions typesuseful
into of analysis
situation with Twitterinformation
awareness data. Emergency in real-time.Situation
The
tweet text. Raw
with relative ease.tweets
Socialare transformed
graphs are created intoina therepresenta-
form of Awareness
ESA platform (ESA) [76] isofa modules
consists platformto developed to detect,con-
detect incidents, as-
tion of interactions
a retweet or mention tonetwork
create networks
and theyofdoretweets,
not crawl mentions
for the sess, summarise and report messages
dense and summarise messages, classify messages of high of interest published
and
user users
graphmentioning
with traditional hashtags.followingThisrelationships.
feature captured It alsoby on Twitter
value, identify forand crisis coordination
track issues and tasks.
finally to The objective
conduct of
foren-
TwitterZombie, which other studies pay
draws discussion on how multi-byte tweets in languages like little attention to, their work isoftohistorical
sic analysis convert largeevents. streams of socialare
The modules media data
enriched
is helpful
Arabic or in answering
Chinese can be different
storedtypes of researchtranslitera-
by performing questions into
by a useful
suite of situation awareness
visualisation information
interfaces. Baldwin in real-time. The
et al. [11] pro-
with
tion. relative ease. Social graphs are created in the form of ESA another
pose platformsupport consistsplatform
of modules to detect
focused incidents,
on detecting con-
events
a retweet
More or mention
recently, TwitHoardnetwork [69]and they do
suggests not crawl of
a framework forsup-
the dense
on and summarise
Twitter. The Twitter messages,
stream isclassify
queried messages
with a setof of high
key-
user graph
porting with traditional
processors for data following
analytics relationships.
on Twitter with It also
em- value, specified
words identify and track
by the user issues
withandthe finally
objective to conduct
of filteringforen-
the
draws discussion on how multi-byte tweets in languages like sic analysis of historical events. The modules are enriched
Arabic or Chinese can be stored by performing translitera- by a suite of visualisation interfaces. Baldwin et al. [11] pro-
tion. pose another support platform focused on detecting events
More recently, TwitHoard [69] suggests a framework of sup- on Twitter. The Twitter stream is queried with a set of key-
porting processors for data analytics on Twitter with em- words specified by the user with the objective of filtering the

stream on a topic of interest. The results are piped through management frameworks concentrate on a set of challenges
text processing components and the geo-located tweets are more than others.
visualised on a map for better interaction. Clearly, platforms We aim to recognize the key ingredients of an integrated
of this nature that deal with incident exploration need to framework that takes into account shortcomings of existing
make provisions for real-time analysis of the incoming Twit- systems.
stream
ter streamon aand topic of interest.
produce The visualizations
suitable results are piped through
of detected management frameworks concentrate on a set of challenges
text processing
incidents. components and the geo-located tweets are more than others.
visualised on a map for better interaction. Clearly, platforms 4.
We aimLANGUAGES
to recognize the FOR key QUERYING
ingredients of anTWEETS integrated
3.5this Data
of natureModel that deal and withStorage
incidentMechanisms
exploration need to framework
The goal ofthat takes into
proposing account shortcomings
declarative languages and of systems
existing
make provisions for real-time analysis of the incoming Twit- systems.
for querying tweets is to put forward a set of primitives
Data models are not discussed in detail in most studies,
ter stream and produce suitable visualizations of detected or an interface for analysts to conveniently query specific
as a simple data model is sufficient to conduct basic form
incidents. interactions on Twitter exploring the user, time, space and
of analysis. When standard tweets are collected, flat files
[11, 72] is the preferred choice. Several studies that cap-
4. LANGUAGES FOR QUERYING TWEETS
topical dimensions. High level languages for querying tweets
3.5 Data Model and Storage Mechanisms
ture the social relationships [15, 19] of the Twittersphere, The goal
extend of proposing
capabilities declarative
of existing languages
languages such andas SQLsystems
and
Data models
employs are not discussed
the relational data model in detail
but doinnot most studies,
necessarily for querying
SPARQL. tweets
Queries areiseither
to put forward
executed on athe setTwitter
of primitives
stream
as a simple
store data model inis asufficient
the relationships to conductAs
graph database. basic form
a conse- or real-time
in an interface or on fora analysts to conveniently
stored collection of tweets. query specific
of analysis.
quence, manyWhen analyses standard
that can tweets are collected,
be performed flat files
conveniently interactions on Twitter exploring the user, time, space and
[11,
on a72]graph
is theare preferred
not capturedchoice. bySeveral studies that Only
these platforms. cap- 4.1 Generic
topical dimensions. Languages
High level languages for querying tweets
ture the social relationships [15, 19] of
TwitHoard [69] in their paper models co-occurrence of terms the Twittersphere, extend capabilities of existing languages such as SQL and
TweeQL [49] provides a streaming SQL-like interface to the
employs
as a graph thewithrelational
temporallydata model
evolving butproperties.
do not necessarily
Twitter SPARQL. Queries are either executed on the Twitter stream
Twitter API and provides a set of user defined functions
store
Zombie the[15]relationships
and TwitHoard in a graph database.
[69] should As a conse-
be highlighted for in real-time or on a stored collection of tweets.
(UDFs) to manipulate data. The objective is to introduce
quence, many
capturing analyses including
interactions that can be theperformed
retweets and conveniently
term as- a query language to extract structure and useful informa-
on a graphapart
sociations are from
not captured by these
the traditional platforms. social
follower/friend Only 4.1 Generic Languages
tion that is embedded in unstructured Twitter data. The
TwitHoard [69] in their paper models co-occurrence
relationships. TrendMiner [61] draws explicit discussion on of terms TweeQL
language [49] provides
exploits botharelational
streamingand SQL-like
streaminginterface to the
semantics.
as a graph
making with temporally
provisions for processing evolving
millionsproperties.
of data and Twitter
takes Twitter API and provides a set of user
UDFs allow for operations such as location identification, defined functions
Zombie
advantage [15]of and
Apache TwitHoard [69] should be
Hadoop MapReduce highlighted
framework to per-for (UDFs)processing,
string to manipulate data. prediction,
sentiment The objective is to entity
named introduceex-
capturing interactions including the retweets
form distributed processing of the tweets stored as key-value and term as- a query language
traction and eventtodetection.
extract structure
In the spirit and ofuseful informa-
streaming se-
sociations
pairs. apart from
CoalMine the traditional
[72] also has Apachefollower/friend
Hadoop at thesocial core tion thatitisprovides
mantics, embedded SQLinconstructs
unstructured Twitteraggregations
to perform data. The
relationships. TrendMiner
of their batch processing [61] drawsresponsible
component explicit discussion
for efficient on language exploits both
over the incoming stream relational and streaming
on a user-specified time semantics.
window.
making provisions
processing of large for processing
amount of data.millions of data and takes UDFs allow for operations such as location
The result of a given query can be stored in a relational identification,
advantage of Apache Hadoop MapReduce framework to per- string processing,
fashion for subsequent sentiment
querying. prediction, named entity ex-
3.6 distributed
form Supportprocessing for Visualization
of the tweetsInterfaces
stored as key-value traction
Models for andrepresenting
event detection. In the network
any social spirit of in streaming
RDF have se-
pairs. CoalMine [72] also has Apache Hadoop at the core mantics, it provides SQL constructs to perform aggregations
There are many platforms designed with integrated tools been proposed by Martin and Gutierrez [50] allowing queries
of their batch processing component responsible for efficient over the incoming stream on a user-specified
predominantly for visualization, to analyse data in spatial, in SPARQL. The work explores the feasibilitytime window.
of adoption
processing of large amount of data. The result of bya given query cantheir be stored in aanrelational
temporal and topical perspectives. One tool is tweetTracker of this model demonstrating idea with illustra-
[44], which is designed to aid monitoring of tweets for hu- fashion for subsequent
tive prototype but doesquerying.
not focus on a single social network
3.6 Support for Visualization Interfaces
manitarian and disaster relief. TweetXplorer [54] also pro- Models
like for representing
Twitter in particular. anyTwarQL
social network in RDF
[53] extracts have
content
There are many
vides useful platforms
visualization designed
tools to explorewithTwitter
integrateddata.toolsFor been
from proposed
tweets and byencodes
Martin it andin Gutierrez
RDF format [50]using
allowing queries
shared and
predominantly
a particular campaign,for visualization, to analyse
visualizations data in spatial,
in tweetXplorer help in SPARQL.
well The work explores
known vocabularies (FOAF,the feasibility
MOAT, SIOC) of adoption
enabling
temporal
analysts to andviewtopical perspectives.
the data One tool
along different is tweetTracker
dimensions; most of this model
querying by demonstrating
in SPARQL. The extraction theirfacility
idea with an illustra-
processes plain
[44], which is designed to aid monitoring
interesting days in the campaign (when), important of tweets forusers hu- tive prototype but does not focus on
tweets and expands its description by adding sentiment a single social networkan-
manitarian
and and disaster
their tweets (who/what) relief.
and TweetXplorer
important locations[54] alsoinpro- the like Twitter
notations, in particular.
DBPedia entities, TwarQL [53] extracts
hashtag definitions andcontent
URLs.
vides
dataset useful visualization
(where). Systemstools to explore[48],
like TwitInfo Twitter data. For
Twitcident [8] from tweets and encodes
The annotation of tweets it in RDFdifferent
using format using shared and
vocabularies en-
a
andparticular
Torettorcampaign,
[65] also providevisualizations
a suiteinoftweetXplorer
visualisationhelp ca- well
ables querying and analysis in different dimensionsenabling
known vocabularies (FOAF, MOAT, SIOC) such as
analysts
pabilitiestotoview explorethe tweets
data along different
in different dimensions;
dimensions most
relating querying
location, in SPARQL.
users, sentimentThe andextraction
relatedfacility
namedprocesses
entities. plain
The
interesting days in the campaign (when),
to specific applications like fighting fire and detecting earth- important users tweets and expands its description by
infrastructure of TwarQL enables subscription to a stream adding sentiment an-
and theirWeb-mashups
quakes. tweets (who/what) and important
like Trendsmap [5] andlocations
Twitalyzer in the[6] notations,
that matches DBPedia
a given entities,
query and hashtag
returnsdefinitions
streamingand URLs.
annotated
dataset
provide (where).
a web interface Systems and like TwitInfobusiness
enterprise [48], Twitcident
solutions [8] to The
data annotation
in real-time.of tweets using different vocabularies en-
and
gain Torettor
real-time [65] trend also
andprovide
insightsa ofsuite
userofgroups.
visualisation ca- ables querying
Temporal and analysis
and topical features inare
different dimensions
of paramount such as
importance
pabilities
Table to explorean
1 illustrates tweets in different
overview of relateddimensions
approaches relatingand location,
in users,microblogging
an evolving sentiment and related
stream likenamed
Twitter. entities.
In the The
lan-
to specific
features of applications
different platforms.like fighting fire and detecting
Pre-processing in Tableearth-1 in- infrastructure
guages above, time of TwarQL
and topic enables subscription
of a tweet (topic can to be
a stream
repre-
quakes.
dicates ifWeb-mashups
any form of language like Trendsmap
processing [5] and
tasksTwitalyzer
such as POS [6] that
sentedmatches
simplyaby given query and
a hashtag) arereturns
considered streaming
meta-dataannotated
of the
provide
tagging or a web interface and
normalization enterprise business
are conducted. Information solutions
extrac- to data
tweetinandreal-time.
is not treated any different from other metadata
gain real-time trend and insights of user
tion refers to the types of post processing performed to infergroups. Temporal
reported. and Topics topical features are
are regarded of paramount
as part of the tweet importance
content
Table 1 illustrates
additional information, an overview of relatedorapproaches
such as sentiment named entities and in an
or whatevolving
drivesmicroblogging
the data filtering streamtasklikefrom Twitter. In theAPI.
the Twitter lan-
features of different
(NEs). Multiple ticksplatforms. Pre-processing
() correspond to a task in that
Tableis 1car-
in- guages above,
There have beentime and topic
efforts of a tweet
to exploit features (topic
that cangobe repre-
well be-
dicates if any form of language processing
ried out extensively. In addition to collecting tweets, some tasks such as POS sented
yond asimply
simpleby a hashtag)
filter based on aretimeconsidered
and topic. meta-data of the
Plachouras
tagging also
studies or normalization
capture the users are conducted.
social graph Information
while others extrac-
pro- tweetStavrakas
and and is not[59] treated
stressany thedifferent
need forfrom other modelling
temporal metadata
tion
pose refers
the need to theto types
regardof interactions
post processing performedretweets,
of hashtags, to infer reported.
of terms inTopicsTwitter aretoregarded
effectively as part
captureof the tweet content
changing trends.
additional
mentions asinformation, such as sentiment
separate properties. Backend or datanamed
models entities
sup- or what refers
A term drives to theany
data filtering
word task phrase
or short from the of Twitter
interest API.
in a
(NEs). by
ported Multiple ticks ()
the platform shapecorrespond
the typestoofaanalysis
task that thatis car-
can There have
tweet, been efforts
including hashtags to orexploit
output features
of anthat entitygo recogni-
well be-
ried out extensively.
be conveniently doneIn onaddition to collecting
each framework. Fromtweets,
the some
sum- yond a simple Their
tion process. filter proposed
based on query time and topic. can
operators Plachouras
express
studiesinalso
mary Tablecapture
1, wethe canusers social
observe graph
that each while
study others
on datapro- and Stavrakas
complex queries[59] forstress the need
associations for temporal
between terms over modelling
vary-
pose the need to regard interactions of hashtags, retweets, of terms in Twitter to effectively capture changing trends.
mentions as separate properties. Backend data models sup- A term refers to any word or short phrase of interest in a
ported by the platform shape the types of analysis that can tweet, including hashtags or output of an entity recogni-
be conveniently done on each framework. From the sum- tion process. Their proposed query operators can express
mary in Table 1, we can observe that each study on data complex queries for associations between terms over vary-

Table 1: Overview of related approaches in data management frameworks.
Prepossessing Examples of Social and/or other Data Store
extracted information interactions captured?
TwitterEcho [16] Language Yes Not given
Byun et al. [19] Location Yes Relational
Twitter Zombie [15] Table 1: Overview of related approaches in data management Yesframeworks. Relational
TwitHoard [69] Prepossessing
Examples of Social and/or Yes other Data
GraphStoreDB
CoalMine [72] extracted information interactions Nocaptured? Files
TwitterEcho
TrendMiner [61] [16] Location, Language
Sentiment, NEs Yes
No Not given
Key-value pairs
Byun et
TwitIE [17] al. [19] Location
Language, Location, NEs Yes
No Relational
Not given
Twitter
ESA [76]Zombie [15]
Location, NEs Yes
No Relational
Not given
TwitHoard
Baldwin et al. [11][69] Language, Location Yes
No Graph DB
Flat files
CoalMine [72] No Files
TrendMiner [61] Location, Sentiment, NEs No Key-value pairs
ing time TwitIE [17] to discover context
granularities, of collected Language, data.Location,
graphNEs database to demonstrate No Not given
the feasibility of the model.
ESA [76]
Operators also allow retrieving a subset of tweets satisfying Location, NEs
languages that operate on the twitter Not
No given
stream like TweeQL
Baldwin
these complex et al. [11]
conditions
on term associations. ThisLanguage,
enables Location
and TwarQL generates No the output inFlat files TweeQL
real-time;
the end user to select a good set of terms (hashtags) that [49] allows the resulting tweets to be collected in batches
drive the data collection which has a direct impact on the then stores them in a relational database, while TwarQL [53]
ing timeofgranularities,
quality to discover
the results generated fromcontext of collected data.
the analysis. graph
at the database
end of the to information
demonstrateextractionthe feasibility
phase,of the model.
annotated
Operators
Spatial featuresalso allow retrieving
are another a subsetofoftweets
property tweetsoftensatisfying
over- languages
tweets are that encodedoperate in RDF.on the twitter stream like TweeQL
these
lookedcomplex
in complex conditions
analysis. on term associations.
Previously discussed This
workenables
uses and TwarQL generates the output in real-time; TweeQL
the
the end user attribute
location to select aasgood set of terms
a mechanism to (hashtags)
filter tweets that
in 4.3 Query
[49] allows the Languages
resulting tweets fortoSocial Networks
be collected in batches
drive the data collection which has
space. To complete our discussion we briefly outline twoa direct impact on the then stores them in a relational database,
To the best of our knowledge, there is no existing work while TwarQL [53]
quality
studies of that theuseresults generated
geo-spatial from thetoanalysis.
properties perform complex at the end of the information extraction
focusing on high level languages operating on the Twit- phase, annotated
Spatial
analysisfeatures
using theare another
location propertyDoytsher
attribute. of tweetsetoftenal. [31]over-
in- tweetssocial
ters are encoded
graph. inHowever RDF. it is important to note pro-
looked
troduced in acomplex
model and analysis.
queryPreviously
language suiteddiscussed work uses
for integrated posals for declarative query languages tailored for querying
the
datalocation
connecting attribute
a socialasnetworka mechanism
of users to withfilter tweetsnet-
a spatial in 4.3 Query Languages
social networks in general [10,for Social
30, 32, 50, 51,Networks
64]. One of the
space.
work to To complete
identify placesour discussion
visited we briefly
frequently. Edges outline
named life-two To the supported
queries best of ourareknowledge, path queries there is no existing
satisfying a set of work
con-
studies
patternsthat are useusedgeo-spatial
to associate properties
the social to perform
and spatial complex
net- focusing
ditions ononthe high
path, levelandlanguages
the languages operating on the
in general takeTwit-ad-
analysis
works. using Differentthe location attribute. Doytsher
time granularities et al. [31] for
can be expressed in- ters social
vantage graph. properties
of inherent However it is important
of social networks. to Semantics
note pro-
troduced
each visited a model
location andrepresented
query language by the suited for integrated
life-pattern edge. posals
of the for declarative
languages are query
based languages
on Datalog tailored for querying
[51], SQL [32, 64]
data
Even connecting
though thea implementation
social network ofemploys users with a spatial syn-
a partially net- social networks in general
or RDF/SPARQL [10, 30, 32, 50, 51,
[50]. Implementations 64].
are One of the
conducted on
work
thetictodataset,
identifyitplaceswill be visited frequently.
interesting Edges named
to investigate how life-
the queries supported
bibliographical are path
networks [32],queries
Facebooksatisfying a set ofsocial
and evolving con-
patterns
socio-spatial are networks
used to associate
and the the social and
life-pattern edgesspatial
that net-are ditions
content on sites thelike
path, Yahoo!and the
Travel languages
[10] andinare general take ad-
not tested on
works. Differentthe
used to associate time granularities
spatial and socialcan be expressed
networks can be rep- for vantage of
Twitter inherenttaking
networks propertiesTwitterof social networks.
specific Semantics
affordances into
each
resentedvisited
in a location
real socialrepresented
network dataset by thewith life-pattern edge.
location infor- of the languages are based on Datalog [51], SQL [32, 64]
consideration.
Even
mation, though
such the implementation
as Twitter. GeoScope employs a partially
[18] finds informationsyn- or RDF/SPARQL [50]. Implementations are conducted on
thetic
trends dataset,
by detecting it will be interesting
significant to investigate
correlations how the
among trending 4.4 Information
bibliographical networks Retrieval
[32], Facebook - Tweet Searchsocial
and evolving
socio-spatial
location-topicnetworkspairs in aand the life-pattern
sliding window. This edges
givesthatriseare
to content sites like Yahoo! Travel [10] and are not tested on
used to associateofthe spatial and Another class of systems presents textual queries to effi-
the importance capturing the social
notionnetworks
of spatial caninforma-
be rep- Twitter networks taking Twitter specific affordances into
resented in ainreal social networkindataset with location infor- ciently search over a corpus of tweets. The challenges in this
tion trends social networks analysis tasks. Real-time consideration.
mation, area are similar to that of information retrieval in addition
detectionsuch as Twitter.
of crisis events from GeoScope
a location[18] in
finds information
space, exhibits
with having to deal with peculiarities of tweets. The short
trends by detecting significant correlations
the possible value of Geoscope. In one of the experiments among trending 4.4
length Information
of tweets in particular Retrieval - Tweet
creates added Search
complexity to
location-topic
Twitter is usedpairs as aincasea sliding
study window. This gives
to demonstrate rise to
its useful- Another
text-based class
searchof systems
tasks as presents
it is textual
difficult to queries relevant
identify to effi-
the
ness,importance
where a hashtag of capturing
is chosen the tonotion of spatial
represent informa-
the topic and ciently matching
tweets search over a a corpus
user query of[13,35].
tweets. Expanding
The challenges tweetin con-
this
tion trendswhich
city from in social networks
the tweet in analysis
originates chosen tasks. Real-time
to capture the area
tent are
is similar to
suggested as that
a way oftoinformation
enhance the retrieval
meaning. in The
addition
goal
detection
location. of crisis events from a location in space, exhibits with
of suchhaving
systems to deal with peculiarities
is to express of tweets. need
a users information The in short
the
the possible value of Geoscope. In one of the experiments
lengthof of
form a tweets text
simple in particular
query, much creates
like inadded
search complexity
engines, and to
4.2 Data Model for the Languages
Twitter is used as a case study to demonstrate its useful-
text-based
return a search
tweet list tasks
in as it is difficult
real-time with to identify
effective relevant
strategies for
ness, where a hashtag is chosen to represent the topic and
Relational, RDF the andtweet
Graphs are the chosen
most common choices tweets matching
ranking and relevance a user measurements
query [13,35]. Expanding
[33, 34, 71]. tweet con-
Indexing
city from which originates to capture the
of data representation. There is a close affiliation in these tent is suggested
mechanisms are as a way to
discussed in enhance
[23] as the meaning.
they directly The goal
impact ef-
location.
data models observing that, for instance, a graph can easily of suchretrieval
ficient systems of is to express
tweets. The a users
TRECinformation
micro-bloggingneed track
in the5
correspond to a set of RDF triples or vice versa. In fact, form
is of a simple
dedicated text query,
to calling much like
participants to in searchreal-time
conduct engines, and ad-
4.2 Data Model
some studies for theand
like Plachouras Languages
Stavrakas [59] have put return
hoc searcha tweettaskslist overin areal-time
given tweet withcollection.
effective strategies
Publications for
Relational, RDF and Graphs are the most
forward their data model as a labeled multi digraph and have common choices ranking
of TREC and[56],relevance
documents measurements
the findings [33,
of 34,
all 71].
systems Indexing
in the
of data arepresentation.
chosen relational database Therefor is its
a close affiliation in None
implementation. these mechanisms
task of ranking are discussed
the most in [23] as tweets
relevant they directly
matching impact
a pre- ef-
5
data models
of these query observing
systems that, models for Twitter
instance,social
a graph can easily
network with ficient
definedretrieval
set of user of tweets.
queries.The TREC micro-blogging track
correspond
following or to a set relationships
retweet of RDF triples or vice
among versa.
users. In fact,
Doytsher et is dedicated
Table to calling
2 illustrates participants
an overview to conduct
of related real-time
approaches ad-
in sys-
some
al. [31]studies
implement like Plachouras
their algebraic and query
Stavrakas [59] have
operators with putthe hoc
temssearch tasks over
for querying a given
tweets. Data tweet collection.
models Publications
and dimensions in-
forward
use of both their dataand
graph model as a labeled
a relational multi digraph
database and have
as the underlying of TREC [56], documents the findings of all systems in the
chosen a relational
data storage. They database
experimentallyfor its compare
implementation.
relationalNone and task
5 of ranking the most relevant tweets matching a pre-
http://trec.nist.gov/
of these query systems models Twitter social network with defined set of user queries.
following or retweet relationships among users. Doytsher et Table 2 illustrates an overview of related approaches in sys-
al. [31] implement their algebraic query operators with the tems for querying tweets. Data models and dimensions in-
use of both graph and a relational database as the underlying
5
data storage. They experimentally compare relational and http://trec.nist.gov/

Table 2: Overview of approaches in systems for querying tweets.
Data Model Explored dimensions
Relational RDF Graph Text Time Space Social Network Real-Time
TweeQL [49] Yes
TwarQL [53] Yes
Plachouras et al. [60] Table 2: Overview of approaches in systems for
querying tweets. No
Doytsher et al. [31] Data Model Explored
dimensions
No
GeoScope et al. [18] Relational RDF Graph Text Time
Space
Social Network Real-Time Yes
TweeQL [49]
Languages on social networks

Yes
No
TwarQL [53]
Tweet search systems
Yes
Yes
Plachouras et al. [60] No
Doytsher et al. [31] No
GeoScope et al. [18]
vestigated in each system are depicted

in the Table 2. Sys-
mation extraction considering the inherent peculiarities of Yes
tems Languages
that have made on social networks
provision

for the real-time
streaming tweets and not all frameworks we discussed provided No this
nature Tweet
of thesearch
tweetssystems
are indicated in the Real-time column.
functionality. Pre- processing components Yes
for example, nor-
Multiple ticks () correspond to a dimension explored in de- malization and tokenization should be implemented, with
tail. Note that the systems marked with an asterisk (*) are the option for the end users to customize the modules to suit
vestigated
not implementedin eachspecifically
system aretargeting
depicted tweets,
in the Table
though 2. their
Sys- mation extraction considering
their requirements. Informationthe inherentmodules
extraction peculiarities
such as of
tems that have
application made provision
is meaningful and can forbethe real-time
extended to streaming
the Twit- tweets and
location not all frameworks
prediction, we discussedand
sentiment classification provided
named this en-
nature
tersphere.of the
Wetweets
observe are indicated
potential forin the Real-time
developing column.
languages for functionality.
tity recognition Pre-
areprocessing componentsanalysis
useful in conducting for example,
on tweetsnor-
Multiple ticks () correspond to a dimension
querying tweets that include querying by dimensions that explored in de- malization and tokenization should
and attempt to derive more information from plain tweet be implemented, with
tail.
are notNote that thebysystems
captured existing marked
systems, with an asterisk
especially the(*) are
social the option
text and theirfor the end usersIdeally,
metadata. to customize
a userthe modules
should be ableto suit
to
not
graph.implemented specifically targeting tweets, though their their
integraterequirements.
any combination Information
of the extraction
components modules
into theirsuch ownas
application is meaningful and can be extended to the Twit- location
applications. prediction, sentiment classification and named en-
tersphere. We observe potential for developing languages for tity recognition are useful in conducting analysis on tweets
Data model:toMuch of the literature presented
from (Section 3.5
5. THE
querying NEED
tweets that FORinclude INTEGRATED
querying by dimensions SOLU- that and attempt
and 4.2) does
derive
not
more
emphasize
information
or draw explicit
plain tweet
discussions
are not captured by existing systems, especially the social text and their metadata. Ideally, a user should be able to
graph. TIONS on the data
integrate anymodel in use. ofThe
combination the logical data into
components model theirgreatly
own
There is a need to assimilate individual efforts with the goal influences
applications. the types of analysis that can be done with rela-
of providing a unified framework that can be used by re- tive ease on collected data. A physical representation of the
Data model:involvingMuch of the literature andpresented (Section 3.5
5. THEandNEED
searchers FOR across
practitioners INTEGRATED
many disciplines. SOLU- Inte- model
and
of large4.2)volumes
suitable
does notof emphasize
indexing
data is an or
storage mechanisms
draw explicit
important discussions
consideration for
grated solutions should ideally handle the entire workflow
TIONS
of the data analysis life cycle from collecting the tweets to
on the data model in use. The logical
efficient retrieval. We notice that current research pays data model greatly
lit-
There is a need to assimilate influences theto types of analysis that interactions,
can be done with rela-
presenting the results to theindividual
user. Theefforts with we
literature the have
goal tle attention queries on Twitter the social
of providing a unifiedsectionsframework that efforts
can bethat usedsupport
by re- tive
graph ease
in on collected A
particular. data.
graphA physical
view of representation
the Twittersphere of the is
reviewed in previous outlines
searchers and practitioners across model involving suitableand indexing and storage
great mechanisms
different parts of the workflow. In many disciplines.
this section, Inte-
we present consistently overlooked we recognize potential in
grated solutions of
thislarge
area.volumes of data is an important consideration for
our position withshouldthe aim ideally handle the
of outlining entire workflow
significant compo- The graph construction on Twitter is not lim-
of the ofdata analysis life cycle from collecting the tweets to efficient retrieval. We notice that current
ited to considering the users as nodes and links as follow- research pays lit-
nents an integrated solution addressing the limitations of
presenting the results to the user. The literature we have tle attention
ing relationships;to queries
embracing on Twitter
useful interactions,
characteristics thesuchsocial
as
existing systems.
reviewed in previous sections outlines efforts that support graph
retweet, in mention
particular. and Ahashtag
graph view of the Twittersphere
(co-occurrence) networks is
in
According to a review of literature conducted on the mi-
different parts of the workflow. In this section, we present consistently
the data modeloverlooked
will createandopportunity
we recognize to great
conduct potential
complex in
croblogging platform [25], the majority of published work
our positionconcentrates
with the aim this area.on Thethesegraph construction on ofTwitter
tweets.is We not envi-
lim-
on Twitter on of theoutlining
user domain significant
and the compo-
mes- analysis structural properties
nentsdomain.
of an integrated ited to considering the users as nodes and links as follow-
sage The usersolutiondomain addressing the limitations
explores properties of Twit- of sion a new data model that proactively captures structural
existing ing relationships; embracing useful characteristics such as
ter userssystems.
in the microblogging environment while the mes- relationships taking into consideration efficient retrieval of
According to a review of literature conducted on the mi- retweet,
relevant mention
data to and hashtag
perform the (co-occurrence) networks in
queries.
sage domain deals with properties exhibited by the tweets
croblogging platform [25], the majority the data model will create opportunity to conduct complex
themselves. In comparison to the extentofofpublished
work done workon Query
on Twitter concentrates on the user domain and the mes- analysis language: Languages
on these structural described
properties in Section
of tweets. We4,envi-
de-
the microblogging platform, only a few investigates the de- fines both simple operators to be appliedcaptures
on tweets and ad-
sage domain. The user domain explores properties of Twit- sion a new data model that proactively structural
velopment of data management frameworks and query lan- vanced operators that extract complex patterns, which can
ter users in the microblogging environment while the mes- relationships taking into consideration efficient retrieval of
guages that describe and facilitate processing of online so- be manipulated in different types of applications. Some lan-
sage domain deals with properties exhibited by the tweets relevant data to perform the queries.
cial networking data. In consequence, there is opportunity guages provide support for continuous queries on the stream
themselves.
for improvement In comparison
in this areatoforthe extent
future of work
research done on
addressing Query
or queries language:
on a storedLanguages
collection, described
while others in offer
Section 4, de-
flexibility
the microblogging
the challenges in dataplatform, only a few
management. Weinvestigates the de-
elicit the following finesboth.
for both Thesimple operators
advent to be view
of a graph applied on tweets
makes crucialand ad-
contri-
velopment of data management
high-level components and envisage frameworks
a platform andfor
query lan-
Twitter vanced
butions operators
in analyzing that theextract complex allowing
Twittersphere patterns,uswhich can
to query
guages
that that describe
encompasses suchand facilitate processing of online so-
capabilities: be manipulated
twitter data in in different
novel types offorms.
and varying applications.
It willSome lan-
be inter-
cial networking data. In consequence, there is opportunity guagesto
esting provide support
investigate how fortypical
continuous queries on
functionality the
[73] stream
provided
Focused crawler:
for improvement Responsible
in this area for for retrieval
future researchandaddressing
collection
or
by queries
graph queryon a stored
languagescollection,
can be while otherstooffer
adapted flexibility
Twitter net-
of Twitter
the challengesdata in bydata crawling
management.the publiclyWe elicitaccessible Twit-
the following for both.
works. In The adventquery
developing of a graph view makes
languages, one crucial
could contri-
investigate
ter APIs. A focused crawler should
high-level components and envisage a platform for Twitter allow the user to de-
butions in analyzing
the distinction between the real-time
Twittersphere allowing
and batch modeus to query
process-
fine
that aencompasses
campaign with such suitable
capabilities:filters, monitor output and
twitter data in novel and varying forms.
ing. Visualizing the data retrieved as a result of a query It will be inter-
iteratively crawl Twitter for large volumes of data until its
Focused crawler: Responsible for retrieval and collection esting
in a to investigate
suitable manner how
is typical
also an functionality
important [73] provided
concern. A set of
coverage of relevant tweets is satisfactory.
of Twitter data by crawling the publicly accessible Twit- by graph query
pre-defined output languages
formats can will bebe useful
adapted to Twitter
in order to providenet-
Pre-processor:
ter APIs. A focused As highlighted
crawler should in Section
allow the 3.2,user
thistostagede- works.
an In developing
informative query languages,
visualization over a mapone forcould
a queryinvestigate
that re-
usually compriseswith
fine a campaign of modules
suitablefor pre-processing
filters, monitor output and infor-and the distinction between real-time and batch mode process-
iteratively crawl Twitter for large volumes of data until its ing. Visualizing the data retrieved as a result of a query
coverage of relevant tweets is satisfactory. in a suitable manner is also an important concern. A set of
pre-defined output formats will be useful in order to provide
Pre-processor: As highlighted in Section 3.2, this stage an informative visualization over a map for a query that re-
usually comprises of modules for pre-processing and infor-

turns an array of locations. Another interesting avenue to characters, which make use of informal language, undoubt-
explore is the introduction of a ranking mechanism on edly making a simple task of POS tagging more challeng-
the query result. Ranking criteria may involve relevance, ing. Besides the length limit, heavy and inconsistent usage
timeliness or network attributes like the reputation of users of abbreviations, capitalizations and uncommon grammar
in the case of a social graph. Ranking functions are a stan- constructions pose additional challenges to text processing.
turns an array of in
dard requirement locations. Another
the field of interesting
information avenue
retrieval to
[23,39] characters,
With the volume which of make tweets usebeing
of informal
orders language,
of magnitude undoubt-
more
explore is the introduction of a ranking mechanism
and studies like SociQL [30] report the use of visibility and on edly making a simple task of
than news articles, most of the conventional methodsPOS tagging more challeng-
cannot
the query result.
reputations metricsRanking
to rank criteria may involve
results generated fromrelevance,
a social ing.directly
be Besidesapplied
the length to the limit,
noisyheavy genreand inconsistent
of Twitter data.usageAny
timeliness or network
graph. A query languageattributes
with a like
graphtheview
reputation of users
of the Twitter- of abbreviations,
effort that uses Twitter capitalizations
data needsand uncommon
to make grammar
use of appropri-
in the case
sphere alongofwith
a social graph. Ranking
capabilities functionsand
for visualizations are ranking
a stan- constructions
ate twitter-specific pose additional
strategies to challenges
pre-process to text
text processing.
addressing
dardcertainly
will requirement in the
benefit thefield of information
upcoming retrieval
data analysis [23,39]
efforts of With
the the volume
challenges of tweets
associated being
with ordersproperties
intrinsic of magnitude more
of tweets.
and studies like SociQL [30] report the use of visibility and
Twitter. than newsinformation
Similarly, articles, most of the conventional
extraction from tweetsmethods cannot
is not straight-
reputations metrics
Here, we focused on to
therank
key results generated
ingredients from
required foraasocial
fully be directly
forward as it applied to the
is difficult tonoisy
derivegenre contextof Twitter
and topics data. fromAny a
graph.
developed A query language
solution with a improvements
and discussed graph view of the Twitter-
we can make effort that is
tweet that uses Twitter data
a scattered part of needs to make useThere
a conversation. of appropri-
is sep-
sphere alongliterature.
on existing with capabilities for visualizations
In the next and ranking
section, we identify chal- ate
aratetwitter-specific strategies to
literature on identifying pre-process
entities text addressing
(references to organi-
will certainly
lenges benefitissues
and research the upcoming
involved. data analysis efforts of the challenges
zations, places,associated
products, persons) with intrinsic properties
[47,63], languages of [20,36],
tweets.
Twitter. Similarly,
sentiment information
[57] present extraction
in the tweetfrom texttweets is not source
for a richer straight- of
Here, we focused on the key ingredients required for a fully forward
information.as it is difficult istoanother
Location derive context and topics
vital property from a
represent-
6. CHALLENGES
developed ANDimprovements
solution and discussed RESEARCH we ISSUES
can make tweet
ing that isfeatures
spatial a scattered either partof ofthe a conversation.
tweet or of the There
user.is sep-
The
on
To existing
completeliterature. In the next
our discussion, section,
in this sectionweweidentify chal-
summarize arate
locationliterature
of eachon identifying
tweet may beentities optionally (references
recorded to iforgani-
using
lenges
key and research
research issues
issues in datainvolved.
management and present tech- zations, places, products,
a GPS-enabled device. Apersons) user can [47,63], languages
also specify his[20,36],
or her
nical challenges that need to be addressed in the context of sentiment
location as[57] present
a part of the in theuser tweet
profile textand
for isa richer
often source
reported of
building a data analytics platform for Twitter. information. Location is another
in varying granularities. The drawback is that only a smallvital property represent-
6. CHALLENGES AND RESEARCH ISSUES ing spatial
portion features
of about 1% either of the tweet
of the tweets or of the [24].
are geo-located user. Since
The
6.1complete
To Data Collection Challenges
our discussion, in this section we summarize location of eachalways
analysis almost tweet may requires be optionally
the location recorded
property, if using
when
key research issues in data management and present tech- a GPS-enabled
absent, device. A
studies conduct userown
their canmechanisms
also specifytohisinfer or her
lo-
Once a suitable Twitter API has been identified, we can
nical challenges that need to be addressed in the context of locationofas
cation the a part
user, of the user
a tweet, or profile
both. and There is often
are two reported
major
define a campaign with a set of parameters. The focused
building a data analytics platform for Twitter. in varying granularities.
approaches for location The drawback
prediction: is thatanalysis
content only a small with
crawler can be programmed to retrieve all tweets matching
the query of the campaign. If a social graph is necessary, portion of about
probabilistic 1% of the
language modelstweets [24,are27,geo-located
37] or inference[24]. Since
from
6.1 Data Collection Challenges
separate modules would be responsible to create this net- analysis
social and almost
otheralways
relations requires
[22, 29,the 67].location property, when
Once iteratively.
work a suitable Exhaustively
Twitter API crawling
has beenallidentified, we can
the relationships absent, studies conduct their own mechanisms to infer lo-
define a campaign with a set of parameters.
between Twitter users is prohibitive given the restrictions The focused 6.3
cation Data
of the Management
user, a tweet, or Challenges both. There are two major
crawler
set by the canTwitter
be programmed
API. Hence to itretrieve all tweets
is required for thematching
focused approaches for location prediction: content analysis with
In Section 3.5 and Section 4.2 we outlined several alterna-
the query
crawler to of the campaign.
prioritize If a social
the relationships to graph is necessary,
crawl based on the probabilistic language models [24, 27, 37] or inference from
tive approaches in literature for a data model to charac-
separateand
impact modules wouldofbespecific
importance responsible
Twitter to accounts.
create thisInnet- the social and other relations [22, 29, 67].
terise the Twittersphere. The relational and RDF models
work iteratively.
case that Exhaustively
the platform handlescrawling
multipleallcampaigns
the relationships
in par- are frequently chosen while graph-based models are acknowl-
between
allel, there Twitter users to
is a need is optimize
prohibitive thegiven
accessthetorestrictions
the API. 6.3 Data Management Challenges
edged, however not realized concretely at the implementa-
set by the Twitter API. Hence itofisarequired
crawler for the focused In
Typically, the implementation should aim to tion phase.3.5
Section WaysandinSectionwhich 4.2 we we canoutlined
apply graphseveral dataalterna-
man-
crawler
minimizetothe prioritize
number theofrelationships
API requests, to considering
crawl based the on the
re- tive approaches in literature for a diverse
data model to charac-
agement in Twitter are extremely and interesting;
impact
strictions, while fetching data for many campaigns in In
and importance of specific Twitter accounts. the terise thetypes
Twittersphere.
paral- different of networksThe canrelational
be constructedand RDF apart models
from
case that the
lel. Hence platform
building handles multiple
an effective crawling campaigns
strategy is in par-
a chal- are frequently chosen while graph-based models are acknowl-
the traditional social graph as outlined in Section 5. With
allel, there is a need to optimize the access to the API. edged, however not realized
lenging task, in order to optimize the use of API requests the advent of a graph view to concretely at the implementa-
model the Twittersphere, gives
Typically, the implementation of a crawler should aim to tion phase. Ways in which
available. rise to a range of queries thatwe cancan be apply
performedgraph ondata man-
the struc-
minimize
Appropriate thecoverage
number of ofthe
API requests,is considering
campaign the re-
another significant agement intweets
Twitter are extremely diverse and range
interesting;
ture of the essentially capturing a wider of use
strictions, while fetching data for many campaigns in paral- different typesused of networks
concern and denotes whether all the relevant information has case scenarios in typicalcan data be analytics
constructed apart from
tasks.
lel. Hence building an effective crawling strategy is to adefine
chal- the traditional
been collected. When specifying the parameters As discussed in social
Sectiongraph as outlined
4.3, there in Section
are already languages5. Withsim-
lenging task, ina order
the campaign, to optimize
user needs a very the good useknowledge
of API requestson the the advent of a graph view to model the Twittersphere, gives
ilar to SQL adapted to social networks. Many of the tech-
available. rise to ainrange of queries
relevant keywords. Depending on the specified keywords, niques literature are that
for the can case
be performed
of genericonsocialthe struc-
net-
Appropriate
a collection coverage
may missofrelevant
the campaign tweetsis inanother
addition significant
to the ture of the tweets essentially capturing a wider range of use
works under a number of specific assumptions. For example
concern and denotes whether all theby relevant
APIs. information has casesocial
scenarios used satisfy
in typical data analytics
tweets removed due to restrictions Plachouras and the networks properties such astasks.
the power law
been collected.
Stavrakas work [60]Whenis anspecifying
initial stepthe parameters
in this directiontoasdefineit in- As discussed in Section 4.3,small
therediameters
are already languages sim-
distribution, sparsity and [52]. We envision
the campaign,
vestigates a user of
this notion needs a very
coverage andgood knowledge
proposes on the
mechanisms ilar to SQL adapted to social networks. Many of the tech-
queries that take another step further and executes on Twit-
relevant keywords. Depending on the specified keywords, niques in literature are forlanguages
the case FQL of generic social
to automatically adapt the campaign to evolving hashtags. ter graphs. Simple query [1] and YQLnet- [7]
a collection may miss relevant tweets in addition to the works
provideunderfeaturesa number
to explore of specific
properties assumptions.
of Facebook For example
and Ya-
6.2 Pre-processing
tweets Challenges
removed due to restrictions by APIs. Plachouras and the social
hoo APIs but networks satisfy
are limited toproperties
querying only suchpart(usually
as the power law
a sin-
Stavrakas work [60] is an initial step in this direction as it in- distribution, sparsity and small diameters [52]. We
Many problems associated with summarization, topic de- gle users connections) of the large social graph. As envision
pointed
vestigates this notion of coverage and proposes mechanisms queries thatreport
take another
tection and part-of-speech (POS) tagging, in the case of out in the on the step furtherand
Databases and Web
executes2.0 onPanelTwit-at
to automatically adapt the campaign to evolving hashtags. ter graphs. Simple query languages FQL [1] trust,
and YQL [7]
well-formed documents, e.g. news articles, have been ex- VLDB 2007 [9], understanding and analyzing author-
tensively studied in the literature. Traditional named entity provide features to
ity, authenticity, and explore
other properties
quality measures of Facebook and net-
in social Ya-
6.2 Pre-processing Challenges
recognizers (NERs) heavily depend on local linguistic fea- hoo APIs but are limited to querying
works pose major research challenges. While there are in- only part(usually a sin-
Many [62]
tures problems associated
of well-formed with summarization,
documents like capitalizationtopic andde- gle users connections)
vestigations on these quality of themeasures
large social graph. As
in Twitter, itspointed
about
tection
POS tagging and part-of-speech
of previous words. (POS) Nonetagging,
of the in the case of
characteristics out
timeinthatthewe report
enable ondeclarative
the Databases and Web
querying 2.0 Panel
of networks usingat
well-formed
hold for tweets documents,
with shorte.g. news articles,
utterances of tweetshave limitedbeen ex-
to 140 VLDB
such 2007 [9], understanding and analyzing trust, author-
measurements.
tensively studied in the literature. Traditional named entity ity, authenticity, and other quality measures in social net-
recognizers (NERs) heavily depend on local linguistic fea- works pose major research challenges. While there are in-
tures [62] of well-formed documents like capitalization and vestigations on these quality measures in Twitter, its about
POS tagging of previous words. None of the characteristics time that we enable declarative querying of networks using
hold for tweets with short utterances of tweets limited to 140 such measurements.

One of the predominant challenges is the management of [3] Sparksee: Scalable high-performance graph database.
large graphs that inevitably results from modeling users, http://www.sparsity-technologies.com/.
tweets and their properties as graphs. With the large volume
of data involved in any practical task, a data model should [4] Titan: distributed graph database. http:
be information rich, yet a concise representation that en- //thinkaurelius.github.io/titan.
One
ables of the predominant
expression challenges
of useful queries. is theonmanagement
Queries graphs should of [3] Sparksee: Scalable high-performance graph database.
large graphs that inevitably results from modeling users, [5] TrendsMap, Realtime local twitter trends. http://
http://www.sparsity-technologies.com/.
be optimized for large networks and should ideally run in-
tweets and their trendsmap.com/.
dependent of theproperties
size of the as graphs.
graph. With Therethe large volume
arealready ap-
of data involved in any practical [4] Titan: distributed graph database. http:
proaches that investigate efficienttask, a data model
algorithms on veryshould large [6] Twitalyzer: Serious analytics for social business. http:
be information rich,Efficient
yet a concise representation //thinkaurelius.github.io/titan.
graphs [4143, 66]. encoding and indexingthat mecha-en-
//twitalyzer.com.
ables expression
nisms should be in of place
usefultaking
queries. intoQueries
accountonvariations
graphs shouldof in- [5] TrendsMap, Realtime local twitter trends. http://
be optimized
dexing systemsfor large proposed
already networks for andtweets
should [23]ideally run in-
and indexing [7] Yahoo! Query Language guide on YDN. https://
trendsmap.com/.
dependent
of graphs [75] of the size of the
in general. We graph.
need to There
consider arealready
maintaining ap- developer.yahoo.com/yql/.
proaches
indexes for that investigate
tweets, keywords, efficient
users,algorithms
hashtags for onefficient
very largeac- [6] Twitalyzer: Serious analytics for social business. http:
graphs [4143,
cess of data in 66].
advance Efficient
queries.encoding and situations
In certain indexing mecha-it may [8] F. Abel, C. Hauff, and G. Houben. Twitcident: fight-
//twitalyzer.com.
nisms
be should beto
impractical in store
place thetaking intoraw
entire account
tweetvariations
and the of in-
coming fire with information from social web streams. In
dexing
plete user systems
graphalready
and it proposed for tweetsto[23]
may be desirable andcompress
either indexing WWW, pages
[7] Yahoo! Query 305308, 2012.
Language guide on YDN. https://
of graphs
or [75] in general.
drop portions of the data. We It need to considertomaintaining
is important investigate developer.yahoo.com/yql/.
indexes for tweets,ofkeywords,
which properties tweets and users,
thehashtags for efficient
graph should be com-ac- [9] S. Amer-Yahia, V. Markl, A. Halevy, A. Doan,
cess of
pressed. data in advance queries. In certain situations it may G. Abel,
[8] F. Alonso,C. D. Kossmann,
Hauff, and G. and G. Weikum.
Houben. Databases
Twitcident: fight-
be impractical to challenges,
store the entire and Webwith
ing fire 2.0 panel at VLDB
information 2007.
from In SIGMOD
social Record,
web streams. In
Besides the above tweetsraw tweet
impose and the
general com-
research
plete volume
WWW, 37, pages
pages 4952,2012.
305308, Mar. 2008.
issuesuser graph
related to and it mayChallenges
big data. be desirable to either
should compress
be addressed
or dropsame
in the portions
spiritofasthe anydata.
other It big
is important
data analyticsto investigate
task. In
which [10]
[9] S. AmerYahia;,
Amer-Yahia,L. V. V. Lakshmanan;,
Markl, A. Halevy, and Cong A. Yu. So-
Doan,
the faceproperties
of challenges of tweets
posed and the graph
by large volumes should
of data be being
com-
cialScope
G. Alonso,: D. Enabling information
Kossmann, discoveryDatabases
and G. Weikum. on social
pressed.
collected, the NoSQL paradigm should be considered as an content
and Websites. In CIDR,
2.0 panel 2009.2007. In SIGMOD Record,
at VLDB
Besides
obvious the above
choice of challenges,
dealing with tweets
them. impose generalsolutions
Developed research
volume 37, pages 4952, Mar. 2008.
issues related
should to big for
be extensible data. Challenges
upcoming should beand
requirements addressed
should [11] T. Baldwin, P. Cook, and B. Han. A support platform
in the same
indeed spirit When
scale well. as anycollecting
other bigdata, dataaanalytics
user needtask. to con-In [10] for event detection
S. AmerYahia;, using
L. V. social intelligence.
Lakshmanan;, and CongIn Demon-
Yu. So-
the face
sider of challenges
scalable crawlingposed as thereby large volumes
are large of data
volumes being
of tweets strations
cialScope at: the 13th Conference
Enabling information of the European
discovery on Chap-
social
collected, the NoSQL
received, processed and paradigm
indexed should be considered
per second. With respectas an ter of the
content Association
sites. In CIDR, for2009.Computational Linguistics,
obvious choice of dealing with them.
to implementation, it is necessary to investigate paradigms Developed solutions pages 6972, 2012.
should
that be well,
scale extensible for upcoming
like MapReduce whichrequirements
is optimized andforshould
offline [11] T. Baldwin, P. Cook, and B. Han. A support platform
indeed
analytics scale well. data
on large When collecting on
partitioned data, a user need
hundreds to con-
of machines. [12] for
L. Barbosa and J. Feng.
event detection usingRobust sentiment detection
social intelligence. In Demon-on
sider scalable
Depending oncrawling
the complexityas thereofarethe large volumes
queries of tweets
supported, it Twitter
strationsfrom biased
at the 13th and noisy data.
Conference of thepages 3644,Chap-
European Aug.
received,
might be processed
difficult toand expressindexed
graph per second. With
algorithms respect
intuitively in 2010.
ter of the Association for Computational Linguistics,
to implementation,
MapReduce it is necessary
graph models, to investigate
consequently databases paradigms
such as pages 6972, 2012.
that scale [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
Titan [4], well,
DEXlike [3],MapReduce
and Neo4j which is optimized
[2] should be compared for offline
for
analytics on large data partitioned on hundreds of machines.
graph implementations. [12] and E. H. Chi.
L. Barbosa and Eddi: interactive
J. Feng. topic-based
Robust sentiment browsing
detection on
Depending on the complexity of the queries supported, it of social from
Twitter status streams.
biased In 23nd
and noisy annual
data. pagesACM sympo-
3644, Aug.
might be difficult to express graph algorithms intuitively in sium
2010. on User interface software and technology - UIST,
7. CONCLUSION
MapReduce graph models, consequently databases such as pages 303312, Oct. 2010.
In this paper we addressed issues[2]around thebebig data nature [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
Titan [4], DEX [3], and Neo4j should compared for
of Twitter analytics and the need for new data management [14] and
A. Bifet
E. H.and
Chi.E. Eddi:
Frank.interactive
Sentimenttopic-based
knowledge discovery
browsing
graph implementations.
and query language frameworks. By conducting a careful in
of twitter streaming
social status data.InDiscovery
streams. Science.
23nd annual ACMSpringer
sympo-
and extensive review of the existing literature we observed Berlin
sium onHeidelberg, pagessoftware
User interface 115, Oct.
and 2010.
technology - UIST,
7.
ways CONCLUSION
in which researchers have tried to develop general plat- pages 303312, Oct. 2010.
In this to
forms paper we addressed
provide a repeatable issues around thefor
foundation bigTwitter
data naturedata [15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog-
of Twitter We
analytics. analytics
reviewed andthe thetweet
need for new data
analytics space management
by explor- [14] gins. Twitter
A. Bifet and E. Zombie:
Frank. Architecture for capturing,
Sentiment knowledge so-
discovery
and mechanisms
ing query language frameworks.
primarily for data By conducting
collection, data amanage-
careful cially transforming
in twitter streaming anddata.analyzing
Discoverythe Twittersphere.
Science. Springer
and
mentextensive reviewfor
and languages of querying
the existing andliterature
analyzingwe observed
tweets. We In International
Berlin Heidelberg,conference
pages 115, on Oct.
Supporting
2010. group work,
ways in which researchers
have identified the essential have tried to develop
ingredients requiredgeneral plat-
for a unified pages 229238, 2012.
forms to provide a repeatable foundation for Twitter data [15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog-
framework that address the limitations of existing systems. [16] gins.
M. Boanjak
Twitter and E. Oliveira.
Zombie: TwitterEcho
Architecture - A dis-
for capturing, so-
analytics.
The paperWe reviewed
outlines the tweet
research issuesanalytics space bysome
and identifies explor-of
ing mechanisms primarily with for data tributed focused crawler
cially transforming and to support the
analyzing openTwittersphere.
research with
the challenges associated the collection,
development data of manage-
such in- twitter data. In International
In International conference onconference
Supporting companion on
group work,
ment
tegrated andplatforms.
languages for querying and analyzing tweets. We
have identified the essential ingredients required for a unified World Wide Web,
pages 229238, pages 12331239, 2012.
2012.
framework that address the limitations of existing systems.
8. REFERENCES
The paper outlines research issues and identifies some of
[17] K. Bontcheva
[16] M. Boanjak and andE.L.Oliveira.
Derczynski. TwitIE: an
TwitterEcho - Aopen-
dis-
source
tributedinformation extraction
focused crawler pipeline
to support open for microblog
research with
the challenges associated with the development of such in- text. Indata.
twitter International Conference
In International on Recent
conference Advances
companion on
[1] FaceBook Query Language(FQL) overview.
tegrated platforms. in Natural
World WideLanguage Processing,
Web, pages 2013.
12331239, 2012.
https://developers.facebook.com/docs/
technical-guides/fql.
8. REFERENCES [18] C. Budak,
[17] K. T. Georgiou,
Bontcheva and D. E. Abbadi.
and L. Derczynski. TwitIE: GeoScope:
an open-
[2] Neo4j: The worlds leading graph database. http:// Online detection of extraction
source information geo-correlated information
pipeline trends
for microblog
www.neo4j.org/.
[1] FaceBook Query Language(FQL) overview. in social
text. networks. PVLDB,
In International 7(4):229240,
Conference 2013.
on Recent Advances
https://developers.facebook.com/docs/ in Natural Language Processing, 2013.
technical-guides/fql.
[18] C. Budak, T. Georgiou, and D. E. Abbadi. GeoScope:
[2] Neo4j: The worlds leading graph database. http:// Online detection of geo-correlated information trends
www.neo4j.org/. in social networks. PVLDB, 7(4):229240, 2013.

[19] C. Byun, H. Lee, Y. Kim, and K. K. Kim. Twitter [34] S. Frenot and S. Grumbach. An in-browser microblog
data collecting tool with rule-based filtering and analy- ranking engine. In International conference on Ad-
sis module. International Journal of Web Information vances in Conceptual Modeling, volume 7518, pages 78
Systems, 9(3):184203, 2013. 88, 2012.
[20] S. Carter,
[19] C. W. Lee,
Byun, H. Weerkamp,
Y. Kim, andand
M. K.
Tsagkias.
K. Kim.Microblog
Twitter [35] G. Fr
[34] S. Golovchinsky
enot and S.and M. Efron.An
Grumbach. Making sense of
in-browser Twitter
microblog
language identification:
data collecting tool with Overcoming the limitations
rule-based filtering of
and analy- search.
rankingInengine.
CHI, 2010.
In International conference on Ad-
short, unedited
sis module. and idiomatic
International text. of
Journal Language Resources
Web Information vances in Conceptual Modeling, volume 7518, pages 78
and Evaluation, 47(1):195215, [36] M. Graham, S. A. . Hale, and D. Gaffney. Where in the
Systems, 9(3):184203, 2013. June 2012. 88, 2012.
world are you ? Geolocation and language identification
[21] S.
[20] M.Carter,
Cha, H.W. Haddadi,
Weerkamp,F. Benevenuto, and K.Microblog
and M. Tsagkias. P. Gum- in Twitter.
[35] G. In ICWSM,
Golovchinsky and M. pages
Efron.518521, 2012.of Twitter
Making sense
madi. Measuring
language user influence
identification: in twitter:
Overcoming The million
the limitations of search. In CHI, 2010.
[37] B. Hecht, L. Hong, B. Suh, and E. Chi. Tweets from
follower
short, fallacy. and
unedited In ICWSM,
idiomaticpages
text. 1017,
Language2010.
Resources Justin
[36] M. Biebers
Graham, S. heart: the and
A. . Hale, dynamics of the Where
D. Gaffney. locationinfield
the
and Evaluation, 47(1):195215, June 2012. in userareprofiles. In Conference
[22] S. Chandra, L. Khan, and F. B. Muhaya. Estimating world you ? Geolocation and on Human
language Factors in
identification
[21] twitter
M. Cha,userH. location
Haddadi,using social interactionsa
F. Benevenuto, and K. P.content
Gum- Computing
in Twitter. Systems,
In ICWSM, pages 237246,
pages 2011.
518521, 2012.
based approach. In
madi. Measuring IEEE
user Conference
influence on Privacy,
in twitter: Se-
The million [38]
[37] B. J. Jansen,
Hecht, M. Zhang,
L. Hong, K. Sobel,
B. Suh, and E. and
Chi.A.Tweets
Chowdury.
from
curity,
followerRisk and In
fallacy. Trust, pagespages
ICWSM, 838843, Oct.
1017, 2011.
2010. Twitter power: heart:
Tweets
Justin Biebers theas electronic
dynamics word
of the of mouth.
location field
Journal of the American
in user profiles. SocietyonforHuman
In Conference Information
FactorsSci-in
[23] C. Chandra,
[22] S. Chen, F. L.Li, Khan,
C. Ooi,andandF. S.B.Wu. TI : An
Muhaya. efficient
Estimating
ence and Technology,
Computing 60(11):21692188,
Systems, pages 237246, 2011. Nov. 2009.
indexing mechanism
twitter user for real-time
location using search. In SIGMOD,
social interactionsa content
pages
based 649660,
approach.2011.
In IEEE Conference on Privacy, Se- [39]
[38] J.
B. Jiang, L. Hidayah,
J. Jansen, T.K.
M. Zhang, Elsayed,
Sobel, and A.H. Chowdury.
Ramadan.
curity, Risk and Trust, pages 838843, Oct. 2011. BEST
Twitterofpower:
KAUST at TREC-2011
Tweets : Building
as electronic word of effective
mouth.
[24] Z. Cheng, J. Caverlee, K. Lee, and C. Science. A search
JournalinofTwitter. TREC, Society
the American 2011. for Information Sci-
[23] content-driven
C. Chen, F. Li,framework
C. Ooi, and forS.geo-locating
Wu. TI : An microblog
efficient
ence and Technology, 60(11):21692188, Nov. 2009.
users. ACM
indexing Transactions
mechanism on Intelligent
for real-time Systems
search. In SIGMOD,and [40] P. J
urgens, A. Jungherr, and H. Schoen. Small worlds
Technology, 2012.
pages 649660, 2011. [39] with a difference:
J. Jiang, new gatekeepers
L. Hidayah, T. Elsayed, and
and the
H. filtering
Ramadan. of
political
BEST ofinformation
KAUST at on Twitter. In: International
TREC-2011 Web
Building effective
[25] M. Cheng,
[24] Z. Cheong J.
andCaverlee,
S. Ray. A
K.literature
Lee, andreview of recent
C. Science. A Science
search inConference-WebSci, pages 15, June 2011.
Twitter. TREC, 2011.
microblogging
content-driven developments.
framework forTechnical report,
geo-locating Clayton
microblog
School of Information
users. ACM Technology,
Transactions Monash
on Intelligent University,
Systems and [41]
[40] U. Kang,
P. Jurgens, D.A.
H. Jungherr,
Chau, andand C. Faloutsos.
H. Schoen.Managing and
Small worlds
2011.
Technology, 2012. mining large graphsnew
with a difference: : Systems and implementations.
gatekeepers and the filtering In
of
SIGMOD, volume 1, on
political information pages 589592,
Twitter. 2012.
In International Web
[26] Chew,
[25] M. Cynthia,
Cheong and and G. Eysenbach.
S. Ray. A literaturePandemics in the
review of recent Science Conference-WebSci, pages 15, June 2011.
age of twitter:developments.
microblogging content analysis of tweets
Technical during
report, the
Clayton [42] U. Kang and C. Faloutsos. Big graph mining :
2009
SchoolH1N1 outbreak. PloS
of Information one, 5(11),
Technology, 2010.University,
Monash [41] Algorithms
U. Kang, D. H. andChau,
discoveries. SIGKDD Managing
and C. Faloutsos. Explorations,
and
2011. 14(2):2936, 2013. : Systems and implementations. In
mining large graphs
[27] B. O. Connor, N. A. Smith, and E. P. Xing. A latent SIGMOD, volume 1, pages 589592, 2012.
[26] variable model forand
Chew, Cynthia, geographic lexical variation.
G. Eysenbach. PandemicsIninCon-
the [43] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos.
ference on Empirical
age of twitter: Methods
content in Natural
analysis Language
of tweets duringPro-
the [42] Gbase:
U. Kang An and
efficient
C. analysis
Faloutsos. platform for large
Big graph graphs.:
mining
cessing,
2009 H1N1 pages 12771287,
outbreak. PloS2010.
one, 5(11), 2010. VLDB Journal,
Algorithms and21(5):637650, June 2012.Explorations,
discoveries. SIGKDD
14(2):2936, 2013.
[28] Conover,
[27] B. Michael,
O. Connor, J. Ratkiewicz,
N. A. Smith, and E. P. M.Xing.Francisco,
A latent [44] S. Kumar, G. Barbier, M. Abbasi, and H. Liu. Tweet-
B. Goncalves,
variable model F.forMenczer, and
geographic A. Flammini.
lexical variation. Political
In Con- [43] Tracker:
U. Kang, An H. analysis
Tong, J. tool
Sun,for humanitarian
C.-Y. Lin, and C.and disaster
Faloutsos.
polarization on Twitter.
ference on Empirical In ICWSM,
Methods 2011.
in Natural Language Pro- relief.
Gbase:InAn ICWSM,
efficientpages 661662,
analysis 2011.
platform for large graphs.
cessing, pages 12771287, 2010. VLDB Journal, 21(5):637650, June 2012.
[45] H. Kwak, C. Lee, H. Park, and S. Moon. What is twit-
[29] J. David. Thats what friends are for inferring location
[28] in online social
Conover, mediaJ.platforms
Michael, based M.
Ratkiewicz, on social rela-
Francisco, [44] ter, a socialG.
S. Kumar, network or aM.
Barbier, news media?
Abbasi, andInH.WWW, pages
Liu. Tweet-
591600,
Tracker: An 2010.
analysis tool for humanitarian and disaster
tionships. In ICWSM,
B. Goncalves, 2013.and A. Flammini. Political
F. Menczer,
polarization on Twitter. In ICWSM, 2011. relief. In ICWSM, pages 661662, 2011.
[46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and
[45] A
H.novel
Kwak, approach
C. Lee, H.for Park,
event anddetection by mining
S. Moon. What is spatio-
twit-
[29] V. Guana.Thats
J. David. SociQL:
whatAfriends
query are
language for the
for inferring social
location temporal information
ter, a social network or on microblogs.
a news International
media? In WWW, pages
Web. In social
in online E. Kranakis, editor, Advances
media platforms based on in Network
social rela- Conference on Advances in Social Networks Analysis
591600, 2010.
Analysis
tionships.and its Applications,
In ICWSM, 2013. chapter 17, pages 381 and Mining, pages 254259, July 2011.
406. 2013. [46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and [47] A
C. novel
Li, J. approach
Weng, Q.for He,event
Y. Yao, and A.by
detection Datta.
miningTwiNER:
spatio-
[31] V.
Y. Doytsher and B. Galon.
Guana. SociQL: Querying
A query geo-social
language for thedata by
social named entity
temporal recognition
information on in targeted twitter
microblogs. stream. In
In International
bridging
Web. In spatial networkseditor,
E. Kranakis, and social networks.
Advances In 2nd
in Network SIGIR, pages
Conference on721730,
Advances 2012.
in Social Networks Analysis
ACM SIGSPATIAL
Analysis International
and its Applications, Workshop
chapter on Loca-
17, pages 381 and Mining, pages 254259, July 2011.
tion
406. Based
2013. Social Networks, pages 3946, 2010. [48] A. Marcus, M. Bernstein, and O. Badar. Tweets as
data:
[47] C. Li, demonstration
J. Weng, Q. He,ofY. TweeQL
Yao, and andA.Twitinfo. In SIG-
Datta. TwiNER:
[32] A. Dries,
[31] Y. S. Nijssen,
Doytsher and L. De
and B. Galon. Raedt.geo-social
Querying A query language
data by MOD, pages
named entity 12591261,
recognition2011.
in targeted twitter stream. In
for analyzing
bridging networks.
spatial In CIKM,
networks pages
and social 485494,In2009.
networks. 2nd SIGIR, pages 721730, 2012.
ACM SIGSPATIAL International Workshop on Loca- [49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] tion
M. Efron.
BasedHashtag retrieval inpages
Social Networks, a microblogging
3946, 2010.environ- [48] visualizing
A. Marcus,the M.data in tweets.
Bernstein, andSIGMOD
O. Badar.Record, 40(4),
Tweets as
ment. pages 787788, 2010. 2012.
data: demonstration of TweeQL and Twitinfo. In SIG-
[32] A. Dries, S. Nijssen, and L. De Raedt. A query language MOD, pages 12591261, 2011.
for analyzing networks. In CIKM, pages 485494, 2009.
[49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] M. Efron. Hashtag retrieval in a microblogging environ- visualizing the data in tweets. SIGMOD Record, 40(4),
ment. pages 787788, 2010. 2012.

[50] M. S. Martn and C. Gutierrez. Representing, querying [63] A. Ritter, S. Clark, and O. Etzioni. Named entity recog-
and transforming social networks with RDF/SPARQL. nition in tweets : an experimental study. In Conference
In European Semantic Web Conference, pages 293307, on Empirical Methods in Natural Language Processing,
2009. pages 15241534, 2011.
[51] P. T.
[50] M. S. W. Mauro
Martn andSan Martn, Claudio
C. Gutierrez. Gutierrez.
Representing, SNQL
querying [63] A.
[64] R. Ritter,
Ronen andS. Clark, and O. SoQL:
O. Shmueli. Etzioni.ANamed entity
language recog-
for query-
:and
A social networksocial
transforming querynetworks
and transformation language.
with RDF/SPARQL. nition
ing in tweets
and : an
creating experimental
data in social study. In Conference
networks. In ICDE,
In European
5th Alberto Mendelzon
Semantic Web International Workshop
Conference, pages on
293307, on Empirical
pages Methods
15951602, Mar.in Natural Language Processing,
2009.
Foundations
2009. of Data Management, 2011. [65] pages 15241534,
T. Sakaki. 2011.shakes twitter users : Real-time
Earthquake
event detection by social sensors. In WWW, pages 851
[52] M.T.
[51] P. Mcglohon
W. Mauro andSanC.Mart
Faloutsos. Statistical
n, Claudio properties
Gutierrez. SNQL [64] R. Ronen
860, 2010.and O. Shmueli. SoQL: A language for query-
of
: Asocial
social networks. In C.and
network query C. transformation
Aggarwal, editor, Social
language. ing and creating data in social networks. In ICDE,
Network Data Analytics,
In 5th Alberto Mendelzonchapter 2, pagesWorkshop
International 1742. 2011.
on pages
[66] S. 15951602,
Salihoglu and J.Mar. 2009.GPS : A graph processing
Widom.
Foundations of Data Management, 2011. [65] T. Sakaki.
system. In Earthquake
Internationalshakes twitter on
Conference users : Real-time
Scientific and
[53] P. Mendes, A. Passant, and P. Kapanipathi. Twarql:
event detection
Statistical by social
Database sensors. Inpages
Management, WWW, pages
131, 851
2013.
[52] tapping into the
M. Mcglohon andwisdom of the crowd.
C. Faloutsos. In Proceedings
Statistical of
properties
860, 2010.
the 6th International
of social networks. InConference on Semantic
C. C. Aggarwal, Systems,
editor, Social [67] A. Schulz, A. Hadjakos, and H. Paulheim. A multi-
pages
Network35, 2010.
Data Analytics, chapter 2, pages 1742. 2011. [66] S. Salihoglu
indicator and J. Widom.
approach GPS : A graph
for geolocalization processing
of tweets. In
[54] system. Inpages
ICWSM, International Conference on Scientific and
573582, 2013.
[53] F. Morstatter,
P. Mendes, S. Kumar,
A. Passant, andH.P.Liu, and R. Maciejew-
Kapanipathi. Twarql:
ski. Understanding Twitter datacrowd.
with TweetXplorer. Statistical Database Management, pages 131, 2013.
tapping into the wisdom of the In Proceedings In
of [68] A. Signorini, A. M. Segre, and P. M. Polgreen. The
SIGKDD, pages 14821485,
the 6th International 2013.on Semantic Systems,
Conference [67] A.
use Schulz, A. Hadjakos,
of Twitter and H.
to track levels Paulheim.
of disease A multi-
activity and
pages 35, 2010. indicator
public approach
concern in the for
U.S.geolocalization of tweets.
during the influenza A H1N1In
[55] P. Noordhuis, M. Heijkoop, and A. Lazovik. Mining
[54] Twitter in the cloud:
F. Morstatter, A caseH.study.
S. Kumar, Liu, In
andIEEE 3rd Inter-
R. Maciejew- ICWSM, pages
pandemic. PloS 573582,
one, 6(5),2013.
Jan. 2011.
national ConferenceTwitter
ski. Understanding on Cloud
dataComputing, pages 107
with TweetXplorer. In [68] A.
[69] Signorini, and
Y. Stavrakas A. M.V. Segre, and P.A M.
Plachouras. Polgreen.
platform The
for sup-
114, July 2010.
SIGKDD, pages 14821485, 2013. use of Twitter
porting to track
data analytics onlevels
twitterof challenges
disease activity and
and objec-
[56]
[55] I.
P. Ounis, C. Macdonald,
Noordhuis, M. Heijkoop,J.and Lin,
A. and I. Soboroff.
Lazovik. Mining publicIntl.
tives. concern in the U.S.
Workshop during theExtraction
on Knowledge influenza A&H1N1
Con-
Overview
Twitter inofthe
the TREC-2011
cloud: Microblog
A case study. Track.
In IEEE 3rdInInter-
20th pandemic. from
solidation PloS Social
one, 6(5),
Media,Jan.(Ict
2011.
270239), 2013.
Text REtrieval
national Conference
Conference (TREC),
on Cloud 2011. pages 107
Computing, [69] Y.
[70] Stavrakas
Tumasjan, and V. Plachouras.
Andranik, A platform
T. O. Sprenger, for sup-
P. G. Sandner,
114, July 2010. porting data analytics on twitter challenges
[57] A. Pak and P. Paroubek. Twitter as a corpus for sen- and I. M. Welpe. Predicting Elections withand objec-
Twitter:
[56] timent analysis
I. Ounis, and opinionJ.mining.
C. Macdonald, Lin, andIn International
I. Soboroff. tives. Intl.
What Workshop on
140 Characters Knowledge
Reveal Extraction
about Political & Con-
Sentiment.
Conference
Overview of onthe Language
TREC-2011 Resources
Microblogand Evaluation,
Track. In 20th solidation
In ICWSM, from Social
pages Media,
178185, (Ict 270239), 2013.
2010.
pages 13201326,
Text REtrieval 2010.
Conference (TREC), 2011.
[70]
[71] Tumasjan, Andranik,
J. Weng, E.-p. Lim, andT.J.O. Sprenger,
Jiang. P. G. Sandner,
TwitterRank : Find-
[58] Paul,
[57] A. PakM.and
J, and M. Dredze.Twitter
P. Paroubek. In ICWSM, pages 265272.
as a corpus for sen- and I. M. Welpe. Predicting
ing topic-sensitive influential Elections
twitterers.with
In Twitter:
WSDM,
timent analysis and opinion mining. In International What
pages 140 Characters
261270, 2010. Reveal about Political Sentiment.
[59] Conference
V. Plachouras on and Y. Stavrakas.
Language Querying
Resources term asso-
and Evaluation, In ICWSM, pages 178185, 2010.
ciations and their 2010.
temporal evolution in social data. In [72] J. S. White, J. N. Matthews, and J. L. Stacy. Coalmine:
pages 13201326,
International VLDB Workshop on Online Social Sys- [71] an
J. Weng, E.-p. in
experience Lim, and J.aJiang.
building systemTwitterRank : Find-
for social media an-
[58] tems, 2012.
Paul, M. J, and M. Dredze. In ICWSM, pages 265272. ing topic-sensitive
alytics. influential
In I. V. Ternovskiy andtwitterers. In WSDM,
P. Chin, editors, Pro-
pages 261270,
ceedings of SPIE, 2010.
volume 8408, 2012.
[60] V. Plachouras,
[59] Plachouras andY. Stavrakas, and Querying
Y. Stavrakas. A. Andreou.termAssess-
asso-
ing the coverage
ciations and theiroftemporal
data collection campaigns
evolution in socialon Twit-
data. In [72] J.
[73] P. S.
T.White,
Wood. J. N. Matthews,
Query languagesand J. L. Stacy.
for graph Coalmine:
databases. SIG-
ter: A case study.
International VLDBInWorkshop
On the Move to Meaningful
on Online In-
Social Sys- an experience
MOD Record, in building a Apr.
41(1):5060, system for social media an-
2012.
ternet 2012.
tems, Systems: OTM 2013 Workshops, pages 598607. alytics. In I. V. Ternovskiy and P. Chin, editors, Pro-
2013. [74] S. Wu, J.ofM.
ceedings Hofman,
SPIE, volumeW.8408,
A. Mason,
2012. and D. J. Watts.
[60] V. Plachouras, Y. Stavrakas, and A. Andreou. Assess- Who says what to whom on twitter. In WWW, pages
[61] ing
D. the
Preotiuc-Pietro,
coverage of dataS. collection
Samangooei, and T.
campaigns on Cohn.
Twit- [73] 705714,
P. T. Wood. Query
Mar. 2011.languages for graph databases. SIG-
Trendminer : An architecture
ter: A case study. for real
In On the Move to time analysisIn-
Meaningful of MOD Record, 41(1):5060, Apr. 2012.
social
ternet media text.
Systems: OTMIn Workshop on Real-Time
2013 Workshops, Analysis
pages 598607. [75] X. Yan, P. S. Yu, and J. Han. Graph indexing : A
and
2013.Mining of Social Streams, pages 47, 2012. [74] S. Wu, J.structure-based
frequent M. Hofman, W.approach.
A. Mason, In and D. J. Watts.
SIGMOD, pages
Who says2004.
335346, what to whom on twitter. In WWW, pages
[62] L. Ratinov
[61] D. and D. Roth.
Preotiuc-Pietro, S. Design challenges
Samangooei, andand
T.miscon-
Cohn. 705714, Mar. 2011.
ceptions in named
Trendminer entity recognition.
: An architecture for realIntime
Conference
analysis on
of [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
Computational Natural
social media text. Language
In Workshop on Learning
Real-Time(CoNLL),
Analysis [75] X. Yan, P. situation
emergency S. Yu, and J. Han. via
awareness Graph indexing : In
microbloggers. A
number June,
and Mining of pages
Social147155,
Streams,2009.
pages 47, 2012. frequentpages
CIKM, structure-based
27012703, approach.
2012. In SIGMOD, pages
335346, 2004.
[62] L. Ratinov and D. Roth. Design challenges and miscon-
ceptions in named entity recognition. In Conference on [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
Computational Natural Language Learning (CoNLL), emergency situation awareness via microbloggers. In
number June, pages 147155, 2009. CIKM, pages 27012703, 2012.

What is Tumblr: A Statistical Overview and Comparison
Yi Chang , Lei Tang , Yoshiyuki Inagaki , Yan Liu

Yahoo Labs, Sunnyvale, CA 94089
@WalmartLabs, San Bruno, CA 94066

University of Southern California, Los Angeles, CA 90089
yichang@yahoo-inc.com, leitang@acm.org, inagakiy@yahoo-inc.com, yanliu.cs@usc.edu
ABSTRACT trary, popular social networking sites like Facebook6 , have richer
social interactions, but lower quality content comparing with blo-
Tumblr, as one of the most popular microblogging platforms, has
gosphere. Since most social interactions are either unpublished or
gained momentum recently. It is reported to have 166.4 millions of
less meaningful for the majority of public audience, it is natural for
users and 73.4 billions of posts by January 2014. While many arti-
Facebook users to form different communities or social circles. Mi-
cles about Tumblr have been published in major press, there is not
croblogging services, in between of traditional blogging and online
much scholar work so far. In this paper, we provide some pioneer
social networking services, have intermediate quality content and
analysis on Tumblr from a variety of aspects. We study the social
intermediate social interactions. Twitter7 , which is the largest mi-
network structure among Tumblr users, analyze its user generated
croblogging site, has the limitation of 140 characters in each post,
content, and describe reblogging patterns to analyze its user be-
and the Twitter following relationship is not reciprocal: a Twitter
havior. We aim to provide a comprehensive statistical overview of
user does not need to follow back if the user is followed by another.
Tumblr and compare it with other popular social services, including
As a result, Twitter is considered as a new social media [11], and
blogosphere, Twitter and Facebook, in answering a couple of key
short messages can be broadcasted to a Twitter users followers in
questions: What is Tumblr? How is Tumblr different from other
real time.
social media networks? In short, we nd Tumblr has more rich
content than other microblogging platforms, and it contains hybrid Tumblr is also posed as a microblogging platform. Tumblr users
characteristics of social networking, traditional blogosphere, and can follow another user without following back, which forms a non-
social media. This work serves as an early snapshot of Tumblr that reciprocal social network; a Tumblr post can be re-broadcasted by
later work can leverage. a user to its own followers via reblogging. But unlike Twitter, Tum-
blr has no length limitation for each post, and Tumblr also supports
multimedia post, such as images, audios or videos. With these dif-
1. INTRODUCTION ferences in mind, are the social network, user generated content, or
user behavior on Tumblr dramatically different from other social
Tumblr, as one of the most prevalent microblogging sites, has be-
media sites?
come phenomenal in recent years, and it is acquired by Yahoo! in
2013. By mid-January 2014, Tumblr has 166.4 millions of users In this paper, we provide a statistical overview over Tumblr from
and 73.4 billions of posts1 . It is reported to be the most popular assorted aspects. We study the social network structure among
social site among young generation, as half of Tumblrs visitor are Tumblr users and compare its network properties with other com-
under 25 years old2 . Tumblr is ranked as the 16th most popular monly used ones. Meanwhile, we study content generated in Tum-
sites in United States, which is the 2nd most dominant blogging blr and examine the content generation patterns. One step further,
site, the 2nd largest microblogging service, and the 5th most preva- we also analyze how a blog post is being reblogged and propagated
lent social site3 . In contrast to the momentum Tumblr gained in through a network, both topologically and temporally. Our study
recent press, little academic research has been conducted over this shows that Tumblr provides hybrid microblogging services: it con-
burgeoning social service. Naturally questions arise: What is Tum- tains dual characteristics of both social media and traditional blog-
blr? What is the difference between Tumblr and other blogging or ging. Meanwhile, surprising patterns surface. We describe these
social media sites? intriguing ndings and provide insights, which hopefully can be
leveraged by other researchers to understand more about this new
Traditional blogging sites, such as Blogspot4 and Live Journal5 ,
form of social media.
have high quality content but little social interactions. Nardi et
al. [17] investigated blogging as a form of personal communica-
tion and expression, and showed that the vast majority of blog posts 2. TUMBLR AT FIRST SIGHT
are written by ordinary people with a small audience. On the con-
Tumblr is ranked the second largest microblogging service, right
after Twitter, with over 166.4 million users and 73.4 billion posts
1
http://www.tumblr.com/about by January 2014. Tumblr is easy to register, and one can sign up
2
http://www.webcitation.org/64UXrbl8H for Tumblr service with a valid email address within 30 seconds.
3
http://www.alexa.com/topsites/countries/US Once sign in Tumblr, a user can follow other users. Different from
4
http://blogspot.com Facebook, the connections in Tumblr do not require mutual conr-
5
http://livejournal.com mation. Hence the social network in Tumblr is unidirectional.
6
http://facebook.com
7
http://twitter.com

Both Twitter and Tumblr are considered as microblogging plat- Photo: 78.11%
Text: 14.13%
forms. Comparing with Twitter, Tumblr exposes several differ- Quote: 2.27%
ences: Audio: 2.01%
Video: 1.35%
Chat: 0.85%
There is no length limitation for each post; Answer: 0.82%
Link: 0.46%
Tumblr supports multimedia posts, such as images, audios
and videos;
Similar to hashtags in Twitter, bloggers can also tag their

blog post, which is commonplace in traditional blogging.
But tags in Tumblr are seperate from blog content, while in
Twitter the hashtag can appear anywhere within a tweet.
Tumblr recently (Jan. 2014) allowed users to mention and

link to specic users inside posts. This @user mechanism
needs more time to be adopted by the community; Figure 2: Distribution of Posts (Better viewed in color)
Tumblr does not differentiate veried account.
3.1 billion edges. Though this graph is not yet up-to-date, we be-
lieve that many network properties should be well preserved given
the scale of this graph. Meanwhile, we sample about 586.4 million
of Tumblr posts from August 10 to September 6, 2013. Unfortu-
nately, Tumblr does not require users to ll in basic prole infor-
mation, such as gender or location. Therefore, it is impossible for
us to conduct user prole analysis as done in other works. In or-
Figure 1: Post Types in Tumblr der to handle such large volume of data, most statistical patterns
are computed through a MapReduce cluster, with some algorithms
Specically, Tumblr denes 8 types of posts: photo, text, quote, being tricky. We will skip the involved implementation details but
audio, video, chat, link and answer. As shown in Figure 1, one concentrate solely on the derived patterns.
has the exibility to start a post in any type except answer. Text, Most statistical patterns can be presented in three different forms:
photo, audio, video and link allow one to post, share and comment probability density function (PDF), cumulative distribution func-
any multimedia content. Quote and chat, which are not available tion (CDF) or complementary cumulative distribution function (CCDF),
in most other social networking platforms, let Tumblr users share describing P r(X = x), P r(X x) and P r(X x) respec-
quote or chat history from ichat or msn. Answer occurs only when tively, where X is a random variable and x is certain value. Due
one tries to interact with other users: when one user posts a ques- to the space limit, it is impossible to include all of them. Hence,
tion, in particular, writes a post with text box ending with a question we decide which form(s) to include depending on presentation and
mark, the user can enable the option for others to answer the ques- comparison convenience with other relevant papers. That is, if
tion, which will be disabled automatically after 7 days. A post can CCDF is reported in a relevant paper, we try to also report CCDF
also be reblogged by another user to broadcast to his own follow- here so that rigorous comparison is possible.
ers. The reblogged post will quote the original post by default and Next, we study properties of Tumblr through different lenses, in
allow the reblogger to add additional comments. particular, as a social network, a content generation website, and
Figure 2 demonstrates the distribution of Tumblr post types, based an information propagation platform, respectively.
on 586.4 million posts we collected. As seen in the gure, even
though all kinds of content are supported, photo and text dominate
the distribution, accounting for more than 92% of the posts. There- 3. TUMBLR AS SOCIAL NETWORK
fore, we will concentrate on these two types of posts for our content We begin our analysis of Tumblr by examining its social network
analysis later. topology structure. Numerous social networks have been analyzed
Since Tumblr has a strong presence of photos, it is natural to com- in the past, such as traditional blogosphere [21], Twitter [10; 11],
pare it to other photo or image based social networks like Flickr8 Facebook [22], and instant messenger communication network [13].
and Pinterest9 . Flickr is mainly an image hosting website, and Here we run an array of standard network analysis to compare with
Flicker users can add contact, comment or like others photos. Yet, other networks, with results summarized in Table 110 .
different from Tumblr, one cannot reblog anothers photo in Flickr. Degree Distribution. Since Tumblr does not require mutual con-
Pinterest is designed for curators, allowing one to share photos or rmation when one follows another user, we represent the follower-
videos of her taste with the public. Pinterest links a pin to the followee network in Tumblr as a directed graph: in-degree of a user
commercial website where the product presented in the pin can represents how many followers the user has attracted, while out-
be purchased, which accounts for a stronger e-commerce behavior. degree indicates how many other users one user has been following.
Therefore, the target audience of Tumblr and Pinterest are quite Our sampled sub-graph contains 62.8 million nodes and 3.1 billion
different: the majority of users in Tumblr are under age 25, while
Even though we wish to include results over other popular social
10
Pinterest is heavily used by women within age from 25 to 44 [16]. media networks like Pinterest, Sina Weibo and Instagram, analysis
We directly sample a sub-graph snapshot of social network from over those websites not available or just small-scale case studies
Tumblr on August 2013, which contains 62.8 million nodes and that are difcult to generalize to a comprehensive scale for a fair
comparison. Actually in the Table, we observe quite a discrepancy
8
http://ickr.com between numbers reported over a small twitter data set and another
9
http://pinterest.com comprehensive snapshot.

Table 1: Comparison of Tumblr with other popular social networks. The numbers of Blogosphere, Twitter-small, Twitter-huge, Facebook,
and MSN are obtained from [21; 10; 11; 22; 13], respectively. In the table, implies the corresponding statistic is not available or not
applicable; GCC denotes the giant connected component; the symbols in parenthesis m, d, e, r respectively represent mean, median, the
90% effective diameter, and diameter (the maximum shortest path in the network).
Metric Tumblr Blogosphere Twitter-small Twitter-huge Facebook MSN
#nodes 62.8M 143,736 87,897 41.7M 721M 180M
#links 3.1B 707,761 829,467 1.47B 68.7B 1.3B
in-degree distr k2.19 k2.38 k2.4 k2.276
degree distr in r-graph = power-law = power-law k0.8 e0.03k
direction directed directed directed directed undirected undirected
reciprocity 29.03% 3% 58% 22.1%
degree correlation 0.106 >0 0.226
avg distance 4.7(m), 5(d) 9.3(m) 4.1(m), 4(d) 4.7(m), 5(d) 6.6(m), 6(d)
diameter 5.4(e), 29(r) 12(r) 6(r) 4.8(e), 18(r) < 5(e) 7.8(e), 29(r)
GCC coverage 99.61% 75.08% 93.03% 99.91% 99.90%
edges. Within this social graph, 41.40% of nodes have 0 in-degree, point due to the Tumblr limit of 5000 followees for ordinary users.
and the maximum in-degree of a node is 4.06 million. By con- The reciprocity relationship on Tumblr does not follow the power
trast, 12.74% of nodes have 0 out-degree, the maximum out-degree law distribution, since the curve mostly is convex, similar to the
of a node is 155.5k. Top popular Tumblr users include equipo11 , pattern reported over Facebook[22].
instagram12 , and woodendreams13 . This indicates the media char- Meanwhile, it has been observed that ones degree is correlated
acteristic of Tumblr: the most popular user has more than 4 million with the degree of his friends. This is also called degree correlation
audience, while more than 40% of users are purely audience since or degree assortativity [18; 19]. Over the derived r-graph, we obtain
they dont have any followers. a correlation of 0.106 between terminal nodes of reciprocate con-
Figure 3(a) demonstrates the distribution of in-degrees in the blue nections, reconrming the positive degree assortativity as reported
curve and that of out-degrees in the red curve, where y-axis refers in Twitter [11]. Nevertheless, compared with the strong social net-
to the cumulated density distribution function (CCDF): the proba- work Facebook, Tumblrs degree assortativity is weaker (0.106 vs.
bility that accounts have at least k in-degrees or out-degrees, i.e., 0.226).
P (K >= k). It is observed that Tumblr users in-degree follows Degree of Separation. Small world phenomenon is almost uni-
a power-law distribution with exponent 2.19, which is quite sim- versal among social networks. With this huge Tumblr network,
ilar from the power law exponent of Twitter at 2.28 [11] or that we are able to validate the well-known six degrees of separation
of traditional blogs at 2.38 [21]. This also conrms with earlier as well. Figure 4 displays the distribution of the shortest paths in
empirical observation that most social network have a power-law the network. To approximate the distribution, we randomly sample
exponent between 2 and 3 [6]. 60,000 nodes as seed and calculate for each node the shortest paths
In regard to out-degree distribution, we notice the red curve has a to other nodes. It is observed that the distribution of paths length
big drop when out-degree is around 5000, since there was a limit reaches its mode with the highest probability at 4 hops, and has a
that ordinary Tumblr users can follow at most 5000 other users. median of 5 hops. On average, the distance between two connected
Tumblr users out-degree does not follow a power-law distribution, nodes is 4.7. Even though the longest shortest path in the approxi-
which is similar to blogosphere of traditional blogging [21]. mation has 29 hops, 90% of shortest paths are within 5.4 hops. All
If we explore users in-degree and out-degree together, we could these numbers are close to those reported on Facebook and Twitter,
generate normalized 3-D histogram in Figure 3(b). As both in- yet signicantly smaller than that obtained over blogosphere and
degree and out-degree follow the heavy-tail distribution, we only instant messenger network [13].
zoom in those user who have less than 210 in-degrees and out- Component Size. The previous result shows that those users who
degrees. Apparently, there is a positive correlation between inare connected have a small average distance. It relies on the as-
degree and out-degree because of the dominance of diagonal bars. sumption that most users are connected to each other, which we
In aggregation, a user with low in-degree tends to have low out- shall conrm immediately. Because the Tumblr graph is directed,
degree as well, even though some nodes, especially those top pop- we compute out all weakly-connected components by ignoring the
ular ones, have very imbalanced in-degree and out-degree. direction of edges. It turns out the giant connected component
Reciprocity. Since Tumblr is a directed network, we would like to (GCC) encompasses 99.61% of nodes in the graph. Over the de-
examine the reciprocity of the graph. We derive the backbone of the rived r-graph, 97.55% are residing in the corresponding GCC. This
Tumblr network by keeping those reciprocal connections only, i.e., nding suggests the whole graph is almost just one connected com-
user a follows b and vice versa. Let r-graph denote the correspond- ponent, and almost all users can reach others through just few hops.
ing reciprocal graph. We found 29.03% of Tumblr user pairs have To give a palpable understanding, we summarize commonly used
reciprocity relationship, which is higher than 22.1% of reciprocity network statistics in Table 1. Those numbers from other popular
on Twitter [11] and 3% of reciprocity on Blogosphere [21], indicat- social networks (blogosphere, Twitter, Facebook, and MSN) are
ing a stronger interaction between users in the network. Figure 3(c) also included for comparison. From this compact view, it is obvi-
shows the distribution of degrees in the r-graph. There is a turning ous traditional blogs yield a signicantly different network struc-
ture. Tumblr, even though originally proposed for blogging, yields
11
http://equipo.tumblr.com a network structure that is more similar to Twitter and Facebook.
12
http://instagram.tumblr.com
13
http://woodendreams.tumblr.com

0 0
10 0.2 10
InDegree
Percentage of Users
OutDegree
0.15
2 2
10 10
0.1
0.05
CCDF
CCDF
4 4
10 10
0
0
1
6 2 6
10 3 10 10
4 9
5 8
6 7
7 6
5
8
9 3 4
10 2
0 1
8 8
10 X 10
InDegree = 2
0
10
2
10 10
4 6
10 10
8
OutDegree = 2Y 0
10
1
10 10
2
10
3 4
10
5
10
InDegree or OutDegree InDegree (same to OutDegree)
(a) in/out degree distribution (b) in/out degree correlation (c) degree distribution in r-graph
Figure 3: Degree Distribution of Tumblr Network
10
0 Text Post Photo Caption
Dataset Dataset
10
2
# Posts 21.5 M 26.3 M
4
Mean Post Length 426.7 Bytes 64.3 Bytes
10
Median Post Length 87 Bytes 29 Bytes
PDF
10
6 Max Post Length 446.0 K Bytes 485.5 K Bytes
10
8 Table 2: Statistics of User Generated Contents
10
10
0 5 10 15 20 25 30
Shortest Path Length key difference: Tumblr has no length limit while Twitter enforces
1 the strict limitation of 140 bytes for each tweet. How does this key
difference affect user post behavior?
0.8
It has been reported that the average length of posts on Twitter is
67.9 bytes and the median is 60 bytes14 . Corresponding statistics
0.6
of Tumblr are shown in Table 2. For the text post dataset, the aver-
CDF
0.4
age length is 426.7 bytes and the median is 87 bytes, which both,
as expected, are longer than that of Twitter. Keep in mind Tum-
0.2 blrs numbers are obtained after removing all quotes, photos and
URLs, which further discounts the discrepancy between Tumblr
0
0 5 10 15 20 25 30 and Twitter. The big gap between mean and median is due to a
Shortest Path Length
small percentage of extremely long posts. For instance, the longest
text post is 446K bytes in our sampled dataset. As for photo cap-
Figure 4: Shortest Path Distribution tions, naturally we expect it to be much shorter than text posts.
The average length is around 64.3 bytes, but the median is only 29
bytes. Although photo posts are dominant in Tumblr, the number
of text posts and photo captions in Table 2 are comparable, because
4. TUMBLR AS BLOGOSPHERE FOR majority of photo posts dont contain any raw photo captions.
CONTENT GENERATION A further related question: is the 140-byte limit sensible? We plot
As Tumblr is initially proposed for the purpose of blogging, here post length distribution of the text post dataset, and zoom into less
we analyze its user generated contents. As described earlier, photo than 280 bytes in Figure 5. About 24.48% of posts are beyond
and text posts account for more than 92% of total posts. Hence, we 140 bytes, which indicates that at least around one quarter of posts
concentrate only on these two types of posts. One text post may will have to be rewritten in a more compact version if the limit was
contain URL, quote or raw message. In this study, we are mainly enforced in Tumblr.
interested in the authentic contents generated by users. Hence, we Blending all numbers above together, we can see at least two types
extract raw messages as the content information of each text post, of posts: one is more like posting a reference (URL or photo) with
by removing quotes and URLs. Similarly, photo posts contains 3 added information or short comments, the other is authentic user
categories of information: photo URL, quote photo caption, raw generated content like in traditional blogging. In other words, Tum-
photo caption. While the photo URL might contain lots of addi- blr is a mix of both types of posts, and its no-length-limit policy
tional meta information, it would require tremendous effort to ana- encourages its users to post longer high-quality content directly.
lyze all images in Tumblr. Hence, we focus on raw photo captions What are people talking about? Because there is no length limit
as the content of each photo post. We end up with two datasets of on Tumblr, the blog post tends to be more meaningful, which al-
content: one is text post, and the other is photo caption.
Whats the effect of no length limit for post? Both Tumblr and 14
http://www.quora.com/Twitter-1/What-is-the-average-length-of-
Twitter are considered microblogging platforms, yet there is one a-tweet

1
Topic Topical Keywords
Pets cat dog cute upload kitty batch puppy
0.8 pet animal kitten adorable
Scenery summer beach sun sky sunset sea nature
CCDF
0.6
ocean island clouds lake pool beautiful
Pop music song rock band album listen lyrics
0.4
Music punk guitar dj pop sound hip
0.2 Photography photo instagram pic picture check
daily shoot tbt photography
0
0 50 100 150 200 250 300
Sports team world ball win football club
Post Length (Bytes) round false soccer league baseball
Medical body pain skin brain depression hospital
Figure 5: Post Length Distribution teeth drugs problems sick cancer blood
Topic Topical Keywords Table 4: Topical Keywords from Photo Caption Dataset
Pop music song listen iframe band album lyrics
Music video guitar
Sports game play team win video cookie For brevity, we just show the result for text post dataset as similar
ball football top sims fun beat league patterns were observed over photo captions.
Internet internet computer laptop google search online The patterns are strong in both gures. Those users who have
site facebook drop website app mobile iphone higher in-degree tend to post more, in terms of both mean and me-
Pets big dog cat animal pet animals bear tiny dian. One caveat is that what we observe and report here is merely
small deal puppy correlation, and it does not derive causality. Here we draw a con-
Medical anxiety pain hospital mental panic cancer servative conclusion that the social popularity is highly positively
depression brain stress medical correlated with user blog frequency. A similar positive correlation
Finance money pay store loan online interest buying is also observed in Twitter[11].
bank apply card credit In contrast, the pattern in terms of user registration time is beyond
our imagination until we draw the gure. Surprisingly, those users
Table 3: Topical Keywords from Text Post Dataset who either register earliest or register latest tend to post less fre-
quently. Those who are in between are inclined to post more fre-
quently. Obviously, our initial hypothesis about the incentive for
lows us to run topic analysis over the two datasets to have an overview new users to blog more is invalid. There could be different expla-
of the content. We run LDA [4] with 100 topics on both datasets, nations in hindsight. Rather than guessing the underlying explana-
and showcase several topics and their corresponding keywords on tion, we decide to leave this phenomenon as an open question to
Tables 3 and 4, which also show the high quality of textual content future researchers.
on Tumblr clearly. Medical, Pets, Pop Music, Sports are shared in- As for reference, we also look at average post-length of users, be-
terests across 2 different datasets, although representative topical cause it has been adopted as a simple metric to approximate quality
keywords might be different even for the same topic. Finance, In- of blog posts [1]. The corresponding correlations are plot in Fig-
ternet only attracts enough attentions from text posts, while only ure 7. In terms of post length, the tail users in social networks are
signicant amount of photo posts show interest to Photography, the winner. Meanwhile, long-term or recently-joined users tend to
Scenery topics. We want to emphasize that most of these keywords post longer blogs. Apparently, this pattern is exactly opposite to
are semantically meaningful and representative of the topics. post frequency. That is, the more frequent one blogs, the shorter
Who are the major contributors of contents? There are two po- the blog post is. And less frequent bloggers tend to have longer
tential hypotheses. 1) One supposes those socially popular users posts. That is totally valid considering each individual has limited
post more. This is derived from the result that those popular users time and resources. We even changed the post length to the max-
are followed by many users, therefore blogging is one way to at- imum for each individual user rather than average, but the pattern
tract more audience as followers. Meanwhile, it might be true that remains still.
blogging is an incentive for celebrities to interact or reward their In summary, without the post length limitation, Tumblr users are
followers. 2) The other assumes that long-term users (in terms of inclined to write longer blogs, and thus leading to higher-quality
registration time) post more, since they are accustomed to this ser- user generated content, which can be leveraged for topic analysis.
vice, and they are more likely to have their own focused commu- The social celebrities (those with large number of followers) are
nities or social circles. These peer interactions encourage them to the main contributors of contents, which is similar to Twitter [24].
generate more authentic content to share with others. Surprisingly, long-term users and recently-registered users tend to
Do socially popular users or long-term users generate more con- blog less frequently. The post-length in general has a negative cor-
tents? In order to answer this question, we choose a xed time relation with post frequency. The more frequently one posts, the
window of two weeks in August 2013 and examine how frequent shorter those posts tend to be.
each user blogs on Tumblr. We sort all users based on their in-
degree (or duration time since registration) and then partition them
into 10 equi-width bins. For each bin, we calculate the average
5. TUMBLR FOR INFORMATION PROPA-
blogging frequency. For easy comparison, we consider the maxi- GATION
mal value of all bins as 1, and normalize the relative ratio for other Tumblr offers one feature which is missing in traditional blog ser-
bins. The results are displayed in Figure 6, where x-axis from left to vices: reblog. Once a user posts a blog, other users in Tumblr can
right indicates increasing in-degree (or decreasing duration time). reblog to comment or broadcast to their own followers. This en-

1.2 1.2
Mean of Post Frequency Mean of Post Length
Median of Post Frequency Median of Post Length
1 1
Normalized Post Frequency
Normalized Post Length

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
InDegree from Low to High along xAxis InDegree from Low to High along xAxis
1.2 1.2
Mean of Post Frequency Mean of Post Length
Median of Post Frequency Median of Post Length
1 1
Normalized Post Frequency
Normalized Post Length

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Registration Time from Early to Late along xAxis Registration Time from Early to Late along xAxis
Figure 6: Correlation of Post Frequency with User In-degree or Figure 7: Correlation of Post Length with User In-degree or Dura-
Duration Time since Registration tion Time since Registration
ables information to be propagated through the network. In this of reblog cascade involving few reblog events. Yet, within a time
section, we examine the reblogging patterns in Tumblr. We exam- window of two weeks, the maximum cascade could reach 116.6K.
ine all blog posts uploaded within the rst 2 weeks, and count re- In order to have a detailed understanding of reblog cascades, we
blog events in the subsequent 2 weeks right after the blog is posted, zoom into the short head and plot the CCDF up to reblog cascade
so that there would be no bias because of the time window selection size equivalent to 20 in Figure 9. It is observed that only about
in our blog data. 19.32% of reblog cascades have size greater than 10. By contrast,
Who are reblogging? Firstly, we would like to understand which only 1% of retweet cascades have size larger than 10 [11]. The re-
users tend to reblog more? Those people who reblog frequently blog cascades in Tumblr tend to be larger than retweet cascades in
serves as the information transmitter. Similar to the previous sec- Twitter.
tion, we examine the correlation of reblogging behavior with users Reblog depth distribution. As shown in previous sections, almost
in-degree. As shown in the Figure 8, social celebrities, who are the any pair of users are connected through few hops. How many hops
major source of contents, reblog a lot more compared with other does one blog to propagate to another user in reality? Hence, we
users. This reblogging is propagated further through their huge look at the reblog cascade depth, the maximum number of nodes to
number of followers. Hence, they serve as both content contrib- pass in order to reach one leaf node from the root node in the reblog
utor and information transmitter. On the other hand, users who cascade structure. Note that reblog depth and size are different. A
registered earlier reblog more as well. The socially popular and cascade of depth 2 can involve hundreds of nodes if every other
long-term users are the backbone of Tumblr network to make it a node in the cascade reblogs the same root node.
vibrant community for information propagation and sharing. Figure 10 plots the distribution of number of hops: again, the reblog
Reblog size distribution. Once a blog is posted, it can be re- cascade depth distribution follows a power law as well according
blogged by others. Those reblogs can be reblogged even further, to the PDF; when zooming into the CCDF, we observe that only
which leads to a tree structure, which is called reblog cascade, with 9.21% of reblog cascades have depth larger than 6. That is, major-
the rst author being the root node. The reblog cascade size indi- ity of cascades can reach just few hops, which is consistent with the
cates the number of reblog actions that have been involved in the ndings reported over Twitter [3]. Actually, 53.31% of cascades in
cascade. Figure 9 plots the distribution of reblog cascade sizes. Tumblr have depth 2. Nevertheless, the maximum depth among all
Not surprisingly, it follows a power-law distribution, with majority cascades can reach 241 based on two week data. This looks un-

0
10
1.2
Mean of Reblog Frequency
Median of Reblog Frequency 10
2
1
Normalized Reblog Frequency
PDF
4
10
0.8
6
10
0.6
8
10 0 2 4 6
0.4 10 10 10 10
Reblog Cascade Size
0.2 1
0.8
0
InDegree from Low to High along xAxis
0.6
CCDF
1.2 0.4
Mean of Reblog Frequency
Median of Reblog Frequency 0.2
1
Normalized Reblog Frequency
0
0 5 10 15 20 25
0.8 Reblog Cascade Size
0.6 Figure 9: Distribution of Reblog Cascade Size
0.4
is posted, the less likely it would be reblogged. 75.03% of rst re-
blog arrive within the rst hour since a blog is posted, and 95.84%
0.2
of rst reblog appears within one day. Comparatively, It has been
reported that half of retweeting occurs within an hour and 75%
0
Registration Time from Early to Late along xAxis under a day [11] on Twitter. In short, Tumblr reblog has a strong
bias toward recency, and information propagation on Tumblr is fast.
Figure 8: Correlation of Reblog Frequency with User In-degree or

Duration Time since Registration
6. RELATED WORK
There are rich literatures on both existing and emerging online so-
cial network services. Statistical patterns across different types of
social networks are reported, including traditional blogosphere [21],
likely at rst glimpse, considering any two users are just few hops user-generated content platforms like Flickr, Youtube and Live-
away. Indeed, this is because users can add comment while reblog- Journal [15], Twitter [10; 11], instant messenger network [13],
ging, and thus one user is likely to involve in one reblog cascade Facebook [22], and Pinterest [7; 20]. Majority of them observe
multiple times. We notice that some Tumblr users adopt reblog as shared patterns such as long tail distribution for user degrees (power
one way for conversation or chat. law or power law with exponential cut-off), small (90% quantile ef-
Reblog Structure Distribution. Since most reblog cascades are fective) diameter, positive degree association, homophily effect in
few hops, here we show the cascade tree structure distribution up terms of user proles (age or location), but not with respect to gen-
to size 5 in Figure 11. The structures are sorted based on their cov- der. Indeed, people are more likely to talk to the opposite sex [13].
erage. Apparently, a substantial percentage of cascades (36.05%) The recent study of Pinterest observed that ladies tend to be more
are of size 2, i.e., a post being reblogged merely once. Generally active and engaged than men [20], and women and men have differ-
speaking, a reblog cascade of a at structure tends to have a higher ent interests [5]. We have compared Tumblrs patterns with other
probability than a reblog cascade of the same size but with a deep social networks in Table 1 and observed that most of those trend
structure. For instance, a reblog cascade of size 3 have two vari- hold in Tumblr except for some number difference.
ants, of which the at one covers 9.42% cascade while the deep Lampe et al. [12] did a set of survey studies on Facebook users,
one drops to 5.85%. The same patten applies to reblog cascades and shown that people use Facebook to maintain existing ofine
of size 4 and 5. In other words, it is easier to spread a message connections. Java et al. [10] presented one of the earliest re-
widely rather than deeply in general. This implies that it might be search paper for Twitter, and found that users leverage Twitter to
acceptable to consider only the cascade effect under few hops and talk their daily activities and to seek or share information. In ad-
focus those nodes with larger audience when one tries to maximize dition, Schwartz [7] is one of the early studies on Pinterest, and
inuence or information propagation. from a statistical point of view that female users repin more but
Temporal patten of reblog. We have investigated the information with fewer followers than male users. While Hochman and Raz [8]
propagation spatially in terms of network topology, now we study published an early paper using Instagram data, and indicated differ-
how fast for one blog to be reblogged? Figure 12 displays the dis- ences in local color usage, cultural production rate, for the analysis
tribution of time gap between a post and its rst reblog. There is of location-based visual information ows.
a strong bias toward recency. The larger the time gap since a blog Existing studies on user inuence are based on social networks or

36.05% 9.42% 5.85% 3.58% 1.69% 1.44% 2.78% 1.20% 1.15% 0.58% 0.51% 0.42% 0.33% 0.31% 0.24% 0.21%
Figure 11: Cascade Structure Distribution up to Size 5. The percentage at the top is the coverage of cascade structure.
0
10 1
2 0.8
10
0.6
PDF
CDF
4
10
0.4
6
10
0.2
8
10 0 1 2 3 0
10 10 10 10 1m 10m 1h 1d 1w
Reblog Cascade Depth Lag Time of First Reblog
1
Figure 12: Distribution of Time Lag between a Blog and its rst
0.8
Reblog
7. CONCLUSIONS AND FUTURE WORK

0.6
CCDF
0.4 In this paper, we provide a statistical overview of Tumblr in terms

of social network structure, content generation and information prop-
0.2
agation. We show that Tumblr serves as a social network, a blo-
gosphere and social media simultaneously. It provides high qual-
0
0 5 10 15
Reblog Cascade Depth
20 25 ity content with rich multimedia information, which offers unique
characteristics to attract youngsters. Meanwhile, we also summa-
rize and offer as rigorous comparison as possible with other social
Figure 10: Distribution of Reblog Cascade Depth services based on numbers reported in other papers. Below we
highlight some key ndings:
With multimedia support in Tumblr, photos and text account
content analysis. McGlohon et al. [14] found topology features for majority of blog posts, while audios and videos are still
can help us distinguish blogs, the temporal activity of blogs is very rare.
non-uniform and bursty, but it is self-similar. Bakshy et al. [3]
investigated the attributes and relative inuence based on Twitter Tumblr, though initially proposed for blogging, yields a sig-
follower graph, and concluded that word-of-mouth diffusion can nicantly different network structure from traditional blogo-
only be harnessed reliably by targeting large numbers of potential sphere. Tumblrs network is much denser and better con-
inuencers, thereby capturing average effects. Hopcroft et al. [9] nected. Close to 29.03% of connections on Tumblr are re-
studied the Twitter user inuence based on two-way reciprocal rela- ciprocate, while blogosphere has only 3%. The average dis-
tionship prediction. Weng et al. [23] extended PageRank algorithm tance between two users in Tumblr is 4.7, which is roughly
to measure the inuence of Twitter users, and took both the topi- half of that in blogosphere. The giant connected component
cal similarity between users and link structure into account. Kwak covers 99.61% of nodes as compared to 75% in blogosphere.
et al. [11] study the topological and geographical properties on Tumblr network is highly similar to Twitter and Facebook,
the entire Twittersphere and they observe some notable properties with power-law distribution for in-degree distribution, non-
of Twitter, such as a non-power-law follower distribution, a short power law out-degree distribution, positive degree associa-
effective diameter, and low reciprocity, marking a deviation from tivity for reciprocate connections, small distance between
known characteristics of human social networks. connected nodes, and a dominant giant connected compo-
However, due to data access limitation, majority of the existing nent.
scholar papers are based on either Twitter data or traditional blog-
ging data. This work closes the gap by providing the rst overview Without post length limitation, Tumblr users tend to post
of Tumblr so that others can leverage as a stepstone to investigate longer. Approximately 1/4 of text posts have authentic con-
more over this evolving social service or compare with other related tents beyond 140 bytes, implying a substantial portion of
services. high quality blog posts for other tasks like topic

Those social celebrities tend to be more active. They post [8] N. Hochman and R. Schwartz. Visualizing instagram: Tracing
analysis and text mining. and reblog more frequently, serv- cultural visual rhythms. In Proceedings of the Workshop on
ing as both content generators and information transmitters. Social Media Visualization (SocMedVis) in conjunction with
Moreover, frequent bloggers like to write short, while infre- ICWSM, 2012.
quent bloggers spend more effort in writing longer posts.
[9] J. E. Hopcroft, T. Lou, and J. Tang. Who will follow you
In terms of duration since registration, those long-term users back?: reciprocal relationship prediction. In Proceedings of
and recently registered users post less frequently. Yet, long- ACM International Conference on Information and Knowl-
term users reblog more. edge Management (CIKM), pages 11371146, 2011.
Majority of reblog cascades are tiny in terms of both size [10] A. Java, X. Song, T. Finin, and B. Tseng. Why we twit-
and depth, though extreme ones are not uncommon. It is rel- ter: understanding microblogging usage and communities. In
atively easier to propagate a message wide but shallow rather WebKDD/SNA-KDD 07, pages 5665, New York, NY, USA,
than deep, suggesting the priority for inuence maximization 2007. ACM.
or information propagation. [11] H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a
social network or a news media. In Proceedings of 19th Inter-
Compared with Twitter, Tumblr is more vibrant and faster in
national World Wide Web Conference (WWW), 2010.
terms of reblog and interactions. Tumblr reblog has a strong
bias toward recency. Approximately 3/4 of the rst reblogs [12] C. Lampe, N. Ellison, and C. Steineld. A familiar
occur within the rst hour and 95.84% appear within one face(book): Prole elements as signals in an online social net-
day. work. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 2007.
This snapshot research is by no means to be complete. There are
several directions to extend this work. First, some patterns de- [13] J. Leskovec and E. Horvitz. Planetary-scale views on a large
scribed here are correlations. They do not illustrate the underlying instant-messaging network. In WWW 08: Proceeding of the
mechanism. It is imperative to differentiate correlation and causal- 17th international conference on World Wide Web, pages
ity [2] so that we can better understand the user behavior. Secondly, 915924, New York, NY, USA, 2008. ACM.
it is observed that Tumblr is very popular among young users, as [14] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. S.
half of Tumblrs visitor base being under 25 years old. Why is it Glance. Finding patterns in blog shapes and blog evolution.
so? We need to combine content analysis, social network analysis, In Proceedings of ICWSM, 2007.
together with user proles to gure out. In addition, since more
than 70% of Tumblr posts are images, it is necessary to go beyond [15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and
photo captions, and analyze image content together with other meta B. Bhattacharjee. Measurement and analysis of online social
information. networks. In IMC 07: Proceedings of the 7th ACM SIG-
COMM conference on Internet measurement, pages 2942,
New York, NY, USA, 2007. ACM.
8. REFERENCES
[16] S. Mittal, N. Gupta, P. Dewan, and P. Kumaraguru. The pin-
[1] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the in- bang theory: Discovering the pinterest world. arXiv preprint
uential bloggers in a community. In Proceedings of WSDM, arXiv:1307.4952, 2013.
2008. [17] B. Nardi, D. J. Schiano, S. Gumbrecht, and L. Swartz. Why
[2] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Inuence we blog. Commun. ACM, 47(12):4146, 2004.
and correlation in social networks. In Proceedings of KDD, [18] M. E. J. Newman. Assortative mixing in networks. Physical
2008. review letters, 89(20): 208701, 2002.
[3] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Ev- [19] M. E. J. Newman. Mixing patterns in networks. Physical Re-
eryones an inuencer: quantifying inuence on twitter. In view E, 67(2): 026126, 2003.
Proceedings of WSDM, 2011.
[20] R. Ottoni, J. P. Pesce, D. Las Casas, G. Franciscani, P. Ku-
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan:. Latent dirichlet allo- maruguru, and V. Almeida. Ladies rst: Analyzing gen-
cation. Journal of Machine Learning Research, 3:9931022, der roles and behaviors in pinterest. Proceedings of ICWSM,
2003. 2013.
[5] S. Chang, V. Kumar, E. Gilbert, and L. Terveen. Specializa- [21] X. Shi, B. Tseng, , and L. A. Adamic. Looking at the blogo-
tion, homophily, and gender in a social curation site: Findings sphere topology through different lenses. In Proceedings of
from pinterest. In Proceedings of The 17th ACM Conference ICWSM, 2007.
on Computer Supported Cooperative Work and Social Com- [22] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow.
puting, CSCW14, 2014. The anatomy of the facebook social graph. arXiv preprint
arXiv:1111.4503, 2011.
[6] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law
distributions in empirical data. arXiv, 706, 2007. [23] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: nd-
ing topic-sensitive inuential twitterers. In Proceedings of
[7] E. Gilbert, S. Bakhshi, S. Chang, and L. Terveen. i need to WSDM, pages 11371146, 2010.
try this!: A statistical overview of pinterest. In Proceedings
of the SIGCHI Conference on Human Factors in Computing [24] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who
Systems (CHI), 2013. says what to whom on twitter. In Proceedings of WWW 2011.

Change Detection in Streaming Data in the Era of Big
Change Detection
Data:inModels
Streaming
and Data in the Era of Big
Issues
Data: Models and Issues
Dang-Hoan Tran Mohamed Medhat Gaber Kai-Uwe Sattler
Dang-Hoan
Vietnam Tran
Maritime University Mohamed MedhatScience
School of Computing Gaber Kai-Uwe
Ilmenau Sattlerof
University
484 Lach
Vietnam Tray, Ngo
Maritime Quyen
and Digital
University
School Media, Robert
of Computing Science Technology
Ilmenau University of
484Haiphong,
Lach Tray, Vietnam
Ngo Quyen
andGordon
Digital Media, Robert
University POTechnology
Box 100565
Haiphong, Vietnam Gordon
Riverside EastUniversity
Garthdee Road PO Box Germany
Ilmenau, 100565
dang- Riverside East Garthdee Road
Aberdeen Ilmenau, Germany
dang-
hoan.tran@vimaru.edu.vn Aberdeen kus@tu-ilmenau.de
AB10 7GJ, UK kus@tu-ilmenau.de
hoan.tran@vimaru.edu.vn AB10 7GJ, UK
m.gaber1@rgu.ac.uk
m.gaber1@rgu.ac.uk
ABSTRACT real-world systems such as InforSphere Streams (IBM)1 , Rapid-
ABSTRACT real-world
miner systems
Streams such2 ,asStreamBase
Plugin InforSphere3 ,Streams
MOA 4(IBM) , AnduIN
1 , Rapid-
5 . In
Big Data is identified by its three Vs, namely velocity, volume, miner Streams Plugin 2 , StreamBase 3 4 5 . In
Big Data is identified by data
its three Vs, processing
namely velocity, volume, order to deal with the high-speed data ,streams,
MOA a, AnduIN hybrid model
and variety. The area of stream has long dealt order to deal with the high-speed data streams, a hybrid model
and that combines the advantages of both parallel batch processing
with variety. The two
the former areaVsofvelocity
data stream processing
and volume. Overhasa long dealt
decade of that combines the advantages of both model
parallelis batch processing
with the former two Vs velocity and volume. Over a decade of model and streaming data processing proposed. Some
intensive research, the community has provided many important model
intensive research, theincommunity has provided projectsand for streaming
such hybrid data processing
model include model is proposed.
S4 66 , Storm 7 , and GrokSome8
research discoveries the area. The third V ofmanyBig important
Data has projects for such hybrid model include S4 , Storm 7 , and Grok 8
research .
been the discoveries
result of socialin the area.and
media Thethethird
largeV unstructured
of Big Data data
has
.One of these challenges facing data stream processing and min-
been
it the result
generates. of socialtechniques
Streaming media andhave the large unstructured
also been proposeddata
re-
it generates. Streaming techniques have also been proposed re- One
ing isofthethese challenges
changing naturefacing data stream
of streaming data.processing
Therefore,and the min-
abil-
cently addressing this emerging need. However, a hidden factor ing is the changing nature of streaming data. Therefore, the abil-
cently addressing this emerging need. However, a hidden
can represent an important fourth V, that is variability or change. factor ity to identify trends, patterns, and changes in the underlying pro-
can ity to identify
cesses trends,
generating datapatterns, and changes
contributes in the underlying
to the success of processing pro-
Our represent
world is an important
changing fourthand
rapidly, V, that is variability
accounting or change.
to variability is cesses generating data contributes to the success of processing
Our world is changing rapidly, and accounting to variability
a crucial success factor. This paper provides a survey of change is and mining massive high-speed data streams.
adetection
crucial success factor. This paper provides data.
a survey and mining massive high-speed data streams.
techniques as applied to streaming Theofreview
change
is A model of continuous distributed monitoring has been recently
A model of continuous distributed monitoring has been recently
detection techniques as applied to streaming data. The review is proposed to deal with streaming data coming from multiple
timely with the rise of Big Data technologies, and the need to proposed to deal with streaming data coming from multiple
timely with the rise of Big Data technologies, and the need to sources. This model has many observers where each observer
have this important aspect highlighted and its techniques catego- sources. This model has many observers where each observer
have this important aspect highlighted and its techniques catego- monitors a single data stream. The goal of continuous distributed
rized and detailed. monitors a single data stream. The goal of continuous distributed
rized and detailed. monitoring is to perform some tasks that need to aggregate the in-
monitoring is to perform some tasks that need to aggregate the in-
coming data from the observers. The continuous distributed mon-
coming data from the observers. The continuous distributed mon-
itoring is applied to monitor networks such as sensor networks,
1.
1. INTRODUCTION
INTRODUCTION itoring is applied to monitor networks such as sensor networks,
social networks, networks of ISP [11].
Todays world is changing very fast. The changes occur in every social networks, networks of ISP [11].
Todays world is changing very fast. The changes occur in every Change
Change detection
detection is is the
the process
process of of identifying
identifying differences
differences in in the
the
aspects
aspects of of life.
life. Therefore,
Therefore, thethe ability
ability to
to detect,
detect, adapt,
adapt, and
and react
react state
to state ofof an
an object
object or or phenomenon
phenomenon by by observing
observing it it at
at different
different
to the change play an important role in all aspects of life.
the change play an important role in all aspects of life. The
The times
times or or different
different locations
locations in in space.
space. InIn the
the streaming
streaming context,
context,
physical
physical world
world isis often
often represented
represented in in some
some model
model oror some
some infor-
infor- change
mation change detection is the process of segmenting aa data
detection is the process of segmenting data stream
stream into
into
mation system.
system. TheThe changes
changes inin the
the physical
physical world
world are
are reflected
reflected inin different
different segments
segments by by identifying
identifying the the points
points where
where the the stream
stream
terms
terms of the changes in data or model built from data. Therefore,
of the changes in data or model built from data. Therefore, dynamics
the dynamics change
change [53].
[53]. A A change
change detection
detection method
method consists
consists of of
the nature
nature ofof data
data is
is changing.
changing. the
the following
following tasks:
tasks: change
change detection
detection andand localization
localization of of change.
change.
The
The advance of technology results
advance of technology results inin the
the data
data deluge.
deluge. The
The data
data Change
volume Change detection
detection identifies
identifies whether
whether aa change
change occurs,
occurs, and and re-
re-
volume is is increasing
increasing with
with an
an estimated
estimated rate
rate of
of 50%
50% per
per year
year [39].
[39]. sponds
sponds to to the
the presence
presence of of such
such change.
change. Besides
Besides change
change detec-detec-
Data
Data flood
flood makes
makes traditional
traditional methods
methods including
including traditional
traditional dis-
dis- tion,
tributed tion, localization
localization of of changes
changes determines
determines the the location
location of of change.
change.
tributed framework and parallel models inappropriate for
framework and parallel models inappropriate for pro-
pro- The
The problem
problem of of locating
locating the the change
change has has been
been studied
studied in in statistics
statistics
cessing,
cessing, analyzing,
analyzing, storing,
storing, and
and understanding
understanding thesethese massive
massive data
data in
in the
the problems
problems of of change
change point
point detection.
detection.
sets.
sets. Data
Data deluge
deluge needs
needs aa new
new generation
generation of of computing
computing tools
tools that
that This
th This paper
paper presents
presents the the background
background issues
issues and and notation
notation relevant
relevant
Jim
Jim Gray
Gray calls
calls the
the 44th paradigm
paradigm in in scientific
scientific computing
computing [25].
[25]. Re-
Re- to the problem of change detection in data streams.
cently, there have been some emerging computing paradigms that
meet the requirements of Big Data as follows. Parallel batch pro-
cessing model only deals with the stationary massive data [17]. 2. CHANGE DETECTION IN STREAM-
However, evolving data continuously arrives with high speed. In ING DATA
fact, online data stream processing is the main approach to deal-
ing with the problem of three characteristics of Big Data includ- 1 http://www-01.ibm.com/software/data/infosphere/
ing big volume, big velocity, and big variety. Streaming data pro- streams/
cessing is a model of Big Data processing. Streaming data is tem- 2 http://www-ai.cs.uni-dortmund.de/auto?self=
poral data in nature. In addition to the temporal nature, streaming $eit184kc
data may include spatial characteristics. For example, geographic 3 http://www.streambase.com/
information systems can produce spatial-temporal data stream. 4 http://moa.cs.waikato.ac.nz/
Streaming data processing and mining have been deploying in 5 http://www.tu-ilmenau.de/dbis/research/anduin/
6 http://incubator.apache.org/s4/
7 https://github.com/nathanmarz/storm/wiki/Tutorial
8
8 https://www.numenta.com/grok_info.html
SIGKDD Explorations
SIGKDD Explorations Volume 16,
Volume 16, Issue
Issue 1
1 Page 30
Page 30
Streaming computational model is considered one of the widely- data. Change analysis both detects and explains the change. Hido
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
ing data processing helps the decision-making process in real- vised learning.
time. A data
Streaming stream is defined
computational model as isfollows.
considered one of the widely- data. Change analysis both detects
D EFINITION 4. Change point and explains
detection is the change. Hido
identifying time
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
points at which properties of time series data change[32]
ingDdata
EFINITION
processing 1. Ahelps
data the stream is an infinite sequence
decision-making process inofreal-
ele- vised learning.
ments
time. A data stream is defined as follows. Depending on specific application, change detection can be
D EFINITION 4. Change point detection is identifying time
S = (X1 , T1 ) , .., X j , T j , ... (1) called in different terms such as burst detection, outlier detection,
D EFINITION 1. A data stream is an infinite sequence of ele- points at which properties of time series data change[32]
or anomaly detection. Burst detection a special kind of change
ments
Each element is a pair X j , T j where X j is a d-dimensional vec- DependingBurst
detection. on specific
is a period application,
on streamchange detection sum
with aggregated can ex-be
tor X j = (x1 , x2 , ...,
S= xd) (X
arriving at the time stamp
1 , T1 ) , .., X j , T j , ...

T j . Time-stamp (1) called
ceeding in adifferent
threshold terms [31]. such as burst
Outlier detection,
detection is aoutlier
special detection,
kind of
is defined over discretedomain with a total order. There are two or anomaly
change detection.
detection. Anomaly Burst detection
detection can a special
be seen kindasofa change
special
Each
types element
of time-stamps: X j , T jtime-stamp
is a pairexplicit where X j is agenerated
d-dimensional vec-
when data detection.
type of changeBurstdetection
is a period on streamdata.
in streaming with aggregated sum ex-
tor
arrive; (x1 , x2 ,time-stamp
X j =implicit ..., xd ) arriving at the time
is assigned stampdata
by some T j . Time-stamp
stream proceeding
To find aasolution
threshold to the[31].problem
Outlierofdetection is a special
change detection, wekindshouldof
is defined
cessing over discrete domain with a total order. There are two
system. change detection.
consider the aspectsAnomaly of changedetection can be
of the system in seen
whichaswea want special to
types of time-stamps: explicit time-stamp is generated when data type
detect.of As
change
shown detection
in [52],in thestreaming
followingdata. aspects of change, which
Streaming data time-stamp
arrive; implicit includes theis fundamental
assigned by some characteristics
data stream as pro-
fol- To
mustfindbeaconsidered,
solution to the problem
include of change
subject of change, detection,
type of wechange,
should
lows.
cessingFirst, data arrives continuously. Second, streaming data
system. consider
cause of the aspects
change, effectof change
of change,of the system of
response in which
change,wetemporal
want to
evolves overtime. Third, streaming data is noisy, corrupted. detect.
issues, As andshown
spatialinissues.
[52], the In following
particular,aspects
to design of change, which
an algorithm
Streaming
Forth, datainterfering
timely includes the fundamental
is important. Fromcharacteristics as fol-
the characteristics mustdetecting
be considered,
lows. First, data for changesinclude in sensor subject of change,
streaming data, thetypemajor
of change,
ques-
of streaming data arrives
and data continuously.
stream model, Second, streaming
data stream data
process- cause
tions we of change,
need to effect
answer of include:
change, Whatresponse of change,
is the system in temporal
which
evolves overtime. Third, streaming data is
ing and mining pose the following challenges. First, as streaming noisy, corrupted. issues, and spatial
Forth, timely interfering is important. From thedata characteristics the changes need toissues.
be detected?In particular,
What are tothe
design an algorithm
principles used to
data arrives rapidly, the techniques of streaming process and for
modeldetecting changesWhat
the problem? in sensor
is datastreaming
type? What data,arethethemajor ques-
constraints
of streaming
analysis must data
keepand data the
up with stream
data model, data stream
rate to prevent fromprocess-
the loss tions we need to answer include: What is the system in which
ing and mining pose the following of the problem? What is the physical subject of change? What is
of important information as well aschallenges.
avoid dataFirst, as streaming
redundancy. Sec- the changes
the meaningneed to be detected?
of change to the user? WhatHow aretothe principles
respond used to
and react to
data
ond, arrives rapidly,
as the speed of the techniques
streaming dataof is streaming
very high, data process
the data and
volume model the problem?
analysis must keep up with the data rate to prevent from the loss this change? How to What visualize is data
thistype?
change? What are the constraints
overcomes the processing capacity of the existing systems. Third, of the problem? What is the physical subject of change? What is
of A change detection method can fall into one of two types: batch
theimportant information
value of data decreasesasover welltime,
as avoid data redundancy.
the recent streaming data Sec-
is the meaning of change to the user? How to respond and react to
change detection and sequential change detection. Given a se-
ond, as the speed of streaming data is very high,
sufficient for many applications. Therefore, one can only capture the data volume this change? How to visualize this change?
overcomes the processing capacity of the existing systems. Third, quence of N observations x1 , .., xN , where N is invariant, the task
and process the data as soon as it is generated. A change detection method can fall into one of two types: batch
the value of data decreases over time, the recent streaming data is of a batch change detection method is deciding whether a change
change detection and sequential change detection. Given a se-
2.1
sufficientChange
for many Detection: Definitions
applications. Therefore, one canand onlyNota-
capture occurs at some point in the sequence by using all N available ob-
quence of N observations x1 , .., xN , where N is invariant, the task
tionthe data as soon as it is generated.
and process servations. When the arriving speed of data is too high, batch
of a batch change detection method is deciding whether a change
change detection is suitable. In other words, change detection
2.1 section
This Change Detection:
presents Definitions
concepts and classificationand Nota-
of changes occurs at some point in the sequence by using all N available ob-
method using two adjacent windows model will be used. How-
tiondetection methods. To develop a change detection
and change servations. When the arriving speed of data is too high, batch
ever, the drawback of batch change detection method is that its
method, we should understand what a change is. change detection is suitable. In other words, change detection
This section presents concepts and classification of changes running time is very large when detecting changes in a large
method using two adjacent windows model will be used. How-
andDchange detection
EFINITION methods.
2. Change To develop
is defined as thea change detection
difference in the amount of data. In contrast,
ever, the drawback of batchthe sequential
change change
detection detection
method prob-
is that its
method, we should understand what a change is.
state of an object or phenomenon over time and/or space [52; lem is based on the observations so far.
running time is very large when detecting changes in a large If no change is detected,
1].D EFINITION 2. Change is defined as the difference in the the next of
amount observation is processed.
data. In contrast, Whenever
the sequential a change
change is detected,
detection prob-
the change detector is reset.
lem is based on the observations so far. If no change is detected,
state
In theofview
an object or phenomenon
change is theover timeofand/or space [52;a Change
1].
of system, process transition from the next detection
observation methods can be classified
is processed. Wheneverinto the following
a change is detected,ap-
state of a system to another. In other words, a change can be de- proaches:
the changethreshold-based
detector is reset.change detection method; state-based
fined
In theas the of
view difference betweenisan
system, change theearlier
process state and a laterfrom
of transition state.a change
Change detection
detection method;
methods trend-based change
can be classified intodetection
the following method.ap-
An
stateimportant
of a system distinction
to another.between
In otherchange
words,and difference
a change can is
be that
de- A change threshold-based
proaches: detection algorithm changeshould meet method;
detection three main require-
state-based
afined
change
as therefers to a transition
difference betweeninantheearlier
state state
of an and
object or astate.
a later phe- ments
change[37]:detectionaccuracy,
method; promptness,
trend-basedand changeonline. The algorithm
detection method.
nomenon
An important overtime while the
distinction difference
between changemeans
andthe dissimilarity
difference in
is that should
A change detect as many
detection as possible
algorithm actual
should meet changethreepoints
main and gen-
require-
the characteristics
a change refers to of two objects.
a transition A change
in the state ofcan
an reflect
object theor ashort-
phe- erate
mentsas[37]:few asaccuracy,
possible false alarms. The
promptness, and algorithm
online. The should detect
algorithm
term trendovertime
nomenon or long-term
whiletrend. For example,
the difference meansa the
stock analyst may
dissimilarity in change point as
should detect as early
manyas as possible. The algorithm
possible actual change points shouldand be gen-
effi-
be
theinterested in theofshort-term
characteristics change
two objects. of thecan
A change stock price.the short-
reflect cient
erate assufficient for a realfalse
few as possible timealarms.
environment.
The algorithm should detect
Change
term trend detection is defined
or long-term asFor
trend. the example,
process ofa stock
identifying
analyst differ-
may Change
change point detection in data
as early stream allows
as possible. us to identify
The algorithm shouldthebetime-
effi-
ences in the state
be interested in theofshort-term
an object or phenomenon
change by observing
of the stock price. it at evolving trends,for
cient sufficient and time-evolving
a real patterns. Research issues on
time environment.
different times [54].
Change detection is In the above
defined as thedefinition,
process of a change
identifyingis detected
differ- mining
Changechangesdetection in in
data datastreams
stream include
allowsmodeling and representa-
us to identify the time-
on thein
ences basis
the of differences
state of an
of an object orobject at different
phenomenon times without
by observing it at tion of changes,
evolving trends, change-adaptive
and time-evolving mining method,
patterns. and interactive
Research issues on
considering
different times the[54].
differences of an definition,
In the above object in locations
a changeinisspace.
detectedIn exploration
mining changes of changes
in data [19].streams Change
includedetection
modeling playsandanrepresenta-
important
many
on thereal
basisworld applications,
of differences changes
of an objectcan occur both
at different timesin terms
withoutof role in changes,
tion of the field change-adaptive
of data stream analysis. Since change
mining method, in model
and interactive
both time and
considering thespace. For example,
differences multiple
of an object spatial-temporal
in locations in space. data
In may conveyofinteresting
exploration changes [19]. time-dependent
Change detection information
plays anand knowl-
important
streams
many real representing triple (latitude,
world applications, changeslongitude,
can occurtime)
both are created
in terms of role inthe
edge, thechange
field ofof data
the data stream
stream can beSince
analysis. used for understanding
change in model
both
in timeinformation
traffic and space. For example,
systems usingmultiple
GPS [23]. spatial-temporal
Hence, changedata de- maynature
the convey of interesting time-dependent
several applications. information
Basically, interesting andresearch
knowl-
streamscan
tection representing
be definedtriple (latitude, longitude, time) are created
as follows. edge, the change
problems on mining of thechanges
data stream can streams
in data be used for canunderstanding
be classified
in traffic information systems using GPS [23]. Hence, change de- the
intonature
three of several applications.
categories: modeling and Basically, interesting
representation research
of changes,
D EFINITION
tection 3. Change
can be defined detection is the process of identify-
as follows. problems
mining on mining
methods, changes in exploration
and interactive data streams of can be classified
changes. Change
ing differences in the state of an object or phenomenon by ob- into threealgorithm
detection categories: canmodeling
be used asand representationinof
a sub-procedure manychanges,
other
D EFINITION
serving 3. Change
it at different detection
times and/or is thelocations
different process in of space.
identify- mining
data methods,
stream mining and interactiveinexploration
algorithms order to deal ofwith
changes. Change
the changing
ing differences in the state of an object or phenomenon by ob- detection
data in dataalgorithm
streamscan [28;be4].used as a sub-procedure
A definition of changeindetection
many other for
A distinction
serving betweentimes
it at different concept driftdifferent
and/or detection and change
locations detec-
in space. data streamdata mining algorithms in order to deal with the changing
streaming is given as follows
tion is that concept drift detection focuses on the labeled data data in data streams [28; 4]. A definition of change detection for
A distinction
change between
detectionconcept
can dealdrift
withdetection and and
change detec-
while both labeled unlabeled D EFINITION
streaming data is5. givenChange detection is the process of segment-
as follows
tion is that concept drift detection focuses on the labeled data
while change detection can deal with both labeled and unlabeled D EFINITION 5. Change detection is the process of segment-

the changes of whose individual example is associated with a given class
the generated model label, otherwise, it is unlabeled data stream. A change de-
tection algorithm that identifies changes in the labeled data
Incoming
the changes of
stream is supervised
whose individual change
example detection [34;
is associated with5], whileclass
a given one
Data stream
Data Stream
the generated model detecting changesit in
label, otherwise, the unlabeled
is unlabeled data data
stream. stream is called
A change de-
Processing/Mining unsupervised change
tection algorithm that detection algorithm
identifies changes in [7]. The advan-
the labeled data
Incoming Model
tage
streamof is
thesupervised
supervisedchange approach is that the
detection [34;detection
5], whileaccu-one
Data stream
the changes of
Data Stream racy is high.
detecting However,
changes in thetheunlabeled
ground truth datadata mustisbecalled
stream gen-
Processing/Mining erated. Thus achange
unsupervised unsupervised
detection change detection
algorithm [7]. approach
The advan- is
data generating process
Model
tage of thetosupervised
preferred the supervisedapproach one isinthatcasethethe ground accu-
detection truth
the changes of racy is unavailable.
data high. However, the ground truth data must be gen-
Figure 1:
dataAgenerating
general process
diagram for detecting changes in data stream erated. Thus a unsupervised change detection approach is
Completeness
preferred to theofsupervised
statistical information:
one in case the Onground
the basis of
truth
the
datacompleteness
is unavailable.of statistical information, a change de-
ing a data
Figure 1: Astream into
general different
diagram forsegments
detectingbychanges
identifying thestream
in data points tection algorithm can fall into one of three following cat-
where the stream dynamics changes [53]. Completeness
egories. Parametric of statistical
change information:
detection schemes On thearebasisbasedof
the knowing
on completeness of statistical
the full information,
prior information a change
before and afterde-
As
ing data
a datastreams
streamevolve overtime
into different in nature,
segments there is growing
by identifying em-
the points tection algorithm
change. For example, can infalltheinto one of three
distributional following
change cat-
detection
phasis on detecting
where the changeschanges
stream dynamics not only[53].
in the underlying data dis- egories.
methods,Parametric change detection
the data distributions before and schemes are based
after change are
tribution, but also in the models generated by data stream process on knowing
known [41; 42].theAfull prior introduced
recently information beforeto and
method after
detecting
As data
and datastreams
stream evolve
mining.overtime
As can in benature,
seen inthere is growing
Figure em-
1, a change change.
changes For example,
in order stockin streams
the distributional
is a parametricchangemethod
detection in
phasis on detecting changes notoronly
theinstreaming
the underlying
model.data dis- methods,
can occur in the data stream, There- which thethe data distributions
distribution of streambefore of stockandorders
after change
confideare to
tribution,
fore, therebut arealso
twoin types
the models
of thegenerated
problemsbyofdata stream
change process
detection: known [41; 42]. A recently introduced method of to detecting
the Poisson distribution [37]. The advantage paramet-
and datadetection
stream mining. Asgenerating
can be seen in Figure 1, a change changes
change in the data process and change detec- ric change detection approaches is that they can method
in order stock streams is a parametric produceina
can occur in the data stream, or the streaming model. There- which
tion in the model generated by a data stream processing, or min- higher the distribution
accurate result of thanstream of stock orders
semi-parametric andconfide
nonpara- to
fore, there
ing. The are two types
fundamental ofofthe
issues problems
detecting of change
changes detection:
in data streams the Poisson distribution [37]. The advantage of paramet-
metric methods. However, in many real-time applications,
change
includesdetection in the data
andgenerating process and change detec- ric change detection approaches is that they can produce
tion in the
characterizing
model generated
quantifying
by a data
of changes
stream
and detect-
processing, or min-
data may not confine to any standard distribution, thusa
ing changes. A change detection method in streaming data needs higher
parametricaccurate result than
approaches semi-parametric
are inapplicable. and nonpara-
Semi-parametric
ing. The fundamental issues of detecting changes in data streams metric methods. However, in many real-time
a trade-off
includes
among space-efficiency, detection performance,
characterizing and quantifying of changes and detect-
and methods are based on the assumption that theapplications,
distribution
time-efficiency. data may not confine
of observations belongstoto any somestandard
class of distribution,
distribution func- thus
ing changes. A change detection method in streaming data needs parametric approaches are inapplicable. Semi-parametric
tion, and parameters of the distribution function change
2.2 Change
a trade-off among Detection Methods
space-efficiency, detection in Streaming
performance, and methods are based on the assumption that the distribution
in disorder moments. Recently, Kuncheva [36] has pro-
time-efficiency.
Data of observations belongs to some class of distribution func-
posed a semi-parametric method using a semi-parametric
tion, and parameters of the distribution function change
2.2 theChange
Over Detection
last 50 years, Methods
change detection in Streaming
has been widely studied log-likelihood for testing a change. Nonparametric meth-
in disorder moments. Recently, Kuncheva [36] has pro-
and applied
Data in both academic research and industry. For exam- ods make no distribution assumptions on the data. Non-
ple, it has been studied for a long time in the following fields: posed a semi-parametric method using a semi-parametric
Over the last 50 years, change detection has been widely studied parametric methods for detecting changes in the underly-
log-likelihood for testing a change. Nonparametric meth-
statistics, signal processing, and control theory. In recent years ing data distribution includes Wilcoxon, kernel method,
and applied in both academic research and industry. For exam- ods make no distribution assumptions on the data. Non-
many
ple, it change detection
has been studiedmethods have
for a long been
time in proposed for stream-
the following fields: Kullback-Leiber distance, and Kolmogorov-Smirnov test.
parametric methods for detecting changes in the underly-
ing data. The approaches to detecting changes in data streamyears
statistics, signal processing, and control theory. In recent can Nonparametric methods can be classified into two cate-
ing data distribution includes Wilcoxon, kernel method,
be classified
many changeasdetection
follows. methods have been proposed for stream- gories: nonparametric methods using window [33]; non-
Kullback-Leiber distance, and Kolmogorov-Smirnov test.
ing data. Thestream
approaches parametric methods without using window into [27].twoWe cate-
have
Data model:toAdetecting changes
data stream in data
can fall intostream
one of can
the Nonparametric methods can be classified
be classified as follows. paid
gories:particular attentionmethods
nonparametric to the nonparametric
using window change [33]; non- de-
following models: time series model, cash register model,
tection
parametricmethods
methodsusingwithout
windowusing because windowin many [27].real-world
We have
and
Dataturnstile model [41].
stream model: A data Onstream
the basis of the
can fall intodata
onestream
of the applications, theattention
distributions
model, there are change paid particular to theofnonparametric
both null hypothesis change and de-
following models: time detection algorithms
series model, developed
cash register model,for
alternative hypothesis are unknown ininadvance. Further-
the corresponding data stream model. Krishnamurthy et al tection methods using window because many real-world
and turnstile model [41]. On the basis of the data stream more, we arethe only interested of in both
recent data. A common
presented a sketch-based change detection applications, distributions null hypothesis and
model, there are change detection algorithmsmethod for the
developed for approach
most general streaming model model.
Turnstile model [35]. et al alternativetohypothesis
identifyingaretheunknown change is in to comparing
advance. two
Further-
the corresponding data stream Krishnamurthy samples
more, weinare orderonlyto interested
find out theindifference
recent data. between
A common them,
presented a sketch-based change detection method for the which is called two-sample change detection, or window-
Data characteristics: Change detection methods can be approach to identifying the change is to comparing two
most general streaming model Turnstile model [35]. based change detection.
classified on the basis of the data characteristics of stream- samples in order to find As out data stream is infinite,
the difference betweenathem, slid-
ing
Datadata such as data dimensionality,
characteristics: Change detection datamethods
label, andcandatabe ing
whichwindow is often
is called used to detect
two-sample changechanges.
detection, Window based
or window-
type. A data
classified on item coming
the basis from
of the datathecharacteristics
data stream can be uni-
of streamchange detection
based change incurs the
detection. Ashighdatadelay
stream [37].is Window-based
infinite, a slid-
variate
ing dataorsuch
multi-dimensional. It would data
as data dimensionality, be great if we
label, andcould
data change
ing windowdetection scheme
is often used is to based
detecton the dissimilarity
changes. Window based mea-
develop a general
type. A data algorithm
item coming able
from thetodata
detect changes
stream can in
be both
uni- sure
changebetween
detectiontwoincurs
distributions
the highordelay synopses extracted from
[37]. Window-based
univariate and multidimensional
variate or multi-dimensional. data streams.
It would be great Change de-
if we could the reference
change detectionwindow
scheme andisthe current
based on thewindow.
dissimilarity mea-
tection
developalgorithms in streaming
a general algorithm ablemultivariate data have
to detect changes in been
both sure between two distributions or synopses extracted from
Velocity of data change: Aggarwal proposes a framework
presented
univariate [14; 34; 36]. Data streams
and multidimensional datacan be classified
streams. Changeintode- the reference window and the current window.
that can deal with the changes in both spatial velocity pro-
categorial data stream
tection algorithms and numerical
in streaming data stream.
multivariate Webeen
data have can
Velocity
file of data change:
and temporal velocityAggarwal
profile [1;proposes
2]. In this approach,
a framework
presentedthe
develop change
[14; detection
34; 36]. algorithm
Data streams canforbecategorial
classified data
into thatchanges
the can dealin datathedensity
with changes occurring at eachvelocity
in both spatial locationpro- are
categorial
stream data stream
or numerical and
data numerical
stream. In realdata stream.
world We can
applications, file and temporal velocityvelocity
profile [1; 2]. In in thissome
approach,
estimated by estimating density user-
develop
each datatheitem
change detection
in data streamalgorithm for categorial
may include multipledataat- the changes in data densityAn occurring at advantage
each location are
defined temporal window. important of this
stream orofnumerical
tributes both numericaldata stream. In real world
and categorial data. applications,
In such situ- estimated
approach isbythat estimating
it visualizes velocity density This
the changes. in some user-
visualiza-
each data
ations, theseitem
datainstreams
data stream may include
can be projected multiple
by each at-
attribute defined temporal helpswindow.
tion of changes userAn important the
understand advantage
changesofintu-this
tributes
or groupofofboth numerical
attributes. and categorial
Change detectiondata. In such
methods cansitu-
be approach
itively. is that it visualizes the changes. This visualiza-
ations, these
applied to thedata streams can be
corresponding projected
projected by streams
data each attribute
after- tion of changes helps user understand the changes intu-
or group
wards. Dataof streams
attributes. areChange
classifieddetection methods
into labeled data can
streambe Speed
itively. of response: If a change detection method needs
applied to the corresponding
and unlabeled data streams. Aprojected
labeled data
data streams
stream isafter-
one to react to the detected changes as fast as possible, the
wards. Data streams are classified into labeled data stream Speed of response: If a change detection method needs
and unlabeled data streams. A labeled data stream is one to react to the detected changes as fast as possible, the

quickest
quickest detection
detection of of change
change should
should be be proposed.
proposed. Quickest
Quickest can
can be
be divided
divided into into structural
structural and and measurement
measurement components.
components. To To
change
change detection
detection can can help
help aa system
system make make aa timely
timely alarm.
alarm. detect
detect deviation
deviation between
between two two models,
models, theythey compare
compare specific
specific parts
parts
Timely
Timely alarmalarm warning
warning is is benefit
benefit for for economical.
economical. In In some
some of
of these
these corresponding
corresponding models. models. The The models
models obtained
obtained by by data
data
cases,
cases, it it may
may save
save the the human
human life life such
such as as inin fire-fighting
fire-fighting mining
mining algorithms
algorithms includes
includes frequent
frequent itemitem sets,
sets, decision
decision trees,
trees, and
and
system.
system. Change
Change detection
detection methods
methods using using two two overlapping
overlapping clusters.
clusters. TheThe change
change in in model
model may may convey
convey interesting
interesting informa-
informa-
windows
windows can can quickly
quickly reactreact to
to the
the changes
changes in in streaming
streaming data data tion
tion or
or knowledge
knowledge of of an
an event
event or or phenomenon.
phenomenon. Model Model change
change is is
while
while methods
methods using using adjacent
adjacent windows
windows model model may may incur
incur defined
defined in in terms
terms of of the
the difference
difference between
between two two set
set ofof parameters
parameters
the
the high
high delay.
delay. AsAs change
change can can bebe abrupt
abrupt change
change or or gradual
gradual of
of two
two models
models and and thethe quantitative
quantitative characteristics
characteristics of of two
two mod-
mod-
change,
change, therethere exists
exists the the abrupt
abrupt change
change detection
detection algorithm els.
algorithm els. As
As such,
such, model
model change
change detection
detection is is finding
finding the the difference
difference
and gradual change detection algorithm [46; 40]. between two set of parameters of two models and the quantita-
and gradual change detection algorithm [46; 40]. between two set of parameters of two models and the quantita-
tive characteristics of these two models. We should distinguish
Decision making methodology: Based on the decision tive characteristics of these two models. We should distinguish
Decision making methodology: Based on the decision between detection of changes in data distribution by using mod-
making methodology, a change detection method can fall between detection of changes in data distribution by using mod-
making methodology, a change detection method can fall els and detection of changes in model built from streaming data.
into one of the following categories: rank-based method els and detection of changes in model built from streaming data.
into one of the following categories: rank-based method While model change detection aims to identify the difference be-
[33], density-based method [55], information-theoretic While model change detection aims to identify the difference be-
[33], density-based method [55], information-theoretic tween two models, change detection in the underlying data distri-
method [15]. A change detection problem can be also clas- tween two models, change detection in the underlying data distri-
method [15]. A change detection problem can be also clas- bution by using models is inferring the changes in two data sets
sified into batch change detection and sequential change bution by using models is inferring the changes in two data sets
from the difference between two models constructed from two
sified into Based
batch on change detection
delayand that sequential change from
detection.
detection. Based
detection
on detection delay
a change detector
that a canchange detector data sets. difference
the The changes between two modelsdata
in the underlying constructed
distribution fromcantwoin-
suffers from, a change detection methods fall into one data
duce sets. The changes inchanges
the corresponding the underlying
in the model data distribution
produced from can the
in-
suffers from, a change
of two following types: detection
real-time methods can fall into
change detection, and one
ret- duce the corresponding changes in the model produced from the
of two following data generating process.
rospective changetypes: real-time
detection. Based change
on thedetection,
spatial orand tempo-ret- data generating
As models can be process.
generated by statistics method or data mining
rospective change detection. Based on
ral characteristics of data, change detection algorithm canthe spatial or tempo- As models can be generatedinby statistics
ral characteristics of data, methods, change detection models canmethod or datainto
be classified mining
data
fall into one of three kinds:change
spatialdetection algorithmtem-
change detection; can methods,
mining change
model anddetection
statisticalinmodel.
modelsTwo cankinds
be classified
of models into
wedataare
fall
poralinto one ofdetection;
change three kinds: spatial change detection;
or spatio-temporal change detec- tem- mining model and statistical model. Two kinds of models we are
poral[6].change detection; or spatio-temporal change detec- interested in detecting changes are predictive model and explana-
tion interested
tory model. in Predictive
detecting changes
model is areused
predictive model
to predict theand explana-
changes in
tion [6]. tory model.Detecting
Predictive model inis the used to predict
Application: On the basis of applications that generate data the future. changes pattern can bethe changesfor
beneficial in
Application: the future.
many Detecting
applications. Inchanges
explanatory in the pattern
model, can be beneficial
a change that occurred for
streams, data On the basis
streams can beof applications
classified as that into generate
transactionaldata
streams, data streams many
is bothapplications.
detected and In explained.
explanatoryThere model, area change that occurred
some approaches to
data stream, sensor can databestream,
classified as into data
network transactional
stream,
data stream, sensor data stream, data network data stream, is both detection:
change detected and explained.
one-model There are
approach, some approaches
two-model approach, or to
stock order data stream, astronomy stream, video data
stock
stream, order
etc. data
Based stream,
on theastronomy data stream,there
specific applications, video aredata
the change detection:
multiple-model one-model approach, two-model approach, or
approach.
stream, etc.
change Based methods
detection on the specific
for theapplications,
corresponding thereapplica-
are the multiple-model
A model-basedapproach. change detection algorithm consists of two
change
tions such detection
as change methods for methods
detection the corresponding
for sensor applica-
stream- A model-based
phases as follows: changemodeldetection
constructionalgorithm consistsdetection.
and change of two
tionsdata
ing such[56],
as change
changedetection
detectionmethods
methodsfor forsensor stream-
transactional phases as follows:
First, a model is builtmodel
by using construction
some stream andmining
change detection.
method such
ing data [56],
streaming datachange
[45; 57;detection methodsvan
8]. For example, for Leeuwen
transactionaland First,
as a model
decision is clustering,
tree, built by using somepattern.
frequent stream mining
Second,method such
a difference
streaming
Siebes [57]data
have [45; 57; 8]. For
presented example,
a change van Leeuwen
detection methodand for as decision
measure tree, clustering,
between two models frequent pattern.based
is computed Second,the acharacteris-
difference
Siebes [57] have
transactional presented
streaming dataabased
change on detection
the principle methodof Min-for measure
tics of the between
model,two thismodels
step isisalso
computed
called based the characteris-
the quantification of
transactional
imum streaming
Description Length.data based on the principle of Min- tics of difference.
model the model, Therefore,
this step isone alsofundamental
called the quantification
issue here is of to
imum Description Length. model difference.
quantify the changes Therefore,
between two one models
fundamentaland to issue
determinehere crite-
is to
Stream processing methodology: Based on methodology ria for making
quantify decision
the changes whether
between twoand whenand
models a change in the model
to determine crite-
Stream
for processing
processing methodology:
data stream, a data stream
Based on canmethodology
be classified occurs. Recently,
ria for making some whether
decision change detection
and when methodsa change in the streaming
model
into online data
for processing datastream
stream,anda off-line
data stream datacan stream [38]. In
be classified data by clustering
occurs. Recently, some have beenchange proposed
detection[10;methods
3]. Basedinon the data
streaming
some work, data
into online an online
streamdata andstream
off-lineis datacalled a live[38].
stream stream In stream
data bymining
clustering model,
havewe beenmay have the[10;
proposed corresponding
3]. Based onproblems
the data
while
some an work,off-line data stream
an online is called
data stream is archived
called a data live stream of detecting
stream mining changes
model,inwe modelmayashavefollows. Ikonomovska et
the corresponding al. [30]
problems
[18].
whileOnline datadata
an off-line streamstreamneeds to be processed
is called archived data online be-
stream have presented
of detecting an algorithm
changes in model for learningIkonomovska
as follows. regression trees et al.from
[30]
cause of its high
[18]. Online data speed.
stream Such needsonline data streams
to be processed include
online be- streaming
have presented data in an the presence
algorithm foroflearning
conceptregression
drifts. Their treeschange
from
streams
cause ofofitsstockhighticker,
speed.streams of network
Such online measurements,
data streams include detection
streamingmethod data inisthe based on sequential
presence of concept statistical
drifts.tests
Their that mon-
change
and
streamsstreams
of stockof sensor data,etc.ofOff-line
ticker, streams networkstream is a se-
measurements, itoring
detection themethod
changes of theon
is based local error, atstatistical
sequential each node testsof that
tree,mon-
and
quence
and streams of updates
of sensor to warehouses or backup
data,etc. Off-line devices.
stream is a These- inform
itoring the
the learning
changes process of theerror,
of the local localatchanges.
each node of tree, and
queries
quence of over the off-line
updates streams orcan
to warehouses be processed
backup devices. The off- Detecting
inform thechangeslearningof streamofcluster
process model
the local has been received in-
changes.
line. However, as off-line
it is insufficient
streams time can to be process off-line creasing
queries over the processed off- Detectingattention.
changes of Zhou et al.cluster
stream [59] have
model presented
has beena received
method for in-
streams, techniques
as it isforinsufficient
summarizing timedata are necessary. tracking
line. However, to process off-line creasing the evolution
attention. Zhou of et
clusters
al. [59] over
havesliding windows
presented by using
a method for
In off-linetechniques
change detection method, the entire
are data set is temporal cluster features and the overexponential histogram,
streams, for summarizing data necessary. tracking the evolution of clusters sliding windows bywhich
using
available
In off-linefor the analysis process to detect the change. The called exponential histogram
change detection method, the entire data set is temporal cluster features andofthecluster features.histogram,
exponential Chen and Liu which[9]
online method
for thedetects theprocess
changetoincrementally basedThe on have
available analysis detect the change. calledpresented
exponential a framework
histogram for detecting
of cluster the changes
features. Chen and in cluster-
Liu [9]
the recently incoming
detects data item. An important distinction ing
online
between
method
off-line method
the change
and online
incrementally based on havestructures
presentedconstructed
a framework fromfor categorial
detecting the datachanges
streamsinby using
cluster-
the recently incoming data item. An one is that distinction
important the online hierarchial entropy trees to capture the entropy
ing structures constructed from categorial data streams by using characteristics of
method
betweenisoff-line
constrained byand
the online
detection oneand reaction time clusters, andentropy
then detecting
due to the
method is that the online hierarchial trees to changes
capture the in clustering structures based
entropy characteristics of
method is requirement
constrained by of real-time
the detection applications
and reaction whiletimethe on these and
clusters, entropythencharacteristics.
detecting changes in clustering structures based
off-line is free from the detection time, and reaction time. Based onentropy
the data stream mining model, we may have the cor-
due to the requirement of real-time applications while the on these characteristics.
Methods for detecting changes can be useful for stream- responding problems of detecting changes in model as follows
off-line is free from the detection time, and reaction time. Based on the data stream mining model, we may have the cor-
ing data warehouses where both live streams of data and [14]. Recently Ng and Dash [44] have introduced an algorithm
Methods for detecting changes can be useful for stream- responding problems of detecting changes in model as follows
archived data streams are available [24; 29]. In this work, for mining frequent patterns from evolving data streams. Their
ing data warehouses where both live streams of data and [14]. Recently Ng and Dash [44] have introduced an algorithm
we focus on developing the methods for detecting changes algorithm is capable of updating the frequent patterns based on
archived data streams are available [24; 29]. In this work, for mining frequent patterns from evolving data streams. Their
in online data streams, in particular, sensor data streams. the algorithms for detecting changes in the underlying data dis-
we focus on developing the methods for detecting changes algorithm is capable of updating the frequent patterns based on
in work
onlineondata streams, inchange particular, sensorproposed
data streams. tributions. Two windows are used for change detection: the ref-
The first model-based detection by [21; the algorithms
erence windowfor anddetecting
the current changes
window. in the
At underlying data dis-
the initial stage, the
22] is FOCUS. The central idea behind FOCUS is that the models tributions. Two windows are used for change detection: the ref-
The first work on model-based change detection proposed by [21;
erence window and the current window. At the initial stage, the
22] is FOCUS. The central idea behind FOCUS is that the models

reference is initialized with the first batch of transactions from erance. The scalability refers to the ability to extend the
reference
data is initialized
stream. The currentwith the first
window batch
moves onof
thetransactions
data streamfrom and erance.
size Thenetwork
of the scalability refers
without to the ability
significantly to extend
reducing the
the per-
data stream.
captures the The
next current
batch ofwindow moves Two
transactions. on the data stream
frequent and
item sets size of theof
formance network without significantly
the framework. As faults may reducing
occurthe
dueper-
to
are constructed
captures the nextfrom
batchtwoofcorresponding
transactions. Two windows
frequentby using
item setsthe the transmission
formance of the error and the As
framework. effects
faultsof noisy channels
may occur duebe-
to
Apriori algorithm.
are constructed fromA statistical test is performed
two corresponding windows on by
twousing
absolutethe tween local sensors
the transmission and
error andfusion center,ofa noisy
the effects distributed change
channels be-
support values thatAare
Apriori algorithm. computed
statistical testby
is the Apriorion
performed from
twothe refer-
absolute detection
tween local method
sensors should be able
and fusion to tolerate
center, these faults
a distributed changein
ence
supportwindow
valuesand
thatcurrent window.byBased
are computed on the from
the Apriori statistical test,
the refer- order
detectionto assure
method theshould
functionbe of
ablethetosystem.
tolerate these faults in
the
encedeviation
window canand be significant
current window. or insignificant.
Based on theIfstatistical
the deviation
test, order to assure the function of the system.
is
thesignificant
deviation then a change
can be in theordata
significant stream is reported.
insignificant. Chang
If the deviation Distributed change detection using the local approach is di-
and Lee [8] have
is significant then presented
a change ina themethod for monitoring
data stream the Chang
is reported. recent rectly relevant
Distributed to the
change problemusing
detection of multiple
the localhypotheses
approach istest-
di-
change
and Leeof[8]frequent item setsa from
have presented methoddataforstream by using
monitoring the sliding
recent ing
rectly and data fusion
relevant because of
to the problem each local change
multiple hypothesesdetector
test-
window.
change of frequent item sets from data stream by using sliding needs
ing and to data
perform
fusiona hypothesis
because each test local
to determine
change whether
detector
window. aneeds
change occurs. Therefore,
to perform a hypothesisbesides
test toconsidering the de-
determine whether
2.3 Design Methodology tection
a change performance of local change
occurs. Therefore, besides detection
consideringalgorithms
the de-
2.3 Design
There are Methodology
two design methodologies for developing the change including probabilityof
tection performance oflocal
detection
change anddetection
probability of false
algorithms
detection alarm
includingat the node level,ofthe
probability detection
detection and performance
probability of
ofafalse
dis-
There are algorithms
two design in streaming data.
methodologies for The first methodology
developing the change
is to adaptalgorithms
the existing tributed
alarm at change
the nodedetection
level, themethod
detection at the fusion center
performance of amust
dis-
detection in change
streamingdetection
data. Themethods
first for streaming
methodology
data. However, many traditional change detection methods can- be takenchange
tributed into account.
detection method at the fusion center must
is to adapt the existing change detection methods for streaming
not
data.beHowever,
extendedmany for streaming
traditionaldata because
change of the methods
detection high compu-
can- be taken into account.
Distributed detection and data fusion have been widely studied
tational complexity
not be extended for such as some
streaming kernel-based
data because of change
the highdetection
compu- for many decades. However,
methods, and density-based change detection methods. The sec- Distributed detection and dataonly recently,
fusion have distributed
been widelydetection
studied
tational complexity such as some kernel-based change detection in
forstreaming
many decades.data has receivedonly
However, attention.
recently, distributed detection
ond methodology
methods, is to develop
and density-based new change
change detection detection
methods. methods for
The sec- in streaming data has received attention.
streaming data. is to develop new change detection methods for
ond methodology 3.1 Distributed Detection: One-shot versus
There are two
streaming data.common approaches to the problem of change de- 3.1 Continuous
Distributed Detection: One-shot versus
tection
There arein streaming
two common dataapproaches
distributions: distance-based
to the change de-
problem of change de- Continuous
Distributed detection of changes can be classified into two types
tectors and
tection in predictive
streaming datamodel-based
distributions:change detectors. change
distance-based In the for-
de- of models asdetection
follows. of changes can be classified into two types
mer, two windows are used to extractchange
two data segments Distributed
tectors and predictive model-based detectors. Infrom the
the for- of models as follows.
data
mer, stream.
two windowsThe change
are used is to
quantified
extract twoby data
usingsegments
some dissimilar-
from the One-shot distributed detection of changes: Figure 2 shows
ity
datameasure.
stream. TheIf thechange
dissimilarity measure
is quantified is greater
by using somethan a given
dissimilar- two models
One-shot of one-shot
distributed distributed
detection change detection.
of changes: One-
Figure 2 shows
threshold then a change is detected. Similar to
ity measure. If the dissimilarity measure is greater than a given distance-based shot change of
two models detection
one-shot method means
distributed a change
change detector
detection. de-
One-
change
threshold detectors, two windows
then a change are used
is detected. for detecting
Similar changes.
to distance-based tects and reacts
shot change to themethod
detection detected change
means once detector
a change a changede-is
Instead of comparing the dissimilarity measure
change detectors, two windows are used for detecting changes. between two win- detected.
tects and One-shot
reacts to thedistributed
detectedchange
changedetection have re-
once a change is
dows
Insteadwith a given threshold, a change is detected by using the ceived great deal of attention forchange
a long time. One-shot
of comparing the dissimilarity measure between two win- detected. One-shot distributed detection havedis-
re-
prediction error of the model built from the current window and tributed change
dows with a given threshold, a change is detected by using the ceived great dealdetection include
of attention two models:
for a long distributed
time. One-shot dis-
the predictive
prediction model
error of theconstructed
model built from
from thethereference
current window.
window and detection with decision with decision fusion as shown in
tributed change detection include two models: distributed
the predictive model constructed from the reference window. Figure
detection 2(a);
withdistributed
decision detection without
with decision decision
fusion fusion
as shown in
3. DISTRIBUTED CHANGE DETECTION as illustrated in Figure 2(b). What are the differences
Figure 2(a); distributed detection without decision fusion be-
tween one-shot in detection and What
continuous
are thedetection.
3. IN STREAMINGCHANGE
DISTRIBUTED DATA DETECTION as illustrated Figure 2(b). differences be-
IN STREAMING
Knowledge DATA
discovery from massive amount of streaming data tween one-shot
Continuous detection
distributed and continuous
detection detection.
of changes: In this chap-
can be achieved only when ter, we propose two continuous distributed detection mod-
Knowledge discovery from we could amount
massive develop of thestreaming
change detec-
data Continuous distributed detection of changes: In this chap-
tion frameworks that monitor streaming data created by multiple els as shown in Figure 3. An important distinction
ter, we propose two continuous distributed detection mod-between
can be achieved only when we could develop the change detec-
sources such as sensor networks, WWWdata [13].created
The objectives of continuous
els as showndistributed
in Figure 3.detection of changes
An important and one-shot
distinction between
tion frameworks that monitor streaming by multiple
designing a distributed distributed detection of changes is that the inputs to the
sources such as sensor change
networks,detection
WWWscheme [13]. Theareobjectives
maximizing of continuous distributed detection of changes and one-shot
the lifetime of the network, maximizing the detection capability, one-shot distributed change detection are batches of data
designing a distributed change detection scheme are maximizing distributed detection of changes is that the inputs to the
and minimizing the communication cost [58]. while the inputs to the continuous distributed detection of
the lifetime of the network, maximizing the detection capability, one-shot distributed change detection are batches of data
There are two approaches to the problem of change detection in changes are the data streams in which data items continu-
and minimizing the communication cost [58]. while the inputs to the continuous distributed detection of
streaming data that is created from multiple sources. In the cen- ously arrive.
There are two approaches to the problem of change detection in changes are the data streams in which data items continu-
tralized approach: all remote sites send raw data to the coordi- ously detection
arrive. model without fusion is a truly distributed
streaming data that is created from multiple sources. In the cen- Distributed
nator. The coordinator aggregates all the raw streaming data that
tralized approach: all remote sites send raw data to the coordi- detection model in which The decision-making process occurs at
is received from the remote sites. Detection of changes is per- Distributed detection model without fusion is a truly distributed
nator. The coordinator aggregates all the raw streaming data that each sensor.
formed on the aggregated streaming data. In most cases, com- detection model in which The decision-making process occurs at
is received from the remote sites. Detection of changes is per-
munication consumes the largest amount of energy. The lifetime each
formed on the aggregated streaming data. In most cases, com-
of sensors therefore drastically reduces when they communicate
3.2 sensor.
Locality in Distributed Computing
munication consumes the largest amount of energy. The lifetime As
raw measurements to a centralized server for analysis. Central-
of sensors therefore drastically reduces when they communicate
3.2oneLocality
of the properties of distributed Computing
in Distributed computational systems
ized approaches suffer from the following problems: communi- is locality [43], a distributed algorithm for detecting changes in
raw measurements to a centralized server for analysis. Central- As one of the properties of distributed computational systems
cation constraint, power consumption, robustness, and privacy. streaming data should meet the locality. A local algorithm is de-
ized approaches suffer from the following problems: communi- is locality [43], a distributed algorithm for detecting changes in
Distributed detection of changes in streaming data addresses the fined as one whose resource consumption is independent of the
cation constraint, power consumption, robustness, and privacy. streaming data should meet the locality. A local algorithm is de-
challenges that come from the problem of change detection, data system size. The scalability of distributed stream mining algo-
Distributed detectionand
of changes in streaming data addresses the fined
rithmsascan
onebewhose resource
achieved consumption
by using is independent
the local change detectionof the
algo-
stream processing, the problem of distributed computing. system
challenges
The challengesthat come
comingfrom thethe
from problem of change
distributed detection,
computing data
environ- rithms size. The scalability of distributed stream mining algo-
stream rithms can be achieved by using the local change detection algo-
ment areprocessing,
as follows and the problem of distributed computing. Local algorithms can fall into one of two categories [16]:Exact
rithms
The challenges coming from the distributed computing environ- local algorithms are defined as ones that produce the same results
ment are as follows
Distributed change detection in streaming data is a problem Local algorithmsalgorithm;
as a centralized can fall into one of twolocal
Approximate categories [16]:Exact
algorithms are al-
of distributed computing in nature. Therefore, a distributed local algorithms are defined as ones that produce the same
gorithms that produce approximations of the results that central-results
Distributed
framework for change detection
detecting in streaming
changes data isthe
should meet a problem
proper- as a centralized
ized algorithms algorithm; Approximate
would produce. local algorithms
Two attractive properties are al-
of lo-
of distributed computing in nature. Therefore, a distributed
ties of distributed computing such scalability, and fault tol- gorithms that produce approximations of the results that central-
cal algorithms are scalability and fault tolerance. A distributed
framework for detecting changes should meet the proper- ized algorithms would produce. Two attractive properties of lo-
ties of distributed computing such scalability, and fault tol- cal algorithms are scalability and fault tolerance. A distributed

Phenomenon
Phenomenon
x1 Phenomenon x2
Phenomenon
x1 x2
x1 x2
x1 x2
u1 u2
Local decision Local decision
u1 u2
Global decision
u1 u2
Global decision
(a) Distributed detection without decision (b) Distributed detection with decision fusion
u1
fusion u2
(a) Distributed detection without decision (b) Distributed detection with decision fusion
fusion Figure 2: One-shot distributed change detection models
Figure 2: One-shot distributed change detection models
Phenomenon
Phenomenon Data stream 1 Phenomenon Data stream 2
Data stream 1 Phenomenon Data stream 2 Data stream 1 Data stream 2
Data stream 1 Data stream 2
u1 u2
u1 u1 u2
u2
u1 Global decision
u2
Global decision
(a) Distributed continuous Local
detection without deci- (b) Distributed continuous detection with decision
decision
Local decision
sion fusion fusion
(a) Distributed continuous detection without deci- (b) Distributed continuous detection with decision
sion fusion fusion
Figure 3: Continuous distributed change detection models
Figure 3: Continuous distributed change detection models

framework for mining streaming data should be robust to network in change detection is expected to exploit such scalable data
framework
work for mining
partitions, and node streaming
failures. data should be robust to network in change
processing toolsdetection is expected
in efficiently detect, to exploitand
localize such scalable
classify data
occur-
workadvantage
The partitions,ofand node
local failures. is the ability to preserve pri-
approaches processing
ring tools
changes. Forinexample,
efficiently detect, localize
distributed changeand classifymodels
detection occur-
The advantage
vacy of local approaches
[20]. A drawback of the localisapproach
the ability
to to
thepreserve
problempri-of ring make
can changes.
use For example,
of the distributed
MapReduce changetodetection
framework models
accelerate their
vacy [20]. A
distributed drawback
change of the
detection is local approach to theproblem.
the synchronization problemFor of can make use
respective of the MapReduce framework to accelerate their
processes.
example,
distributedthe local change
change detectionapproach can meet the principle
is the synchronization problem. ofFor
lo- respective processes.
calized
example,algorithms
the local in wireless
change sensorcan
approach networks in which
meet the dataofpro-
principle lo-
cessing is performed
calized algorithms at node-level
in wireless sensorasnetworks
much asinpossible in order
which data pro-
5. REFERENCES
to reduceisthe
cessing amount of
performed informationastomuch
at node-level be sent
as in the network.
possible in order
5. REFERENCES
to reduce the amount of information to be sent in the network. [1] C. Aggarwal. A framework for diagnosing changes in
3.3 Distributed Detection of Changes in evolving
[1] C. data streams.
Aggarwal. In Proceedings
A framework of the changes
for diagnosing 2003 ACM in
3.3 DistributedData
Streaming Detection of Changes in SIGMOD international
evolving data streams. conference on Management
In Proceedings of the 2003 of ACM
data,
Over theStreaming Data
last decades, the problem of decentralized detection has pages
SIGMOD 575586. ACM New
international York, NY,
conference USA, 2003. of data,
on Management
received
Over the lastmuch attention.
decades, There are
the problem of two directionsdetection
decentralized of researchhas pages 575586. ACM New York, NY, USA, 2003.
on decentralized detection.ThereThe first [2] C. Aggarwal. On change diagnosis in evolving data streams.
received much attention. are approach focusesofon
two directions aggre-
research
gating measurements from The multiple sensors tofocuses
test a single hy- IEEE
[2] C. Transactions
Aggarwal. on Knowledge
On change diagnosis inand Data Engineering,
evolving data streams.
on decentralized detection. first approach on aggre-
pothesis. The second focuses on dealing with to
multiple pages
IEEE 587600,
Transactions 2005.
on Knowledge and Data Engineering,
gating measurements from multiple sensors test a dependent
single hy-
testing/estimation
pothesis. The second tasks fromon
focuses multiple
dealingsensors [51]. Distributed
with multiple dependent pages 587600, 2005.
[3] C. Aggarwal. A segment-based framework for modeling
change detection usually
testing/estimation tasks frominvolves
multiplea setsensors
of sensors
[51]. that receive
Distributed andAggarwal.
[3] C. mining data A streams. Knowledge
segment-based and information
framework sys-
for modeling
observations
change detection from usually
the environment
involves and a setthen
of transmit
sensors those obser-
that receive tems, 30(1):129,
and mining 2012. Knowledge and information sys-
data streams.
vations back to
observations fusion
from center in order
the environment and tothen
reach the final
transmit consensus
those obser-
of detection. Decentralized tems, 30(1):129, 2012.
vations back to fusion centerdetection
in order toand datathefusion
reach are there-
final consensus [4] C. Aggarwal and P. Yu. A survey of synopsis construction
fore two closely
of detection. related tasks
Decentralized that arise
detection andin data
the context
fusion are of sensor
therein data streams. Data streams: models and algorithms, page
networks [48; 47]. Two traditional approaches to the decentral- [4] C. Aggarwal and P. Yu. A survey of synopsis construction
fore two closely related tasks that arise in the context of sensor 169, 2007.
in data streams. Data streams: models and algorithms, page
ized change
networks detection
[48; 47]. Two aretraditional
data fusion,approaches
and decision to fusion. In data
the decentral-
fusion, each node detects change and sends quantized version of 169, 2007.
ized change detection are data fusion, and decision fusion. In data [5] A. Bondu and M. Boull. A supervised approach for change
its observation
fusion, each node to detects
a fusionchange
centerand responsible for making
sends quantized versiondeci-
of detection
[5] A. Bondu in data
and M.streams.
Boull. A Insupervised
Neural Networks (IJCNN),
approach The
for change
sion on the detected
its observation changes,
to a fusion and responsible
center further relaying information.
for making deci- 2011 International Joint In
Conference on, pages 519526.
detection in data streams. Neural Networks (IJCNN), The
In
sioncontrast,
on the in decision
detected fusion, and
changes, eachfurther
node performs
relaying local change
information. IEEE, 2011.
detection byinusing somefusion,
local change algorithm andlocal
updates its
2011 International Joint Conference on, pages 519526.
In contrast, decision each node performs change IEEE, 2011.
decision
detectionbased on the
by using received
some information
local change and broadcasts
algorithm and updates again
its [6] S. Boriah, V. Kumar, M. Steinbach, C. Potter, and
its new decision. Thisreceived
processinformation
repeats until andconsensus
broadcastsamong S. Klooster.
decision based on the again [6] S. Boriah, Land cover change
V. Kumar, detection: C.
M. Steinbach, a case study.and
Potter, In
the nodesdecision.
are reached.
ThisCompared to datauntil
fusion, decision among
fusion Proceeding
its new process repeats consensus S. Klooster.ofLand
the 14th
coverACM SIGKDD
change international
detection: confer-
a case study. In
can reduceare
the nodes thereached.
communication
Compared costtobecause sensors
data fusion, need only
decision fusion to ence on Knowledge discovery and datainternational
mining, pages 857
Proceeding of the 14th ACM SIGKDD confer-
transmit the local decisions represented by small data structures. 865.
can reduce the communication cost because sensors need only to ence ACM, 2008. discovery and data mining, pages 857
on Knowledge
Although
transmit the there is great
local deal represented
decisions of work on distributed
by small data detection and
structures. 865. ACM, 2008.
data fusion, most of work focuses on the one-time
Although there is great deal of work on distributed detection and change detec- [7] G. Cabanes and Y. Bennani. Change detection in data
tion
data solutions. One-time
of workquery is defined as a querychange
that needs to streams through
fusion, most focuses on the one-time detec- [7] G. Cabanes and unsupervised learning.detection
Y. Bennani. Change In Neural
in Net-
data
proceed data once in order to provide the answer [12]. Likewise, works
tion solutions. One-time query is defined as a query that needs to streams(IJCNN),
through The 2012 International
unsupervised learning. Joint Conference
In Neural Net-
one-time change
oncedetection method is athe
change detection that re- on, pages 16. IEEE,
proceed data in order to provide answer [12]. Likewise, works (IJCNN), The 2012.
2012 International Joint Conference
quires to proceed data once in response to the change
one-time change detection method is a change detection that re- occurred. In
on, pages 16. IEEE, 2012.
real-world applications, we need the approaches
quires to proceed data once in response to the change occurred. Incapable of con- [8] J. Chang and W. Lee. estwin: adaptively monitoring the re-
tinuously
real-worldmonitoring thewe changes of approaches
the events occurring
capable ofincon- the cent change
applications, need the [8] J. Chang andofW.frequent itemsets
Lee. estwin: over online
adaptively data streams.
monitoring the re-
environment. Recently, work on continuous detection
tinuously monitoring the changes of the events occurring in the and mon- In Proceedings of the twelfth international conference on
cent change of frequent itemsets over online data streams.
itoring of changes has been
work started receiving attentionand such as Information andofknowledge
environment. Recently, on continuous detection mon- In Proceedings the twelfthmanagement,
internationalpages 536539.
conference on
[49; 13; 50]. Das et al. [13] have presented a scalable distributed ACM, 2003.
itoring of changes has been started receiving attention such as Information and knowledge management, pages 536539.
framework for detecting changes in astronomy data streams us-
[49; 13; 50]. Das et al. [13] have presented a scalable distributed ACM, 2003.
ing local, asynchronous eigen monitoring algorithms. Palpanas [9] K. Chen and L. Liu. HE-Tree: a framework for detecting
framework for detecting changes in astronomy data streams us-
et al. [49] proposed a distributed framework for outlier detection changes in clustering structure for categorical data streams.
ing local, asynchronous eigen monitoring algorithms. Palpanas [9] K. Chen and L. Liu. HE-Tree: a framework for detecting
in real-time data streams. In their framework, each sensor esti- The VLDB Journal, pages 120.
et al. [49] proposed a distributed framework for outlier detection changes in clustering structure for categorical data streams.
mates and maintains a model for its underlying distribution by
in real-time data streams. In their framework, each sensor esti- The VLDB Journal, pages 120.
[10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
using kernel density estimators. However, they did not show how
mates and maintains a model for its underlying distribution by Segment-based change detection method in multivariate
to reach the global detection decision. [10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
using kernel density estimators. However, they did not show how data stream, Apr. 9 2009. WO Patent WO/2009/045,312.
to reach the global detection decision. Segment-based change detection method in multivariate
4. CONCLUDING REMARKS dataCormode.
[11] G. stream, Apr.
The 9continuous
2009. WOdistributed
Patent WO/2009/045,312.
monitoring model.
We
4. argued in this paper that variability,
CONCLUDING REMARKS or simply change, is cru- SIGMOD Record, 42(1):5, 2013.
cial in a world full of affecting factors that alter the behavior of [11] G. Cormode. The continuous distributed monitoring model.
We argued in this paper that variability, or simply change, is cru- SIGMOD Record,
the data, and consequently the underlying model. The ability to
cial in a world full of affecting factors that alter the behavior of [12] G. Cormode and 42(1):5, 2013. Efficient strategies for
M. Garofalakis.
detect such changes in centralized as well as distributed system continuous distributed tracking tasks. IEEE Data Engineer-
the data, and consequently the underlying model. The ability to [12] G.
plays an important role in identifying validity of data models.
detect such changes in centralized as well as distributed system ing Cormode and M. Garofalakis.
Bulletin, 28(1):3339, 2005. Efficient strategies for
The paper presented the state-of-the-art in this area of paramount continuous distributed tracking tasks. IEEE Data Engineer-
plays an important role in identifying validity of data models. ingDas,
Bulletin, 28(1):3339, 2005.
importance. Techniques, in some cases, are tightly coupled with [13] K. K. Bhaduri, S. Arora, W. Griffin, K. Borne, C. Gi-
The paper presented
application domains.the state-of-the-art
However, most of in
thethis area of paramount
techniques reviewed annella, and H. Kargupta. Scalable Distributed Change De-
importance.
in this paper are generic and could be adapted to coupled
Techniques, in some cases, are tightly differentwith [13] K. Das,from
K. Bhaduri, S. Arora,
do- tection Astronomy Data W. Griffin,
Streams K. Borne,
using Local, C. Gi-
Asyn-
application domains. However, most of the techniques reviewed annella,
mains of applications. chronousand H. Kargupta.
Eigen Scalable
Monitoring Distributed
Algorithms. Change
In SIAM De-
Interna-
in this paper are generic and could be adapted to different do- tection from Astronomy Data Streams using2009.
Local, Asyn-
With Big Data technologies reaching a mature stage, the future tional Conference on Data Mining, Nevada,
mains of applications. chronous Eigen Monitoring Algorithms. In SIAM Interna-
With Big Data technologies reaching a mature stage, the future tional Conference on Data Mining, Nevada, 2009.

[14] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and [29] W. Huang, E. Omiecinski, L. Mark, and M. Nguyen. His-
[14] T.
K. Dasu, S. Krishnan,
Yi. Change D. Lin,
(Detection) You S.
CanVenkatasubramanian,
Believe in: Finding Dis- and [29] tory
W. Huang,
guidedE.low-cost
Omiecinski,
changeL. Mark, and in
detection M.streams.
Nguyen. Data
His-
K. Yi. Change
tributional (Detection)
Shifts You CanInBelieve
in Data Streams. in: Finding
Proceedings of theDis-
8th tory guided low-cost
Warehousing change detection
and Knowledge Discovery,in streams. Data
pages 7586,
International Symposium
tributional Shifts on Intelligent
in Data Streams. Data Analysis:
In Proceedings of the Ad-
8th 2009.
Warehousing and Knowledge Discovery, pages 7586,
vances in Intelligent
International Dataon
Symposium Analysis VIII,Data
Intelligent pageAnalysis:
34. Springer,
Ad- 2009.
2009.
vances in Intelligent Data Analysis VIII, page 34. Springer, [30] E. Ikonomovska, J. Gama, R. Sebastio, and D. Gjorgjevik.
2009. [30] Regression trees from
E. Ikonomovska, dataR.
J. Gama, streams withand
Sebastio, driftD.detection. In
Gjorgjevik.
[15] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. Discovery
RegressionScience,
trees frompages
data121135.
streams Springer,
with drift2009.
detection. In
AnDasu,
[15] T. information-theoretic
S. Krishnan, S. approach to detecting and
Venkatasubramanian, changes in
K. Yi. Discovery Science, pages 121135. Springer, 2009.
multi-dimensional data streams.
An information-theoretic In 38th
approach Symposium
to detecting on the
changes in [31] M. Karnstedt, D. Klan, C. Plitz, K.-U. Sattler, and
Interface of Statistics,
multi-dimensional Computing
data streams. Science,
In 38th and Applica-
Symposium on the [31] C.
M. Franke. Adaptive
Karnstedt, burst detection
D. Klan, C. Plitz,in K.-U.
a stream engine.and
Sattler, In
tions. Citeseer,
Interface 2005. Computing Science, and Applica-
of Statistics, Proceedings of the 2009
C. Franke. Adaptive ACM
burst symposium
detection on Applied
in a stream Com-
engine. In
tions. Citeseer, 2005. puting, pagesof15111515.
Proceedings the 2009 ACM ACM, 2009. on Applied Com-
symposium
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kar- puting, pages 15111515. ACM, 2009.
gupta.
[16] S. Distributed
Datta, K. Bhaduri, data
C. mining in peer-to-peer
Giannella, R. Wolff, andnetworks.
H. Kar- [32] Y. Kawahara and M. Sugiyama. Change-point detection in
IEEE
gupta.Internet Computing,
Distributed pages in
data mining 1826, 2006. networks.
peer-to-peer [32] time-series
Y. Kawaharadata andbyM.direct density-ratio
Sugiyama. estimation.
Change-point In Pro-
detection in
ceedings of data
time-series 2009bySIAM
directInternational
density-ratioConference
estimation.on In Data
Pro-
IEEE Internet Computing, pages 1826, 2006.
[17] J. Dean and S. Ghemawat. Mapreduce: Simplified data pro- Mining
ceedings(SDM2009),
of 2009 SIAM pages 389400, 2009.
International Conference on Data
cessing
[17] J. Dean andon S.
large clusters. Mapreduce:
Ghemawat. Communications of the
Simplified dataACM,
pro- Mining (SDM2009), pages 389400, 2009.
51(1):107113, [33] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in
cessing on large2008. clusters. Communications of the ACM,
51(1):107113, 2008. [33] data streams.
D. Kifer, In Proceedings
S. Ben-David, and J.of the Thirtieth
Gehrke. international
Detecting change in
[18] N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Effi- conference
data streams.onInVery large dataofbases-Volume
Proceedings the Thirtieth 30, page 191.
international
ciently
[18] N. correlating
Dindar, complexM.
P. M. Fischer, events
Soner, over
andlive
N. and archived
Tatbul. Effi- VLDB Endowment,
conference 2004.data bases-Volume 30, page 191.
on Very large
data streams.
ciently In ACM
correlating DEBS Conference,
complex 2011.and archived
events over live VLDB Endowment, 2004.
[34] A. Kim, C. Marzban, D. Percival, and W. Stuetzle. Using
data streams. In ACM DEBS Conference, 2011.
[19] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and [34] labeled
A. Kim,data to evaluate
C. Marzban, D.change detectors
Percival, and W.inStuetzle.
a multivariate
Using
P. Yu.
[19] G. Dong,Online mining
J. Han, of changes from
L. Lakshmanan, dataH.streams:
J. Pei, Wang, andRe- streaming
labeled dataenvironment. Signal detectors
to evaluate change Processing,in a89(12):2529
multivariate
search
P. Yu. problems and preliminary
Online mining of changesresults. Citeseer.
from data streams: Re- 2536, 2009.environment. Signal Processing, 89(12):2529
streaming
search
[20] A. problems J.and
R. Ganguly, preliminary
Gama, results. Citeseer.
O. A. Omitaomu, M. M. Gaber, 2536, 2009.
[35] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-
andR.
[20] A. R.Ganguly,
R. Vatsavai. Knowledge
J. Gama, O. A.discovery
Omitaomu,from
M.sensor data,
M. Gaber, [35] based change detection:
B. Krishnamurthy, Methods,
S. Sen, evaluation,
Y. Zhang, and applica-
and Y. Chen. Sketch-
volume 7. CRC, 2008.
and R. R. Vatsavai. Knowledge discovery from sensor data, tions. In Proceedings of the 3rd ACM SIGCOMM Confer-
based change detection: Methods, evaluation, and applica-
volume ence
tions.on
InInternet Measurement,
of the 3rdpages
ACM234247. ACM New
[21] V. Ganti,7.J.CRC, 2008.
Gehrke, and R. Ramakrishnan. A framework for York, NY,
Proceedings
USA, 2003.
SIGCOMM Confer-
measuring ence on Internet Measurement, pages 234247. ACM New
[21] V. Ganti, J. changes in data
Gehrke, and characteristics.AInframework
R. Ramakrishnan. Proceedings
for
of the eighteenth
measuring ACM
changes SIGMOD-SIGACT-SIGART
in data sympo-
characteristics. In Proceedings [36] York, NY, USA,
L. Kuncheva. 2003. detection in streaming multivariate
Change
sium
of theon Principles
eighteenth ACM of SIGMOD-SIGACT-SIGART
database systems, pages 126137.
sympo- [36] data using likelihood
L. Kuncheva. Changedetectors.
detectionKnowledge andmultivariate
in streaming Data Engi-
ACM,
sium on1999.
Principles of database systems, pages 126137. neering, IEEE
data using Transactions
likelihood on, Knowledge
detectors. (99):11, 2011.
and Data Engi-
ACM,
[22] V. 1999.
Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. A [37] neering, IEEE
X. Liu, X. Wu,Transactions
H. Wang, R.on, (99):11,
Zhang, 2011. and K. Ra-
J. Bailey,
framework
[22] V. Ganti, J.forGehrke,
measuringR. differences
Ramakrishnan, in dataand
characteristics.
W. Loh. A [37] mamohanarao.
X. Liu, X. Wu, Mining
H. Wang,distribution
R. Zhang,change in stock
J. Bailey, and K.order
Ra-
Journal of Computer and System Sciences, 64(3):542578,
framework for measuring differences in data characteristics. streams. Prof. of ICDE, pages 105108, 2010.
mamohanarao. Mining distribution change in stock order
2002.
Journal of Computer and System Sciences, 64(3):542578, [38] streams.
G. Manku Prof.
andof
R.ICDE, pages
Motwani. 105108, 2010.
Approximate frequency counts
2002.
[23] S. Geisler, C. Quix, and S. Schiffer. A data stream-based over data streams. In Proceedings of the 28th international
[38] G. Manku and R. Motwani. Approximate frequency counts
evaluation
[23] S. Geisler, framework
C. Quix, and forS.traffic information
Schiffer. systems. In
A data stream-based conference on Very Large Data of Bases, pages 346357.
over data streams. In Proceedings the 28th international
Proceedings of the ACM SIGSPATIAL International Work- VLDB Endowment, 2002.
evaluation framework for traffic information systems. In conference on Very Large Data Bases, pages 346357.
shop on GeoStreaming, pages 1118. ACM, 2010.
Proceedings of the ACM SIGSPATIAL International Work- [39] VLDB Endowment,
J. Manyika, 2002.
M. Chui, B. Brown, J. Bughin, R. Dobbs,
shop
[24] L. on GeoStreaming,
Golab, T. Johnson, pages 1118. ACM,
J. S. Seidel, and V.2010.
Shkapenyuk. C. Roxburgh, and A. Byers. Big data: The next frontier for
[39] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
Stream warehousing with datadepot. In Proceedings of the innovation, competition and productivity. McKinsey Global
[24] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. C. Roxburgh, and A. Byers. Big data: The next frontier for
35th SIGMOD international conference on Management of Institute, May, 2011.
Stream warehousing with datadepot. In Proceedings of the innovation, competition and productivity. McKinsey Global
data, pages 847854. ACM, 2009.
35th SIGMOD international conference on Management of [40] Institute,
A. Maslov, May,M. 2011.
Pechenizkiy, T. Krkkinen, and M. Thti-
data,J.pages
[25] A. Hey, 847854. ACM,
S. Tansley, and2009.
K. M. Tolle. The fourth nen. Quantile index for gradual and abrupt change detec-
[40] A. Maslov, M. Pechenizkiy, T. Krkkinen, and M. Thti-
paradigm: data-intensive scientific discovery. Microsoft tion from cfb boiler sensor data in online settings. In Pro-
[25] A. J. Hey, S. Tansley, and K. M. Tolle. The fourth nen. Quantile index for gradual and abrupt change detec-
Research Redmond, WA, 2009. ceedings of the Sixth International Workshop on Knowledge
paradigm: data-intensive scientific discovery. Microsoft tion from cfb boiler sensor data in online settings. In Pro-
Discovery from Sensor Data, pages 2533. ACM, 2012.
Research
[26] S. Redmond,
Hido, T. WA, 2009.
Id, H. Kashima, H. Kubo, and H. Matsuzawa. ceedings of the Sixth International Workshop on Knowledge
Unsupervised change analysis using supervised learning. [41] Discovery from Sensor
S. Muthukrishnan. DataData, pagesAlgorithms
streams: 2533. ACM,
and 2012.
applica-
[26] S. Hido, T. Id, H. Kashima, H. Kubo, and H. Matsuzawa.
In Proceedings of the 12th Pacific-Asia conference on Ad- tions. Now Publishers Inc, 2005.
Unsupervised change analysis using supervised learning. [41] S. Muthukrishnan. Data streams: Algorithms and applica-
vances in knowledge discovery and data mining, pages
In Proceedings
148159. of the 12th 2008.
Springer-Verlag, Pacific-Asia conference on Ad- [42] tions. Now Publishers
S. Muthukrishnan, Inc,den
E. van 2005.Berg, and Y. Wu. Sequential
vances in knowledge discovery and data mining, pages change detection on data streams. ICDM Workshops, 2007.
148159.
[27] S. Ho and Springer-Verlag, 2008. changes in unlabeled data
H. Wechsler. Detecting [42] S. Muthukrishnan, E. van den Berg, and Y. Wu. Sequential
streams using martingale. In Proceedings of the 20th in- [43] change
M. Naordetection
and L. on data streams.
Stockmeyer. ICDM
What can Workshops,
be computed 2007.
lo-
[27] S. Ho and H.joint
ternational Wechsler. Detecting
conference changesintelligence,
on Artifical in unlabeledpages
data cally? pages 184193, 1993.
streams [43] M. Naor and L. Stockmeyer. What can be computed lo-
19121917. Morgan Kaufmann Publishers Inc., 2007. in-
using martingale. In Proceedings of the 20th
[44] cally?
W. Ng pages
and M.184193,
Dash. A 1993.
change detector for mining frequent
ternational joint conference on Artifical intelligence, pages
19121917.
[28] W. Huang, E.Morgan Kaufmann
Omiecinski, and L. Publishers Inc., 2007.
Mark. Evolution in Data patterns over evolving data streams. In Systems, Man and
[44] Cybernetics,
W. Ng and M.2008.Dash.SMC
A change
2008. detector for mining frequent
IEEE International Confer-
Streams. 2003. patterns
[28] W. Huang, E. Omiecinski, and L. Mark. Evolution in Data ence on, over
pagesevolving data IEEE,
24072412. streams. In Systems, Man and
2008.
Streams. 2003. Cybernetics, 2008. SMC 2008. IEEE International Confer-
ence on, pages 24072412. IEEE, 2008.

[45] W. Ng and M. Dash. A test paradigm for detecting changes [53] G. Ross, D. Tasoulis, and N. Adams. Online annotation and
in transactional
[45] W. data streams.
Ng and M. Dash. In Database
A test paradigm Systemschanges
for detecting for Ad- [53] prediction
G. Ross, D.for regimeand
Tasoulis, switching dataOnline
N. Adams. streams. In Proceed-
annotation and
vanced Applications,
in transactional pages 204219.
data streams. Springer,
In Database 2008.
Systems for Ad- ings of the for
prediction 2009 ACMswitching
regime symposium on streams.
data Applied InComputing,
Proceed-
vanced Applications, pages 204219. Springer, 2008. pages
ings of15011505.
the 2009 ACM ACM, 2009.
symposium on Applied Computing,
[46] D. Nikovski and A. Jain. Fast adaptive algorithms for abrupt
pages 15011505. ACM, 2009.
change
[46] D. detection.
Nikovski and A.Machine learning,
Jain. Fast adaptive79(3):283306, 2010.
algorithms for abrupt [54] A. Singh. Review Article Digital change detection tech-
change detection. Machine learning, 79(3):283306, 2010. [54] niques using
A. Singh. remotely-sensed
Review data. International
Article Digital Journal
change detection of
tech-
[47] R. Niu and P. K. Varshney. Performance analysis of dis-
tributed detection Remote Sensing,
niques using 10(6):9891003,
remotely-sensed data.1989.
International Journal of
[47] R. Niu and P. K. in a randomPerformance
Varshney. sensor field. analysis
Signal Process-
of dis-
ing, IEEE
tributed Transactions
detection on, 56(1):339349,
in a random 2008. Process-
sensor field. Signal Remote Sensing, 10(6):9891003, 1989.
[55] X. Song, M. Wu, C. Jermaine, and S. Ranka. Statistical
ing, IEEE Transactions on, 56(1):339349, 2008.
[48] R. Niu, P. K. Varshney, and Q. Cheng. Distributed detec- [55] change
X. Song,detection
M. Wu,for C. multi-dimensional data. In Statistical
Jermaine, and S. Ranka. Proceed-
tionNiu,
[48] R. in a P.
large
K. wireless
Varshney,sensor network.
and Q. Cheng.Information
DistributedFusion,
detec- ings of detection
change the 13th ACM SIGKDD international
for multi-dimensional data. Inconference
Proceed-
7(4):380394, 2006. sensor network. Information Fusion,
tion in a large wireless on
ingsKnowledge
of the 13th discovery and datainternational
ACM SIGKDD mining, pagesconference
667676.
7(4):380394, ACM, 2007. discovery and data mining, pages 667676.
on Knowledge
[49] T. Palpanas, 2006.
D. Papadopoulos, V. Kalogeraki, and ACM, 2007.
D. Gunopulos.
[49] T. Palpanas, Distributed deviation detection
D. Papadopoulos, in sensor net-
V. Kalogeraki, and [56] D.-H. Tran and K.-U. Sattler. On detection of changes in
works. ACM SIGMOD
D. Gunopulos. Record,
Distributed 32(4):7782,
deviation detection2003.
in sensor net- [56] sensor data streams.
D.-H. Tran and K.-U. In Proceedings of the 9thof
Sattler. On detection International
changes in
works.Pham,
ACM SIGMOD Record, Conference on Advances
sensor data streams. in Mobile of
In Proceedings Computing and Multi-
the 9th International
[50] D.-S. S. Venkatesh, M. 32(4):7782, 2003.
Lazarescu, and S. Budha- media, pageson5057. ACM,
ditya. Pham,
Anomaly detection inM.large-scale Conference Advances in 2011.
Mobile Computing and Multi-
[50] D.-S. S. Venkatesh, Lazarescu,data
andstream net-
S. Budha- media, pages 5057. ACM, 2011.
works. Data Mining
ditya. Anomaly and Knowledge
detection Discovery,
in large-scale 28(1):145
data stream net- [57] M. van Leeuwen and A. Siebes. Streamkrimp: Detecting
189, 2014.
works. Data Mining and Knowledge Discovery, 28(1):145 [57] change
M. van in data streams.
Leeuwen and A.Machine
Siebes.Learning and Knowledge
Streamkrimp: Detecting
189,Rajagopal,
2014. Discovery in Databases,
change in data pages 672687,
streams. Machine Learning 2008.
and Knowledge
[51] R. X. Nguyen, S. C. Ergen, and P. Varaiya.
Distributed online simultaneous Discovery in Databases, pages 672687, 2008.
[51] R. Rajagopal, X. Nguyen, S. C.fault detection
Ergen, and P.forVaraiya.
multi- [58] V. Veeravalli and P. Varshney. Distributed inference in wire-
ple sensors. In Information Processing in Sensor Networks,
Distributed online simultaneous fault detection for multi- [58] less sensor networks.
V. Veeravalli Philosophical
and P. Varshney. DistributedTransactions
inference inofwire-
the
2008. IPSN08.
ple sensors. International
In Information Conference
Processing on, pages
in Sensor 133
Networks, Royal Societynetworks.
A: Mathematical, Physical and Engineering
less sensor Philosophical Transactions of the
144. IEEE, 2008.
2008. IPSN08. International Conference on, pages 133 Sciences, 370(1958):100117, 2012.
Royal Society A: Mathematical, Physical and Engineering
144.Roddick,
[52] J. IEEE, 2008.L. Al-Jadir, L. Bertossi, M. Dumas, Sciences, 370(1958):100117, 2012.
[59] A. Zhou, F. Cao, W. Qian, and C. Jin. Tracking clusters
[52] J. Roddick, L.K. Al-Jadir,
H. Gregersen, Hornsby, L.J. Bertossi,
Lufter, F.M.Mandreoli,
Dumas, [59] in
A. evolving
Zhou, F.data
Cao,streams overand
W. Qian, sliding windows.
C. Jin. Knowledge
Tracking clusters
T. Mannisto, E. Mayol, et al. Evolution
H. Gregersen, K. Hornsby, J. Lufter, and F.
change in data
Mandreoli, and Information Systems, 15(2):181214, 2008.
management:issues andetdirections. ACM in evolving data streams over sliding windows. Knowledge
T. Mannisto, E. Mayol, al. Evolution andSigmod
changeRecord,
in data and Information Systems, 15(2):181214, 2008.
29(1):2125, 2000. and directions. ACM Sigmod Record,
management:issues
29(1):2125, 2000.

Contextual Crowd Intelligence
Beng Chin Ooi , Kian-Lee Tan , Quoc Trung Tran , James W. L. Yip ,
Gang Chen# , Zheng Jye Ling , Thi Nguyen , Anthony K. H. Tung , Meihui Zhang
National University of Singapore National University Health System # Zhejiang University
{ooibc, tankl, tqtrung, thi, atung, zmeihui}@comp.nus.edu.sg
{james yip, zheng jye ling}@nuhs.edu.sg # cg@zju.edu.cn
ABSTRACT data-driven solution employs machine learning algorithms to

derive risk factors solely from observational data. An alternative
Most data analytics applications are industry/domain specific, e.g.,
approach combines the knowledge-driven and data-driven
predicting patients at high risk of being admitted to intensive care
approaches in the data analytics applications [31]. However, there
unit in the healthcare sector or predicting malicious SMSs in the
are exceptional situations where it is not easy to capture or
telecommunication sector. Existing solutions are based on best
formalize, and where neither general guidelines are available nor
practices, i.e., the systems decisions are knowledge-driven
rules can be derived from data (e.g., in rare conditions). Instead, it
and/or data-driven. However, there are rules and exceptional cases
is only through many years of experience can subject-matter
that can only be precisely formulated and identified by
experts (SMEs) formulate and identify these situations. The
subject-matter experts (SMEs) who have accumulated many years
challenge then is to be able to capture and utilize such knowledge
of experience. This paper envisions a more intelligent database
to effectively support industry/domain specific applications, e.g.,
management system (DBMS) that captures such knowledge to
improving the accuracy of the prediction tasks.
effectively address the industry/domain specific applications. At
the core, the system is a hybrid human-machine database engine This paper proposes building the next generation of intelligent
where the machine interacts with the SMEs as part of a feedback database management systems (DBMSs) that exploit contextual
loop to gather, infer, ascertain and enhance the database crowd intelligence. The crowd intelligence here refers to the
knowledge and processing. We discuss the challenges towards knowledge and experience of subject-matter experts (SMEs).
building such a system through examples in healthcare predictive Although such knowledge is an important component in
analysis a popular area for big data analytics. transforming data into information, it is currently not captured by
a structured system. The participants in an intelligent crowd are
domain experts rather than unknown lay-persons in existing
1. INTRODUCTION systems that use crowdsourcing as part of database query
Most data analytics applications are industry or domain specific. processing (e.g., CrowdDB [13], Deco [24], Qurk [23],
For example, many prediction tasks in healthcare require prior CDAS [12; 22]) and information extraction or knowledge
medical knowledge, such as, identifying patients at high risk of acquisition (e.g., HIGGINS [21] and CASTLE [28]). For
being admitted to the intensive care unit, or predicting the applications where data confidentiality and privacy are important
probability of the patients being readmitted into the hospital (e.g., healthcare analytics), the intelligent crowd may consist of
within 30 days after discharge. Another example from the only experts from within the organization, since the tasks cannot
telecommunication sector is the identification of malicious SMSs be outsourced to external parties. Given that the crowd is known
requiring inputs from security experts. Building competent tools apriori, there is an assurance of user accountability, which
to effectively address these problems are important, as industrial translates to an assurance in the quality of the answers. A recent
organizations face increasing pressures to improve outcomes while system, called Data Tamer [30], also proposed to leverage on
reducing costs [3]. expert crowdsourcing system to enhance machine computation but
Existing solutions to industry or domain specific tasks are based in the context of data curation. Our proposition differs from Data
on best practices. These solutions are knowledge-driven (i.e., Tamer in several aspects. First, the target applications of our work
utilizing general guidelines such existing clinical guidelines or (i.e., data analytics) are different from those in Data Tamer (i.e.,
literature from medical journals) and/or data-driven (i.e., deriving data curation). Thus, each system needs to address a unique,
rules from observational data) [31]. Let us consider the task of different set of challenges. Second, the domain experts in our
identifying the risk factors related to heart failure. The context are also users/reviewers of the system. Thus, the experts
knowledge-driven solution uses risk factors identified from are likely to take ownership and hence are motivated to improve
existing clinical knowledge or literature, such as, age, the accuracy of the analytics and the usability of the applications.
hypertension and diabetes status. However, it may miss out other This would reduce the need to localize/customize the system since
unknown risk factors specific to the population of interest. The the experts/users are continuously interacting with the system;
reason is that the guidelines are generic and based on existing these experts define the best practices for the system. For
knowledge, which results in models that may not adequately example, doctors in a particular department may use a different
represent the underlying complex disease processes in the convention or notation from another department, e.g., when
population with a comprehensive list of risk factors [31]. The doctors write PID in the orthopedic department, the acronym
refers to the Prolapsed Intervertebral Disc only and not the
Pelvic Inflammatory Disease. Clearly, such knowledge can only

be provided by internal domain experts. In contrast, experts in high risk of being admitted to intensive care unit, or predicting the
Data Tamer are not the users of the system and hence there is a probability of the patients being readmitted into the hospital soon
need to customize/localize the system for different use-cases. after discharge. There are also queries that monitor real-time data
In order to entrench the crowd intelligence into the DBMS, the of patients in critical conditions for unusual conditions, such as,
system needs to keep SMEs as part of the feedback loop. The whether patients are at high risk of collapsing. With correct
system can then further utilize feedback provided from the SMEs predictions, doctors can intervene early to alleviate the
to infer, ascertain and enhance its processing, thus continuously deterioration of patients health outcome. This can potentially
improving the effectiveness of the system. For example, when reduce the burden of limited healthcare resources in the primary
predicting the risk of unplanned patient readmissions, the system and acute care facilities. For instance, if a patient is at high-risk
asks the doctors to label patients who the system has low for unplanned post discharge readmission, he can potentially
confidence in predicting their readmissions, and the benefit from close followed-up after discharge, e.g., the hospital
rules/hypotheses that the doctors used to do the labeling. One sends a case manager or nurse to examine him once every three
example of such an expert rule is that an elderly patient who lives days. In addition, important queries related to public health
alone and have had several severe diseases is likely to be surveillance can be answered in a timely fashion. For example, it
readmitted into the hospital frequently. The system would then is critical to provide real-time, early information to alert
verify or adjust these rules/hypotheses and revert back to the decision-makers of emerging threats that need to be addressed in a
doctors with evidence to support or reject their rules/hypotheses. particular population. The ultimate goal of these predictive queries
Such interactions are beneficial to both the system and the doctors. is to predict, pre-empt and prevent for better healthcare outcome.
Eventually, the application system evolves over time. SMEs
become part of this evolving process by sharing their domain 3. AN INTELLIGENT DBMS FOR BIG DA-
knowledge and rich experience, thereby contributing to the
improvement and development of the system. Hence, the experts TA ANALYTICS
are more willing and comfortable to use the system to alleviate the In this section, we discuss the challenges of addressing big data
burden of their duties. analytics and present an overview of a hybrid human-machine
This work is part of our CIIDAA project on building large scale, system for these tasks.
Comprehensive IT Infrastructure for Data-intensive Applications
and Analysis [2]. Our collaborators are clinicians in the National
3.1 Challenges of Big Data Analytics
University Health System (NUHS) [5]. The project aims to harness Essentially, many tasks of big data analytics can be viewed as
the power of cloud computing to solve big data problems in the real conventional data mining problems, such as, classifying patients
world, with healthcare predictive analytics being a popular area for into different class labels (high or low risk of being admitted to
big data analytics [26]. intensive care units). There are, however, three important aspects
that differentiate big data analytics from traditional machine
Organization. The remainder of this paper is organized as learning problems.
follows. Section 2 presents motivating examples in healthcare
predictive analytics. Section 3 discusses the architecture of an First, many valuable features for the analytics tasks are
intelligent DBMS that aims to embed contextual crowd stored in unstructured data, for example, doctors notes [25].
intelligence. Section 4 elaborates on research problems that we We cannot simply treat these notes as traditional
need to address in order to build an intelligent DBMS. Section 5 bag-of-words documents. Instead, we need powerful tools
presents our preliminary results on the problem of predicting the to extract from these documents the right entities (such as,
risk of unplanned patient readmissions. Section 6 presents the diseases, medications, laboratory tests) and domain-specific
related work. Finally, Section 7 concludes our work. relationships (such as, the relationship between a disease
and a laboratory test). The text in unstructured data has to
2. MOTIVATING EXAMPLES be contextualized to each organizations practice, e.g.,
Let us consider a hospital that has an integrated view of the medical doctors in a particular department may use a different
care records of patients as shown in Table 1. The table contains two convention or notation from another department.
types of information:
Second, there is usually a lack of training samples with
Structured information, including the case identifier, well-defined class labels. For instance, when predicting the
patients name, age, gender, race, the number of days that risk of committing suicide for each patient, the total number
the patient stayed at the hospital during a particular visit patients known to have committed suicide (i.e., class 1) is
(LengthO f Stay), and the number of days before the patient very small. However, it does not mean that all the remaining
was readmitted into the hospital after discharge patients did not commit suicide (i.e., class 0). Hence we
(Readmission) ; and need to infer the correct class labels for these patients. This
problem also occurs in other domains such as home security
Unstructured information, i.e., free-text from a doctors note
and banking. For example, one important task that many
that contains additional and useful information of a patient
national security agencies need to perform is identifying
healthcare profile such as his past medical history, social
persons or groups of people who will likely commit a
factors, previous medications, complaints of patients based
crime [4]. In this setting, the agency maintains a very small
on a doctors investigations, major lab results, issues and
set of people who have committed crime. However, we
progress, etc.
cannot simply assume that the remaining people are not
The tuples in this table are extracted from real cases of patients likely to commit crime. As before, we need to infer the
admitted to the National University Hospital (NUH) in Singapore. correct class labels for these people. Another example is in
Healthcare professionals often have queries relating to predicting telecommunication, where a service provider wants to
the severity of patients condition, such as, identifying patients at predict whether an SMS is malicious. In this case, we do not

CaseID Name Age Gender Race LengthOfStay Readmission Doctors note
PMH:
1 IHD
- on GTN 0.5mg prn
2 DM
Case 1 Patient 1 71 Female Chinese 5 20 - on Metformin 750mg
- HbA1c 7.5% 09/12
3 HL
Stays with son
Social issues: Single, no child
Used to live with friend in a shophouse
Case 2 Patient 2 60 Male Malaysian 10 20 Now at sheltered home since Sept 2011.
No next-of-skin or visitor.

Table 1: Medical care table
have any predefined class labels and might need to ask Hadoop Distributed File System (HDFS) and a key-value storage
security experts to provide the class labels for some sample system, ES2 [8]) for both unstructured and structured data. The
cases. next layer (which is the security layer) enables users to protect
data privacy by encryption. The third layer (which is the
Lastly, data in different domains (e.g., healthcare, distributed processing layer) provides a distributed processing
telecommunication, home security) is expected to grow infrastructure called E3 [9] that supports different parallel
dramatically in the years ahead [26]. For instance, patients processing logics such as MapReduce [11], Directed Acyclic
in intensive care units are constantly being monitored, and Graph (DAG) and SQL. The top layer (which is the analytics
their historical records have to be retained. This can easily layer) exploits the contextual crowd intelligence for big data
result in hundreds of millions of (historical) records of analytics. The details of this layer are shown in Figure 1. In
patients. As another example, during a mass casualty Figure 2, KB is the knowledge base and iCrowd is the component
disaster (e.g., SARS, H5N1), there is an overwhelming that interacts with the domain experts. Different components of
number of patients who have to be monitored and tracked, the analytics layer (e.g., scalable machine learning algorithms) can
and information about each patient is huge by itself. process their data with the most appropriate data processing model
Furthermore, streaming data arrive continuously, e.g., new and their computations will be automatically executed in parallel
data from the real-time data feed are constantly being by the lower layers.
inserted. Hence, the system in healthcare setting must In the remaining of this paper, we focus only on the analytics layer.
provide the real-time predictions, e.g., predicting the For more details of the other layers of the epiC system, please refer
survival of patients in the next 6 hours. to [1; 10; 19].
The three above mentioned aspects call for a new generation of
intelligent DBMSs that can provide effective solutions for big data 4. RESEARCH PROBLEMS
analytics. Our proposition of exploiting contextual crowd In this section, we elaborate on the research problems that we need
intelligence is, we believe, a big step towards this goal. to address in order to build an intelligent system for big data
analytics.
3.2 Contextual Data Management
The central theme of crowd intelligence is to get domain experts 4.1 Asking Experts The Right Questions
engaged as both the participants to fine tune the system and the Given a large volume of data and a limited amount of time that
end-users of the system. Figure 1 presents an intelligent system domain experts can participate in building the systems, we need
that exploits contextual crowd intelligence for big data analytics. to ask the experts the right questions. In the context of healthcare
The system first builds a knowledge base that will be subsequently analytics, we plan to ask the following domain knowledge from
used for the analytics tasks based on historical data, domain doctors.
knowledge from SMEs (e.g., doctors), and other sources such as
general clinical guidelines. Each source contributes to build some Labelings. The system asks doctors to label tuples that the
weak classifiers. The system needs to combine these classifiers system has low confidence in performing the prediction
to derive a final classifier that achieves a high level of accuracy for task. There are two important issues here. First, doctors
prediction purposes. The system also needs to go through several have different levels of confidence when answering different
iterations of interaction with the experts to refine, for example, the questions, i.e., doctors are reluctant to assess patient profiles
final classifier. As such, the experts participate in the entire that they do not have specialties. Second, since there is so
process in fine tuning the system and decide on the best much information about patients, selecting the relevant
practices. When real-time data or feed arrives, the system feature of each patient to present to the doctors in order not
performs the prediction on-the-fly and alerts the experts to overwhelm them is also a major issue.
immediately. Hence, the experts become the end-users of the In essence, what we need is a diverse set of labeled patients
system. that covers the whole data space as much as possible. One
We have developed the epiC system [1; 10; 19] to support large possible solution is to group similar patient profiles together
scale data processing, and are extending it to support healthcare and show these groups to doctors. The purpose is to let the
analytics. Figure 2 shows the software stack of epiC. At the doctors select the groups of patients that they are
bottom, the storage layer supports different storage systems (e.g., comfortable in providing the labels. In addition, for each

Classifer KB iCrowd Analytics Layer
Answers
Historical data
Classifier from Classifier from Questions
SMEs data MR DAG SQL
Distributed
Classifier from Classifier from Processing Layer
general guidelines other sources
E3
Derive
Update SMEs
Real-time data/feed
Trusted Data Service Security Layer
Predictor Knowledge
Base
Recommendations Storage Layer

HDFS ES2
Feedback
Figure 1: Contextual crowd intelligence for big data analytics. Figure 2: The software stack of epiC for big data analyt-
ics.
group, we present only the features which the patients in the lab tests) from various medical dictionaries (a.k.a. knowledge
group have similar values. In this way, we can avoid base), such as, the Unified Medical Language System (UMLS) [6].
overwhelming the doctors with information. Note that, in We now discuss several problems raised due to the nature of the
some cases, we need to perform hierarchical clustering to unstructured data and the incompleteness of the knowledge base,
reduce the number of patients shown to the doctors each and subsequently discuss a hybrid human-machine approach to
time. Selecting the right clustering algorithms and solve these problems. The discussion uses the following running
developing effective visualization tools to present patients example. We run cTAKES on the doctors note of patient 1 (in
profiles are important here. Table 1), and obtain the following clinical entities: (1) diseases:
IHD (Ischemic Heart Disease) and DM; (2) medications: GTN
Rules/Hypotheses. The system collects expert
and Metformin; and (3) laboratory test: HbA1c.
rules/hypotheses that the doctors used to do the labeling.
For example, to predict the risk of unplanned patient Ambiguous mentions. In many cases, a mention in the free text
readmissions, the doctors suggested a hypothesis that social may refer to different domain entities. For instance, in the running
factors and the status of the diseases are important risk example, DM refers to two different diseases Dystrophy
indicators for readmission. The system would then verify or Myotonic and Diabetes Mellitus. We note that this problem is
adjust these hypotheses and revert back to the doctors with not uncommon as doctors tend to use abbreviations in their notes.
evidence to support or reject their hypotheses. Such For example, CCF refers to either Congestive heart failure or
interactions are beneficial to both the system and the Carotid-Cavernous Fistula diseases; PID refers to either
doctors. Prolapsed Intervertebral Disc or Pelvic Inflammatory Disease.
There are also cases where only human but not the machine can
Inferred implicit knowledge. The system can also infer understand the meaning of some mentions in the text. For
implicit and valuable knowledge based on the example, assuming that we are extracting the social factor of
answers/reactions of the domain experts. For instance, if the patients in Table 1. It is rather easy to extract the social factor for
doctors label two patients who belong to a given cluster patient 1, since the text contains the phraze stays with son.
differently, then the system can adjust the distance function However, it is challenging, if not possible, for the machine to
used to compute the similarity between two patients, and extract the social factor for patient 2. The reason is that the
thus infer which features are more important. Such paragraph contains several different keywords relating to the
knowledge is implicit as the doctors themselves may not be social factor such as single, no child, live with friend,
aware of. sheltered home, next-of-kin.
We can also ask the same kind of questions for the analytics tasks Incomplete knowledge base. The knowledge base is incomplete
in other domains. For instance, to predict malicious SMSs, we for the following reasons. First, the terms used in the doctors
need to select a small set of messages (by utilizing some clustering notes could be specific within a country or a particular hospital,
algorithms) and ask the experts to provide labels for these whereas the existing knowledge bases may only cover the
samples. We also collect rules and heuristics that the experts universal ones. Thus, these terms do not exist in the dictionary.
utilize to label the SMSs. One example is the term HL in our running example, which
refers to the Hyperlipidemia disease but is not captured in
4.2 Extracting Domain Entities From UMLS. Second, the relationships between entities covered in
Unstructured Data existing medical knowledge bases (like ULMS) are far from
Feature selection is very important for any machine learning task complete. In the running example, the fact that the medication
and can greatly affect the algorithms quality. Processing doctors Metformin is used to treat Diabetes Mellitus (DM) is also missing
notes for extracting important features is an inevitably important in UMLS. The relationships that exist between domain entities can
step for healthcare analytics problems. There are several be used to derive implicit and useful information. For instance,
state-of-the-art Natural Language Processing (NLP) engines for from the laboratory result of the lab test HbA1c, we can infer
processing clinical documents, such as, MedLEE [14] and whether the DM condition is well-controlled (i.e., the relationship
cTAKES [27]. These engines process clinical notes, identifying between a disease and a lab test).
types of clinical entities (e.g., medications, diseases, procedures,

A hybrid human-machine approach. To infer the correct Round 3 C7 Final classifier
entities from unstructured data, a hybrid human-machine solution
should be employed. The system can leverage the information
from the knowledge base (e.g., UMLS) together with the implicit Intermediate
Round 2 C5 C6
information (signals) inherent in the unstructured data (e.g., classifiers
doctors notes) to improve the accuracy of its inference process

and enhance the knowledge base as well. The system will pose
Round 1 C1 C2 C3 C4
questions to the healthcare professionals for verification. Based on
the answers from the experts, the system adjusts its inference Classifier from Classifier from Classifier from Classifier from
doctor A doctor B
results. The inference process gets more accurate and complete as data general guidelines
the system runs more iterations. Meanwhile, the knowledge base

becomes more comprehensive and customized to each Figure 3: An example of several rounds of learning for healthcare
organizations practice. More specifically, in our running example: predictive analytics
Since DM is attached with the laboratory test HbA1c
in the paragraph, the machine conjectures that DM would There are many ways to achieve the goal. Figure 3 shows an
refer to the Diabetes Mellitus disease only. The reason is example of a process consisting of three rounds of learning for the
that HbA1c is a laboratory test that monitors the control of task of predicting the severity of patients. In the first round, the
diabetes and HbA1c does not have any relationship with the system computes four classifiers: C1 and C2 are the classifiers
other disease related to DM (i.e., Dystrophy Myotonic). derived from rules provided by SMEs (i.e., doctors); C3 is the
To correctly infer the disease Hyperlipidemia for HL, classifier derived from historical data; and C4 is the classifier
the machine infers a pattern of num d where num is a derived from clinical guidelines. It is essential to resolve
fraction annotation and d is a disease. (1 IHD and 2 disagreeing opinions from various sources. There are several ways
DM are two examples.) The machine then infers that HL to combine different classifiers, such as, using majority-voting for
may refer to a disease since the phrase 3 HL follows the the outputs of different rules/classifiers or combining features
pattern. The machine then poses a question to a doctor: being used in different input classifiers.
which disease HL represents for? In this case, the doctor It is likely that all the classifiers built after the first round do not
confirms that HL represents for the Hyperlipidemia agree with each other for the prediction tasks. Thus, in this
disease. Based on the answer, the machine adds the example, the system performs two additional rounds of learning to
mapping between the mention HL and the disease improve the accuracy of the classifier. It is also possible that there
Hyperlipidemia to the knowledge base. Hence, the is no way to reconcile the classifiers, i.e., there will be multiple
knowledge base becomes more comprehensive and different classifiers. In such situations, it may be necessary to
customized to NUHs practice. rank the results of the different classifiers, and pick the answer
that is ranked highest. How to do this is an open question.
To identify the missing relationship between the medication
Metformin and the disease DM, the machine infers a pattern 4.4 Scalable Processing
of d on med, where d is a disease, med is a medication Big data analytics is characterized by the so-called 3V features:
and med is used to treat d. (IHD on GTN is an example.) Volume - a huge amount of data, Velocity - a high data ingestion
The machine conjectures that there should have a rate, and Variety - a mixed of structured, semi-structured and
relationship between DM and Metformin, since the phraze unstructured data. These requirements force us to rethink the
DM on Metformin follows the pattern. The machine then whole software stack to address big data analytics efficiently and
verifies this inference with the doctors. The doctors confirm effectively, ranging from the storage layer that should manipulate
that they typically write the medications that are used to both structured and unstructured data to application layer that
treat a disease right next to the disease, and connect these should support scalable machine learning algorithms. To illustrate
relationships by the preposition on. Clearly, such rule is the points, let us reconsider the problem of predicting the
very useful the machine will then infer other missing malicious SMSs. The collection of SMSs is huge, e.g., in the order
relationships using this expert rule with fewer questions of hundreds of tera-bytes. As discussed in Section 4.1, we need to
being posed to the doctors. pick a set of SMSs for domain experts to label. Conventional
To derive the social factor for patient 2, the machine can first clustering algorithms may not work well here as we need to
attempt to derive the information using a simple strategy handle such a large amount of data. The problem is even more
such as analyzing the NLP structure of sentences containing challenging in our context, as we need to frequently get the
patterns like stay with, live with. For complicated cases domain experts involved in building the system. The delay from
when the machine cannot find out the information, we need human beings reaction may be a large factor affecting the low
to tap on the knowledge of the experts. latency of the system.
The scalability of the problems is also in terms of
high-dimensional data space. Our data set inherently contains a
4.3 Combining Multiple Weak Classifiers large number of features. For instance, there are different
We can obtain different classifiers from multiple sources such as information about patients such as thousands of different diseases
classifiers built based on the observational data, rules used by the and lab tests. One solution to reduce the dimensions is to group
doctors and general clinical guidelines. Each source of knowledge these attributes semantically, e.g., grouping together different
can be considered as a weak classifier and the task is to combine diseases that share a same root. For instance, the Hypertension
these classifiers to derive a final classifier that achieves a very high disease, Hypotension disease and Ischaemic Heart disease can be
accuracy in prediction. grouped together under the category of Cardiovascular disease.

Clearly, to perform such tasks, we need to consult the domain # actual class 1 # actual class 0
experts as different hospitals/doctors may have different #predicted class 1 1071 1321
opinions/reasoning in performing this task. This is, again, an #predicted class 0 4587 22070
example of getting the domain experts involved in building the (a) Using only structured features
systems.
# actual class 1 # actual class 0
4.5 Engaging Expert Users #predicted class 1 2679 4250
As the system needs to interact with SMEs frequently, it is #predicted class 0 2979 19141
important to engage the experts along the process of building and
using the system. The system should provide several (b) Using both structured and derived features
functionalities for this purpose:
Table 2: The accuracy of our classifier.
A user-friendly interface for the experts to provide their
inputs such as rules, hypothesis, labels, etc.
number of them is important and is captured in the doctors
The system should provide not only the final outcome (e.g.,
notes. As a result, selecting lab findings mentioned by
whether the patient is at high/low risk of being sent to ICU)
doctors naturally reduces the dimensions of the data set.
but also the reasons that drive its decision. Therefore,
keeping track of the provenance of the knowledge is Comorbidity influence, i.e., we should take into account the
important. For instance, when the system makes a decision past medical history of the patient together with the disease
that differs from experts opinions, the system should be status (whether the disease has been well-controlled).
able to trace back whether the mismatch is mainly due to the
use of some general guidelines, or due to other experts
Participants in a crowd-sourcing system. We adopted a hybrid
opinions.
human-machine approach to extract the social factors and lab
Presenting feedback to the experts. For instance, the system findings from doctors free-text notes.
can explain how well an expert performs compared to other To extract the social factors, we use an NLP technique to analyze
colleagues. As another example, the system can reveal sentences containing phrases related to the social factor such as
comments and annotations by other experts to see whether live (with), stay (with), main care-giver to pinpoint some
an expert would change her decision. It is also interesting to keywords such as daughter, family, spouse, etc. The system
present new patterns of knowledge that an expert may lack then asks the doctors to handpick a set of predefined categories of
and potentially educate her. social factors. For instance, living with family and taking care by
professional helpers (e.g., maid, domestic helpers) are in a same
5. PRELIMINARY RESULTS group. As another example, living alone and living in a
community nursing home are in a same group. The system also
We are studying the problem of predicting the probability of
performs a postprocessing step to pull out cases that can be
patients being readmitted into the hospital within 30 days after
assigned more than one category of social factors. The system
discharge. We refer to the task as readmission prediction for short.
then asks the doctors to label these cases manually. (There are
We use the clinical data drawn from the National University
about 200 cases that need to be manually labeled.)
Hospitals Computerized Clinical Data Repository (CCDR) and
To extract the lab findings, the system first uses a simple pattern
focus only on the elderly patients (i.e., patients with age older than
matching technique to extract all possible lab tests mentioned in
60) admitted to the hospital in 2012. The table used for the
the note. For instance, if the note contains a pattern of the form
prediction task is the medical care table1 that has similar schema
word num where word is some word and num is a number, then
as the one presented in Table 1. There are in total 29049 elderly
word is a candidate lab test. A word is a correct lab test if it exists
patients admitted to NUH in 2012, where 5658 patients readmitted
in the medical dictionary with the category of lab tests. For the
within 30 days, i.e., the proportion of patients who were
false lab tests that are currently not present in the dictionary and
readmitted (i.e. class label 1) is 0.188.
appear frequently in the notes, the system asks the doctors to verify
5.1 Interacting with Domain Experts them. As a result, there are some actual lab tests that are missing
in the dictionary such as TW, which is a local convention used
We have been getting the doctors involved in the following tasks.
inside NUH.
Hypothesis/Rules. Our clinician collaborators have suggested a
Extracting medical concepts. We run the cTAKES NLP engine
hypothesis that the following features (indicators) might be
over the UMLS dictionary to extract the past medical history of a
important for the readmission prediction:
patient. We are in the process of developing algorithms to improve
Social-economic factors, e.g., who are the care-givers and the accuracy of extraction (to resolve problems mentioned in
the patients economic status. Section 4.2). Thus, we use the number of diseases that the patient
has as an indicator instead of the actual diseases.
Lab findings. We should extract the lab findings that the
doctors mentioned in their notes instead of using the labs 5.2 Results
recorded in the structured data in CCDR. The reason is that After interacting with the doctors to extract relevant features, we
patients typically have hundreds of lab tests but only a small obtained two sets of features for the prediction task:
1 To derive the medical care table, we joined information from var-
ious relations in CCDR, including: Discharge Summary, Patient Structured features: patients demographics (age, gender,
Demographics, Visit and Encounter, Lab Results and Emergency race), the number of days that the patient stayed at the
Department. hospital, the number of previous hospitalizations, and the

number of prior emergency visits in the last six month Crowdsourcing in database. There has been a lot of recent
before admission. interest in the database community in using crowdsourcing as part
of database query processing (e.g., CrowdDB [13], Deco [24],
Derived features from free-texts (We refer to these features Qurk [23], CDAS [12; 22]). As discussed, the intelligent crowds
as derived features for short): social factors, lab findings, and in our context are domain experts (rather than lay-persons in the
past medical history (i.e., diseases). existing crowds) who are also users/reviewers of the system.
We used WEKA [15] to run a 10-fold cross-validation and the Furthermore, exploiting intelligent crowd can be much more
Bayesian Network classifier to construct a readmission classifier2 . collaborative in nature. In typical crowdsourcing, the crowds are
Table 2 reports the accuracy of the prediction across all the 10 not aware of each others answers. But in our context, we can
validation data. If only structured features are used to build the actually go through several iterations and see whether the experts
classifier (Table 2(a)), the resulting classifier can correctly predict will change their decisions when they are provided with comments
1071 cases that are readmitted (within 30 days). The precision and and annotations by other experts.
recall in this case are 0.448 and 0.189, respectively. Meanwhile, if A recent system, called Data Tamer [30], also leveraged expert
both structured and derived features are used to build the classifier crowdsourcing system to enhance machine computation but in the
(Table 2(b)), the resulting classifier can correctly predict 2679 context of data curation. As discussed in Section 1, the key
cases that are readmitted. The precision and recall are 0.387 and difference between our proposition and Data Tamer lies in the fact
0.473 respectively. Clearly, the recall has been improved that the domain experts in our context are also users/reviewers of
significantly with the usage of the derived features from the the system. Thus, the experts are likely to take ownership and
free-text doctors notes. The result is also very promising when we hence are motivated to improve the accuracy of the analytics and
compared it to the result handled manually by domain experts the usability of the applications. This would reduce the need to
such as physicians, case managers, and nurses [7]. The recall localize/customize the system. Also, each system needs to address
reported in [7] is in the range [0.149, 0.306]. The conclusion in [7] a different set of challenges, since the targeted applications are
is that care-providers were not able to accurately predict which different.
patients were at highest risk of readmission. However, we believe Active learning. In the active learning model, the data come
that a hybrid machine-human solution would greatly alleviate the unlabeled but the goal is to ultimately learn a classifier (e.g., [17;
problem. 29; 32]). The idea is to query the labels of just a few points that are
We would like to emphasize that there are many rooms to further especially informative in order to obtain an accurate classifier. The
improve the accuracy of the prediction such as enhancing the labels are obtained from highly-trained experts (e.g., doctors). The
feature extraction process, employing additional features, such as, scope of our proposition is much more general than active learning
disease status, specific diagnoses, medications, and using special in the following points. First, we would like to exploit as much
classifiers for highly-imbalanced data set. domain knowledge from experts as possible, not restricting to only
the class labels as in active learning. For instance, rules and
6. RELATED WORK hypotheses provided by experts with many years of experience
must be exploited in several cases. Second, active learning focuses
Related works to our proposition can be broadly classified into the
on getting a better classifier so the query points presented to the
following three categories.
crowd are usually those data points that are at the boundary of the
Existing solutions for industry/domain specific applications. separating plane. However, these are also the data points that the
Existing solutions are currently built based on best practices. experts are usually not very clear about. As such, we need to be
One direction is knowledge-driven approach that is based on able to identify additional information that should be provided for
general guidelines such as clinical guidelines, e.g., IBM the experts to be able to make an informed decision. Lastly, we
Watson [3]. Another direction is data-driven approach that is need to handle a large amount of data whereas existing solutions
based on rules extracted from the observational data, e.g., [16; on active learning usually deal with small data set.
18; 20]. Recently, IBM proposes to combine the strengths of the
two directions [31]. However, these solutions have not explored
the exceptionally complicated rules/patterns that can only be
7. CONCLUSION
provided by internal domain experts with years of working Each of us is a subject-matter expert (SME) of our profession, and
experience. Our research aims to fill this gap: we seek to engage we carry with us a vast amount of knowledge and insights not
the experts as users of the system, and tap on their expertise to captured by a structured system. This might have explained the
enhance the database knowledge and processing. There are several emergence of Knowledge Management systems. However, there
benefits of employing internal domain experts. First, we do not are many rules and exceptional cases that can only be formulated
need to customize/localize the system for different use-cases; they by experts with many years of experience. Such rules, when
themselves define the best practices for the system. Second, in properly coded, can help in facilitating contextual decision
terms of the data used to build the knowledge base, our system making. This paper envisions a more intelligent DBMS that
mainly bases on observational data and knowledge provided by captures such information or knowledge. At the core, the system is
domain experts; whereas others (e.g., IBM Watson) need to a hybrid human-machine database processing engine where the
process a much larger amount of inputs such as medical journals, machine keeps the SMEs as part of the feedback loop to gather,
white papers, medical policies and practices, information in the infer, ascertain and enhance the database knowledge and
web, etc. Third, the system should become more intelligent over processing. This paper discussed many open challenges that we
times when the expert users continuously enhance the system with need to tackle in order to build such a system.
their expert knowledge.
2 We also used other classifiers such as decision tree, rule-based
8. ACKNOWLEDGEMENTS
classifier, SVM, etc and observe that the Bayesian Network classi- This work was supported by the National Research Foundation,
fier provides the best result. Prime Ministers Office, Singapore under Grant No.

NRF-CRP8-2011-08. We thank Associate Professor Gerald C.H. [18] A. Hosseinzadeh, M. T. Izadi, A. Verma, D. Precup, and D. L.
Koh and Dr. Chuen Seng Tan (Saw Swee Hock School of Public Buckeridge. Assessing the predictability of hospital readmis-
Health, National University Health System) for sharing with us sion using machine learning. In IAAI, 2013.
domain knowledge in healthcare.
[19] D. Jiang, G. Chen, B. C. Ooi, K.-L. Tan, and S. Wu. epic:
an extensible and scalable system for processing big data. In
9. REFERENCES PVLDB, 2014.
[1] http://www.comp.nus.edu.sg/epic. [20] P. S. Keenan, S.-L. T. Normand, Z. Lin, E. E. Drye, K. R.

Bhat, J. S. Ross, J. D. Schuur, B. D. Stauffer, S. M. Bernheim,
[2] The comprehensive it infrastructure for data- A. J. Epstein, Y. Wang, J. Herrin, J. Chen, J. J. Federer, J. A.
intensive applications and analysis project. Mattera, Y. Wang, and H. M. Krumholz. An administrative
http://www.comp.nus.edu.sg/ciidaa/. claims measure suitable for profiling hospital performance on
the basis of 30-day all-cause readmission rates among patients
[3] Ibm big data for healthcare. http://www.ibm.com. with heart failure. Circ Cardiovasc Qual Outcomes, 1(1):29
37, 2008.
[4] The minority report: Chicagos new police
computer predicts crimes, but is it racist? [21] K. S. Kumar, P. Triantafillou, and G. Weikum. Human com-
http://www.theverge.com/2014/2/19/5419854/the-minority- puting games for knowledge acquisition. In CIKM, pages
report-this-computer-predicts-crime-but-is-it-racist. 25132516, 2013.
[5] National university health system. http://www.nuhs.edu.sg/. [22] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang.
Cdas: a crowdsourcing data analytics system. PVLDB,
[6] Unified medical language system. 5(10):10401051, 2012.
http://www.nlm.nih.gov/research/umls/.
[23] A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowd-
[7] N. Allaudeen, J. L. Schnipper, E. J. Orav, R. M. Wachter, and sourced databases: Query processing with people. In CIDR,
A. R. Vidyarthi. Inability of providers to predict unplanned pages 211214, 2011.
readmissions. J Gen Intern Med, 26(7):771776.
[24] A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzo-
[8] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. tis, and J. Widom. Deco: declarative crowdsourcing. In CIK-
Vo, S. Wu, and Q. Xu. Es2: A cloud data storage system for M, pages 12031212, 2012.
supporting both oltp and olap. In ICDE, pages 291302, 2011.
[25] S. Perera, A. Sheth, K. Thirunarayan, S. Nair, and N. Shah.
[9] G. Chen, K. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo, Challenges in understanding clinical notes: Why nlp engines
and S. Wu. E3: an elastic execution engine for scalable data fall short and where background knowledge can help. In CIK-
processing. JIP, 20(1):6576, 2012. M Workshop, 2013.
[10] G. Chen, H. Jagadish, D. Jiang, D. Maier, B. Ooi, K. Tan, and [26] W. Raghupathi and V. Raghupathi. Big data analytics in
W. Tan. Federation in cloud data management: Challenges healthcare: promise and potential. Health Information Sci-
and opportunities. TDKE, 2014. ence and Systems, 2014.
[27] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn,
[11] J. Dean and S. Ghemawat. Mapreduce: Simplified data pro-
K. K. Schuler, and C. G. Chute. Mayo clinical text analysis
cessing on large clusters. Commun. ACM, 51(1), Jan. 2008.
and knowledge extraction system (ctakes): architecture, com-
[12] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid ponent evaluation and applications. JAMIA, 17(5):507513,
machine-crowdsourcing system for matching web tables. In 2010.
ICDE, pages 976987, 2014. [28] T. K. Sean Goldberg, Daisy Zhe Wang. Castle: Crowd-
[13] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and assisted system for textual labeling & extraction. HCOM,
R. Xin. Crowddb: answering queries with crowdsourcing. In 2013.
SIGMOD Conference, pages 6172, 2011. [29] B. Settles. Active learning literature survey. Technical report,
University of WisconsinMadison, 2010.
[14] C. Friedman, P. O. Alderson, J. H. Austin, J. J. Cimino, and
S. B. Johnson. A general natural-language text processor for [30] M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cher-
clinical radiology. JAMIA, 1(2):161174, 1994. niack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale:
The data tamer system. In CIDR, 2013.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. H. Witten. The weka data mining software: An update. [31] J. Sun, J. Hu, D. Luo, M. Markatou, F. Wang, S. Edabol-
SIGKDD Explorations, 11(1), 2009. lahi, S. E. Steinhubl, Z. Daar, and W. F. Stewart. Combining
knowledge and data driven insights for identifying risk factors
[16] J. Han. Data Mining: Concepts and Techniques. Morgan using electronic health records. In AMIA, 2012.
Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[32] J. Wiens and J. Guttag. Active learning applied to patient-
[17] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active adaptive heartbeat classification. In NIPS, pages 24422450,
learning and its application to medical image classification. In 2010.
ICML, pages 417424, 2006.

On Power Law Distributions in Large-scale Taxonomies
Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier and Massih-Reza
On Power Law Distributions in Large-scale Taxonomies
Amini
Universite Grenoble Alpes, CNRS
F-38000 Grenoble, France
rohit.babbar@imag.fr,
Rohit cornelia.metzig@imag.fr,
Babbar, Cornelia Metzig, ioannis.partalas@imag.fr,
Ioannis Partalas, Eric Gaussier and Massih-Reza
eric.gaussier@imag.fr,Amini
massih-reza.amini@imag.fr
Universite Grenoble Alpes, CNRS
F-38000 Grenoble, France
ABSTRACT erarchy tree in large-scale taxonomies with the goal of mod-
rohit.babbar@imag.fr, cornelia.metzig@imag.fr, ioannis.partalas@imag.fr,
elling the process of their evolution. This is undertaken
In many of the large-scale physical and social complex sys-
eric.gaussier@imag.fr,
tems phenomena fat-tailed distributions occur, for which dif-
massih-reza.amini@imag.fr
by a quantitative study of the evolution of large-scale tax-
onomy using models of preferential attachment, based on
ferent generating mechanisms have been proposed. In this
the famous model proposed by Yule [33] and showing that
ABSTRACT
paper, we study models of generating power law distribu- erarchy treethe
throughout in large-scale
growth process, taxonomies with theexhibits
the taxonomy goal of amod-
fat-
tions in the evolution of large-scale taxonomies such as Open elling
In many ofProject,
the large-scale physical and social complex tailed the process ofWe
distribution. their evolution.
apply this reasoningThis isto undertaken
both cate-
Directory which consist of websites assigned to sys-
one by a quantitative study of the evolution of large-scale tax-
tems phenomena fat-tailed distributions gory sizes and tree connectivity in a simple joint model.
of tens of thousands of categories. Theoccur, for which
categories dif-
in such onomy using modelsvariable
of preferential attachment,
ferent generating mechanisms haveorbeen Formally, a random X is defined to followbased
a power on
taxonomies are arranged in tree DAGproposed.
structured In con-
this
the
paper, we study law famous model
distribution proposed
if for by Yuleconstant
some positive [33] anda,showing that
the comple-
figurations havingmodels of generating
parent-child relationspower
among lawthem.
distribu-
We throughout the growth process, the
tions quantitatively
in the evolutionanalyse
of large-scale taxonomies such as mentary cumulative distribution is taxonomy exhibits a fat-
given as follows:
first the formation process of Open
such tailed distribution. We apply this reasoning to both cate-
Directory Project,
taxonomies, whichwhichleads consist
to power of websites assigned as
law distribution to one
the P (X > x) in
gory sizes and tree connectivity xaa simple joint model.
of tens of thousands
stationary distributions. of categories.
In the context The ofcategories
designinginclassi-
such
Formally, a random variable X is defined to follow a power
taxonomies
fiers are arranged
for large-scale taxonomies,in treewhich
or DAG structuredassign
automatically con- Power law distributions,
law distribution if for someorpositive
more generally
constant a, fat-tailed
the comple-dis-
figurations
unseen having to
documents parent-child relations among
leaf-level categories, them. how
we highlight We tributions that decaydistribution
slower thanisGaussians, are found in a
mentary cumulative given as follows:
firstfat-tailed
the quantitatively
natureanalyse
of thesethe formation can
distributions process of such
be leveraged wide variety of physical and social complex systems, ranging
taxonomies,
to analytically which
study leads
the to power
space law distribution
complexity of such as the
classi- (X > x) xof
from city population,Pdistribution
a
wealth to citations of
stationary distributions.
fiers. Empirical evaluationInofthe thecontext of designingonclassi-
space complexity pub- scientific articles [23]. It is also found in network connectiv-
fiers available
licly for large-scale
datasetstaxonomies,
demonstrates which theautomatically
applicability assign
of our Power
ity, lawthe
where distributions,
internet andorWikipediamore generally fat-tailed
are prominent dis-
exam-
unseen documents to leaf-level categories, we highlight how
approach. tributions
ples [27; 7].that
Ourdecay
analysisslower in than Gaussians,
the context are foundweb-
of large-scale in a
the fat-tailed nature of these distributions can be leveraged wide variety leads
taxonomies of physical and social
to a better complex systems,
understanding of suchranging
large-
to analytically study the space complexity of such classi-
1. INTRODUCTION
fiers. Empirical evaluation of the space complexity on pub-
from data,
scale city population,
and also leveraged distribution
in orderof wealth
to presentto citations
a concrete of
With the tremendous scientific
analysis of articles
space[23]. It is alsofor
complexity found in network
hierarchical connectiv-
classification
licly available datasets growth of datathe
demonstrates on applicability
the web fromof var- our ity, where Due
schemes. the internet
to the ever and increasing
Wikipediascale are prominent
of trainingexam-
data
ious sources
approach. such as social networks, online business ser-
vices and news networks, structuring the data into concep- ples [27;terms
size in 7]. Our analysis
of the number in the context of large-scale
of documents, feature setweb- size
tual taxonomies leads to better scalability, interpretability taxonomies
and numberleads to a classes,
of target better understanding
the space complexity of such of large-
the
1. INTRODUCTION
and visualization. Yahoo! directory, the open directory scale data,
trained and also
classifiers leveraged
plays in order
a crucial role intothepresent a concrete
applicability of
With the(ODP)
project tremendous growth ofare
and Wikipedia data on the web
prominent from var-
examples of analysis of space
classification systemscomplexity
in manyfor hierarchical
applications classification
of practical im-
ious sources
such web-scale such as social networks,
taxonomies. The Medical online business
Subject ser-
Heading schemes. Due to the ever increasing scale of training data
portance.
vices and news
hierarchy of thenetworks,
Nationalstructuring
Library of the data into
Medicine concep-
is another size in
The terms
space of the number
complexity analysis of presented
documents, in feature
this paper set pro-
size
tual taxonomies
instance leads to better
of a large-scale taxonomy scalability,
in the interpretability
domain of life and
videsnumber of target
an analytical classes, the
comparison of thespace complexity
trained of the
model for hi-
and visualization.
sciences. Yahoo! directory,
These taxonomies the open
consist of classes directory
arranged in trained classifiers
erarchical and flat plays a crucialwhich
classification, role incan thebeapplicability
used to select of
project
a (ODP)structure
hierarchical and Wikipedia are prominent
with parent-child examples
relations among of classification
the appropriate systems
modelina-priori
many applications of practical
for the classification im-
prob-
such web-scale
them and can be taxonomies.
in the formThe of a Medical
rooted treeSubject
or a Heading
directed portance.
lem at hand, without actually having to train any mod-
hierarchy of the
acyclic graph. ODP National Librarywhich
for instance, of Medicine
is in theisform another
of a The Exploiting
els. space complexity
the power analysis presented
law nature in this paper
of taxonomies pro-
to study
instancetree,
rooted of alists
large-scale taxonomy
over 5 million websitesin the domain among
distributed of life vides an analytical
the training comparisonforofhierarchical
time complexity the trained Supportmodel forVec- hi-
sciences.
close to 1 These
milliontaxonomies
categories and consist of classes arranged
is maintained by close to in erarchical
tor Machinesandhasflat been
classification,
performed which
in [32;can19].
be used
The to select
authors
a hierarchical
100,000 humanstructure with parent-child
editors. Wikipedia, relations
on the other hand,among
rep- the appropriate
therein justify the model
power a-priori for the classification
law assumption prob-
only empirically,
them and
resents can be
a more in the form
complicated of a rooted
directed graphtree or a directed
taxonomy struc- lem at our
unlike hand, without
analysis in actually
Section 3having wherein to wetrain any mod-
describe the
acyclic graph. ODP
ture consisting of over fora instance, which is in
million categories. Inthethisform of a
context, els. Exploiting
generative the of
process power law nature
large-scale web of taxonomiesmore
taxonomies to study
con-
rooted tree,hierarchical
large-scale lists over 5 classification
million websites distributed
deals with the task among of the training
cretely, in thetime complexity
context of similarfor processes
hierarchical Support
studied Vec-
in other
close to 1 million
automatically categories
assigning labelsand is maintained
to unseen documents by close
fromto a tor Machines
models. Despitehas the
beenimportant
performedinsights
in [32; of 19].[32;
The19],authors
space
100,000 human
set of target editors.
classes whichWikipedia, on the other
are represented by thehand, rep-
leaf level therein justify
complexity has the
not power law assumption
been treated formally so only
far.empirically,
resentsina the
nodes more complicated directed graph taxonomy struc-
hierarchy. unlike our analysis
The remainder of this in paper
Section is as3 wherein we describe
follows. Related workthe on
ture
In consisting
this work, weofstudyover athe million categories.
distribution In this
of data andcontext,
the hi- generative process of large-scale web taxonomies
reporting power law distributions and on large scale hierar- more con-
large-scale hierarchical classification deals with the task of cretely,classification
chical in the context of similar in
is presented processes
Section studied in other
2. In Section 3,
automatically assigning labels to unseen documents from a models.
we recall Despite
important thegrowth
importantmodels insights of [32; 19], space
and quantitatively jus-
set of target classes which are represented by the leaf level complexity
tify has not of
the formation been powertreated
lawsformally
as they so arefar.
found in hi-
nodes in the hierarchy. The remainder of this paper is as follows. Related work on
In this work, we study the distribution of data and the hi- reporting power law distributions and on large scale hierar-
chical classification is presented in Section 2. In Section 3,
we recall important growth models and quantitatively jus-
tify the formation of power laws as they are found in hi-

erarchical large-scale web taxonomies by studying the evo- get classes for the purpose of text classification have been
lution dynamics that generate them. More specifically, we studied in [18; 6] and [8] wherein the number of target classes
present a process that jointly models the growth in the size were limited to a few hundreds. However, the work by [19]
of categories, as well as the growth of the hierarchical tree is among the pioneering studies in hierarchical classification
structure. We derive from this growth model why the class towards addressing web-scale directories such as Yahoo! di-
erarchical
size large-scale
distribution at a webgiven taxonomies
level of the by hierarchy
studying the alsoevo-
ex- get classes
rectory for the of
consisting purpose of text target
over 100,000 classification
classes.have Thebeenau-
lution dynamics
hibits power law that
decay. generate
Building them. More we
on this, specifically,
then appeal we studied in [18;the
thors analyse 6] and [8] whereinwith
performance the respect
number to of accuracy
target classes and
present
to Heaps a process
law in thatSection jointly
4, tomodels
explainthethe growth in the size
distribution of were limited
training timeto a few hundreds.
complexity for flat However,
and hierarchicalthe work by [19]
classifica-
of categories,
features among as categories
well as thewhich growth is of
thentheexploited
hierarchical tree
in Sec- is among
tion. More therecently,
pioneering otherstudies in hierarchical
techniques classification
for large-scale hierar-
structure.
tion 5 for We derive the
analysing fromspacethis growth
complexity model forwhy the class
hierarchical towards
chical text addressing web-scale
classification have directories
been proposed. such Prevention
as Yahoo! di- of
size distribution
classification at a given
schemes. level of isthe
The analysis hierarchyvalidated
empirically also ex- rectory
error consisting by
propagation of applying
over 100,000 Refined target classes.
Experts trainedTheon au- a
hibits
on publiclypoweravailable
law decay. DMOZ Building
datasets on from
this, we the then
LargeappealScale thors analyse
validation set the
was performance
proposed in [4]. withInrespect to accuracy
this approach, bottom-and
to Heaps lawText in Section 4, to explain the (LSHTC)
distribution 1 of training time complexity for flat and hierarchical classifica-
Hierarchical Classification Challenge and up information propagation is performed by utilizing the
featuresdata among 2
categories which Intellectual
is then exploited in Sec- tion. More recently, otherclassifiers
techniques for large-scale hierar-
patent (IPC) from World Property Or- output of the lower level in order to improve clas-
tion 5 for analysing
ganization. the space
Finally, Section complexity
6 concludes thisforwork. hierarchical chical textatclassification
sification top level. The have been
deep proposed. Prevention
classification method proof
classification schemes. The analysis is empirically validated error propagation
posed in [31] firstby applying
applies Refinedpruning
hierarchy Experts to trained
identify on a
on publicly available DMOZ datasets from the Large Scale validation
much smaller set was proposed
subset of targetin [4]. In thisPrediction
classes. approach, of bottom-
a test
2. RELATED
Hierarchical WORK Challenge (LSHTC)1 and
Text Classification up information
instance is then propagation
performed byisre-training
performedNaive by utilizing
Bayes clas- the
2
patent data (IPC) fromare
Power law distributions reported
World in a wide
Intellectual varietyOr-
Property of output
sifier onofthethesubset
lower of level classifiers
target classesinidentified
order to fromimprove the clas-
first
physical
ganization. andFinally,
social complex
Section 6systemsconcludes [22],thissuch
work. as in inter- sification
step. More atrecently,
top level.Bayesian
The deep classification
modelling method hier-
of large-scale pro-
net topologies. For instance [11; 7] showed that internet posed inclassification
archical [31] first applies has beenhierarchy
proposed pruning
in [15]toinidentify
which hi- a
topologies exhibit power laws with respect to the in-degree much smaller subset of target classes. Predictionnodes of a test
2.
of theRELATED
nodes. AlsoWORK the size distribution of website cate-
erarchical
instance
dependencies between the parent-child are
modelledisbythen performed
centring by re-training
the prior of the childNaive nodeBayesat theclas-
pa-
Power law
gories, distributions
measured in termsare reportedofinwebsites,
of number a wide exhibits
variety of a sifier on values
rameter the subset of itsofparent.
target classes identified from the first
physical and
fat-tailed social complex
distribution, systems [22],
as empirically such as inininter-
demonstrated [32; step. More recently, Bayesian modelling
In addition to prediction accuracy, otherofmetrics
large-scale hier-
of perfor-
net topologies.
19] for the OpenFor instanceProject
Directory [11; 7] (ODP).
showed Variousthat internetmod- archical classification has been proposed in [15] in which hi-
mance such as prediction and training speed as well as space
topologies
els have been exhibit powerfor
proposed laws
thewith respectpower
generation to thelaw in-degree
distri- erarchical dependencies
complexity of the modelbetween the parent-child
have become increasingly nodesimpor-are
of the nodes.
butions, Also the that
a phenomenon size may
distribution
be seen ofas website
fundamental cate- modelled by iscentring the true
priorinof the
the context
child node
tant. This especially of at the pa-
challenges
gories,
in complex measured
systems in terms
as theofnormal
number of websites,inexhibits
distribution statistics a rameter
posed byvalues
problems of its
in parent.
the space of Big Data, wherein an opti-
fat-tailed
[25]. distribution,
However, in contrast as to
empirically demonstrated
the straight-forward in [32;
derivation In addition
mal trade-offtoamong prediction accuracy,
such metrics other metrics
is desired. of perfor-
The significance
19] for thedistribution
of normal Open Directory via theProject
central (ODP).
limit theorem,Variousmodels mod- mance such asspeedprediction
of prediction in suchand traininghas
scenarios speedbeenashighlighted
well as space in
els have been
explaining proposed
power for the generation
law formation all rely on power law distri-
an approxima- complexity of such
the model have
recent studies as [3; 13; 24;become
5]. Theincreasingly
prediction speed impor- is
butions,
tion. Some a phenomenon
explanationsthat may be
are based onseen as fundamental
multiplicative noise tant. This is especially true in theof context of challenges
directly related to space complexity the trained model, as
in
or complex systems as thegroup
on the renormalization normal distribution
formalism in statistics
[28; 30; 16]. For posed
it mayby notproblems
be possiblein the tospace
load of Big Data,
a large wherein
trained modelaninopti-the
[25].growth
the However, in contrast
process to the straight-forward
of large-scale taxonomies, models derivation
based mal trade-off
main memoryamong due to such metrics
sheer size. isDespite
desired.its The significance
direct impact
of
on normal distribution
preferential attachment via thearecentral limit theorem,
most appropriate, models
which are of
onprediction
predictionspeed speed, in nosuch scenarios
earlier workhas hasbeen highlighted
focused on space in
explaining
used in thispower
paper.law formation
These modelsall arerely on on
based an theapproxima-
seminal recent studies such as [3; 13; 24; 5]. The prediction speed is
complexity of hierarchical classifiers.
tion. Some
model by Yule explanations
[33], originallyare based on multiplicative
formulated for the taxonomy noise directly related to space complexity
Additionally, while the existence of of the trained
power model, as
law distributions
or on the renormalization
of biological species, detailed group formalism
in section 3. It[28; 30; 16].
applies For
to sys- it
hasmaybeen not be possible
used for analysis to load a large
purposes in trained
[32; 19] model in the
no thorough
the growth
tems where process
elementsofoflarge-scale
the systemtaxonomies,
are groupedmodels based
into classes, main memory
justification is due
giventoon sheer
the size.
existenceDespite its direct
of such impact
phenomenon.
on preferential
and the systemattachment
grows bothare in most appropriate,
the number which and
of classes, are on
Ourprediction
analysis inspeed, Section no 3,earlier
attemptsworktohas focused
address thisonissue
spacein
used
in theintotal
this number
paper. These models
of elements are based
(which are hereon the seminal
documents complexity of hierarchical classifiers.
a quantitative manner. Finally, power law semantics have
model
or by Yule In
websites). [33],
itsoriginally
original form,formulated
Yulesfor the taxonomy
model serves as Additionally,
been used for whilemodelthe existence
selection andof evaluation
power law of distributions
large-scale
of biological for
explanation species,
powerdetailed in section
law formation 3. Ittaxonomy,
in any applies to irre- sys- has been used for analysissystems
purposes
hierarchical classification [1].inUnlike
[32; 19] no thorough
problems stud-
tems where
spective of anelements
eventual of hierarchy
the systemamong are grouped
categories. into classes,
Similar justification is given on the existence of which
such phenomenon.
ied in classical machine learning sense deal with a
and the system
dynamics have been grows bothtoinexplain
applied the numberscalingofinclasses,
the connec- and Our analysis in Section 3, classes,
attemptsthis to address this forms
issue in
limited number of target application a
in the of
tivity total numberwhich
a network, of elements
grows in (which
termsare here documents
of nodes and edges a quantitative manner. hidden Finally,information
power law semantics have
blue-print on extracting in big data.
or websites).
via preferentialInattachment
its original[2]. form, Yules
Recent modelgeneraliza-
further serves as been used for model selection and evaluation of large-scale
explanation
tions apply the for power law formation
same growth processintoany treestaxonomy,
[17; 14; irre-29]. hierarchical classification systems [1]. Unlike problems stud-
spective
In of an eventual
this paper, describe hierarchy among categories.
the approximate power-lawSimilar in the 3. POWER LAW IN LARGE-SCALE WEB
ied in classical machine learning sense which deal with a
dynamics have been
child-to-parent applied
category to explain
relations by thescaling
model in the
by connec-
Klemm limitedTAXONOMIES
number of target classes, this application forms a
tivity
et al. of a network,
[17]. Furthermore,whichwe grows
combinein termsthisofformation
nodes and edges
process We begin by
blue-print onintroducing
extracting hiddenthe complementary
information cumulative
in big data.size
viaa preferential
in simple manner attachment
with the [2]. originalRecent
Yulefurther
model generaliza-
in order to distribution for category sizes. Let Ni denote the size of cat-
tions apply
explain also the same law
a power growth process sizes,
in category to trees i.e. [17;
we 14; 29].
provide
In this paper, describe
a comprehensive the approximate
explanation for the formation power-law in the
process of 3. POWER
egory i (in terms LAW of number INofLARGE-SCALE
documents), then the WEB proba-
bility that Ni > N is given by
child-to-parent
large-scale web category
taxonomies relations
such as byDMOZ.
the model From by theKlemmsec- TAXONOMIES
et
ond,al. we[17]. Furthermore,
infer a third scaling we combine
distribution this for
formation
the number processof We begin by introducing P (Nithe
>N ) N
complementary (1)
cumulative size
in a simple
features permanner
category. with theisoriginal
This done viaYule model in order
the empirical Heapss to distribution
where > 0fordenotes
category sizes.
the exponent denote
Let Ni of the size
the power of cat-
law dis-
explain
law [10],also a power
which law in
describes thecategory
scaling sizes, i.e. we between
relationship provide egory i (in3 terms of number of documents),
tribution. Empirically, it can be assessed then the proba-
by plotting the
a comprehensive
text length and the explanation for the formation process of
size of its vocabulary. bility of
that Ni > N issize
given by its size (see Figure 1) The
rank a categorys against
large-scale
Some of theweb taxonomies
earlier such as DMOZ.
works on exploiting hierarchy From among the sec-
tar- derivative of this distribution,
P (Ni > N )theN category
size probability
(1)
ond, we infer a third scaling distribution for the number of
1 3
http://lshtc.iit.demokritos.gr/
features per category. This is done via the empirical Heapss To avoid
2 where >confusion,
0 denoteswethe
denote the power
exponent lawpower
of the exponents for
law dis-
http://web2.wipo.int/ipcpub/
law [10], which describes the scaling relationship between in-degree 3distribution and feature size distribution and .
tribution. Empirically, it can be assessed by plotting the
text length and the size of its vocabulary. rank of a categorys size against its size (see Figure 1) The
Some of the earlier works on exploiting hierarchy among tar- derivative of this distribution, the category size probability
1 3
http://lshtc.iit.demokritos.gr/ To avoid confusion, we denote the power law exponents for
2
http://web2.wipo.int/ipcpub/ in-degree distribution and feature size distribution and .

density p(Ni ), then also follows a power law with exponent 100000
(+1)
( + 1), i.e. p(Ni ) Ni .
Two of our empirical findings are a power law for both the
complementary cumulative category size distribution and 10000
# categories # categories
the counter-cumulative in-degree distribution, shown in Fig-
density p(Ni ), then also follows a power law with exponent 100000
ures 1 and 2, for LSHTC2-DMOZ
(+1) dataset which is a subset
( + 1), i.e. i ) 4N
1000
of ODP. Thep(Ndataset contains
i . 394, 000 websites and 27, 785
Two of our The
categories. empirical
numberfindings are a power
of categories law
at each for of
level both
the the
hi-
complementary cumulative category size distribution and 10000
erarchy is shown in Figure 3. 100
the counter-cumulative in-degree distribution, shown in Fig-
ures 1 and 2, for LSHTC2-DMOZ dataset which is a subset
1000
of ODP.100000
The dataset4 contains 394, 000 websites and 27, 785
= 1.1 10
Ni>N with Ni>N
categories. The number of categories at each level of the hi- 1 2 3 4 5

erarchy is10000
shown in Figure 3. 100 Level
1000
100000 Figure 3: Number of categories at each level in the hierarchy
categories
= 1.1 10
of the LSHTC2-DMOZ database.
1 2 3 4 5
100
10000 Level
# of with
It further assumes that for every m elements that are added

10
1000 Figure 3: Number of
to the pre-existing categories
classes in theatsystem,
each level in the
a new hierarchy
class of size
# of categories
5
of
1 isthe LSHTC2-DMOZ
created . database.
1
100 The described system is constantly growing in terms of el-
1 10 100 1000 10000
ements and classes, so strictly speaking, a stationary state
category size N It further
does assumes
not exist [20].that for every
However, m elements
a stationary that are added
distribution, the
10
to the pre-existing
so-called Yule classes inhas
distribution, thebeen
system, a newusing
derived classthe
of size
ap-
5
Figure 1: Category size vs rank distribution for the proach of the. master equation with similar approximations
1 is created
LSHTC2-DMOZ
1 dataset. The[26;
by described
23; 17].system Here,iswe constantly growing [23],
follow Newman in terms
who ofcon-el-
1 10 100 1000 10000
ements
siders asandoneclasses,
time-step so the
strictly speaking,
duration betweena stationary
creation ofstate
two
category size N does not exist [20]. From
However,
consecutive classes. this afollows
stationary distribution,
that the average num- the
so-called
ber Yule distribution,
of elements per class is has beenmderived
always + 1, and using
the the ap-
system
Figure 10000
1: Category size vs rank
=distribution for the
dgi>dgwith dgi>dg
1.9 proach
containsof(mthe + master equation
1) elements atwith similar where
a moment approximations
the num-
LSHTC2-DMOZ dataset. by [26; 23; 17].
ber of classes is .Here,
Let we pN,follow
denote Newman [23], who
the fraction con-
of classes
1000 siders
havingasNone time-step
elements whenthe theduration
total between
number of creation
classesofistwo.
consecutive
Between twoclasses.successiveFrom timethisinstances,
follows that thethe average num-
probability for a
10000 ber
givenofpre-existing
elements per class
class i ofissize
always
Ni to mgain
+ 1,aandnewthe system
element is
= 1.9
categories
100 contains
mN (m + 1) elements at a moment where the num-
i /((m + 1)). Since there are pN, classes of size N ,
ber of classesnumber
the expected is . Let such pN, denote
classes whichthegain
fraction
a newofelement
classes
1000 having
(and grow N elements
to size (Nwhen + 1))the total by
is given number
: of classes is .
# of with
10 Between two successive time instances, the probability for a

given pre-existing mNclass i of size Ni m to gain
pN, = N paN,
new element(3) is
# of categories
100 mNi /((m +(m 1)).+Since

1) there are (m +p1) N, classes of size N ,
1 the expected number such classes which gain a new element
1 10 100 1000 The
(andnumber
grow toofsize classes
(N +with1)) N websites
is given by are
: thus fewer by the
10 # of indegrees dg above quantity, but some which had (N 1) websites prior to
the addition of mN a new class have nowmone more website. This
pN, = N pN, (3)
Figure 2: Indegree vs rank distribution for the LSHTC2- step depicting (m the+change
1) of the(m state
+ 1)of the system from
DMOZ dataset.
1 classes to ( + 1) classes is shown in Figure 4. Therefore,
1 10 100 1000 The number of
the expected classesofwith
number N websites
classes with N are thus fewer
documents whenby the
the
# ofofindegrees
We explain the formation dg via models by
these two laws above quantity,
number of classes but some which
is (+1) is given hadby(N 1)
the websites
following prior to
equation:
Yule [33] and a related model by Klemm [17], detailed in the addition of a new class have now one more website. This
step m
Figure
sections2:3.1Indegree
and 3.2, vs rankare
which distribution forinthe
then related LSHTC2-
section 3.3. ( depicting
+ 1)pN,(+1) the=change
pN,of + the state [(Nofthe
1)(p system from
(N 1), )
DMOZ dataset. classes to ( + 1) classes is shown m + 1 in Figure 4. Therefore, (4)
N pwhen
N, ] the
3.1 Yules model the expected number of classes with N documents
We explain the formation of these two laws via models by number of classes is (+1) is given by the following equation:
Yules model describes a system that grows in two quantities, The first term in the right hand side of Equation 4 corre-
Yule [33] and a related model by Klemm [17], detailed in m
in elements and in classes in which the elements are assigned. sponds to classes = with N documents
pN, + [(Nwhen
1)(pthe number of
sections 3.1 and 3.2, which are then related in section 3.3. ( + 1)p N,(+1) (N 1), )
It assumes that for a system having classes, the probability classes is . The second termmcorresponds +1 to the contribu- (4)
that a new element will be assigned to a certain class is tion from classes of size (N 1) which have grown N pto size
] N,
3.1 Yulestomodel
proportional its current size, this is shown by the left arrow (pointing rightwards) in Fig-
N,
Yules model describes a system that grows in two quantities, The 4.first
ure Theterm
lastintermthe corresponds
right hand side to theof Equation 4 corre-
decrease resulting
in elements and in classes N
p(i) in
= which
ithe elements are assigned. (2) sponds to classes with N documents when the number of
5
i =1 N
It assumes that for a system having iclasses, the probability The initial
classes is . size
Themay secondbe generalized
term corresponds to othertosmall sizes; for
the contribu-

that
4
a new element will be assigned to a certain class is tion fromTessone
instance classes ofetsize al. (N consider
1) which entrant
have classes
grown to with
sizesize
N,
http://lshtc.iit.demokritos.gr/LSHTC2
proportional to its current size, datasets drawn from aby
this is shown truncated power (pointing
the left arrow law [29] . rightwards) in Fig-
ure 4. The last term corresponds to the decrease resulting
Ni
p(i) = (2) 5
i =1 N i The initial size may be generalized to other small sizes; for
instance Tessone et al. consider entrant classes with size
4
http://lshtc.iit.demokritos.gr/LSHTC2 datasets drawn from a truncated power law [29] .

300
Variables
250
Ni Number of elements in class i
number of classes
dgi Number of subclasses of class i 200
di Number of features of class i
Total number of classes 150
300
DG
Variables Total number of in-degrees (=subcategories)
100
pN, Fraction of classes having N elements 250
Ni Number
when theoftotal
elements
number in class i
of classes is 50
number of classes
dgi Number of subclasses of class i 200
Constants
di Number of features of class i 0
0 N-1 N N+1
Total number of classes 150 class size
m Number of elements added to the system af-
DG Total
ter number
which a newof class
in-degrees (=subcategories)
is added 100
pN,
w Fraction
of classes having
[0, 1] Probability N elementsof sub-
that attachment Figure 4: Illustration of Equation 4. Individual classes grow
when the total
categories number of classes is
is preferential 50 move to the right over time, as indicated by
constantly i.e.,
Constants
Indices arrows. A stationary distribution means that the height of
0
each bar remains 0 constant.
N-1 N N+1
class size
im Number
Index forofthe
elements
class added to the system af-
ter which a new class is added
w [0, 1] Probability
Table 1: Summary of notationthat
usedattachment
in Section 3of sub- Figurei to
node 4: connect
Illustrationto a of Equation
certain 4. Individual
existing classes grow
node j is proportional
categories is preferential constantly
to its number i.e.,ofmove to the
existing right
edges of over
nodetime,
j. as indicated by
Indices arrows.
A node in A stationary
the Barab adistribution
si-Albert (BA) means thatcorresponds
model the height of a
from classes which have gained an element and have become each bar
class remains
in Yules constant.
model, and a new edge to two newly assigned
ofi size (N + 1),Index
this isforshown
the class
by the right arrow (pointing element. Every added edge counts both to the degree of an
rightwards) in Figure 4. The equation for the class of size 1 existing node j, as well as to the newly added node i. For
Table nodereason
i to connect to a certain
nodes existing node j is added
proportional
is given by: 1: Summary of notation used in Section 3 this the existing j and the newly node i
to its number of existing edges of node
grow always by the same number of edges, implying m = 1 j.
m A
( + 1)p1,(+1) = p1, + 1 p1, (5) andnode in the Barab
consequently =asi-Albert (BA) model
2 in the BA-model, corresponds of
independently a
from classes which have gained an element + m and1 have become class in Yules
the number of model,
edges that and each
a newnewedge to two
node newly assigned
creates.
of
As size
the (N + 1),
number this
of isclasses
shown(andby the right arrow
therefore (pointing
the number of element.
The seminal Every added edge
BA-model hascounts both to the
been extended in degree of an
many ways.
rightwards) in Figure 4. The equation for the
elements (m + 1)) in the system increases, the probabilityclass of size 1 existing node j, as well as to the newly
For hierarchical taxonomies, we use a preferential attach-added node i. For
is given
that by:element is classified into a class of size N , given by
a new this
mentreason
modelthe for existing
trees by nodes j and
[17]. The the newly
authors addedgrowth
considered node i
Equation 3, is assumed to remain constantmand independent grow always edges,
via directed by the and sameexplain
number of edges,
power implying m
law formation in = 1
the
( + 1,(+1) = p1, + 1
1)phypothesis, p1, (5) and consequently
in-degree, i.e. the = 2 directed
edges in the BA-model,
from childrenindependently
to parent inof
of . Under this the stationary
m + distribution
1 for
class sizes can be determined by solving Equation 4 and the
a treenumber of edges
structure. that each
In contrast to new node creates.
the BA-model, newly added
As theEquation
using number 5asofthe classes
initial(and therefore
condition. Thistheis number
given byof The seminal BA-model has been
nodes and existing nodes do not increase their extended in in-degree
many ways. by
elements (m + 1)) in the system increases, the probability For same
the hierarchical
amount, taxonomies,
since new nodes we usestart
a preferential attach-
with an in-degree
that a new elementpN =is(1 + 1/m)B(N,
classified into a2class
+ 1/m)of size N , given(6)by ment
of 0. model for trees
Leaf nodes thus bycannot
[17]. The authors
attract considered
attachment of growth
nodes,
Equation 3, is assumed to remain constant and independent via directed
and preferentialedges, and explain
attachment alonepower law lead
cannot formation in the
to a power-
where B(., .) this
of . Under is the beta distribution.
hypothesis, Equation
the stationary 6 has been
distribution for in-degree, i.e. random
the edges directed from children to parent in
termed law. A small term ensures that some nodes attach
class sizes can be determined by solving Equation 4 vari-
Yule distribution [26]. Written for a continuous and a tree structure. In contrast to the BA-model,
to existing ones independently of their degree, which is the newly added
able
usingNEquation
, it has a5power
as thelaw tail:condition. This is given by nodes and existing nodesofdo
initial analogous to the start a not
newincrease
class intheir
the in-degree
Yule model. by
2 1 the
The same amount,
probability sincea new
v that new node
nodesattaches
start withas aan in-degree
child to the
pN = (1p(N )N
+ 1/m)B(N, m
2 + 1/m) (6) of 0. Leaf
existing nodenodes
i of thus
with cannot
indegree attract attachment of nodes,
dgi becomes
From the
where B(.,above equation
.) is the the exponentEquation
beta distribution. of the density
6 has func-
been and preferential attachment alone cannot lead to a power-
law. A small random di 1 1
tion
termedis between 2 and 3.[26].
Yule distribution ItsWritten
cumulative
for a size distribution
continuous vari- v(i) = wterm ensures + (1 that
w) some , nodes attach (8)
P (NkN>
able N has
, it ), asagiven
powerbylaw
Equation
tail: 1, has an exponent given to existing ones independently DG of their degree, DG which is the
by analogous
where DG to the size
is the startofofthe a new
system class in the Yule
measured in themodel.
total
1
p(N ) N 2 m The probability v that a new node attaches
number of in-degrees. w [0, 1] denotes the probability as a child tothat
the
= (1 + (1/m)) (7) existing node i of becomes
the attachment is with indegree(1dgi w)
preferential, the probability that
From the above equation the exponent of the density func- it is random to any node,
which is between 1 and 2. The higher the frequency 1/m
tion is between 2 and di independently
1 1 of their numbers
at which new classes are3.introduced,
Its cumulative size distribution
the bigger becomes, of indegrees. As v(i)it=has
DG
+ (1
w been done w)
for the
DG
, process [26;
Yule (8)
P
and k > N ), as given by Equation 1, has an exponent given
(Nthe lower the average class size. This exponent is stable 23; 14; 29], the stationary distribution is again derived via
by
over time although the taxonomy is constantly growing. where DG isEquation
the master the size 4.of the
Thesystem measured
exponent of the in the total
asymptotic
number
power law of in-degrees. w [0,
in the in-degree 1] denotes is
distribution the probability that
= 1 + 1/w.This
= (1 + (1/m)) (7)
3.2 Preferential attachment models for net- the
modelattachment
is suitableistopreferential, (1 properties
explain scaling w) the probability that
of the tree or
which works and
is between trees
1 and 2. The higher the frequency 1/m it is random to any node, independently of their
network structure of large-scale web taxonomies, which have numbers
at similar
A which new classes
model are introduced,
has been the network
formulated for bigger growth
becomes,
by of indegrees.
also As itempirically,
been analysed has been done for the Yule
for instance process [26;
for subcategories
and the
Barabasilower
and the average
Albert class size.
[2], which Thisthe
explains exponent is stable
formation of a 23; 14; 29], the
of Wikipedia [7].stationary
It has alsodistribution
been applied is again derivedtrees
to directory via
over time
power lawalthough the taxonomy
distribution is constantly
in connectivity degree ofgrowing.
nodes. It the master
in [14]. Equation 4. The exponent of the asymptotic
assumes that the networks grow in terms of nodes and edges, power law in the in-degree distribution is = 1 + 1/w.This
3.2 Preferential attachment models for net-
and that every newly added node to the system connects 3.3 Model
model for
is suitable to hierarchical web taxonomies
explain scaling properties of the tree or
works and trees
with a fixed number of edges to existing nodes. Attachment network structure of large-scale web taxonomies,
We now apply these models to large-scale web taxonomies which have
A again
is similarpreferential,
model has been formulated
i.e. the for network
probability growth
for a newly by
added also been analysed
like DMOZ. empirically,
Empirically, for instance
we uncovered two for subcategories
scaling laws: (a)
Barabasi and Albert [2], which explains the formation of a of Wikipedia [7]. It has also been applied to directory trees
power law distribution in connectivity degree of nodes. It in [14].
assumes that the networks grow in terms of nodes and edges,
and that every newly added node to the system connects 3.3 Model for hierarchical web taxonomies
with a fixed number of edges to existing nodes. Attachment We now apply these models to large-scale web taxonomies
is again preferential, i.e. the probability for a newly added like DMOZ. Empirically, we uncovered two scaling laws: (a)

one for the size distribution of leaf categories and (b) one for its in-degree di of the category i. (Figure 7). Like in
the indegree (child-to-parent link) distribution of categories [17], the attachment probability to a parent i is
(shown in Figure 2). These two scaling laws are linked in a
non-trivial manner: a category may be very small or even dgi 1 i
v(i) = w + (1 w) . (9)
not contain any websites, but nevertheless be highly con- DG DG
one for the
nected. size on
Since distribution
the other ofhandleaf categories
(a) and (b) and (b) jointly,
arise one for its in-degree di of the category i. (Figure 7). Like in
the indegree here
we propose (child-to-parent link) distribution
a model generating the two of categories
scaling laws [17], the attachment probability to a parent i is
2
(shown
in in Figure
a simple generic 2).manner.
These two Wescaling laws
suggest are linked in of
a combination a
non-trivial manner: a category may be very small dgi 1 i
the two processes detailed in subsections 3.1 and 3.2ortoeven
de- 2 v(i)
4 = w
3 + (1 w) . (9)
not contain
scribe any websites,
the growth process: but nevertheless
websites be highlyadded
are continuously con- DG DG
nected.
to Since on
the system, andthe other hand
classified (a) and (b)byarise
into categories human jointly,
ref- 0 0 1 0 0
we propose
erees. At thehere
same a time,
modelthe generating
categoriesthearetwo
notscaling
a merelaws set, 2
in a form
simple generic manner. We grows
suggest a combination of 0
but a tree structure, which itself in two quanti-
the
ties:two processes
in the numberdetailed in subsections
nodes (categories) and3.1
in and 3.2 to de-
the number of 2 4
3
scribe the growth

in-degrees of nodesprocess: websites are
(child-to-parent continuously
links, added
i.e. subcategory- Figure 7: (iii): Growth in children categories.
to the system,
to-category and Based
links). classified
on theintorules
categories by human
for voluntary ref-
referees 0 0 1 0 0
erees.
of the At the same
DMOZ how time, the categories
to classify websites, are
we not a mere
propose set,
a sim-
but combined
ple form a treedescription
structure, which
of the grows itselfAltogether,
process. in two quanti- the Equation 8,0 where i = 1, would suffice to explain
ties: in the
database number
grows nodes
in three (categories) and in the number of
quantities: power law in-degrees dgi and in category sizes Ni .
in-degrees of nodes (child-to-parent links, i.e. subcategory- Figure 7: (iii): Growth in children categories.
To link the two processes more plausibly, it can be
to-category
(i) Growthlinks). Based onNew
in websites. the rules for voluntary
websites are assigned referees
into
assumed that the second term in Equation 9 denoting
of thecategories
DMOZ how to classify
i, with websites,
probability p(i) we
propose
Ni (Figure a sim-
5).
assignment8,ofwhere
Equation new first
i =children
1, woulddepends
suffice on
to the size
explain
ple combined description
This assignment of theindependently
happens process. Altogether,
of the hier-the
N i of parent
power categories,
law in-degrees dg and in category sizes N .
database grows
archy levelinofthree quantities:
category. However, only leaf categories i i
may receive documents. To link the two processes Ni
(i) Growth in websites. New websites are assigned into i = more , plausibly, it can(10) be
assumed that the second term N in Equation 9 denoting
categories i, with probability p(i) Ni (Figure 5).
assignment
since this isofcloser new first
to thechildren
rules bydepends on the
which the size
referees
This assignment happens independently of the hier-
N of parent categories,
create new categories, but is not essential for the ex-
i
archy level of category. However, only leaf categories
may receive documents. planation of the power laws.NIt i
reflects that the bigger
a leaf category, the higher i =the probability
, (10)
that referees
N
create a child category when assigning a new website
Figure 5: (i): A website is assigned to existing categories since
to it. this is closer to the rules by which the referees
with p(i) Ni . create new categories, but is not essential for the ex-
planation the
To summarize, of the power
central laws.
idea It reflects
of this that the
joint model bigger
is to con-
a leaf
sider two category,forthe
measures higher
the size ofthe probabilitythe
a category: that referees
number of
create a
its websites Nichild
(whichcategory
governswhen
the assigning
preferential a new website
attachment
(ii) Growth in categories. With probability 1/m, the ref-
Figureerees
5: (i): A website is assigned to existing of new to websites),
it. and its in-degree, i.e. the number of its
assign a website into a newly createdcategories
category,
with p(i) N . children dgi , which governs the preferential attachment of
at any level of the hierarchy (Figure 6).
i
To summarize,
new categories.the Tocentral
explainideatheofpower
this jointlaw model
in the iscategory
to con-
This assumption would suffice to create a power law in sider two
sizes, measures(i)
assumptions forand
the (ii)
sizeare
of athecategory: the number
requirements. of
For the
its websites
law inNthe i (which governs the preferential attachment
(ii) the category
Growth size distribution,
in categories. but since a 1/m,
With probability tree-structure
the ref- power number of indegrees, assumptions (ii) and
among of neware websites), and its in-degree, i.e. the number of its
erees assign a website into a newly created the
categories exists, we also assume that event
category, (iii) the requirements. The empirically found exponents
of children
= 1.1 dg i , which governs
yield the preferential attachment of
at category
any levelcreation is also attaching
of the hierarchy (Figureat6).particular places and = 1.9 a frequency of new categories
to the tree structure. The probability v(i) that a cate- new categories.
1/m=0.1 To explain
and a frequency of the
newpower
indegrees law (1
in
the
w)category
= 0.9.
This assumption
gory is created aswould suffice
the child of ato createparent
certain a power law in
category sizes, assumptions (i) and (ii) are the requirements. For the
the category
i can dependsize
in distribution,
addition on but the since a tree-structure
in-degree di of that 3.4 law
power Other in the interpretations
number of indegrees, assumptions (ii) and
among
categorycategories exists,9).
(see Equation we also assume that the event (iii) are of
Instead theassuming
requirements. The empirically
in Equations 9 and 10 found exponents
that referees de-
of category creation is also attaching at particular places = to
cide 1.1open
and a single
= 1.9child
yieldcategory,
a frequency it isofmore
new realistic
categoriesto
to the tree structure. The probability 2 v(i) that a cate- 1/m=0.1
assume that andan a frequency of new indegrees
existing category (1 w)
is restructured, i.e.=one
0.9.or
gory is created as the child of a certain
2
parent
3 category several child categories are created, and websites are moved
i can depend in addition on the in-degree di of that 3.4 Other
into these interpretations
new categories such that the parent category con-
category (see Equation 9). 0 0 0 0 0 0
Instead
tains less websites orin even
of assuming Equations
none at9 and
all. 10If that
one referees
of the newde-
cide to open
children a single
categories child category,
inherits all websitesit isofmore realisticcat-
the parent to
2
Figure 6: (ii): Growth in categories is equivalent to growth assume
egory (seethatFigure
an existing
8), thecategory is restructured,
Yule model i.e. one or
applies directly. If
2 3 several child categories are created, and websites arecontains
moved
of the tree structure in terms of in-degrees. the websites are partitioned differently, the model
0
into theseshrinking
effective new categories such thatThis
of categories. the parent
is not category
describedcon-by
0 0 0 0 0
tains less model,
the Yule websites orthe
and even none Equation
master at all. If 4one of the only
considers new
children categories.
growing categories inherits
However,all it
websites
has been of the parent
shown [29; cat-
21]
Figure 6: (ii): Growth in categories is equivalent to growth egory (see Figure
that models 8), the
including Yule model
shrinking applies
categories also directly.
lead to the If
of(iii)
theGrowth
tree structure in terms
in children of in-degrees.
categories. Finally, the hierarchy the websites
formation of are partitioned
power differently,
laws. Further the model compati-
generalizations contains
may also grow in terms of levels, since with a certain effective
ble shrinking
with power of categories.
law formation Thisnew
are that is not described
categories by
do not
probability (1 w), new children categories are as- the Yule model,
necessarily and the
start with one master
document,Equation
and that 4 considers only
the frequency
signed independently of the number of children, i.e. growing
of categories.
new categories doesHowever,
not needittohas been shown [29; 21]
be constant.
that models including shrinking categories also lead to the
(iii) Growth in children categories. Finally, the hierarchy formation of power laws. Further generalizations compati-
may also grow in terms of levels, since with a certain ble with power law formation are that new categories do not
probability (1 w), new children categories are as- necessarily start with one document, and that the frequency
signed independently of the number of children, i.e. of new categories does not need to be constant.

10000
= 1.9
di>d with di>d

1000
categories
10000
100 = 1.9
Figure 8: Model without and with shrinking categories. In
the left figure, a child category inherits all the elements of 1000
# ofwith
its parent and takes its place in the size distribution. 10
# of categories
100
100000 1
Figure 8: Model without and with shrinking
Level 2 categories. In 100 1000 10000 100000
Level
the left figure, a child category inherits 3 elements of
all the category size in features d
with Ni>N
10000
its parent Level
and takes its place in the size 4
distribution. 10
Level 5
Figure 11: Number of features vs rank distribution.
1000
100000 1
of categories
Level 2 100 1000 10000 100000

Level 3 category size in features d
Ni>N
100 Level 4 level. The formation

process may be seen as a Yule process
10000
Level 5 within a level if i =1,l Ni ,l is used for the normalization

Figure 11:2, Number
and thisofformation
features vs rank distribution.
# of categories# with
in Equation happens with probabil-

10
1000 ity p(l) that a website gets assigned into level l. Thereby,
the rate at ml at which new classes are created need not
1 be theThe
level. same for everyprocess
formation level, and may therefore
be seen asthe exponent
a Yule process of
100
1 10 100 1000 10000 100000 the
withinpower law iffit may
a level
vary
N i ,lfrom
is level
used forto level.
the Power
normalization law
i =1,l
category size N decay for the2,per-level
in Equation and thisclass formationsize distribution
happens with is a probabil-
straight-
10 forward corollary
ity p(l) that of thegets
a website described
assignedformation
into levelprocess,
l. Thereby,and
will be used
the rate at m inl Section
at which 5 tonew analyse
classestheare space complexity
created need not of
Figure 9: Category size distribution for each level of the
1 hierarchical
be the sameclassifiers.
for every level, and therefore the exponent of
LSHTC2-DMOZ dataset.
1 10 100 1000 10000 100000 the power law fit may vary from level to level. Power law
category size N decay for the per-level class size distribution is a straight-
3.5 Limitations 4. RELATION
forward corollary of BETWEEN
the described formation CATEGORY process,SIZEand
However,
Figure 9: Figures
Category 1 andsize2distribution
do not exhibit for perfect
each levelpower of law
the
will beAND NUMBER OF FEATURES
used in Section 5 to analyse the space complexity of
decay for several reasons. Firstly, the dataset is limited. hierarchical classifiers.
LSHTC2-DMOZ dataset. Having explained the formation of two scaling laws in the
Secondly, the hypothesis that the assignment probability database, a third one has been found for the number of
(Equation 2) depends uniquely on the size of a category features di in each category, G(d) (see Figures 11 and 12).
3.5
might beLimitations
too strong for web directories, neglecting the change 4. isRELATION
This a consequenceBETWEEN of both the category CATEGORY SIZE
size distribution,
in importance
However, Figuresof topics.
1 and 2Indoreality,
not exhibitbig categories
perfect powercan existlaw shownAND NUMBER
(in Figure OF FEATURES
1) in combination with another power law,
which receive
decay for severalonly reasons.
few new documents
Firstly, theordataset none at isall.limited.
Doro- termed
Having Heaps
explained lawthe[10]. This empirical
formation of two law states
scaling lawsthat
in the
govtsev and
Secondly, theMendes [9] have
hypothesis studied
that this problem
the assignment by intro-
probability number
database,of adistinct
third onewords has R been
in a document
found for isthe related
numberto theof
ducing an assignment
(Equation 2) dependsprobability
uniquely on that thedecays
size ofexponentially
a category length
featuresn dofi in
a document
each category, as follows G(d) (see Figures 11 and 12).
with age.
might be too For a low
strong fordecay parameter neglecting
web directories, they showthe that the
change This is a consequence of both the category size distribution,
stronger
in this decay,
importance the steeper
of topics. In reality, thebig power law; for
categories canstrong
exist R(n) = Knwith ,
shown (in Figure 1) in combination another power (11) law,
decay, receive
which no power lawfew
only forms. A last reason
new documents or might
none atbeall. that ref-
Doro- termed Heaps law [10]. This empirical law states that the
erees re-structure
govtsev and Mendes categories
[9] haveinstudied
ways strongly
this problemdeviating from
by intro- where
numberthe of empirical
distinct words is typically
R in a documentbetween 0.4 and 0.6.
is related to For
the
the rules
ducing an(i) - (iii).
assignment probability that decays exponentially the LSHTC2-DMOZ dataset, Figure 10 shows that for the
length n of a document as follows
with age. For a low decay parameter they show that the collection of words and the collection of websites, similar ex-
3.6 Statistics
stronger this decay,per thehierarchy
steeper the level power law; for strong ponents are found. AnR(n) = Kn , of this result is that
interpretation (11)
The tree-structure of a database
decay, no power law forms. A last reason allows might
also tobestudy the
that ref- the total number words in a category can be measured ap-
sizes of
erees class belonging
re-structure to a in
categories given
ways level of thedeviating
strongly hierarchy.from As proximately by the number of websites
where the empirical is typically betweenin0.4 a category,
and 0.6. For al-
shown
the in (i)
rules Figure 3 the DMOZ database contains 5 levels of
- (iii). though not all websites have the same10length.
the LSHTC2-DMOZ dataset, Figure shows that for the
different size. If only classes on a given level l of the hier- Figure 10 shows that bigger categoriesofcontain alsosimilar
more fea-
collection of words and the collection websites, ex-
3.6 Statistics
archy are considered, perwehierarchy
equally found level
a power law in cate- tures,
ponents but arethis increase
found. An is weaker than of
interpretation thethis
increase
result inis web-
that
gory tree-structure
The size distribution of as
a shown
database in Figure
allows 9. alsoPer-level
to study powerthe sites. Thisnumber
implieswords
that less
the total in avery feature-rich
category can be categories
measured ap- ex-
law decay
sizes of class has also been
belonging to found
a givenfor theofin-degree
level the hierarchy.distribu-As ist, which is also reflected in the
proximately by the number of high decayinexponent
websites a category, = 1.9
al-
tion. This
shown result3 may
in Figure equallydatabase
the DMOZ be explained contains by the model
5 levels of of a power-law fit in Figure
though not all websites have11, the(compared
same length. to the slower de-
introduced
different above:
size. Equations
If only classes 2onand 9 respectively,
a given level l of the are valid
hier- cay of the category
Figure 10 shows thatsize distribution
bigger categories shown
containin figure
also more1 where
fea-
also if are
archy instead of p(k)we
considered, oneequally
considers the conditional proba-
found a power law in cate- = 1.1). Catenation
Ni ,l tures, but this increaseofis the weaker size than
distribution
the increasemeasured
in web- in
gory size distribution as
bility p(l)p(i|l), where p(l) = shown in
Figure
i =1,l 9.is Per-level power
the probability features
sites. Thisand Heaps
implies thatlawlessyields
veryagain size distribution
feature-rich categories mea-
ex-
=1 Ni
law decay has also been found for the in-degree
i
Ni,l
distribu- sured
ist, in websites:
which P (i) =in R(G(d
is also reflected the high i )),decay
i.e. exponent
multiplication = 1.9of
of assignment
tion. This result to amaygivenequally
level, and p(i|l) = by
be explained theNimodel
the the (compared
i =1,l ,l of a exponents
power-law fit yields that 11,
in Figure = 1.1 which to theconfirms
slower our
de-
introduced above:
probability of being Equations
assigned2to and 9 respectively,
a given class within are valid
that empirically found value
cay of the category = 1.1.
size distribution shown in figure 1 where
also if instead of p(k) one considers
the conditional proba- = 1.1). Catenation of the size distribution measured in

i =1,l
Ni ,l
bility p(l)p(i|l), where p(l) = Ni
is the probability features and Heaps law yields again size distribution mea-
i =1
Ni,l sured in websites: P (i) = R(G(di )), i.e. multiplication of
of assignment to a given level, and p(i|l) =
Ni ,l
the the exponents yields that = 1.1 which confirms our
i =1,l
probability of being assigned to a given class within that empirically found value = 1.1.

1e+06 1e+06
100000 100000
= 0.59 = 0.53
nb of features
nb of features
1e+06 1e+06
10000 10000
100000 100000
= 0.59 = 0.53
nb of features
nb of features
1000 1000
10000 10000
100 100
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000 100000
1000
nb of words 1000
nb of docs in collection
Figure 10: Heaps law: number of distinct words vs. number of words, and vs number of documents.
100 100
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000 100000
5. SPACE COMPLEXITY OF
nb of LARGE-SCALE
words memory. We, therefore, nb of docs compare the space complexity of
in collection
HIERARCHICAL CLASSIFICATION hierarchical and flat methods which governs the size of the
Figure 10: Heaps law: number of distinct words trained model in large scale classification. The goal of this
Fat-tailed distributions in large-scale web taxonomies high- vs. number of words, and vs number of documents.
analysis is to determine the conditions under which the size
light the underlying structure and semantics which are use-
of the hierarchically trained linear model is lower than that
ful to visualize important properties of the data especially in
5. SPACE COMPLEXITY OF LARGE-SCALE
big data scenarios. In this section we focus on the applica-
of flat model.
memory. We, therefore, compare the space complexity of
As a prototypical classifier, wewhich
use agoverns
linear classifier
the size of the
tions HIERARCHICAL
in the context of large-scale CLASSIFICATION
hierarchical classification, hierarchical and flat methods of the
form
trained wTmodel
x which in canlarge bescale
obtained using standard
classification. algorithms
The goal of this
wherein the fit of power law distribution to
Fat-tailed distributions in large-scale web taxonomies high- such taxonomies
such
analysis as Support Vector the
is to determine Machine or Logistic
conditions under Regression.
which the size In
can
lightbe theleveraged
underlying to structure
concretelyand analyse the space
semantics whichcomplex-
are use- this
of the work, we apply trained
hierarchically one-vs-all linearL2-regularized
model is lowerL2-loss
than sup-
that
ity of visualize
ful to large-scale hierarchical
important classifiers
properties of the in data
the context
especially of in
a
port
of flatvector
model. classification as it has been shown to yield state-
generic
big datalinear classifier
scenarios. deployed
In this section in wetop-down
focus on the hierarchical
applica- of-the-art
As a performance
prototypical in thewe
classifier, context
use a of largeclassifier
linear scale textofclas-
the
cascade.
tions in the context of large-scale hierarchical classification, sification
form w T [12]. For flat classification one stores weight vec-
x which can be obtained using standard algorithms
In the following
wherein the fit ofsections
power law we first present formally
distribution the task of
to such taxonomies tors
suchw y and hence
asy ,Support Vector in Machine
a K classorproblemLogisticinRegression.
d dimensional In
hierarchical
can be leveraged classification and then
to concretely we proceed
analyse the space to the space
complex- feature
this work,space,we the
apply space complexity
one-vs-all for flat classification
L2-regularized L2-loss sup- is:
complexity analysis
ity of large-scale for large-scale
hierarchical systems.
classifiers in theFinally,
context we of
em- a
pirically validate the derived bounds. port vector classification as it has been shown to yield state-
generic linear classifier deployed in top-down hierarchical Size F lat = d K (12)
of-the-art performance in the context of large scale text clas-
cascade.
5.1 Hierarchical
In the following sectionsClassification
we first present formally the task of
sification
which [12]. For
represents theflatsizeclassification
of the matrixone stores weight
consisting vec-
of K weight
tors w y , y and hence in a K class problem in d dimensional
hierarchical
In single-label classification and then weclassification,
multi-class hierarchical proceed to the thespace
train- vectors, one for each class, spanning the entire input space.
feature space, the space complexity for flat classification is:
complexity
ing set can analysis for large-scale
be represented {(x(i) , y (i)
by S =systems. )}N
Finally,
i=1 . weIn em-
the We need a more sophisticated analysis for computing the
pirically of
context validate the derived bounds.
text classification, x(i) X denotes the vector space complexity for Size hierarchical classification. In this case,
F lat = d K (12)
representation of document i in an input space X Rd . even though the total number of weight vectors is much more
5.1 Hierarchical
The hierarchy in the formClassification
of rooted tree is given by G = since
whichthese are computed
represents the size of forthe
allmatrix
the nodes in the tree
consisting of Kand not
weight
E) where Vmulti-class
(V,single-label
In Y denotes the setclassification,
hierarchical of nodes ofthe G, train-
and only
vectors,for the
one leaves
for eachas in flat
class, classification.
spanning the Inspite
entire of
inputthis, the
space.
E denotes
ing set canthebe set of edges with
represented = {(x(i) , y (i) )}orientation.
by Sparent-to-child N
. In the size
We of hierarchical
need a more model can beanalysis
sophisticated much smallerfor as compared
computing the
i=1
The leaves of the tree which usually to flatcomplexity
model in the large scale classification.
classification. InIntuitively,
context of text classification, x(i) form the setthe
X denotes of vector
target space for hierarchical this case,
classes is given by Y = {u V : v V, (u, v) E}. Assum-
d when
even the feature
though the set size
total numberis high of (top
weight levels in the
vectors hierarchy),
is much more
representation of document i in an input space X R .
ing that there are K classes, the label y (i)
Y represents the
since number
these of computed
are classes is less, for and
all the onnodes
the contrary,
in the treewhen
and the
not
The hierarchy in the form of rooted tree is given by G =
the class associated with the instance (i) number of classes is high (at the bottom),
only for the leaves as in flat classification. Inspite of this, the the feature set
(V, E) where V Y denotes the setx of . nodes
The hierarchical
of G, and
relationship among size
size is
of low.
hierarchical model can be much smaller as compared
E denotes the set ofcategories
edges with implies a transition
parent-to-child from gen-
orientation.
eralization In
to order
flat modelto analytically
in the large compare
scale the relative sizesIntuitively,
classification. of hierar-
The leaves to of specialization
the tree whichasusually one traverses
form the any
setpath from
of target
root towards chical and flat models in the context
when the feature set size is high (top levels in the hierarchy),of large scale classifi-
classes is giventhe by leaves.
Y = {u This V : implies
v V, that
(u, v)the documents
E}. Assum-
which are assigned to a particular leaf also
(i) belong to the cation,
the number we assume
of classes poweris law and
less, behaviour
on the with respect
contrary, to the
when the
ing that there are K classes, the label y Y represents
inner nodes on the path from the root to
(i) that leaf node. number
number of
of features,
classes across
is high levels
(at thein bottom),
the hierarchy. the More pre-
feature set
the class associated with the instance x . The hierarchical
cisely,
size is if low.the categories at a level in the hierarchy are ordered
relationship among categories implies a transition from gen-
5.2 Space
eralization Complexityas one traverses any path from
to specialization
with
In order respect to the number
to analytically compareof features, we observe
the relative sizes ofa hierar-
power
law
chical behaviour.
and flat This hasinalso
models the been
context verified
of empirically
large scale as il-
classifi-
root prediction
The towards the leaves.
speed This implies
for large-scale that the documents
classification is crucial
lustrated
cation, we inassume
Figure power12 for law various levels in
behaviour withtherespect
hierarchy,
to for
the
which
for are assigned
its application in to a particular
many scenarios of leaf also belong
practical to the
importance.
one
number of the of datasets
features, used
across in levels
our experiments.
in the MoreMore
hierarchy. formally,
pre-
inner
It hasnodes on the in
been shown path
[32;from the hierarchical
3] that root to that classifiers
leaf node.are
the
cisely,feature
if the size dl,r of at
categories the r-th in
a level ranked category,are
the hierarchy according
ordered
usually faster to train and test time as compared to flat to therespect
number ofthe
features, for level l, 1 wel L 1, aispower
given
5.2 Space
classifiers. Complexity
However, given the large physical memory of with
by:
to number of features, observe
modern systems, whatforalso matters classification
in practice isisthe size law behaviour. This has also been verified empirically as il-
The prediction speed large-scale crucial
of the trained model with respect to the available physical lustrated in Figure 12 for d various
d r levels
l in the hierarchy, for
(13)
for its application in many scenarios of practical importance. l,r l,1
one of the datasets used in our experiments. More formally,
It has been shown in [32; 3] that hierarchical classifiers are
the feature size dl,r of the r-th ranked category, according
usually faster to train and test time as compared to flat to the number of features, for level l, 1 l L 1, is given
classifiers. However, given the large physical memory of by:
modern systems, what also matters in practice is the size
of the trained model with respect to the available physical dl,r dl,1 rl (13)

where dl,1 represents the feature size of the category ranked leading to, for = 0, 1:
1 at level l and > 0 is the parameter of the power law.
b(l1)(1)
L1
Using this ranking as above, let bl,r represent the number Sizehier < bd1
of children of the r-th ranked category at level l (bl,r is the 1
l=1
branching factor for this category), and let Bl represents the (L1)(1)
where dl,1 represents
total number the feature
of categories sizel.ofThen
at level the category
the size ranked
of the leading to, for = 0,b1: 1
= bd1 (L 1)
1 at level l and > 0 is the parameter of the power law. (b(1) 1)(1 )
b(l1)(1)
entire hierarchical classification model is given by: L1 (1 )
Using this ranking as above, let bl,r represent the number Sizehier < bd1 (18)
of children of the r-th
L1 Bl
ranked category
L1 Bl
at level l (bl,r is the 1
Size Hier = b l,r d l,r
branching factor for this category), and let bl,rBdl,1 r l (14) where the last equality
l=1 is based on the sum of the first terms
(L1)(1)
l represents the
l=1 r=1 l=1 r=1
total number of categories at level l. Then the size of the of the geometric series b (b(1) )l . 1
= bd1 (L 1)
(L1)(1)
Here
entirelevel l = 1 corresponds
hierarchical classification to the rootisnode,
model by: B1 = 1.
given with If > 1, since b >(b(1) 1)(1 that
1, it implies ) b(1) (1 1
(b
)< 0.
1)(1)
Therefore, Inequality 18 can be re-written as: (18)

L1 Bl

L1 Bl
10000
Size Hier = bl,r dl,r bl,r dl,1 rl (14) where the last equality is based on the sum of the first terms
Level 2
l=1 r=1 l=1 r=1
Level 3 of the geometricSize
series <(1)
hier (b )l . 1) ( 1)
bd1 (L
(L1)(1)
1
Here level l = 1 corresponds to the rootLevel
node,4 with B1 = 1. If > 1, since b > 1, it implies that (bb(1) 1)(1)
with di>d
< 0.
1000 Using our notation, the size of the corresponding flat clas-
Therefore, Inequality
sifier is: Size 18 can be re-written as:
f lat = Kd1 , where K denotes the number of
10000 leaves. Thus:
Level 2 Sizehier < bd1 (L 1)
of categories
100 Level 3 K ( 1)
Level 4 If > (> 1), then Sizehier < Sizef lat
di>d
K b(L the
Using our notation, 1) size of the corresponding flat clas-
1000
sifier is:
which SizeCondition
proves f lat = Kd115. , where K denotes the number of
# with
10 leaves. Thus:
The proof for Condition 16 is similar: assuming 0 < < 1, it
# of categories

100 is this time the second K term in Equation 18 ((L 1) (1) )
If > (> 1), then Sizehier < Sizef lat
which is negative,K b(L 1) so that one obtains:
1
100 1000 10000 100000
which proves Condition 15. b(L1)(1) 1
10 # of features d Sizehier < bd1
The proof for Condition 16 is(bsimilar: (1) assuming
1)(1 ) 0 < < 1, it

is this time the second term in Equation 18 ((L 1) (1) )
and then:
Figure 12: Power-law variation for features in different levels which is negative, so that one obtains:
1
for LSHTC2-a
100 dataset, Y-axis
1000 represents the feature set
10000 size
100000 b(L1)(1) 1 1 (L1)(1)
If < K,
b then Sizehier 1 < Sizef lat
plotted against rank of the categories on X-axis (b(1) 1)
Sizehier < bd1 b
# of features d (b(1) 1)(1 )
We now state a proposition that shows that, under some con- which concludes the proof of the proposition.
and then:
Figure
ditions 12: Power-law
on the depth ofvariation for features
the hierarchy, in different
its number levels
of leaves,
for It canb(L1)(1)
be shown, but1 this 1 is beyond the scope of this paper,
its branching factors and power law parameters, the sizesize
LSHTC2-a dataset, Y-axis represents the feature set of If Condition <satisfiedK,
plotted againstclassifier
rank of is
the categories that 16 is forthen Sizehier
a range < Sizeof
of values
f lat
a hierarchical below that ofonitsX-axis
flat version. (b (1) 1) b
]0, 1[. However, as is shown in the experimental part, it is
WeProposition 1. For a hierarchy
now state a proposition that shows of that,
categories
underof depth
some L
con- which
Condition concludes the proof of1the
15 of Proposition thatproposition.
holds in practice.
and K leaves,
ditions on the let
depth =ofminthe1lL l andits
hierarchy, b= maxl,rof
number bl,rleaves,
. De- The previous proposition complements the analysis presented
noting the space complexity of alawhierarchical classification It can
in [32] be shown,it but
in which this isthat
is shown beyond the scope
the training andof test
this time
paper,of
its branching factors and power parameters, the size of
model by Sizehier and the one ofthat
its corresponding flat ver- that Condition
hierarchical 16 is satisfied
classifiers for a range
is importantly of values
decreased of
with respect
a hierarchical classifier is below of its flat version.
sion by Sizef lat , one has: ]0,
to 1[.
the However,
ones of their as isflatshown in the experimental
counterpart. In this workpart, it is
we show
Proposition 1. For a hierarchy of categories of depth L Condition
that the space 15 ofcomplexity
Propositionof1 hierarchical
that holds inclassifiers
practice. is also
K The previous
and KForleaves,
> 1, > min1lL l and
letif = (> b1),
= then
maxl,r bl,r . De- better, underproposition
a conditioncomplements
that holds in thepractice,
analysis presented
than the
noting the space complexity K b(Lof a hierarchical
1) (15)
classification in [32]
one in which
of their flat it is shown thatTherefore,
counterparts. the training forand
largetest time
scale of
tax-
model by Sizehier and the one Size of its corresponding
hier < Sizef lat flat ver- hierarchical
onomies whose classifiers
featureis size
importantly
distribution decreased
exhibitwith respect
power law
sion by Sizef lat , one has: to the ones
decay, of their classifiers
hierarchical flat counterpart.
should be In this work
better we show
in terms of
b(L1)(1) 1 1 that the
speed than space
flat complexity
ones, due toofthe hierarchical classifiers is also
following reasons:
0 < ><1,1,ifif > (1) K
ForFor <(> 1), K, then better, under a condition that holds in practice, than the
(bK b(L 1) 1) b then (16)
(15) one1.ofAs theirshown above, the space
flat counterparts. complexity
Therefore, of hierarchical
for large scale tax-
Size
Size <
hier
hier < Size
Size f
f lat lat onomies classifier
whoseisfeaturelower thansize flat classifiers.
distribution exhibit power law
decay, hierarchical classifiers should be better in terms of
Proof. As dl,1 b(L1)(1)
d1 and Bl 1b(l1)
1 for 1 l L, one 2. For
speed than K flat
classes,
ones,only dueO(log
to theK) classifiers
following need to be eval-
reasons:
For
has, 0 <Equation
from < 1, if 14 and
(1)
<
the definitions of K, thenb:
and uated per test document as against O(K) classifiers in
(b 1) b (16) flat shown
classification.
(l1)
1. As above, the space complexity of hierarchical
Size
L1 b
hier < Size f lat classifier is lower than flat classifiers.
Sizehier bd1 r In order to empirically validate the claim of Proposition 1,
Proof. As dl,1 d1 and Bll=1 (l1)
br=1 for 1 l L, one we2. measured
For K classes,the trained modelK)
only O(log sizes of a standard
classifiers need to top-down
be eval-
has, from Equation 14and b the
(l1) definitions
of and b: hierarchical
uated per scheme (TD), which
test document uses a O(K)
as against linear classifiers
classifier at in
One can then bound r=1 r using ([32]): each parent of the hierarchy, and the flat one.
flat classification.
(l1)
b(l1) (l1)(1) b
L1
We use the publicly available DMOZ data of the LSHTC
b bd1 r
Sizehier In order towhich empirically validate
r < for = 0, 1 (17) challenge is a subset of the claim ofMozilla.
Directory Proposition More 1,
1 l=1
r=1 we measuredwe
specifically, theusedtrained
the model sizes of aofstandard
large dataset top-down
the LSHTC-2010
r=1
b(l1) hierarchical scheme (TD), which uses a linear classifier at
One can then bound r=1 r using ([32]): each parent of the hierarchy, and the flat one.
b(l1)
We use the publicly available DMOZ data of the LSHTC
b(l1)(1) challenge which is a subset of Directory Mozilla. More
r < for = 0, 1 (17)
r=1
1 specifically, we used the large dataset of the LSHTC-2010

edition and two datasets were extracted from the LSHTC- exploit for performing an analysis of the space complexity
2011 edition. These are referred to as LSHTC1-large, LSHTC2- of linear classifiers in large-scale taxonomies. We provided
a and LSHTC2-b respectively in Table 2. The fourth dataset a grounded analysis of the space complexity for hierarchical
(IPC) comes from the patent collection released by World and flat classifiers and proved that the complexity of the
Intellectual Property Organization. The datasets are in the former is always lower than that of the latter. The analysis
edition
LibSVMand two datasets
format, which have were extracted
been from by
preprocessed thestemming
LSHTC- exploit
has beenfor performing
empirically an analysis
validated of thelarge-scale
in several space complexity
datasets
2011
and stopword removal. Various properties of interest LSHTC2-
edition. These are referred to as LSHTC1-large, for the of linear classifiers in large-scale taxonomies.
showing that the size of the hierarchical models We canprovided
be sig-
a and LSHTC2-b
datasets are shownrespectively
in Table 2.in Table 2. The fourth dataset a groundedsmaller
nificantly analysis of the
that the space complexity
ones created by a for
flathierarchical
classifier.
(IPC) comes from the patent collection released by World and
The flat
spaceclassifiers
complexityandanalysis
proved can
thatbetheused
complexity
in order ofto the
es-
Intellectual
Dataset Property#Tr./#Test
Organization. The datasets are
#Classes in the
#Feat. former is always lower than that of the latter. The
timate beforehand the size of trained models for large-scale analysis
LibSVM format, which have been preprocessed by stemming has been
data. Thisempirically validated
is of importance in several large-scale
in large-scale datasets
systems where the
LSHTC1-large
and stopword removal. 93,805/34,880 12,294
Various properties 347,255
of interest for the showing that the size of the hierarchical models can time.
be sig-
LSHTC2-a 25,310/6,441 1,789 145,859 size of the trained models may impact the inference
datasets are shown in Table 2. nificantly smaller that the ones created by a flat classifier.
LSHTC2-b 36,834/9,605 3,672 145,354 The space complexity analysis can be used in order to es-
IPC
Dataset 46,324/28,926
#Tr./#Test 451
#Classes 1,123,497
#Feat. 7.
timateACKNOWLEDGEMENTS
beforehand the size of trained models for large-scale
This
data. work
This is has been partially
of importance supportedsystems
in large-scale by ANR project
where the
LSHTC1-large
Table 2: Datasets for 93,805/34,880 12,294
hierarchical classification 347,255
with the Class-Y (ANR-10-BLAN-0211),
size of the trained models may impact BioASQ theEuropean
inference project
time.
LSHTC2-a
properties: Number of 25,310/6,441 1,789 target
training/test examples, 145,859
classes (grant agreement no. 318652), LabEx PERSYVAL-Lab ANR-
LSHTC2-b
and size of the feature36,834/9,605
space. The depth3,672 145,354
of the hierarchy tree 11-LABX-0025, and the Mastodons project Garguantua.
IPC
for LSHTC datasets 46,324/28,926
is 6 and for the IPC451dataset1,123,497
is 4. 7. ACKNOWLEDGEMENTS
This work has been partially supported by ANR project
Table 32: shows
Table Datasets for hierarchical
the difference classification
in trained model sizewith the
(actual 8. REFERENCES
Class-Y (ANR-10-BLAN-0211), BioASQ European project
properties: Number of training/test examples, target
value of the model size on the hard drive) between the two classes (grant agreement no. 318652), LabEx PERSYVAL-Lab ANR-
and size of theschemes
classification feature space.
for theThe
fourdepth of thealong
datasets, hierarchy
with tree
the [1] R. Babbar, and
11-LABX-0025, I. Partalas, C. Metzig,
the Mastodons projectE.Garguantua.
Gaussier, and
for LSHTC
values datasets
defined is 6 and 1.
in Proposition forThe
the symbol
IPC dataset is 4.to the
refers M.-R. Amini. Comparative classifier evaluation for web-
K
quantity Kb(L1) of condition 15.
Table 3 shows the difference in trained model size (actual 8. scale taxonomies using power law. In European Seman-
REFERENCES
tic Web Conference, 2013.
value of the model size on the hard drive) between the two
Dataset schemes forTD
classification Flatdatasets,
the four b
along the
with [1]
[2] R. Babbar,
A.-L. Barab aI.si Partalas, C. Metzig,
and R. Albert. EmergenceE. Gaussier,
of scalingand
in
values defined in Proposition 1. The
LSHTC1-large 2.8 90.0 1.62 344 1.12 symbol refers to the M.-R. Amini.
random networks.Comparative classifier evaluation for
science, 286(5439):509512, web-
1999.
K scale taxonomies using power law. In European Seman-
quantity
LSHTC2-a
Kb(L1)
of condition
0.46 15.
5.4 1.35 55 1.14
LSHTC2-b 1.1 11.9 1.53 77 1.09 [3] tic
S. Bengio, J. Weston,2013.
Web Conference, and D. Grangier. Label embed-
IPC 3.6 Flat
10.5 2.03 34 1.17 ding trees for large multi-class tasks. In Neural Infor-
Dataset TD b [2] mation
A.-L. Barab asi andSystems,
Processing R. Albert. Emergence
pages 163171,of2010.
scaling in
LSHTC1-large 2.8 90.0 1.62 344 1.12 random networks. science, 286(5439):509512, 1999.
Table 3: Model size (in GB) for flat and hierarchical models [4] P. N. Bennett and N. Nguyen. Refined experts: im-
LSHTC2-a 0.46 5.4 1.35 55 1.14
along with the corresponding values defined in Proposition [3] proving
S. Bengio, J. Weston, inand
classification D. taxonomies.
large Grangier. Label embed-
In Proceed-
LSHTC2-b 1.1 11.9 1.53 K77 1.09
1. The symbol refers to the quantity Kb(L1) ding
ings oftrees
the for
32nd large multi-classACM
international tasks.SIGIR
In Neural Infor-
Conference
IPC 3.6 10.5 2.03 34 1.17
mation
on Processing
Research Systems, pages
and Development 163171, 2010.
in Information Retrieval,
As shown for the three DMOZ datasets, the trained model pages 1118, 2009.
Table
for flat3:classifiers
Model size can(inbeGB) for flatofand
an order hierarchical
magnitude models
larger than [4] P. N. Bennett and N. Nguyen. Refined experts: im-
along with the corresponding
classification. values definedfromin Proposition
for hierarchical This results K the sparse [5] proving
L. Bottou classification in large
and O. Bousquet. Thetaxonomies.
tradeoffs ofInlarge
Proceed-
scale
1.
andThe symbol refers to theofquantity ings of the
high-dimensional nature the problem which is quite
Kb(L1) learning. In32nd international
Advances In NeuralACM SIGIR Conference
Information Processing
typical in text classification. For flat classifiers, the entire on Research
Systems, pagesand161168,
Development
2008.in Information Retrieval,
As shown
feature setfor the three for
participates DMOZ datasets,
all the classes,the
buttrained model
for top-down pages 1118, 2009.
for flat classifiers
classification, the can
numberbe anof order
classesofand
magnitude
features larger than
participat- [6] L. Cai and T. Hofmann. Hierarchical document cate-
for hierarchical
ing classification.
in classifier training This results
are inversely related, from
whenthetravers-
sparse [5] gorization
L. Bottou and withO.support
Bousquet. The machines.
vector tradeoffs ofInlarge scale
Proceed-
and the
ing high-dimensional
tree from the root nature of thethe
towards problem
leaves.which is quite
As shown in learning.
ings Inthirteenth
of the Advances ACM In Neural Information
international Processing
conference on
typical
Propositionin text classification.
1, the For flat classifiers,
power law exponent the entire
plays a crucial role Systems,
Informationpagesand161168, 2008.
knowledge management, pages 7887,
feature
in reducingset participates
the model size for of
allhierarchical
the classes, classifier.
but for top-down 2004.
classification, the number of classes and features participat- [6] L. Cai and T. Hofmann. Hierarchical document cate-
ing in classifier training are inversely related, when travers- [7] gorization
A. Capocci, withV. support vectorF.machines.
D. Servedio, Colaiori, InL.Proceed-
S. Bu-
6.
ing the CONCLUSIONS
tree from the root towards the leaves. As shown in ings
riol, of
D.the thirteenth
Donato, ACM international
S. Leonardi, conference
and G. Caldarelli. on
Pref-
In this work1,we
Proposition thepresented
power law a exponent
model in orderplaystoa explain the
crucial role Information
erential and knowledge
attachment management,
in the growth pages
of social 7887,
networks:
dynamics
in reducingthat the exist
model insize
the of
creation and evolution
hierarchical classifier. of large- 2004.internet encyclopedia wikipedia. Physical Review
The
scale taxonomies such as the DMOZ directory, where the E, 74(3):036116, 2006.
categories are organized in a hierarchical form. More specif- [7] A. Capocci, V. D. Servedio, F. Colaiori, L. S. Bu-
6.
ically,CONCLUSIONS
the presented process models jointly the growth in [8] riol, D. Donato,
O. Dekel, J. Keshet,S. Leonardi, and G.
and Y. Singer. Caldarelli.
Large margin Pref-
hier-
In
thethis
size work
of thewe presented
categories (ina terms
modelofindocuments)
order to explain
as wellthe
as erential classification.
archical attachment inInthe growth ofofsocial
Proceedings networks:
the twenty-first
dynamics
the growththat existtaxonomy
of the in the creation
in terms andofevolution
categories, of which
large- The internet encyclopedia
international conference onwikipedia.
Machine Physical
learning, Review
ICML
scale
to ourtaxonomies
knowledge have such not
as the
beenDMOZ directory,
addressed where
in a joint the
frame- E, 74(3):036116,
04, pages 2734, 2006.
2004.
categories are organized in a hierarchical
work. From one of them, the power law in category size form. More specif-
ically, the presented
distribution, we derived process
powermodels
laws at jointly
each leveltheofgrowth in
the hier- [8] O.
[9] S. Dekel, J. Keshet, and
N. Dorogovtsev and Y.
J. Singer. Large margin
F. F. Mendes. hier-
Evolution
the
archy,sizeand
of the
withcategories
the help (in terms oflaw
of Heapss documents) as welllaw
a third scaling as archical classification.
of networks with aging In Proceedings of the twenty-first
of sites. Physical Review E,
thethe
in growth
featuresof the
size taxonomy
distribution in ofterms of categories,
categories which wewhich
then international2000.
62(2):1842, conference on Machine learning, ICML
to our knowledge have not been addressed in a joint frame- 04, pages 2734, 2004.
work. From one of them, the power law in category size
distribution, we derived power laws at each level of the hier- [9] S. N. Dorogovtsev and J. F. F. Mendes. Evolution
archy, and with the help of Heapss law a third scaling law of networks with aging of sites. Physical Review E,
in the features size distribution of categories which we then 62(2):1842, 2000.

[10] L. Egghe. Untangling herdans law and heaps law: [22] M. Newman. Power laws, pareto distributions and zipfs
Mathematical and informetric arguments. Journal of law. Contemporary Physics, 46(5):323351, 2005.
the American Society for Information Science and
Technology, 58(5):702709, 2007. [23] M. E. J. Newman. Power laws, Pareto distributions and
Zipfs law. Contemporary Physics, 2005.
[11]
[10] M.
L. Faloutsos, P. Faloutsos,
Egghe. Untangling and C. law
herdans Faloutsos. On power-
and heaps law: [22] M. Newman. Power laws,pareto distributions and zipfs
law relationships [24] I. Partalas, R. Babbar, E. Gaussier, and C. Amblard.
Mathematical andof informetric
the internetarguments.
topology. SIGCOMM.
Journal of law. Contemporary Physics, 46(5):323351, 2005.
Adaptive classifier selection in large-scale hierarchical
the American Society for Information Science and
[12] Technology,
R.-E. Fan, K.-W. Chang, 2007.
58(5):702709, C.-J. Hsieh, X.-R. Wang, [23] classification.
M. E. J. Newman. In ICONIP, pages
Power laws, 612619,
Pareto 2012.
distributions and
and C.-J. Lin. LIBLINEAR: A library for large linear Zipfs law. Contemporary Physics, 2005.
[25] P. Richmond and S. Solomon. Power laws are dis-
[11] classification.
M. Faloutsos, P. Journal of Machine
Faloutsos, Learning On
and C. Faloutsos. Research,
power- Gaussier, and
9:18711874, 2008.
law relationships of the internet topology. SIGCOMM. [24] guised boltzmann
I. Partalas, laws. E.
R. Babbar, International Journal of Mod-
C. Amblard.
ern Physics
Adaptive C, 12(03):333343,
classifier selection in 2001.
large-scale hierarchical
[13] T. GaoFan,
[12] R.-E. andK.-W.
D. Koller. Discriminative
Chang, C.-J. Hsieh,learning of re-
X.-R. Wang, classification. In ICONIP, pages 612619, 2012.
[26] H. A. Simon. On a class of skew distribution functions.
laxed hierarchy
and C.-J. for large-scale
Lin. LIBLINEAR: visual recognition.
A library In
for large linear
IEEE International
classification. JournalConference
of MachineonLearning
Computer Vision
Research, [25] Biometrika,
P. Richmond42(3/4):425440,
and S. Solomon. 1955.Power laws are dis-
(ICCV), pages
9:18711874, 20722079, 2011.
2008. guised boltzmann laws. International Journal of Mod-
[27] C. Song, S. Havlin, and H. A. Makse. Self-similarity of
ern Physics C, 12(03):333343, 2001.
complex networks. Nature, 433(7024):392395, 2005.
[14]
[13] M.
T. M.
GaoGeipel,
and D. C.Koller.
J. Tessone, and F. Schweitzer.
Discriminative learningAofcom-
re-
plementary view for
on the growth of directory trees. The [28]
[26] H. A.
Takayasu,
Simon. On A.-H. Sato,
a class and distribution
of skew M. Takayasu. Stable
functions.
laxed hierarchy large-scale visual recognition. In
European Physical Journal B, 71(4):641648, 2009. infinite
Biometrika,variance fluctuations 1955.
42(3/4):425440, in randomly amplified
IEEE International Conference on Computer Vision
langevin systems. Physical Review Letters, 79(6):966
(ICCV), pages 20722079, 2011. [27] 969,
C. Song,
[15] S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. 1997.S. Havlin, and H. A. Makse. Self-similarity of
complex networks. Nature, 433(7024):392395, 2005.
[14] Bayesian models
M. M. Geipel, forTessone,
C. J. large-scale
and hierarchical classifica-
F. Schweitzer. A com-
tion. In Neural [29]
[28] C.
H. J. Tessone, A.-H.
Takayasu, M. M.Sato,Geipel,andandM.F.Takayasu.
Schweitzer.Stable
Sus-
plementary viewInformation Processing
on the growth Systems,
of directory 2012.
trees. The
tainable growth in
infinite variance complex networks.
fluctuations in randomly EPLamplified
(Euro-
European Physical Journal B, 71(4):641648, 2009.
[16] G. Jona-Lasinio. Renormalization group and probabil- physics
langevinLetters),
systems.96(5):58005, 2011. Letters, 79(6):966
Physical Review
[15] ity theory. Y.
S. Gopal, Physics
Yang,Reports,
B. Bai,352(4):439458, 2001.
and A. Niculescu-Mizil. 969, 1997.
[30] K. G. Wilson and J. Kogut. The renormalization group
Bayesian models for large-scale hierarchical classifica-
[17] tion.
K. Klemm, V. M. Eguluz, and M. San Miguel. [29] and
C. J.the expansion.
Tessone, Physics and
M. M. Geipel, Reports, 12(2):75199,
F. Schweitzer. Sus-
In Neural Information Processing Systems,Scaling
2012. 1974.
in the structure of directory trees in a computer cluster. tainable growth in complex networks. EPL (Euro-
Physical
[16] G. review letters,
Jona-Lasinio. 95(12):128701,
Renormalization 2005.
group and probabil- physics Letters), 96(5):58005, 2011.
[31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifi-
ity theory. Physics Reports, 352(4):439458, 2001.
[18] D. Koller and M. Sahami. Hierarchically classifying [30] cation in large-scale
K. G. Wilson text hierarchies.
and J. Kogut. In Proceedings
The renormalization groupof
[17] documents
K. Klemm, V. using veryluz,
M. Egu fewand
words.
M. SanIn Miguel.
Proceedings of
Scaling the
and 31st
the annual international
expansion. Physics ACM SIGIR
Reports, conference
12(2):75199,
the Fourteenth
in the International
structure of Conference
directory trees on Machine
in a computer cluster. on Research and development in information retrieval,
1974.
Learning, ICMLletters,
Physical review 97, 1997.
95(12):128701, 2005. SIGIR 08, pages 619626, 2008.
[31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifi-
[19] [32] cation
Y. Yang,in J. Zhang, and
large-scale textB.hierarchies.
Kisiel. A scalability analysis
In Proceedings of
[18] T.-Y. Liu, Y.
D. Koller andYang, H. Wan, Hierarchically
M. Sahami. H.-J. Zeng, Z. Chen, and
classifying
W.-Y. Ma. Support vector of
theclassifiers
31st annual in text categorization.
international In Proceedings
ACM SIGIR conferenceof
documents using very fewmachines
words. Inclassification
Proceedings with
of
a the 26th annual
on Research and international
development in ACM SIGIR conference
information retrieval,
thevery large-scale
Fourteenth taxonomy. SIGKDD,
International Conference 2005.
on Machine
on Research
SIGIR and 619626,
08, pages development 2008.in informaion retrieval,
Learning, ICML 97, 1997.
[20] B. Mandelbrot. A note on a class of skew distribution SIGIR 03, pages 96103, 2003.
[19] functions:
T.-Y. Liu, Analysis
Y. Yang,and critique
H. Wan, of aZeng,
H.-J. paperZ.byChen,
ha simon.
and [32] Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis
Information and Control, [33] of
G. classifiers
U. Yule. Ainmathematical theory ofInevolution,
text categorization. based
Proceedings of
W.-Y. Ma. Support vector2(1):9099, 1959.
machines classification with
on
the the
26thconclusions of dr. jc willis,
annual international frs. Philosophical
ACM SIGIR conference
a very large-scale taxonomy. SIGKDD, 2005.
[21] C. Metzig and M. B. Gordon. A model for scaling in Transactions
on Research and of the Royal Society
development of London. Series
in informaion B,
retrieval,
[20] firms size and A
B. Mandelbrot. growth rate
note on distribution.
a class Physica A,
of skew distribution Containing Papers96103,
SIGIR 03, pages of a Biological
2003. Character, 213:21
2014.
functions: Analysis and critique of a paper by ha simon. 87, 1925.
Information and Control, 2(1):9099, 1959. [33] G. U. Yule. A mathematical theory of evolution, based
on the conclusions of dr. jc willis, frs. Philosophical
[21] C. Metzig and M. B. Gordon. A model for scaling in Transactions of the Royal Society of London. Series B,
firms size and growth rate distribution. Physica A, Containing Papers of a Biological Character, 213:21
2014. 87, 1925.

Interview: Michael Brodie, Leading Database Researcher,
Industry Leader, Thinker
Gregory Piatetsky
KDnuggets
Brookline, MA
gregory@kdnuggets.com
ABSTRACT 3. INTERVIEW
We discuss the most important database research advances,
industry developments, role of relational and NoSQL databases, Gregory Piatetsky: You have started as a researcher in
Computing Reality, Data Curation, Cloud Computing, Tamr and Databases (PhD from Toronto) and had a very distinguished
Jisto startups, what he learned as a chief Scientist of Verizon, and varied career spanning academia, industry, and
Knowledge Discovery, Privacy Issues, and more. government, in US, Europe, Australia, and Latin America
over the last 25+ years. From your unique vantage point, what
were 3 most important database research advances?
Keywords
Data Curation, NoSQL, Data Curation, Cloud Computing, Michael Brodie: Three most important database research
Verizon, Privacy, Computing Reality. advances:
1. INTRODUCTION
I had a pleasure of working with Michael Brodie when we were 1. Ted Codds Relational model of data (1970) is the most
both at GTE Laboratories in 1990s, where he was already a world- important database research advance as it launched what
famous researcher and a department manager. I recently met him is now a $28 BN/year market still growing at 11%
at another conference, and our discussion led to this interview. CAGR with over 215 RDBMSs on the market. More
Michael is still very sharp, very active, and busy - he answered important to me it launched four decades of amazing
these questions while flying from Boston to Doha, Qatar where he research advances starting with query optimization
is advising Qatar Computing Research Institute. (Selinger) and transactions (Gray) and innovation that
Parts of this interview were published in KDnuggets [1-3]. has probably grown at 20% CAGR.
2. The next most important research advance or stage was
2. BACKGROUND a change in perspective that specific domains require
Dr. Michael L. Brodie [4] has served as their own DBMS such as graph databases, array stores,
Chief Scientist of a Fortune 20 company, an document stores, key-value stores, NoSQL, NewSQL,
Advisory Board member of leading and many more to come. DB-Engines.com lists twelve
national and international research DBMS categories thus bumping the database world
organizations, and an invited speaker and from managing 8% of the worlds data to about 12% but
lecturer. In his role as Chief Scientist Dr. due to the growth of non-database data back to 10%.
Brodie has researched and analyzed Soon, due to the role of data in our digitized world there
challenges and opportunities in advanced will be data management systems for many more
technology, architecture, and domains. While this is amazingly cool, how do we solve
methodologies for Information Technology multi-disciplinary (multi-data domain) problems in a
strategies. He has guided advanced consistent rather that disjoint way?
deployments of emergent technologies at 3. The next most important research advance is just
industrial scale, most recently Cloud emerging and is mind blowing. I call it Computing
Computing and Big Data. Reality, acknowledging that every datum (every real
world observation) is not definitive but probabilistic.
Unlike conventional databases and more like reality,
Throughout his career Dr. Brodie has been active in both
Computing Reality has no single version of truth. How
advanced, academic research and large-scale industrial practice
do we model such worlds, more realistic worlds and
attempting to obtain mutual benefits from the industrial
compute over them? The simple answer is that it is
deployment of innovative technologies while helping research to
already in Big Data sources. There are many related
understand industrial requirements and constraints. He has
attempts to address Computing Reality including social
contributed to multi-disciplinary problem solving at scale in
computing, probabilistic computing, probabilistic
contexts such as Terrorism and Individual Privacy, and
databases, Open Worlds in AI, Web Science,
Information Technology Challenges in Healthcare Reform.
Approximate Computing, Crowd Computing, and more.
Perhaps this will be the next generation of computing.

relational DBMS, key-value stores etc.). This pie-chart shows the
GP: What about the most important database industry number of systems in each category. Some of the systems belong
to more than one category.
developments?
MB: Alas the database industry, like all industries, has a legacy
problem that stifles innovation. It has taken over 30 years to
emerge from the relational era. The most important recent
database industry development came from outside the database
industry, it is Big Data and its marketing arm called MapReduce
and its data sidekicks, Hadoop and NoSQL. Frankly, the
database industry has been insular and protected its relational
turf for FAR too long. Smart folks at Yahoo!, Google and other
places saw value in data, non-database data, and thus emerged
MapReduce, Hadoop, and NoSQL- generally crappy database
ideas but it woke up the database industry1. Hadoop and NoSQL
are growing in demand. In time it will be seen that they are
amazing for a very specific problem domain, embarrassingly
parallel problems, but it is a money pit for everything else. The
importance of MapReduce is that it forced the database industry to Popularity changes per category, April 2014, over 1 year
get out of their hammocks.
x Graph DBMS growing dramatically 3.5X
GP: What is the role of Relational Databases, NoSQL x Wide column stores 2X
databases, Graph databases, and other databases today? x Document stores 2X
Relational Databases have two extremely well established roles. x Native XML DBMS 1.5X
Conventional row stores serve the OLTP community as the x Key-value stores 1.5X
backbone of enterprise operations. These blindingly fast
transaction processors are moving in-memory. OLTP stores are x Search engines 1.5X
modest in number and size (< 1 TB) growing and declining in x RDF stores 1.5X
lock step with business growth and decline. Column stores,
OLAP, are the backbone of data warehouses and until recently x Object oriented DBMS - flat
business intelligence. In general there are huge numbers of these,
x Multivalue DBMS - flat
often of very large size in the Petabyte and Exabyte range. This is
where Big Data battle lines are being drawn. What fun!! x Relational DBMS - flat
This is also where we turn from polishing the relational round ball
[5] and focus on the other dozen or so other DBMS categories. GP: You have held an amazing variety of positions in
Taking over is relative; none of the 12 other categories has more academy, industry, government organization, VC firms, and
than 3% of the database market. Graph databases serve graph start-ups, in US, Brazil, Canada, Australia, and Europe.
applications like networking in communications, telecom, social Which 3 positions were most satisfying to you and why?
networks, and of course NSA applications! But what is wonderful
about these emerging classes of data-domain specific DBMSs is MB: What a great question. Thank you for asking because it
that we are only now discovering the rich use cases that they caused me to think about what I have really enjoyed over 40
serve. years. Somehow CSAIL at MIT and the Faculty of Computing
and Communications at EPFL jump to mind.
The use cases define the DBMSs and the DBMSs help formulate
the use cases. SciDB is a superb example of managing scientific There are scary smart people at those places. Like climbing
data and computation at scale. It is awkward for both communities mountains it both scares and exhilarates me. To be frank my jobs
database folks who dont speak linear algebra or matrices, and at big enterprises in hindsight are confusing. I guess I was window
scientists who only speak R. Exciting times. For a little fun look at dressing because my role did not feel like it had impact. So
the database-engines list [6]. getting motivated and scared at MIT and EPFL are probably top,
so theres number one. Why? Just look down 5,000 feet and ask
Database Engines why am I here?
DB-Engines lists 216 different database management systems,

which are classified according to their database model (e.g. Second is a combination of Advisory roles at US Academy of
Science, DERI, STI, ERCIM, Web Science Trust, and others
because they gave me a sense of collaborating, challenging, and
1
On June 25, 2014 Google launched Cloud Dataflow replacing contributing. How cool is that?_________________________
MapReduce and marking the decline of MR and Hadoop as
predicted at launch in 2010 by Mike Stonebraker in 2010. Third would be working at startups like Tamr and Jisto. Imagine

waking up in the morning and thinking you might change the Hence, the new paradigm is to approach Big Data bottom-up
world. That requires that I conceive the world not just differently, due to the scale of the data and to let the data speak. Big Data
but so that it solves someones REAL problem. Even more cool. is a different, larger world than the database world. The database
world (small data) is a small corner of the Big Data world.
Gregory Piatetsky: Currently you are an adviser at a startup Correspondingly Big Data requires new tools, e.g., Big Data
called Tamr [7], co-founded by another leading DB researcher Analytics, Machine Learning (the current red haired child),
and serial entrepreneur Michael Stonebraker. What can you Fourier transforms, statistics, visualizations, in short any model
tell us about Tamr and its product? that might help elucidate the wisdom in the data. But how do you
get Big Data, e.g., 100 data sources, 1,000, 100,000 or even
500,000, into these tools? How do you identify the 5,000 data
Michael Brodie: Consider the data universe. Since the 1980s I sources that include Sally Blogs and consolidate them into a
have said in keynotes that the database and business worlds deal coherent, rationale, consistent view of dear Sally? When questions
with less than 10% of the worlds data most of which is arise in consolidating Sallys data, how do you bring the relevant
structured, discrete, and conforms to some schema. With the Web human expertise, if needed, to bear at scale on 1 million people?
and Internet of Things in the 1990s massive amounts of Many successful Big Data projects report that this data curation
unstructured data began to emerge with a growth rate that was process takes 80% of the project resources leaving 20% for the
inconceivable while shrinking database data to less than 8%. problem at hand. Data curation is so costly because it is largely
EMC/IDC claims [14] that our Digital Universe is 4.4 zettabytes manual hence it is error prone. Thats where Tamr comes to the
and will double every two years until 2020 when it will be 44 rescue. It is a solution to curate data at scale.
zettabytes.
We call it collaborative data curation because it optimizes the use
[If you are constantly amazed at the growth of the Digital of indispensable human experts. Data Curation is for Big Data
World, you dont understand it yet A profound, casual what Data Integration is for small data.
comment of my departed friend, Gerard Berry, Academie
Francais.]
In 1988 or so you, Gregory, and a few others saw the potential of

data with your knowledge discovery in databases then a radical
idea. Little did others, including me, realize the potential of this,
now named Big Data. Even though Big Data is hot in 2014,
almost 30 years later, its application, tools, and technologies are
in their infancy, analogous to the emergence of the Web in the
early 1990s. Just as the Web has and is changing the world, so too
will Big Data.
Data Curation is bottom up and Data Integration is top down.
________________________________________
It took me about a year to understand that fundamental difference.
I have spent over 20 years of my professional life dealing with
Compared with database data, Big Data is crazy. Its largely
those amazing Data Integration platforms and some of the worlds
not understood hence it is schema-less or model-less. Big Data is
largest data integration applications. Those technologies and
inconceivably massive, dirty, imprecise, incomplete, and
platforms apply beautifully to database data small data; they
heterogeneous beyond anything weve seen before. Yet it
simply do not apply to Big Data.
trumps finite, precise, database data in many ways hence is a
treasure trove of value. Big Data is qualitatively different from
To emphasize what is ahead, here is a prediction. Data Integration
database data that is a small subset of Big Data - EMC/IDC claims
is increasingly crucial to combining top-down data into
1% as of 2013. It offers far greater potential thus value and
meaningful views. Data Integration is a huge challenge and huge
requires different thinking, tools, and techniques. Database data is
market that will not go away. Big Data is orders of magnitude
approached top-down. Telco billing folks know billing inside out
larger than small or database data. Correspondingly Data Curation
so they create models that they impose, top-down on data. Data
will be orders of magnitude larger than Data Integration. The
that does not comply is erroneous. Database data, like Telco bills
world will need Data Curation solutions like Tamr to let data
must be precise with a single version of truth, so that the billing
scientists focus on analytics, the essential use and value of big
amount is justifiable. Due in part to scale, Big Data must be
data, while containing the costs of data preparation. In addition to
approached bottom up. More fundamentally, we should let data
Tamr there are over 65 very cool data curation products
speak; see what models or correlations emerge from the data, e.g.,
contributing to addressing the growing need and creating a new
to discover if adding strawberry to the popsicle line-up makes
software market. What is also cool about data curation is that it
sense (a known unknown) or to discover something we never
can be used to enrich the existing information assets that are the
thought of (unknown unknowns). Rather than impose a
core of most enterprises applications and operations. Of course,
preconceived, possibly biased, model on data we should
the really cool potential of data curation is that it makes Big Data
investigate what possible models, interpretations, or correlations
analytics more efficiently available to allow users to discover
are in the data (possibly in the phenomena) that might help us
things that they never knew! How cool is
understand it.
that?__________________________

For more on Data Curation at scale, see Stonebraker [8]. At First Founders I can do some technology, Jim can do finance,
Howard can do business plans. Collectively we make a judgment.
GP: You also advise another startup Jisto. What can you tell But good VCs are the wizards. They have Rolodexes. When their
us about your role there? taste says maybe they refer the startup to the relevant folks in their
network who essentially do the due diligence for them. Like at
MB: I am having a blast with Jisto [9] some amazingly talented First Founders, the judgment is crowd sourced, actually what we
young engineers [PhDs actually] with lots of energy and a killer call at Tamr, it is expert sourced. I have a growing trust of the
idea. Jisto is an exceptional example of the quality you ask about crowd and especially of the expert crowd.
in the next question.
GP: You were a Chief Scientist at Verizon for over 10 years
Cloud computing enabled by virtualization is radically changing (and before that at GTE Labs which became part of Verizon).
the world by reducing the cost and increasing the availability of What were some of the most interesting projects you were
computing resources. Can you imagine that only 50% of the involved in at GTE and Verizon?
worlds servers are virtualized?
MB: The technical challenge that stays with me is that addressed
Pop quiz [do not cheat and read ahead]. by the Verizon Portal, Verizons solution for Enterprise
Telecommunications providing Telecommunication services to
What is the average CPU utilization of physical servers, enterprise customers, such as Microsoft. Verizon, like all large
worldwide? Of virtual servers? Telcos, is the result of the merger & acquisition of 300+ smaller
Telcos. Each had at least 3 billing systems; hence Verizon
Answer: Virtual machine CPU utilization is typically in the 30- acquired over 1,000 billing systems. Billing is only one of over a
50% range while physical servers are 10-20%, due to risk and dozen systems categories, including sales, marketing, ordering,
scheduling ,but mostly cultural challenges. and provisioning. Providing a customer like Microsoft with a
telephone bill for each Microsoft organization requires integrating
Jisto enables enterprises to transparently run more compute- data potentially from over 1,000 databases. As is the case for most
intensive workloads on these paid-for but unused resources enterprises, Verizon and Microsoft reorganize constantly
whether on premises or in public or private clouds, thus reducing complicating the sources to be integrated, like Microsoft, and the
costs by 7590% over acquiring more hardware or cloud targets, Verizons changing businesses, e.g., wireline and FiOS.
resources. Every service company faces this little-discussed massive
challenge.
Jisto provides a high-performance, virtualized, elastic cloud-
computing environment from underutilized enterprise or cloud
computing resources (servers, laptops, etc.) without impacting the Integrating 1,000s of operational systems is a backward looking
primary task on those resources. Organizations that will benefit problem. The cool forward-looking problem was Verizon ITs
most from Jisto are those that run parallelized compute-intensive Standard Operating Environment (SOE). Prior to cloud platforms
applications in the data center or in private and public clouds (e.g., and cloud providers, Verizon IT (actually one team) sought to
Amazon Web Services, Windows Azure, Google Cloud Platform, develop an SOE onto which Verizons major applications (0ver
IBM SmartCloud). 6,000) could be migrated to be managed virtually on an internal
_ cloud. What a fun challenge. When the team left Verizon as a
Jisto is currently looking for early adopters for its beta program group over 60 major corporate applications, including SAP, had
who will gain significant reduction in the cost of their computing been migrated. Smart folks, good solution that failed in Verizon.
possibly avoiding costly data center expansion. In industry, challenges are 80-20; 80% political, 20%
technical. The SOE is being reborn in the infrastructure of
another major infrastructure corporation.
GP: You are also a Principal at First Founders Limited. What
do you look for in young business ventures - how do you
determine quality?-------------------------------------------- Finally, the next most interesting and yet unsolved industry
challenge was getting over the legacy that Mike Stonebraker and I
MB: There are armies of people who evaluate the potential of addressed in [11].
startups. The professional ones are called "Venture Capitalists How do you keep a massive system up to date in terms of the
(VCs). The retired ones are called Angels. Like any serious application requirements and the underlying technology or
problem there is due diligence to determine and evaluate the migrate it to a modern, efficient, more cost effective platform?
factors relevant to the business opportunity, the technology, the
business plan, etc. as the many books [10] and formulas suggest.-- Enterprises tend to invest only in new revenue generating
------------------------------------------------------- opportunities often leaving the legacy problem to grow and grow.
So existing systems like billing languish and accumulate. Its like
If you are reading a book, then you dont know. Ultimately it a teenager never tidying their room for 60 years. Now where
comes down to good taste developed over years of successful are my blue shoes? I suggested to Mike Stonebraker that we
experience. Andy Palmer, a serial entrepreneur, good friend, and rewrite our 1995 book. He did not even respond to the email,
very smart guy said Do it once really well then repeat. Andy suggesting that it is largely a political problem and not technical,
ought to know, Tamr is about his 25th startup. no matter the brilliant technical solution provided.

Lesson: If you are a CIO, clean up your goddamn room; youre than letting the data speak to suggest directions and models that
not going out until you do! we may never have thought of. As it has always been, it takes
courage to change from a discrete world of top-down models [I
know how this works!] to an ambiguous, probabilistic world
GP: Around 1989 when you were a manager at GTE Labs and [What possible ways does this work?].
I was a member of technical staff there, you were somewhat
skeptical of the idea I proposed for research into Knowledge Those are natural successes and limitations of an emerging field.
Discovery in Databases (then called KDD or Data Mining, and The direction, opportunities, and changes are profound. I
more recently Predictive Analytics, and Data Science). The experience a mix of fear and tingles thinking of asking the data to
field has progressed significantly since then. From your point speak. Hoping that I can be open to what it says and
of view, what are the main successes and disappointments of distinguishing s..t from Shinola.
KDD/Data Mining/Predictive Analytics and can Data Science
become an actual science?
I call the vision Computing Reality. It may be the Next
MB: My current research concerns the scientific and Generation of Computing.
philosophical underpinnings of Big Data and Data Science. With
Big Data we are undergoing a fundamental shift in thinking and in GP: In your very insightful report of the White House-MIT
computing. Big Data is a marvelous tool to investigate What Big Data Privacy Workshop [12] you have a quote Big data
correlations or patterns that suggest that things might have or will has rendered obsolete the current approach to protecting
occur. privacy and civil liberties". Will people get used to much less
privacy (as the digitally-savvy younger people seem to be) or
will government regulation and/or technology be able to
Big Datas weakness is that it says nothing about Why protect privacy? How will this play in US vs. Europe vs. other
causation or why a phenomenon occurred or will occur. regions of the world?
A pernicious aspect of What are the biases that we bring to it. On MB: As an undergraduate at the University of Toronto, I was
a personal note, my biased recall of 1989 was how marvelous extremely fortunate to have had Kelly Gotlieb, the Father of
your ideas were and the amazing potential of data mining. I accept Computing in Canada, as a mentor. I was a student in his 1971
your view that I was skeptical rather than enthusiastic as I recall. course, Computers and Society, later to become the first book on
You see I modified reality to fit my desire to be on the winning the topic. Kelly and the issues, including privacy, have resonated
side, which I was not then. Hence, what we think that we thought with me throughout my career. Kelly observed that privacy, like
may bear little resemblance to reality or, more precisely other many other cultural norms, varies over time. So yes, privacy will
peoples reality. As Richard Feynman said, fluctuate from Alan Westins notion of determining how your
personal information is communicated to the Facebook-esk "Get
The first principle is that you must not fool yourself - and you over it".
are the easiest person to fool.
While personal privacy is undergoing significant change,
That said, I see the main successes of this trend as a nascent disclosure of information assets that are part of the digital
trajectory along the lines of Big Data, Data Analytics, Business economy or of government or corporate strategy may have very
Intelligence, Data Science, and whatever the current trendy term significant impacts on our economy and democracy. Hence, this
is. The World of What is phenomenal machines proposing raises issues of security, protection, and cultural and social issues
potential correlations that are beyond our ability to identify. too complex to be treated here.
Humans consider seven plus or minus 2 variables at a time, a
rather simple model, while models, such as Machine Learning, However, there are a number of very smart people looking at
can consider millions or billions of variables at a time. Yet 95% various aspects. The quote you cite is from Craig Mundy [13] who
(or even 99.99999%) of the resulting correlations may be explores changes that Big Data brings debating the balancing of
meaningless. For example, ~99% of credit card transactions are economic versus privacy issues.
legitimate with less than 1% that are fraudulent, yet the 1% can
kill the profits of a bank. So precision and outlier cases, called
anomalies in science can matter. So it pays to search for Very smart folks, like Butler Lampson and Mike Stonebraker, are
apparently anomalous behavior as it is happening! commenting on practical solutions to this age-old problem. Their
arguments are along the following lines. Due to the massive scale
We have already seen massive benefits of Big Data in the stock of Big Data, and what I call Computing Reality, previously top-
market, electoral predictions, marketing success, and many more down solutions for security, such as anticipating and preventing
that underlie the Big Data explosion. Yet there is a potential Big security breaches, will simply not scale to Big Data. They must be
Data Winter ahead if people blindly apply Big Data and more augmented with new approaches including bottom-up solutions
specifically Machine Learning. The failures concern limited such as Stonebrakers logging to detect and stem previously
models of phenomena and the human tendency of bias. People can unanticipated security breaches and Weitzners accountable
and do use What (Big Data, etc.) to support their biases and systems.
limited models, e.g., used to support the claim of the absence of
climate change or lack of human impact on climate change, rather To beat the Heartbleed bug and others like it, Organizations need

to be able to detect attackers and issues well after they have made
it through their gates, find them, and stop them before damage can
occur, Gazit, a leading cyber security expert said recently. The
only way to achieve such a laser-precision level of detection is
through the use of hyper-dimensional big data analytics,
deploying it as part of the very core of the defense mechanisms.
Big Data has rendered obsolete the current approach to

protecting privacy and civil liberties.
Hence, Big Data requires a shift from a focus on top-down

methods of controlling data generation and collection to a focus
on data usage. Not only do top-down methods not scale, Tightly
restricting data collection and retention could rob society of a
hugely valuable resource [13]. Adequate let alone complete
solutions will take years to develop.
GP: What interesting technical developments you expect in (Michael Brodie and his son on a peak in New Hampshire)
Database and Cloud Technology in the next 5 years?
My activities include the gym (4 times a week); hiking/climbing
MB: I call the Big Picture Computing Reality in which we model
~75 mountains USA, Nepal, Greece, Italy, France, Switzerland,
the world from whatever reasonable perspectives emerge from the
and even Australia; 42 of the 48 4,000 footers in NH (most with
data and are appropriate, e.g., have veracity, and make decisions
Mike Stonebraker); cooking (daily and special occasions with my
symbiotically with machines and people collaborating to optimize
son Justin, an amazing chef and brewer, when hes not doing his
resources while achieving measures of veracity for each result.
PhD), travel, and my garden; all of these except the gym and
garden - with family and close friends.
One subspace of this world is what we currently know with high
levels of confidence, the type of information that we store in
relational databases. Another encompassing space is what we
know but forgot or dont want to remember (unknown knowns)
and a third is what we speculate but do not know (known Very cool Big Data Books:
unknowns), these are all the hypotheses that we make but do not Big Data: A Revolution That Will Transform How We Live, Work,
know in science, business, and life. and Think by Viktor Mayer-Schonberger, Kenneth Cukier,
Houghton Mifflin Harcourt confused and inspired me, then
The rest of the data space unknown unknowns - is infinite; The Signal and the Noise: Why So Many Predictions Fail-but
otherwise learning would be at an end. That is the space of Some Don't, by Nate Silver, Penguin Press, inspired me.
discovery.
I am investigating Computing Reality to investigate the entire Real books

space with the objective of accelerating Scientific Discovery. x Ken Folletts The Pillars of the Earth; Century Trilogy
This is practically interesting because very little of our world is (Fall of Giants, Winter of the World and Edge of
discrete, bounded, finite, or involves a single version of truth, yet Eternity)
that is the world of most computing. With Computing Reality we
x Henning Mankells The Fifth Woman (A Kurt
hope to be far more pragmatic and realistic. This is technically
Wallander Mystery)
and theoretically interesting because we have almost no
mathematical or computing models in these areas. Those that exist
are just emerging or are massively complex. How cool is that? GP: You just returned from Doha, Qatar where you were
You see what old retired guys get to do? advising the Qatar Computing Research Institute (QCRI) -
quite far from Silicon Valley, New York, or Boston. What is
happening there and what computing research are they
GP: What do you like to do in your free time? What recent
doing?
book you liked?
MB: This was my first visit to Qatar that was remarkable
culturally an intellectually. Culturally I saw spectacular result of
MB: Free time what a concept! My yoga teacher, Lynne hydrocarbon wealth and vision, e.g., amazing architecture
recommended that I should try to do nothing one day, and I will. I emerging from the dessert. Intellectually I saw the beginnings of
will. Soon. Really. Life is such a blast; its hard to keep still. Qatars National Vision 2030 to transform Qatars economy from
hydrocarbon-based to knowledge-based.

One step in this direction by the Qatar Foundation was to create [7] Gregory Piatetsky, Exclusive: Tamr at the New Frontier of Big
the Qatar Computing Research Institute (QCRI). In less than three Data Curation, KDnuggets, May 2014,
years QCRI has established the beginnings of a world-class http://www.kdnuggets.com/2014/05/tamr-new-frontier-big-data-
computer science research group seeded with world-class curation.html
researchers in strategically important areas such as Social
Computing, Data Analysis, Cyber Security, and Arabic Language [8] Stonebraker et al, Data Curation at Scale: The Data Tamer
Technologies (e.g., Machine Learning and Translation) amongst System In CIDR 2013 (Conference on Innovative Data Systems
others. Each group already has multiple publications over several Research).
years in the leading conferences in their areas, e.g., SIGMOD and [9] http://jisto.com/
VLDB for Data Analysis. I spent my time reviewing with them
[10] R. Field, "Disciplined Entrepreneurship: 24 Steps to a
what I consider to be some of the most challenging issues in Big
Data. Successful Startup by Bill Aulet", Journal of Business & Finance
Librarianship, vol. 19, no. 1, pp. 83-86, Jan. 2014.
[11] M. Brodie and M. Stonebraker. Legacy Information Systems
4. REFERENCES Migration: The Incremental Strategy, Morgan Kaufmann
[1] Gregory Piatetsky, Exclusive Interview: Michael Brodie, Publishers, San Francisco, CA (1995) ISBN 1-55860-330-1
Leading Database Researcher, Industry Leader, Thinker, in [12] M. Brodie, White House-MIT Big Data Privacy Workshop
KDnuggets, April 2014, Report, KDnuggets, Mar 27, 2014.
http://www.kdnuggets.com/2014/04/michael-brodie-database- http://www.kdnuggets.com/2014/03/white-house-mit-big-data-
researcher-leader-thinker.html privacy-workshop-report.html
[2] Gregory Piatetsky, Interview (part 2): Michael Brodie on Data [13] Craig Mundy, Privacy Pragmatism: Focus on Data Use, Not
Curation, Cloud Computing, Startup Quality, Verizon, in Data Collection, Foreign Affairs, March/April 2014.
KDnuggets, May 2014,
[14] The Digital Universe of Opportunities: Rich Data and the
http://www.kdnuggets.com/2014/04/interview-michael-brodie-2- Increasing Value of the Internet of Things. IDC/EMC, April 2014.
data-curation-cloud-computing-verizon.html
[3] Gregory Piatetsky, Interview (part 3): Michael Brodie on
Industry Lessons, Knowledge Discovery, and Future Trends, in
KDnuggets, May 2014,
About the authors:
http://www.kdnuggets.com/2014/05/interview-michael-brodie-3-
industry-lessons-knowledge-discovery-trends-qcri.html
Gregory Piatetsky is a Data Scientist, co-founder of KDD
[4] www.michaelbrodie.com/michael_brodie.asp conferences and ACM SIGKDD society for Knowledge
[5] Michael Stonebraker. Are We Polishing a Round Ball? Panel Discovery and Data Mining, and President and Editor of
Abstract. ICDE, page 606. IEEE Computer Society, 1993 KDnuggets. He tweets about Analytics, Big Data, Data Science,
and Data Mining at @kdnuggets.
[6] Database Engines List, http://db-engines.com/en/blog_post/23
http:/db-engines.com/en/blog_post/23

07 DSM PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 DSM PDF

Uploaded by

Copyright:

Available Formats

Open Challenges for Data Stream Mining Research

ABSTRACT the last decade, truly autonomous, self-maintaining, adaptive data

SIGKDD Explorations Volume 16, Issue 1 Page 1

 !  real or virtual: While the former requires model adaptation,

3. PROTECTING PRIVACY AND CONFI-

SIGKDD Explorations Volume 16, Issue 1 Page 2

4.1 Streamed Preprocessing 4.2 Timing and Availability of Information

SIGKDD Explorations Volume 16, Issue 1 Page 3

SIGKDD Explorations Volume 16, Issue 1 Page 4

5.1.1 Challenges of Aggregation 5.2 Analyzing Event Data

SIGKDD Explorations Volume 16, Issue 1 Page 5

SIGKDD Explorations Volume 16, Issue 1 Page 6

SIGKDD Explorations Volume 16, Issue 1 Page 7

7.1.3 Solving the right problem

SIGKDD Explorations Volume 16, Issue 1 Page 8

Acknowledgments [15] J. Gama. Knowledge Discovery from Data Streams. Chapman

SIGKDD Explorations Volume 16, Issue 1 Page 9

SIGKDD Explorations Volume 16, Issue 1 Page 10

Oshini Goonetilleke , Timos Sellis , Xiuzhen Zhang , Saket Sathe

SIGKDD Explorations Volume 16, Issue 1 Page 11

SIGKDD Explorations Volume 16, Issue 1 Page 12

SIGKDD Explorations Volume 16, Issue 1 Page 13

SIGKDD Explorations Volume 16, Issue 1 Page 14

SIGKDD Explorations Volume 16, Issue 1 Page 15

SIGKDD Explorations Volume 16, Issue 1 Page 16

SIGKDD Explorations Volume 16, Issue 1 Page 17

SIGKDD Explorations Volume 16, Issue 1 Page 18

SIGKDD Explorations Volume 16, Issue 1 Page 19

SIGKDD Explorations Volume 16, Issue 1 Page 20

Yi Chang , Lei Tang , Yoshiyuki Inagaki , Yan Liu

SIGKDD Explorations Volume 16, Issue 1 Page 21

Similar to hashtags in Twitter, bloggers can also tag their

Tumblr recently (Jan. 2014) allowed users to mention and

SIGKDD Explorations Volume 16, Issue 1 Page 22

SIGKDD Explorations Volume 16, Issue 1 Page 23

Figure 3: Degree Distribution of Tumblr Network

SIGKDD Explorations Volume 16, Issue 1 Page 24

SIGKDD Explorations Volume 16, Issue 1 Page 25

Normalized Post Length

Normalized Post Length

SIGKDD Explorations Volume 16, Issue 1 Page 26

0.6 Figure 9: Distribution of Reblog Cascade Size

Figure 8: Correlation of Reblog Frequency with User In-degree or

SIGKDD Explorations Volume 16, Issue 1 Page 27

7. CONCLUSIONS AND FUTURE WORK

0.4 In this paper, we provide a statistical overview of Tumblr in terms

SIGKDD Explorations Volume 16, Issue 1 Page 28

SIGKDD Explorations Volume 16, Issue 1 Page 29

SIGKDD Explorations Volume 16, Issue 1 Page 31

SIGKDD Explorations Volume 16, Issue 1 Page 32

SIGKDD Explorations Volume 16, Issue 1 Page 33

SIGKDD Explorations Volume 16, Issue 1 Page 34

Figure 2: One-shot distributed change detection models

Phenomenon Data stream 1 Phenomenon Data stream 2

Data stream 1 Phenomenon Data stream 2 Data stream 1 Data stream 2

Data stream 1 Data stream 2

Figure 3: Continuous distributed change detection models

SIGKDD Explorations Volume 16, Issue 1 Page 35

SIGKDD Explorations Volume 16, Issue 1 Page 36

SIGKDD Explorations Volume 16, Issue 1 Page 37

SIGKDD Explorations Volume 16, Issue 1 Page 38

ABSTRACT data-driven solution employs machine learning algorithms to

SIGKDD Explorations Volume 16, Issue 1 Page 39

! real or virtual: While the former requires model adaptation,