You are on page 1of 307

Modeling and Simulation in Science, Engineering and Technology

Series Editor
Nicola Bellomo
Politecnico di Torino
Italy

Advisory Editorial Board

M. Avellaneda (Modeling in Economics) H.G. Othmer (Mathematical Biology)


Courant Institute of Mathematical Sciences Department of Mathematics
New York University University of Minnesota
251 Mercer Street 270A Vincent Hall
New York, NY 10012, USA Minneapolis, MN 55455, USA
avellaneda@cims.nyu.edu othmer@math.umn.edu

K.J. Bathe (Solid Mechanics) L. Preziosi (Industrial Mathematics)


Department of Mechanical Engineering Dipartimento di Matematica
Massachusetts Institute of Technology Politecnico di Torino
Cambridge, MA 02139, USA Corso Duca degli Abruzzi 24
kjb@mit.edu 10129 Torino, Italy
luigi.preziosi@polito.it
P. Degond (Semiconductor and Transport Modeling)
Mathématiques pour l’Industrie et la Physique V. Protopopescu (Competitive Systems,
Université P. Sabatier Toulouse 3 Epidemiology)
118 Route de Narbonne CSMD
31062 Toulouse Cedex, France Oak Ridge National Laboratory
degond@mip.ups-tlse.fr Oak Ridge, TN 37831-6363, USA
vvp@epmnas.epm.ornl.gov
A. Deutsch (Complex Systems
in the Life Sciences) K.R. Rajagopal (Multiphase Flows)
Center for Information Services Department of Mechanical Engineering
and High Performance Computing Texas A&M University
Technische Universität Dresden College Station, TX 77843, USA
01062 Dresden, Germany krajagopal@mengr.tamu.edu
andreas.deutsch@tu-dresden.de
Y. Sone (Fluid Dynamics in Engineering Sciences)
M.A. Herrero Garcia (Mathematical Methods) Professor Emeritus
Departamento de Matematica Aplicada Kyoto University
Universidad Complutense de Madrid 230-133 Iwakura-Nagatani-cho
Avenida Complutense s/n Sakyo-ku Kyoto 606-0026, Japan
28040 Madrid, Spain sone@yoshio.mbox.media.kyoto-u.ac.jp
herrero@sunma4.mat.ucm.es

W. Kliemann (Stochastic Modeling)


Department of Mathematics
Iowa State University
400 Carver Hall
Ames, IA 50011, USA
kliemann@iastate.edu
Dynamics On and Of
Complex Networks
Applications to Biology,
Computer Science, and the Social
Sciences

Niloy Ganguly
Andreas Deutsch
Animesh Mukherjee
Editors

Birkhäuser
Boston • Basel • Berlin
Editors
Niloy Ganguly Andreas Deutsch
Indian Institute of Technology Center for Information Services
Department of Computer Science and High Performance Computing
and Engineering Technische Universität Dresden
Kharagpur 721302 01062 Dresden
India Germany
niloy@cse.iitkgp.ernet.in andreas.deutsch@tu-dresden.de

Animesh Mukherjee
Indian Institute of Technology
Department of Computer Science
and Engineering
Kharagpur 721302
India
animeshm@cse.iitkgp.ernet.in

ISBN: 978-0-8176-4750-6 e-ISBN: 978-0-8176-4751-3


DOI: 10.1007/978-0-8176-4751-3

Library of Congress Control Number: 2009921285

Mathematics Subject Classification (2000): 05C85, 68M10, 82B43, 90B15, 90B18, 90B40, 90C35, 91D30,
92D30, 94C15

© Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Birkhäuser Boston, c/o Springer Science+Business Media, LLC, 233 Spring
Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.

Printed on acid-free paper

Birkhäuser Boston is part of Springer Science+Business Media (www.birkhauser.com)


Preface

In the context of network theory, Complex networks can be defined as a


collection of nodes connected by edges representing various complex inter-
actions among the nodes. Almost any large-scale system, be it natural or
man-made, can be viewed as a complex network of interacting entities, which
is dynamically evolving over time. Naturally occurring networks include bi-
ological, ecological and social networks (e.g., metabolic networks, gene reg-
ulatory networks, protein interaction networks, signaling networks, epidemic
networks, food webs, scientific collaboration networks and acquaintance net-
works), whereas man-made networks include communication networks and
transportation infrastructures (e.g., the Internet, the World Wide Web, peer-
to-peer networks, power grids and airline networks).
This edited volume is a sequel to the workshop Dynamics on and of Com-
plex Networks (http://www.cel.iitkgp.ernet.in/∼eccs07/ ) held as a satellite
event of the fourth European Conference on Complex Systems in Dresden,
Germany from October 1–5, 2007. The primary aim of this workshop was to
systematically explore the statistical dynamics “on” and “of” complex net-
works that prevail across a large number of scientific disciplines. Dynamics
on networks refers to the different types of processes, for instance, prolifera-
tion and diffusion, that take place on networks. The functionality/efficiency
of these processes is strongly tied to the underlying topology as well as the
dynamic behavior of the network. On the other hand, dynamics of networks
mainly refers to the phenomena of self-organization, which in turn lead to the
emergence of the complex structure of the network.
Another important motivation of the workshop was to create a forum
for researchers applying the theories of complex networks to various do-
mains as well as across several disciplines such as computer science, statistical
physics, nonlinear dynamics, econometrics, biology, sociology and linguistics.
The workshop received a large number of quality submissions from authors
pursuing research in multiple disciplines, thus making the forum truly inter-
disciplinary. The total number of participants who attended the workshop
VI Preface

was approximately 40. There were around 20 speakers, including both senior
researchers and young scientists, who spoke about the dynamics on and of
different systems exhibiting a complex network structure.
The theme of this edited volume is identical to that of the workshop. Its
primary aim is to show how the theories of complex networks are being suc-
cessfully used by researchers to tackle numerous difficult problems in various
domains. Towards this aim, it presents an extended version of some of the
very high quality submissions received at the workshop together with new
invited contributions, which can play an extremely important role in the un-
derstanding as well as advancement of the field. Since the target audience
of this book is expected to be largely cross-disciplinary, the chapters have
been made as readable as possible, explaining all the intricate technicalities
wherever necessary in sufficient detail.
The uniqueness of this volume lies in the fact that it presents an equal
mix of (a) very relevant reviews (eight chapters) of important works in the
field, which gives the reader an up-to-date picture of the state of the art, and
(b) independent research reports (eight chapters) providing a clear conception
about how complex networks can be extremely useful in harnessing even the
hardest problems of a particular discipline. The editors feel that research
in this area has reached a stage where there is an urgent need to have a
comprehensive knowledge of the past and the present before the future can
be planned. The blend of reviews and the contributory chapters presented in
this volume strive to achieve this objective and, thereby, set the platform for
a “Phase II” research in complex networks.
The volume consists of three parts. The contributions in Part I center
around the application of complex networks in the understanding of biolog-
ical problems. This part consists of five chapters. The first chapter is From
Network Structure to Dynamics and Back Again: Relating Dynamical Stability
and Connection Topology in Biological Complex Systems, in which Sitabhra
Sinha presents a study of how the topology of a biological network influences
the nature of its dynamics, and conversely, how dynamical considerations put
constraints on the network structure. The next chapter deals with Regula-
tion of Apoptosis via the NFκB Pathway: Modeling and Analysis, in which
Madalena Chaves et al. model and analyze, in the framework of complex net-
works, the interaction of the nuclear factor κB with the apoptosis signaling
pathway. In the third chapter, Network-Based Models in Molecular Biology,
Andreas Beyer presents a survey on the extensive literature that employs
complex networks to understand numerous intricate phenomena in biology.
The fourth chapter, Ecological Networks: Structure, Interaction Strength, and
Stability, by Samit Bhattacharyya and Somdatta Sinha, presents a detailed
survey of the various studies conducted on ecological networks and especially
on food webs. In the last chapter, Signaling and Feedback in Biological Net-
works, Sandeep Krishna et al. review some important studies on the signaling
and feedback mechanisms that are observed in different biological networks.
Preface VII

Part II is also spread over five chapters and focuses on social networks. This
part begins with a chapter on Topographic Spreading Analysis of an Empirical
Sex Workers’ Network, by Johannes Bjelland et al., where the authors present
a “topographic” analysis of spreading (of HIV) on an empirical network of fe-
male sex workers. The authors find that the HIV graph breaks into small
components, thereby reducing the spreading if perfect condom protection is
made possible. The next chapter, Spectral Characterization of Network Struc-
tures and Dynamics, by Anirban Banerjee and Jürgen Jost, centers around
the investigation of the spectral properties of complex networks with a special
thrust on social networks. The third chapter, Dynamics of Social Complex
Networks: Some Insights into Recent Research, is authored by Sergi Lozano
and presents a comprehensive review of how complex network theory has been
instrumental in explaining the structure and the dynamics of a society. The
last two chapters show how complex networks can be applied to explain the
dynamics of human languages. The first one, titled The Structure and Dynam-
ics of Linguistic Networks, by Monojit Choudhury and Animesh Mukherjee,
is a review of the current literature on linguistic networks. The second one,
Networks Generated from Natural Language Text, by Chris Biemann and Uwe
Quasthoff, presents a survey focusing on how corpus linguistics (i.e., the study
of language as expressed in corpora) can be studied within the framework of
complex networks.
Part III presents a comprehensive overview of the networks that are preva-
lent in information sciences. This part is laid out in six chapters. The first
chapter in this part, Efficiency of Navigation in Indexed Networks, by Petter
Holme, explores the efficiency of navigation of data packets on “indexed”
graphs. The second chapter, Evolution of Apache Open Source Software, by
Haoran Wen et al., attempts to explain the evolution of the Apache open
source software through the analysis of its call graphs. The next chapter, Some
New Applications of Network Growth Models, by Gourab Ghoshal, presents
new models of growth for peer-to-peer file-sharing networks. The fourth chap-
ter, The Big Friendly Giant: The Giant Component in Clustered Random
Graphs, by Yakir Berchenko et al., is a theoretical study of the properties of
the giant component in a special kind of random graph, which is relevant for
various information networks. The fifth chapter, Technological Networks, by
Bivas Mitra, presents a detailed review of the large number of studies that
have been conducted on information networks, especially the World Wide
Web and peer-to-peer networks. The last chapter, Advances in the Theory of
Complex Networks, by Fernando Peruani, presents a survey of some of the the-
oretical advancements that have taken place and helps in providing a better
understanding of the structure and dynamics of information networks.
These contributions collectively demonstrate that complex networks in-
deed provide an elegant research framework relevant to a variety of scientific
disciplines. The chapters are designed to serve as the state of the art not only
for students and new comers who intend to pursue research in this field but
VIII Preface

also for the experts. All the chapters have been carefully peer reviewed for
their scientific content as well as readability and self-consistency.
We would like to thank the authors for their contributions, construc-
tive co-operation and gracious acceptance of the editorial comments. We are
also indebted to Ranjita Bhagwan, Chris Biemann, Lutz Brusch, Geoffrey
Canright, Michael Gamon, Gourab Ghoshal, Petter Holme, A. Kumaran,
Abyayananda Maiti, Pabitra Mitra, Luis Morelli, Gautam Mukherjee, Romit
Roy Choudhury, Gustavo Sibona and Biplab K. Sikdar for their constructive
criticisms, comments and suggestions, which have significantly improved the
quality of the chapters. In addition, we would also like to extend our grat-
itude to Rishabh Singh for his painstaking effort in helping to prepare the
Glossary of Essential Terms. Finally, we are also grateful to Tom Grasso and
the Birkhäuser team for all their help and support towards the publication of
this volume.

Kharagpur, India Niloy Ganguly


Dresden, Germany Andreas Deutsch
Kharagpur, India Animesh Mukherjee
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI

Part I Biological Sciences

From Network Structure to Dynamics and Back Again:


Relating Dynamical Stability and Connection Topology
in Biological Complex Systems
Sitabhra Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Regulation of Apoptosis via the NFκB Pathway: Modeling


and Analysis
Madalena Chaves, Thomas Eissing, and Frank Allgöwer . . . . . . . . . . . . . . 19
Network-Based Models in Molecular Biology
Andreas Beyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Ecological Networks: Structure, Interaction Strength,


and Stability
Samit Bhattacharyya and Somdatta Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Signaling and Feedback in Biological Networks


Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen . . . . . . . . . . . . . . . 73

Part II Social Sciences

Topographic Spreading Analysis of an Empirical Sex Workers’


Network
Johannes Bjelland, Geoffrey Canright, Kenth Engø-Monsen,
and Valencia P. Remple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
X Contents

Spectral Characterization of Network Structures


and Dynamics
Anirban Banerjee and Jürgen Jost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Dynamics of Social Complex Networks: Some Insights
into Recent Research
Sergi Lozano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

The Structure and Dynamics of Linguistic Networks


Monojit Choudhury and Animesh Mukherjee . . . . . . . . . . . . . . . . . . . . . . . . . 145

Networks Generated from Natural Language Text


Chris Biemann and Uwe Quasthoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Part III Information Sciences

Efficiency of Navigation in Indexed Networks


Petter Holme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Evolution of Apache Open Source Software


Haoran Wen, Raissa M. D’Souza, Zachary M. Saul,
and Vladimir Filkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Some New Applications of Network Growth Models


Gourab Ghoshal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

The Big Friendly Giant: The Giant Component in Clustered


Random Graphs
Yakir Berchenko, Yael Artzy-Randrup, Mina Teicher, and Lewi Stone . . . 237

Technological Networks
Bivas Mitra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Advances in the Theory of Complex Networks


Fernando Peruani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Glossary of Essential Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
List of Contributors

Frank Allgöwer Andreas Beyer


Institute for Systems Theory Biotechnology Center
and Automatic Control Technische Universität Dresden
University of Stuttgart 01062 Dresden
Pfaffenwaldring 9 Germany
70550 Stuttgart andreas.beyer@biotec.
Germany tu-dresden.de
allgower@ist.uni-stuttgart.de
Samit Bhattacharyya
Yael Artzy-Randrup Mathematical Modelling and
Biomathematics Unit Computational Biology Group
Faculty of Life Sciences Centre for Cellular and Molecular
Tel Aviv University Biology, CSIR
Ramat Aviv 69978 Hyderabad 500007
Israel India
artzyra@post.tau.ac.il samit@ccmb.res.in
Anirban Banerjee
Max Planck Institute for Molecular Chris Biemann
Genetics Institute for Computer Science
Ihnestr. 63–73 NLP Department
14195 Berlin University of Leipzig
Germany Johannisgasse 26
banerjee@molgen.mpg.de 04103 Leipzig
Germany
Yakir Berchenko biem@informatik.uni-leipzig.de
Interdisciplinary Brain Research
Center Johannes Bjelland
Bar Ilan University Telenor R&I
Ramat Gan 52900 1331 Fornebu
Israel Norway
byakir@gmail.com johannes.bjelland@telenor.com
XII List of Contributors

Geoffrey Canright Vladimir Filkov


Telenor R&I Department of Computer Science
1331 Fornebu University of California
Norway Davis, CA 95616
geoffrey.canright@telenor.com USA
vfilkov@ucdavis.edu
Madalena Chaves
COMORE, INRIA
2004 Route des Lucioles, BP 93 Gourab Ghoshal
06902 Sophia-Antipolis Department of Physics, and Michigan
France Center for Theoretical Physics
mchaves@sophia.inria.fr University of Michigan
Ann Arbor MI, 48109
USA
Monojit Choudhury
gghoshal@umich.edu
Microsoft Research India
Sadashivnagar
Bangalore 560080
India Petter Holme
monojitc@microsoft.com Department of Physics
Umeå University
90187 Umeå
Raissa M. D’Souza
Sweden
Department of Mechanical
petter.holme@physics.umu.se
and Aeronautical Engineering
Center for Computational Science
and Engineering
University of California Mogens H. Jensen
Davis, CA 95616 Center for Models of Life
USA Niels Bohr Institute
rmdsouza@ucdavis.edu Blegdamsvej 17
2100 Copenhagen
Denmark
Thomas Eissing mhjensen@nbi.dk
Bayer Technologies Services GmbH
PT-AS Systems Biology
51368 Leverkusen Jürgen Jost
Germany Max Planck Institute for Mathematics
thomas.eissing@ in the Sciences
bayertechnology.com Inselstr. 22
04103 Leipzig
Kenth Engø-Monsen Germany
Telenor R&I Santa Fe Institute
1331 Fornebu Santa Fe, NM 87501
Norway USA
kenth.engo-monsen@telenor.com jost@mis.mpg.de
List of Contributors XIII

Sandeep Krishna Uwe Quasthoff


Center for Models of Life Institute for Computer Science
Niels Bohr NLP Department
Institute University of Leipzig
Blegdamsvej 17 Johannisgasse 26
2100 Copenhagen 04103 Leipzig
Denmark Germany
sandeep@nbi.dk quasthoff@informatik.
uni-leipzig.de
Sergi Lozano
ETH Zürich Valencia P. Remple
Swiss Federal Institute BC Centre for Disease Control
of Technology Epidemiology
UNO D11 University of British Columbia
Universitätstr. 41 Vancouver, BC V5Z 4R4
8092 Zürich Canada
Switzerland
Valencia.Remple@bccdc.ca
slozano@ethz.ch

Zachary M. Saul
Bivas Mitra
Department of Computer Science Department of Computer Science
and Engineering University of California
Indian Institute of Technology Davis, CA 95616
Kharagpur 721302 USA
India zmsaul@ucdavis.edu
bivasm@cse.iitkgp.ernet.in
Sitabhra Sinha
Animesh Mukherjee The Institute of
Department of Computer Science Mathematical Sciences
and Engineering CIT Campus
Indian Institute of Technology Taramani
Kharagpur 721302 Chennai 600113
India India
animeshm@cse.iitkgp.ernet.in sitabhra@imsc.res.in

Fernando Peruani Somdatta Sinha


Service de Physique de l’Etat Mathematical Modelling and
Condensé (SPEC/CEA) and Computational Biology Group
Complex System Institute Paris - Centre for Cellular and Molecular
Ile-de-France (ISC-PIF) Biology, CSIR
F-75005, Paris Hyderabad 500007
France India
fernando.peruani@iscpif.fr sinha@ccmb.res.in
XIV List of Contributors

Kim Sneppen Mina Teicher


Center for Models of Life Interdisciplinary Brain
Niels Bohr Institute Research Center
Blegdamsvej 17 Bar Ilan University
2100 Copenhagen Ramat Gan
Denmark 52900 Israel
sneppen@nbi.dk teicher@macs.biu.ac.il

Lewi Stone Haoran Wen


Biomathematics Unit Department of Mechanical
Faculty of Life and Aeronautical Engineering
Sciences Center for Computational Science
Tel Aviv University and Engineering
Ramat Aviv University of California
69978 Israel Davis, CA 95616, USA
lewi@post.tau.ac.il hrwen@ucdavis.edu
From Network Structure to Dynamics
and Back Again: Relating Dynamical Stability
and Connection Topology in Biological
Complex Systems

Sitabhra Sinha

The Institute of Mathematical Sciences, CIT Campus, Taramani, Chennai 600113,


India; sitabhra@imsc.res.in

1 Introduction
To see a world in a grain of sand,
And a heaven in a wild flower,
Hold infinity in the palm of your hand,
And eternity in an hour.
– William Blake, Auguries of Innocence

Like Blake, physicists look for universal principles that are valid across
many different systems, often spanning several length or time scales. While
the domain of physical systems has often offered examples of such widely ap-
plicable “laws,” biological phenomena tended to be, until quite recently, less
fertile in terms of generating similar universalities, with the notable exception
of allometric scaling relations [20]. However, this situation has changed with
the study of complex networks emerging into prominence. Such systems com-
prise a large number of nodes (or elements) linked with each other according
to specific connection topologies, and are seen to occur widely across the bi-
ological, social and technological worlds [4, 9, 16]. Examples range from the
intra-cellular signaling system which consists of different kinds of molecules af-
fecting each other via enzymatic reactions, to the internet composed of servers
around the world which exchange enormous quantities of information packets
regularly, and food webs which link, via trophic relations, large numbers of
inter-dependent species. While the existence of complex networks in various
domains had been known for some time, the recent excitement among physi-
cists working on such systems has to do with the discovery of certain universal
principles among systems which had hitherto been considered very different
from each other.

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 1,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
4 S. Sinha

Reflecting the development of the modern theory of critical phenomena,


the rise of physics of complex networks has been driven by the simultane-
ous occurrence of detailed empirical studies of extremely large networks that
were made possible by the advent of affordable high-power computing and the
development of statistical mechanics tools to analyze the new network mod-
els. Prior to these developments, the networks that were studied by physicists
belonged to either the class of (i) regular networks, defined on geometrical lat-
tices, where each node interacted with all the neighboring nodes belonging to a
specified neighborhood, or (ii) random networks, where any pair of nodes had
a fixed probability of being linked, i.e., interacting with each other. The first
work that focused public attention on the new network approach presented a
class of network models that were neither regular nor random, but exhibited
properties of both [28]. Such small world networks, as they were referred to,
exhibited high clustering (with nodes sharing a common neighbor having a
higher probability of being connected to each other than to other nodes) and
a very low average path length (where the path length between any two nodes
is defined as the shortest number of connected nodes one has to go through in
order to reach one node starting from the other). As the former property char-
acterized a regular network, while the latter was typical for a random network,
this new class of networks was somehow intermediate between the extremes
of the two well-known network models, which was manifest in their construc-
tion procedure (Fig. 1). Several networks occurring in reality, in particular,
the power grid, the actor collaboration network and the neural connection
patterns of the C. elegans worm, were shown to have the small-world prop-
erty. Later, other examples were added to this list, including the network of
co-active functional brain areas [1] and the Indian railway system [21].
Very soon afterwards, it was discovered that the frequency distribution
of a node degree (i.e., the number of links a node has) exhibits a power-law
scaling form for a large variety of systems including the world wide web [3].

Fig. 1. Constructing a small-world network on a 2-dimensional square lattice sub-


strate. Starting from a regular network (left) where each node is connected to its
nearest and next-nearest neighbors, a fraction p of the links are rewired among ran-
domly chosen pairs of nodes. When all the links are rewired, i.e., p = 1, the system
is identical to a random network (right). For small p, the resulting network (center)
still retains the local properties of the regular network (e.g., high clustering), while
exhibiting global properties of a random network (e.g., short average path length).
From Network Structure to Dynamics and Back Again 5

This further underlined the fact that most networks occurring in reality are
neither regular (in which case the degree distribution would be close to a
delta function) nor random (which has a Poisson degree distribution), as for
both cases the probability of having a node with large degree (i.e., a hub)
would be significantly smaller than that indicated by the power-law tail of
empirically obtained degree distributions. In addition, it was observed that
there exist non-trivial degree correlations among linked pairs of nodes. For
example, a network where nodes with high degree tend to preferentially con-
nect with other high degree nodes is said to show assortative mixing [15]. On
the other hand, in a disassortative network, nodes with a large number of
links prefer to connect with nodes having low degree. Empirical studies indi-
cate that most biological and technological networks are disassortative, while
social networks tend to be assortative [16]. As assortative mixing promotes
percolation and makes a network more robust to vertex removal, it may be
hard to understand why natural evolution in the biological world has favored
disassortativity. However, in a recent study, we have shown that when one
considers the stability of dynamical states of a network, disassortative net-
works would tend to be more robust, and this may be one of the reasons why
they are preferred [6].
This brings us to the thrust of recent work in the area of complex networks
which has shifted from the initial focus on purely structural aspects of the con-
nection topology to the role such features play in determining the dynamical
processes defined on a network [27]. Over the past few years, much effort has
been made to understand not only how structure affects dynamics, and hence
function, in a network, but also the reverse problem of how functional cri-
teria, such as the need for dynamical stability, can constrain the topological
properties of a network. In this chapter, some of the principal results obtained
by our group will be briefly described. The goal of our research program is
to understand the evolution of robust yet complex biological structures, viz.,
networks occurring in reality that are stable against perturbations and, yet,
which can adapt to a changing environment.

2 Biological Networks: Some Examples Across Length


Scales
Before describing our results, which are applicable to a wide range of net-
works, we provide motivation for our general approach by briefly discussing
in this section a few examples of biological networks. Although they span an
enormous range of length scales, from ∼10−8 m in the case of protein contact
networks to ∼105 m in the case of ecological interaction networks, they are
often subject to similar constraints and may share common structural and dy-
namical properties. Questions about networks in one domain may often have
answers and ramifications in another domain.
6 S. Sinha

Molecular scale: protein contact network. Protein structure, viewed as


a network of non-covalent connections between the constituent amino acids,
is one of the smallest length scale networks in the natural world. Its nodes
are the Cα atoms of each amino acid, and their interaction strength is deter-
mined by their proximity to each other. Two nodes are considered to be linked
if the Euclidean distance between them (in 3-dimensional space) is less than a
cutoff value dc , usually between 8–14 Å, which is the relevant distance for non-
covalent interactions. Figure 2 shows the KirBac1.1 protein, which belongs to
the family of potassium ion channels involved in transmission of inward rec-
tifying current across a cellular membrane [13]. The protein consists of four
identical subunits spanning the membrane and intra-cellular regions. The cor-
responding protein contact network (PCN) manifests the existence of the iden-
tical subunits in the approximately block diagonal structure of the adjacency
matrix. In addition, each of these four blocks can be divided into two modules,
corresponding approximately to the membrane and intra-cellular regions.
It is easy to see that the PCN shares the features of a small-world net-
work, with the majority of connections between spatially neighboring nodes,
although there are a few long-range connections. This small-world property
of PCNs for different protein molecules has indeed been noted several times
in the literature (see, e.g., Ref. [2]). This is probably not very surprising,
given that it is also true for a randomly folded polymer. However, in addition,
the PCN adjacency network shows a modular structure, with a majority of
connections occurring between nodes belonging to the same module. This is
a feature not seen in conventional models of small-world networks (e.g., the
Watts–Strogatz model [28]). It is all the more intriguing, as we have recently

subunit I subunit II subunit III subunit IV


intra−cellular
domain
200

400

600

membrane 800
domain
1000

200 400 600 800 1000

Fig. 2. Structure of the KirBac1.1 protein (left) which comprises four identical sub-
units spanning the membrane and intra-cellular regions [13]. The PCN is constructed
by considering a cutoff distance of dc = 12 Å, whose adjacency matrix is shown for
the entire network (right). Each of the four blocks corresponding to a subunit shows a
clear partition into membrane and intra-cellular compartments, indicating a modular
structure.
From Network Structure to Dynamics and Back Again 7

shown that modular networks (whatever the connection topology of individual


modules) exhibit the small-world properties of high clustering and low aver-
age path length [18]. To identify whether the existence of modules indeed
has a significant effect on protein dynamics (e.g., during folding), we look
at the spectral properties of the Laplacian matrix1 L, defined as Lii = ki ,
where the degree of node i, Lij = −1 if nodes i and j are connected, 0 other-
wise. The eigenvector for the smallest eigenvalue (=0), c(1) , corresponds to the
time-invariant properties of the system and has uniform contribution from all
components. The next few smallest eigenvalues dominate the time-dependent
behavior of the protein and show a relatively large spectral gap with the bulk
of the eigenvalue spectra. This indicates the existence of very distinct time
scales in the protein dynamics which approximately correspond to the inter-
and intra-modular modes of motion. As we shall see, the occurrence of mod-
ular structures in complex networks and their effect on dynamics is not just
confined to PCNs but appears in many other biological networks.
Intra-cellular scale: signaling network. Signal transduction pathways,
through which a cell responds appropriately to a signal or stimulus, involve
ordered sequences of biochemical reactions carried out by enzymes inside the
cell. One of the most commonly observed class of enzymes in intra-cellular
signaling is that of kinases, which activate target molecules (usually proteins)
by transferring phosphate groups from energy donor molecules such as ATP to
the targets. This process of phosphorylation is mirrored by the reverse process
of deactivation by phosphatases through dephosphorylation. Such reaction
cascades are activated by second messengers (e.g., cyclic AMP or calcium
ions) and may last for a few minutes, with the number of kinase proteins and
other molecules involved in the process increasing with every reaction step
away from the initial stimulation. Thus, such a signaling cascade can result
in a large response for a relatively low-amplitude signal.
Research over the past decade has, however, shown that the classical pic-
ture of almost isolated cascades linking a unique signal to a specific response
does not explain many experimental results. The adaptability of intra-cellular
signaling is now thought to be a result of multiple signaling pathways interact-
ing with one another to form complex networks. In this picture, complexity
arises from the large number of components, many of which have partially
overlapping functions, from the large number of links (through enzymatic re-
actions) among components and from the spatial relationship between the
components [29]. Figure 3 shows a small fraction of the signaling network
downstream of the B-cell antigen receptor (BCR) involved in immune re-
sponse. As the breakdown of communication in this network can lead to dis-
ease (a fact that may be utilized by infectious agents for proliferation), it is
of obvious importance to understand the mechanisms by which the network
allows the cell response to be sensitive to different stimuli and yet to be robust
in the presence of intra-cellular noise. With this in mind, the time evolution
1
The Laplacian matrix is also referred to as the Kirchhoff matrix (e.g., see Ref. [10]).
8 S. Sinha

IgG receptor
Igα, Igβ PI3K

Pyk2 Syk Lyn PIP2


PIP3

Shc
BLNK Btk PDK1
Grb2 PLCg2
Vav
SOS Rac
DAG IP3 Akt
Raf−1 MEKK Ca2+
PKC
MKK MKK
MEK 1/2 3/4/6 IKK
4/7 CaMK2
K Erk 1/2 Jnk 1/2 p38 Bad Bcl2
IkB NFAT

Fig. 3. A subset of the signal transduction network of the BCR [12]. The kinases
are represented by squares, while other molecules (such as second messengers and
adapters) are depicted as circles.

of the activity (i.e., phosphorylation) of about 20 signaling molecules in this


network was recorded in a recent experiment by Kumar et al. [12]. Apart
from observing the activation profiles under normal conditions, the network
was also subjected to a series of perturbations by serially blocking each of
these molecules from activating any of the other molecules in the network.
The resulting experimental data, capturing the behavior of these molecules
under 21 different conditions, enabled the detection of correlations between
the activity of these molecules. This showed that the existing picture of in-
teractions (Fig. 3) is grossly inadequate in explaining these correlations, e.g.,
the fact that p38 kinase seems to influence the activation of a majority of the
other molecules, although it occurs at the end of a particular pathway. The
results suggest that the signaling network is, in fact, a far more densely con-
nected system than had been previously suspected. It also raises the question
of how certain signals can elicit very specific responses, without significant
risk of cross-talk between interacting pathways. This brings us to the issue of
whether functional modules can exist in networks, such that by using positive
and negative interactions one can channel information from the stimulus to
the response along specific subnetworks only.
Inter-cellular scale: neuronal network. The previous question is of impor-
tance not only for information processing within a cell, but also between cells.
The most important example of the latter process is, of course, the networks
of neurons occurring in the brain. As the nervous system of the nematode
C. elegans comprising 302 neurons has been completely mapped out (in terms
of the positions of the neurons, as well as all their interconnections), it pro-
vides a model system for studying these issues. We have recently analyzed
the connection topology of the non-pharyngeal portion of the nervous system
to which the majority of the neurons (280) belong [7]. One of the striking
From Network Structure to Dynamics and Back Again 9

observations is that many of the sensory neurons belonging to different modal-


ities, viz., chemosensation, mechanosensation, etc., send signals to the same
set of densely connected interneurons which forms the innermost core of the
nervous system. Subsequently, signals are sent from these interneurons to spe-
cific motor neurons which generate appropriate muscle response, e.g., moving
along a chemical gradient, egg laying, etc. It is vital that the signals coming
from different sensory neurons to the same interneurons should not interfere
with each other, as it may result in activation of the incorrect motor response.
A preliminary investigation of a dynamical model for the neuronal network
shows that a complex set of excitatory and inhibitory links between the in-
terneurons manages to achieve segregation of the different functional circuits.
This means that, e.g., a mechanical tap signal will not elicit egg laying, even
if the tap withdrawal circuit shares many common interneurons with the egg-
laying circuit. Even more interesting is the fact that such functional modules
do not need the existence of structural modules in the underlying networks. It
underscores the importance of looking at the nature of the interactions, which
can create complicated control mechanisms to prevent cross-talk and enable
robust response in the presence of environmental noise.
Inter-organism scale: epidemic propagation network. At the scale of
individual organisms, such as human beings, one of the most widely studied
networks is that which leads to propagation of epidemics. The ubiquity of
small-world networks in nature implies that some of the classic theories of
epidemiological transmission, based on assumptions of random connections,
may need to be reviewed. In particular, the global spread of diseases like SARS
shows that even a few long-range links can drastically enhance the propagation
of epidemics [8]. This has led to a series of studies of different disease propaga-
tion models on Watts–Strogatz or related network models (e.g., see Ref. [19]).
However, as mentioned above, all the structural features of such networks
are also shared by modular networks, although modular network have very
different dynamical properties. We have recently shown that while Watts–
Strogatz networks have a continuous range of time scales, modular networks
exhibit very distinct time scales that are related to intra- and inter-modular
events [18]. Thus, an effective strategy to counter the spread of epidemics
must take into account a detailed knowledge of such structures in the social
network of contagious and susceptible individuals.
Inter-species scale: food webs. Possibly the largest (in terms of length
scale) biological networks on earth are those of interactions between different
species in an ecosystem. While general ecological networks consist of all possi-
ble links, such as cooperation and competition, food webs describe the trophic
relations, i.e., between predator and prey. A food web is a directed network
where the nodes are the various species, with prey connected by arrows to
predators, the direction of the arrow indicating the flow of biomass. The links
are usually weighted to represent the amount of energy that is transferred.
10 S. Sinha

It is in the context of these networks that questions first arose on the con-
nection between the structural properties of a network and the stability of
its dynamical behavior (see Section 4). Indeed, one not only asks what kind
of structures allow complex networks to be stable against ever-present per-
turbations, but also how the requirement to be robust constrains the kind of
structures such networks can evolve. To stress the universality of the questions
asked by physicists about networks, we note that, like many other networks,
food webs also have been shown to have a modular structure, with species in
each module interacting between themselves strongly and only weakly with
other species [11]. As in the other systems discussed earlier, the role that
modularity plays in stabilizing the dynamics of ecosystems can be seen as a
specific instance of a much more general question.
Having discussed a few instances of how universal principles about net-
works can appear by investigating very different systems in the biological
world, we now describe certain results of our studies on general network mod-
els. However, we stress that each of these results has relevance to problems
appearing in the context of specific biological systems.

3 From Structure to Dynamics

The role that the connection topology of a network plays in the nature of
its dynamics has been extensively investigated for spin models occurring in
physics. In fact, such systems had been explored for a long time prior to the
recent interest in complex networks, and many results are known regarding or-
dering transition in both regular as well as random structures. More recently,
it has been shown that, for partial random rewiring in a system of sufficiently
large size, any finite value of p (the rewiring probability) causes a transition
to the small-world regime, with the Ising model defined on such a network ex-
hibiting a finite temperature ferromagnetic phase transition [5]. However, spin
models are extremely restricted in their dynamical repertoire; therefore, re-
searchers have looked at the effect of introducing other kinds of node dynamics
in such network structures, e.g., oscillators. Motivated by recent observations
that the brain may have a connection structure with small-world properties
(see e.g., Ref. [1]), we have examined the effect of long-range connections (i.e.,
non-local diffusion) over an otherwise regular network of nodes with links be-
tween nearest neighbors on a square lattice [25]. The dynamics considered
is that of the excitable type, with the variable having a single stable state
and a threshold. If a perturbation causes the system variable to exceed the
threshold, we see a rapid transition to a metastable excited state followed by a
slow recovery phase when the system gradually converges to the stable state.
As a result of coupling the dynamics of individual nodes through diffusive
coupling, various spatial patterns (which may be temporally varying) are ob-
served. Such a dynamics is commonly observed in a large variety of biological
From Network Structure to Dynamics and Back Again 11

Spatial Patterns Temporal Patterns Burn−out


time time time
0.5 0.5 0.5

Activity
0 0 0
0 500 1000 1500 2000 1600 1800 2000 0 100 200

0 0.2 0.4 plc 0.6 0.8 pu


c
1
p
Fig. 4. Schematic diagram indicating the different dynamical regimes in a
2-dimensional small-world excitable medium as a function of the rewiring probability,
p. For low p, the system exhibits spatial patterns characterized by single or multiple
spirals. At p = plc , there is a transition to a state dominated by temporally periodic
patterns that are spatially relatively homogeneous. Above p = puc , all activity ceases
after a brief transient.

cells such as neurons and cardiac myocytes, as well as in non-linear chemical


systems such the Belousov–Zhabotinsky reaction.
In our simulations, by varying the probability of long-range connections,
p, we have observed three categories of patterns. For 0 < p < plc , after an
initial transient period where multiple coexisting circular waves are observed,
the system is eventually spanned by a single or multiple rotating spiral waves
whose temporal behavior is characterized by a flat power spectral density.
At p = plc , the system undergoes a transition from a regime with temporally
irregular, spatial patterns to one with spatially homogeneous, temporally pe-
riodic patterns (Fig. 4). The latter behavior occurs over the range plc < p < puc
as a result of the increased number of long-range connections, whereby a large
fraction of the system is synchronously active and subsequently goes into the
recovery phase. Beyond the upper critical value puc , there is no longer any
self-sustained activity in the system, as all nodes converge to the stable state.
The patterns in each regime were found to be extremely robust against even
large perturbations or disorder in the system.
Our model explains several hitherto unexplained observations in experi-
mental systems where non-local diffusion had been implemented [26]. In ad-
dition, by identifying the long-range connections with those made by neurons
and the regular network with that formed by the glial cells in the brain, our
results provide a possible explanation of why evolution may have preferred to
increase the number of glial cells over neurons (with a ratio of more than 10:1
for certain parts of the human brain) in order to maintain robust dynamical
patterns as brain size increased. It also points towards a possible functional
role of the small-world brain topology in the occurrence of dynamical dis-
eases such as epileptic seizures and bursts. More generally, our work shows
12 S. Sinha

how non-standard network topologies can influence system dynamics by gen-


erating different kinds of spatio-temporal patterns depending on the extent of
non-local diffusion.

4 From Dynamics to Structure

An important functional criterion for most networks occurring in nature and


society is the stability of their dynamical states. While earlier studies have
concentrated on the robustness of the network when subjected to structural
perturbations (e.g., removal of nodes or links), we have looked at the effect
of perturbations on the steady states of network dynamics. In particular, the
question we ask is whether networks become more susceptible to small per-
turbations as their size (i.e., number of nodes N ) increases, the connections
between the nodes become denser (i.e., increased connection probability C)
and the average strength of interaction (s) increases. This is related to a
decades-old controversy, often referred to as the stability-complexity debate.
In the early 1970s, May [14] had shown that for a model ecological network,
where species are assumed to interact with a randomly chosen subset of all
other species, an arbitrarily chosen equilibrium state of the system becomes
unstable if any of the parameters determining the network’s complexity (e.g.,
N , C or s) is increased. In fact, by using certain results of random matrix
theory, the critical condition for the stability of the network was shown to be
N Cs2 < 1 (May–Wigner theorem) [14]. This flew against common wisdom,
gleaned from a large number of empirical studies as well as naive reasoning,
which dictated that increased diversity and/or stronger interactions between
species results in more robust ecosystems. Thus, ever since the publication
of these results, there have been attempts to understand the reason behind
the apparent paradox, especially as this result relates not only to ecologi-
cal systems but extends to all dynamical networks for which the stability of
equilibria has functional significance, e.g., in intra-cellular biochemical net-
works where the concentrations of different molecules need to be maintained
within physiological levels. Two of the common charges leveled against the
theoretical model of May is that (i) it assumes the interaction network to
be random, whereas naturally occurring networks may have certain kinds of
structures, and (ii) the linear stability analysis assumes the existence of simple
steady states (viz., fixed point attractors), which may not be the case for real
systems that may either be oscillating or in a chaotic state.
In our work on dynamical systems defined on networks, we have tried to
address both of these lines of criticism (see Ref. [31] for a recent discussion
of our results from the perspective of ecosystem robustness). For example,
focusing on the question of the inadequacy of linear stability analysis, we
have considered networks with non-trivial dynamics at the nodes, spanning
the range from simple steady states to periodic oscillation and fully developed
chaos, and measured the robustness of the dynamics with respect to variations
in N , C and s [23, 24].
From Network Structure to Dynamics and Back Again 13

Each node in our model network has a dynamical variable associated with
it, which evolves according to a well-known class of difference equations com-
monly used for modeling population dynamics. By varying a non-linear pa-
rameter, the nature of the dynamics (i.e., whether it converges to a steady
state or undergoes chaotic fluctuations) at each node can be controlled. How-
ever, in the absence of coupling, each node will always have a finite, positive
value for its dynamical variable. When coupled in a network (initially in a ran-
dom fashion) with links that can have either positive or negative weights, it is
possible that as a result of dynamical fluctuations, the variable for some nodes
can become negative or zero. As this implies the absence of any activity, the
corresponding node is considered to be “extinct” and thus isolated from the
network. This procedure may create further fluctuations and cause more nodes
to becomes “extinct,” resulting in gradual reduction of the size of the network
(Fig. 5). The final asymptotic size of the network, relative to its initial size,
is a measure of its robustness—the more robust network is one with a higher
fraction of nodes having persistent activity. Analysis showed that the network
robustness (as measured by the above global criterion) not only decreased with
N , C and s, as expected from local stability analysis, but actually matched the
May–Wigner theorem quantitatively [23]. In addition, the asymptotic network
exhibited robust macroscopic features: (a) the number of persistently active
nodes was independent of the initial network size, and (b) the asymptotic
number of links between these persistently active nodes was independent of
both the initial size and connectivity [24]. This is all the more surprising, as
the removal of nodes (and hence, links) is not guided by any explicit fitness
criterion, but rather emerges naturally from the nodal dynamics through fluc-
tuations of individual node properties. Our results imply that asymptotically

Pa

Fig. 5. Evolution of a network with non-trivial dynamics at the nodes. The initial
(left) and final asymptotic (right) networks are shown. Only nodes having persistent
activity are connected to the network. The figures were drawn using Pajek software.
14 S. Sinha

active networks are non-extensive: when two networks of size N are coupled to
each other (with the same connectance as the individual networks), although
the resulting network initially has a size 2N , the ensuing dynamical fluctua-
tions will reduce its size to N . This implies that simply increasing the number
of redundant elements is not a good strategy for designing robust systems.
We have also looked at the effect of empirically reported structures, such
as small-world connection topology and scale-free degree distribution, on the
dynamical stability of networks. Our results indicate that, in general, intro-
ducing such structural features does not alter the outcome expected from
the May–Wigner theorem [6, 22]. However, these details can indeed affect
the nature of the stability-instability transition; for example, the transition
exhibiting a cross-over from being very sharp (resembling first-order phase
transition) for a random network to a more gradual change as the network
becomes more regular in the small-world regime [22].

5 Evolution of Robust Networks

This brings us to the issue of how complex networks can be stable at all, given
that the May–Wigner theorem seems to hold even for networks that have
structures similar to those seen in reality and where non-trivial dynamical
situations have also been considered. The solution to this apparent paradox
lies in the observation that most networks that we see around us did not
occur fully formed but emerged through a process of gradual evolution, where
stability with respect to dynamical fluctuations is likely to be one of the key
criteria for survival. In earlier work, we have shown that a simple model,
where nodes are gradually added to or removed from a network according
to whether this results in a dynamically stable network or not, leads to a
non-equilibrium steady state in which the network is extremely robust [30].
The robustness is manifested by increased resistance and resilience, as well as
decreased probability of large extinction cascades, when the network size (i.e.,
the system diversity) is increased. Thus, our results reconcile the apparently
contradictory conclusions of the May–Wigner theorem and a large number of
empirical studies.
More recently, we have shown that model networks can evolve many of
the observed structural features seen among networks in the natural world,
by taking into account the fact that the majority of such systems must opti-
mize between several (often conflicting) constraints, which may be structural
as well as dynamical in nature. In particular, most networks need to have
high communication efficiency (i.e., low average path length) and low connec-
tivity (to reduce the resource cost involved in maintaining many links) while
being stable with respect to dynamical perturbations. If a network satisfied
only the first two constraints, the optimal structure would have been that of
a star (Fig. 6). Even if the resource cost constraint is somewhat relaxed, so
that the network can have more links than the minimum necessary to make it
From Network Structure to Dynamics and Back Again 15

(A)

(I) (II)

(B) (C)

Fig. 6. Networks with (I) star and (II) clustered star connection topologies can form
the fundamental building blocks of different types of modular networks. Network con-
figurations with clustered star modules can be constructed by (A) connecting different
modules by single undirected links among the hub nodes, or (B) connecting nodes of a
module to another module only through the hub node of the latter, or (C) connecting
nodes of a module randomly to any node of another module.

connected, the resulting optimal configuration is slightly modified to that of a


“clustered” star. However, we note that the dynamical equilibria in such sys-
tems would be extremely unstable with respect to small perturbations. This
happens because the rate of growth of small perturbations is related to the
maximum degree of the network, which, in the case of a star or a clustered star,
is almost identical to the system size. It is easy to see that dividing the network
into multiple stars, connected to each other, will reduce the maximum degree
and hence increase the stability. Indeed, our results show that simultaneous
optimization of all three constraints results in networks with modular struc-
ture, i.e., subnetworks with a high density of connections within themselves
compared to between distinct subnetworks, where each module possesses a
prominent hub [17] (see Fig. 6 for possible configurations of such modular
networks). As these evolved systems also exhibit heterogeneous degree dis-
tribution, our findings have implications for a wide range of systems in the
biological and technological worlds where such features have been observed.
16 S. Sinha

Acknowledgments

I would like to thank my collaborators with whom the work described here
has been carried out, in particular, R. K. Pan, S. Sinha, N. Chatterjee, M.
Brede, C. C. Wilmers, J. Saramäki and K. Kaski, as well as S. Vemparala,
D. Kumar, K. V. S. Rao and B. Saha for helpful discussions.

References
1. Achard, S., Salvador, R., Whitcher, B., Suckling, J., Bullmore, E.: A resilient,
low-frequency, small-world human brain functional network with highly connected
association cortical hubs. J. Neurosci., 26, 63–72 (2006)
2. Aftabuddin, M., Kundu, S.: Hydrophobic, hydrophilic and charged amino acid
networks within protein. Biophys. J., 93, 225–231 (2007)
3. Albert, R., Barabási, A.L.: Emergence of scaling in random networks. Science,
286, 509–512 (1999)
4. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod.
Phys., 74, 47–97 (2002)
5. Barrat, A., Weigt, M.: On the properties of small-world network models. Eur.
Phys. J.B, 13, 547–560 (2000)
6. Brede, M., Sinha, S.: Assortative mixing by degree makes a network more unstable.
Arxiv preprint, cond-mat/0507710 (2005)
7. Chatterjee, N., Sinha, S.: Understanding the mind of a worm: Hierarchical network
structure underlying nervous system function in C. elegans. Prog. Brain Res., 168,
145–153 (2007)
8. Deem, M.W.: Mathematical adventures in biology. Physics Today, 60(1), 42–47
(2007)
9. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets
to the Internet and WWW. Oxford Univ. Press, Oxford (2003)
10. Haliloglu, T., Bahar, I., Erman, B.: Gaussian dynamics of folded proteins. Phys.
Rev. Lett., 79, 3090–3093 (1997)
11. Krause, A.E., Frank, K.A., Mason, D.M., Ulanowicz, R.U., Taylor, W.W.: Com-
partments revealed in food-web structure. Nature, 426, 282–284 (2003)
12. Kumar, D., Srikanth, R., Ahlfors, H., Lahesmaa, R., Rao, K.V.S.: Capturing cell-
fate decisions from the molecular signatures of a receptor-dependent signaling
response. Molecular Systems Biology, 3, 150 (2007)
13. Kuo, A., Gulbis, J.M., Antcliff, J.F., Rahman, T., Lowe, E.D., Zimmer, J., Cuth-
bertson, J., Ashcroft, F.M., Ezaki, T., Doyle, D.A.: Crystal structure of the potas-
sium channel KirBac1.1 in the closed state. Science, 300, 1922–1926 (2003)
14. May, R.M.: Stability and Complexity in Model Ecosystems. Princeton Univ. Press,
Princeton (1973)
15. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett., 89, 208701
(2002)
16. Newman, M.E.J.: The structure and function of complex networks. SIAM Review,
45, 167–256 (2003)
17. Pan, R.K., Sinha, S.: Modular networks emerge from multiconstraint optimization.
Phys. Rev. E, 76, 045103(R) (2007)
From Network Structure to Dynamics and Back Again 17

18. Pan, R.K., Sinha, S.: The small world of modular networks. Arxiv preprint,
arXiv:0802.3671 (2008)
19. Saramäki, J., Kaski, K.: Modelling development of epidemics with dynamic small-
world networks. J. Theor. Biol., 234, 413–421 (2005)
20. Schmidt-Nielsen K: Scaling: Why is Animal Size So Important? Cambridge Univ.
Press, Cambridge (1984)
21. Sen, P., Dasgupta, S., Chatterjee, A., Sreeram, P.A., Mukherjee, G., Manna, S.S.:
Small-world properties of the Indian railway network. Phys. Rev. E, 67, 036106
(2003)
22. Sinha, S.: Complexity vs. stability in small-world networks. Physica A, 346, 147–
153 (2005)
23. Sinha, S., Sinha, S.: Evidence of universality for the May-Wigner stability theorem
for random networks with local dynamics. Phys. Rev. E, 71, 020902(R) (2005)
24. Sinha, S., Sinha, S.: Robust emergent activity in dynamical networks. Phys. Rev.
E, 74, 066117 (2006)
25. Sinha, S., Saramäki, J., Kaski, K.: Emergence of self-sustained patterns in small-
world excitable media. Phys. Rev. E, 76, 015101(R) (2007)
26. Steele, A.J., Tinsley, M., Showalter, K.: Spatiotemporal dynamics of networks of
excitable nodes. Chaos, 16, 015110 (2006)
27. Strogatz, S.H.: Exploring complex networks. Nature, 410, 268–276 (2001)
28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature,
393, 440–442 (1998)
29. Weng, G., Bhalla, U.S., Iyengar, R.: Complexity in biological signaling systems.
Science, 284, 92–96 (1999)
30. Wilmers, C.C., Sinha, S., Brede, M.: Examining the effects of species richness on
community stability: An assembly model approach. Oikos, 99, 363–367 (2002)
31. Wilmers, C.C.: Understanding ecosystem robustness. Trends Ecol. Evoln., 22,
504–506 (2007)
Regulation of Apoptosis via the NFκB
Pathway: Modeling and Analysis

Madalena Chaves,1 Thomas Eissing,2 and Frank Allgöwer3


1
COMORE, INRIA, 2004 Route des Lucioles, BP 93, 06902 Sophia-Antipolis,
France; mchaves@sophia.inria.fr
2
Bayer Technologies Services GmbH, PT-AS Systems Biology, Germany
thomas.eissing@bayertechnology.com
3
Institute for Systems Theory and Automatic Control, University of Stuttgart,
Pfaffenwaldring 9, 70550 Stuttgart, Germany; allgower@ist.uni-stuttgart.de

1 Introduction

Programmed cell death (or apoptosis) has an essential biological function, en-
abling successful embryonic development, as well as maintenance of a healthy
living organism [6]. Apoptosis is a physiological process which enables an
organism to remove unwanted or damaged cells. Malfunctioning apoptotic
pathways can lead to many diseases, including cancer and inflammatory or
immune system related problems. A family of proteins called caspases are
primarily responsible for execution of the apoptotic process: basically, in re-
sponse to appropriate stimuli, initiator caspases (for instance, caspases 8, 9)
activate effector caspases (for instance, caspases 3, 7), which will then cleave
various cellular substrates to accomplish the cell death process [22].
Nuclear factor κB (NFκB) is a transcription factor for a large group of
genes which are involved in several different pathways. For instance, NFκB
activates its own inhibitor (IκB) [14] as well as groups of pro-apoptotic and
anti-apoptotic genes [21]. Among the latter, NFκB activates transcription of
a gene encoding for inhibitor of apoptosis protein (IAP). This protein in turn
contributes to downregulate the activity of the caspase cascade which forms
the core of the apoptotic pathway [6, 8].
The canonical NFκB pathway is induced, among other stimuli, by the
cytokine tumor necrosis factor α (TNFα) [21]. Binding of TNFα to death
receptor TNFR1 forms a first complex which eventually activates NFκB.
A second complex is later formed, which will activate the initiator caspase
8 [6], and hence activate the apoptotic process. The same signal (TNFα stim-
ulation) thus triggers two parallel but contrary pathways: the pro-apoptotic
caspase cascade and the anti-apoptotic NFκB-IκB-IAP pathway. These two
pathways, together with the interactions among their components, form a

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 2,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
20 M. Chaves et al.

complex network which shapes the decision on cell survival or initiation of


programmed cell death. To contribute to a better understanding of the role
of NFκB in the regulation of apoptosis, we propose a qualitative study of this
system and its dynamics, based on a discrete (Boolean) model of the complex
network. This discrete model closely follows a continuous one, recently devel-
oped and studied in [23, 24]. The model integrates the well-known model for
the NFκB pathway [17] and the caspase cascade [8].
Boolean models provide a convenient formalism to describe protein and
gene networks [25]. The states of the network components (e.g., proteins or
messenger RNAs) are characterized as “expressed” or “not expressed” and
are represented by logical variables (with values 0 or 1). The interactions
among the various components are classified as “inhibition” or “activation”
links (these can generally be deduced from gene/protein expression data).
Boolean models thus describe the network structure of a system without in-
volving any kinetic details. The qualitative behaviour of a system can be seen
as an emergent property of this structure. Boolean models are especially use-
ful in the case of large networks [1, 9], for which kinetic parameters are often
unknown, but qualitative properties such as generation of specific gene expres-
sion patterns, stability or multistability, and oscillatory modes can be studied.
Several methods have been developed for analysis of discrete and qualitative
models [2, 5, 7, 13, 26]. Using an approach which combines discrete rules with
continuous degradation rates, our model reproduces many of the known prop-
erties of the system, notably the oscillatory dynamics that can be induced
by the NFκB-IκB negative feedback loop [14, 15, 19]. We explore different
configurations for the network structure and predict its effects on the decision
between cell survival or apoptosis.

2 The Model

The network of interactions among the NFκB pathway and the apoptosis sig-
naling cascade to be studied here is shown in Fig. 1. The various components
of the network (here messenger RNAs, proteins, or protein complexes) form
the set of variables or nodes (Xi , i = 1, . . . , n) of the Boolean model. The
system will evolve according to a set of logical rules which are deduced from
the interactions or links depicted in the schematic diagram of Fig. 1. The in-
teractions among nodes can be classified as “activation” or “inhibition” links:
a directed arrow Xi → Xj means that a high concentration of component
Xi activates component Xj , while the symbol Xi  Xj means that a high
concentration of component Xi inhibits Xj .
The components in our model and the activation or inhibition links
among them are based on existing literature data. For general aspects, the
reviews [6, 21] were used. However, some pathways of regulation among the
NFκB pathway and the caspase cascade are not yet clear, and more work is
needed to understand how these two signaling pathways are interconnected.
In this chapter, we aim to investigate and test several possible hypotheses for
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 21

Fig. 1. Schematic diagram of the NFκB pathway and the caspase cascade (light
shaded regions). The oval dark grey shaded region represents the cellular nucleus.
Both pathways are activated by binding of TNFα to death receptor TNFR1 (the
resulting complex is represented simply by the rectangle TNF). Messenger RNAs are
represented by ellipses, while transcription factors, caspases, and other proteins are
represented by squares. To study the interconnections between the two pathways, four
network variants, based on different combinations of the links A, L, and C, will be
analysed and compared (see Table 2).

the combined network structure. We will consider four model variants and
try to discriminate between them by comparing our numerical analysis with
experimental data from the literature. The four network variants (see Table 2)
are based on different combinations of three links (A, L, C in Fig. 1) which
have been suggested but are not fully established in the apoptosis literature.
The NFκB pathway follows very closely the model presented in [17]. Stim-
ulation of death receptors with TNFα leads (see for instance [6]), first, to the
formation of a complex I (T1 in Fig. 1) which will recruit and activate inhibitor
of IκB kinases (IKK). Inhibitor of NFκB, or IκB, acts by binding to NFκB
molecules and preventing their transcriptional function. Active IKK (IKKa)
phosphorylates IκB which releases NFκB, thus enabling its translocation to
the nucleus and transcription of NFκB-dependent genes, including genes for
inhibitor of apoptosis protein (iap), inhibitor of NFkB (iκB), a protein as-
sociated with inhibition of complex T2 (flip), and a protein regulating IKK
activity (a20) [21]. Transcription of IκB mRNA generates a negative feedback
22 M. Chaves et al.

loop in the NFκB pathway [14, 20], which may lead to oscillatory behaviour
in NFκB and IκB concentrations [19]. In a second step, after dissociation of
components of complex I from the death receptor, a second complex is formed
(T2 in Fig. 1) which will recruit and activate initiator caspase 8 (C8a). As a
result of the signaling cascade [8, 22], effector caspase 3 is also activated (C3a).
Thus, complex T1 activates the anti-apoptotic pathway and, after a certain
delay, complex T2 activates the pro-apoptotic pathway.
Two well-documented points of regulation of the apoptotic pathway by
NFκB are inhibition of C3a by IAP and regulation of complex T2 by FLIP [6].
Active caspase 8 was found to be negatively regulated by caspase-8 and
caspase-10-associated RING proteins (CARPs) [18], which seem to play an
analogous role to IAP’s, but are less well studied. It was found that CARPs
are overexpressed in tumors, and that their suppression leads to restoration of
the apoptotic pathway, with the CARP being rapidly cleaved. In addition, it
was observed that inhibitors of caspase 3 block CARP cleavage. In our model,
we introduced CARP and a pre-complex CARP0 , which is inhibited by C3a.
Inhibition by C3a is, however, not sufficient to control CARP, and there are
probably other regulators. Since CARP plays a similar role to caspases 8 and
10, as IAP plays to caspases 3 and 9 (and in the absence of further details),
we assume that the pre-complex CARP0 is also regulated by a product of the
NFκB pathway.
The points where the caspase cascade influences the NFκB pathway are
less well documented. We will use our model to test different hypotheses by
studying and comparing the network dynamics for the following cases (see
also Table 2): inhibition of IKKa (link L) and/or NFκB (link A) by C3a, or
neither of these links present.
To obtain the logical rules shown in Table 1, some simplifications of the
biological processes were inevitably introduced. For instance, the bound com-
plex NFκB−IκB (either in the cytoplasm or in the nucleus) was not explicitly
considered in the system, but was simply treated as an inhibition effect: the
rule for NFκB says that it vanishes whenever IκB is expressed. Thus, any
state with NFκB = 0 and IκB = 1 represents in fact a high concentration of
bound complex NFκB − IκB, while any state with NFκB = 1 and IκB = 0
represents a high concentration of free NFκB and low concentration of free
IκB. To translate our diagram into a set of logical rules, the convergence of
two or more arrows (either activation or inhibition) at the same node was al-
ways treated as a logical AND, except in three cases: IκB, IAP, and CARP0 .
For these proteins, the overall effect was treated as an AND in the presence of
TNF stimulation, but treated as an OR in the absence of TNF. These three
proteins represent inhibitors whose levels should be stable in the absence of
any stimulus [8]: IAP and CARP0 (or CARP) should be effective inhibitors of
the caspases, and IκB should be at approximately constant levels to control
NFκB transcriptional activity. In contrast, with TNF stimulation, the degra-
dation rates of these proteins can vary and lead to rapid changes in their
concentrations (different degradation rates in the presence or absence of TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 23

have been observed, notably for bound IκB [20]). For instance, under TNF
treatment, the rule for inhibition of NFκB is simplified to IκB+ = [iκB and not
IKKa]. Suppose that IKK becomes activated at time t1 , that is IKKa(t1 ) = 1.
Then, in the next iteration of the model, the IκB rule implies that IκB will
degrade very fast, with IκB(t1 +Δ) = 0. In contrast, in the absence of the TNF
stimulus, the rule is IκB+ = [iκB or not IKKa]. If IKK becomes active at time
t1 , one has IκB(t1 + Δ) = iκB(t1 ), meaning that IκB is only rapidly degraded
if no more of its messenger RNA is available. A similar reasoning justifies the
rules for IAP and CARP0 . The rules for these three proteins with inhibiting
roles reflect the fact that their degradation rates, and hence turnover, can be
much faster in response to TNF stimulation.

3 Analysis of Boolean Models

Boolean networks are a representation of a system, consisting of a set of n


variables or nodes X = (X1 , . . . , Xn ), together with a set of logical rules
(Fi (X), i = 1, . . . , n) describing the evolution of the system from the current
state (Xi at time t) to the next state (Xi at time t + Δ). The variables
or nodes take values in the discrete set {0, 1}, where 1 (resp., 0) denotes
the “expressed” (resp., “not expressed”) state of the node. The associated
rules are typically a composition of logical OR and AND functions, which can
be determined from gene/protein expression patterns (from Western blots or
microarray data, for instance). The set of rules Fi given in Table 1 for the
NFκB pathway and the caspase cascade is a translation of the diagram shown
in Fig. 1. The temporal evolution of the system, X(t), t ∈ (0, ∞), is determined
by successively iterating the logical rules Fi , for which several algorithms are
available. Synchronous algorithms assume that all nodes are simultaneously
updated:

Xi+ = Fi (X1 , . . . , Xn ), i = 1, . . . , n, (1)

where Xi ∈ {0, 1}, X = (X1 , . . . , Xn ) denotes the state of the system at time t,
and X + = (X1+ , . . . , Xn+ ) denotes the next state (at t + Δ). Alternatively, with
asynchronous algorithms, at each iteration the nodes are sequentially updated,
according to a given order (which can be prespecified or randomly chosen).
Discrete models focus on the structure of the network (links), thus offering
a more qualitative description of the system’s dynamics. Continuous models
may offer more detailed descriptions of a system, but they also have the dis-
advantage of involving a large set of kinetic parameters, many of which are
unknown. A method for analysis of Boolean models was introduced in [12, 13],
which provides a bridge between discrete and continuous approaches. In this
method, each node Xi of the network is represented by one continuous vari-
able (xi ) and one discrete variable (Xi , as before). The continous variables are
24 M. Chaves et al.

Table 1. Boolean rules for the model of regulation of apoptosis via the NFκB pathway.
TNF is a constant input. Identification of the nodes is given in the text. The letter “a”
juxtaposed to a variable name denotes the active form of a molecule. The subscript
“nuc” denotes the given component in the cellular nucleus. Alternative rules are given
for the presence/absence of links A, C, L.

Node Boolean rule


+
T1 TNF
T2 + T1 and not FLIP
IKKa+ {L} T1 and not A20a and not C3a {no L} T1 and not A20a
NFκB+ {A} not IκB and not C3a {no A} not IκB
NFκB+ nuc NFκB and not IκBnuc
iκB+ NFκBnuc
IκB+ [T1 and (iκB and not IKKa)] or [not T1 and (iκB or not IKKa)]
IκB+ nuc IκB
a20+ NFκBnuc
A20+ a20
A20a+ T1 and A20
iap+ NFκBnuc
IAP+ [T1 and (iap and not C3a)] or [not T1 and (iap or not C3a)]
flip+ NFκBnuc
FLIP+ flip
C3a+ not IAP and C8a
C8a+ {C} not CARP and (C3a or T2 ) {no C} C3a or T2
CARP+ 0 [T1 and (NFκBnuc and not C3a)]
or [not T1 and (NFκBnuc or not C3a)]
CARP+ CARP0

governed by ordinary differential equations, which combine a synthesis rate


(based on its Boolean rule) and a linear degradation rate:
d xi
= −ai xi + bi Fi (X1 , X2 , . . . , Xn ), i = 1, . . . , n. (2)
dt
At each instant t, the discrete variable Xi is defined as a function of the con-
tinuous variable according to a threshold value of its maximal concentration:

0, xi (t) ≤ θi abii
Xi (t) = (3)
1, xi (t) > θi abii ,

where θi ∈ (0, 1) represents the fraction of maximal concentration which is


necessary for component Xi to become “active” and perform its biological
functions. Initial conditions are equal for discrete and continuous variables:
Xi (0) = xi (0). It is easy to see that the hypercube [0, b1 /a1 ] × · · · × [0, bn /an ]
is an invariant set for system (2). The continuous variables denote concentra-
tions of molecules; they are translated into a Boolean 0/1 response according
to θi . The discrete variables Xi represent expression (1) or not expression (0)
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 25

of species i, according to whether its continuous concentration xi is above or


below the threshold θi bi /ai . Letting the parameters ai , bi , and θi be specific
for each node i allows us to study different time scales for different biological
processes (for instance, transcription, translation, or post-translational pro-
cesses, as in [5]), or investigate the relative turnover rates of two molecules.
Similar piecewise linear systems have also been studied in [7, 26].

3.1 Steady States

The steady states of a Boolean model are given by all the possible solutions
X ∗ of the equations:

Xi∗ = Fi (X1∗ , . . . , Xn∗ ), i = 1, . . . , n.

It is easy to see that any steady state of the Boolean model yields a steady
state of the piecewise linear equations (2), since

d xi bi
= 0 ⇔ xi = Fi (X1 , X2 , . . . , Xn ), i = 1, . . . , n,
dt ai
independently of θi . Because the right-hand side of this equation is discontin-
uous, it is difficult to provide general results on the existence and uniqueness
of solutions for system (2) (see for instance [3] and [11]). In view of this dif-
ficulty, in the present study we will assume that trajectories are well defined
and analyze their dynamical behavior.
For the model of Table 1, the steady states depend on the value of TNF
(see Table 2). It is not difficult to check that (both with and without link A)
there are exactly two distinct steady states when TNF = 0, characterized by
the presence or absence of caspases 3 and 8, and hence corresponding to the
survival or apoptotic responses (nodes not indicated below are zero):

(Ap0 ) T1 = T2 = 0, C3a = C8a = 1, IκB = IκBnuc = 1, (4)


(Lf0 ) T1 = T2 = 0, IκB = IκBnuc = 1, CARP0 = CARP = IAP = 1.

This is in agreement with the idea that, under typical conditions, the cell
should be capable of stably maintaining either an apoptotic or a survival

Table 2. Steady states of the Boolean model, for each model variant, in the presence
and absence of TNF.
Model Links TNF = 0 TNF = 1 Oscillations?
I A, C, no L Ap0 , Lf0 Ap1 Yes
II L, C, no A Ap0 , Lf0 — Yes
III C, no A, no L Ap0 , Lf0 — Yes
IV L, no A, no C Ap0 , Lf0 — Yes
26 M. Chaves et al.

state [8, 4]. If TNF = 1, there is only one possible steady state for models
with link A:

(Ap1 ) T1 = T2 = 1, C3a = C8a = 1. (5)

For models with no link A, there is no possible steady state when TNF = 1,
and there are only periodic orbits of period higher than 1.
Therefore, during TNF treatment, models with link A may at any time
make a decision towards the apoptotic pathway, while models with no link
A will exhibit oscillatory behaviour and can only make a decision when TNF
treatment ceases. Upon removal of TNF stimulation, trajectories of system (2)
may be expected to converge to either the apoptotic or survival state. The
choice of one or the other state will depend on the initial condition and the
set of parameters ai , bi , and θi . Since these parameters are very likely to
vary from cell to cell, it is reasonable to consider several (randomly chosen)
sets of parameters and then compute the probability of convergence to each
steady state. To examine the dynamics of system (2), and its dependence on
parameters and the structure of the network of interactions, several numerical
studies were performed, as described next.

3.2 Numerical Experiments

To test the model and analyse the effects of links A and L (Fig. 1), system (2)
was simulated several times, with randomly chosen sets of parameters. For
simplicity, the synthesis rates and threshold constants were fixed (bi = 1 and
θi = 0.5 for all i), and only parameters ai were allowed to vary, chosen from
a uniform distribution in the interval [1/3, 3] (h−1 ). This seems reasonable, as
the degradation rates used in [17] are roughly between 0.5 and 4 h−1 . Observe
that ai plays a double role: it represents a degradation rate, but also defines the
0/1 threshold concentration (0.5/ai ). Hence, high degradation rates also imply
that a lower concentration is needed to achieve the 0/1 transition. Different
durations of TNF stimulation were considered, namely: 2, 6, 11, 16, and 21
hours. For these simulations, one initial condition was chosen: IκB(0) = 1 and
all other nodes set to zero. This is based on a natural physiological starting
point of the system: previous to stimulation, IKK is in its inactive form, while
IκB is bound to NFκB, preventing transcriptional activity. Caspases reside in
the cytosol in dormant forms [22].
To understand the importance of the links A, C, and L (the least well
documented), four variants of the model depicted in Fig. 1 are compared: (I)
links A and C present, (II) links L and C present, (III) only link C present,
and (IV) only link L present (as listed in Table 2). The first three variants aim
at comparing the effects of links A and L, and the last aims at evaluating the
effect of link C. Other alternatives gave similar results (for example, a model
with all three links gave results very similar to I) and thus are not detailed
here. For each variant, the response of the system to each of the five TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 27

durations was simulated 500 times. Since different sets of parameters {ai }
introduce different time scales, variations in the dynamics from one simulation
to another are expected. These variations may also be interpreted as a result
of natural variability in biological systems. The average response over the 500
simulations will then yield the probability of the system converging towards
each of the steady states.
Other open questions that may be studied with our model include com-
petition between the pro- and anti-apoptotic pathways and the point of ir-
reversibility of the apoptotic decision. For instance, how long after caspase
activation is recovery from the apoptotic pathway still possible [22]? To
address these questions, numerical experiments were conducted by letting
NFκB(0) = 1, setting all others to zero, and maintaining C3a(t) = 1 for
durations of 10, 30, 60, and 360 minutes.
For analysis of the numerical results, a “peak” in the trajectory of node
Xj will be defined as a time interval [T0 , T1 ], during which Xj (t) = 1, and such
that Xj (T0 − Δ) = Xj (T1 + Δ) = 0. The period of oscillations is calculated
as the average time interval between the onset of two consecutive peaks, i.e.,

1 
Np
Period = T0,i − T0,i−1 ,
Np − 1 i=2

where Np is the number of peaks observed during the simulation time.

4 Results and Discussion


In the numerical simulations, it is observed that, once TNF stimulation ceases,
a steady state pattern is always achieved, corresponding to either the apop-
tosis or survival states (4), (5). In the former case IκB is bound to NFκB,
so that mRNAs and proteins downstream of NFκB are not expressed, and
the cell has chosen the apoptotic pathway. The latter case represents survival
of the cell, with IAP stably expressed preventing C3a activation, and CARP
preventing C8a activation (see Fig. 2). In the presence of TNF stimulation,
IκB, NFκB, and its dependent mRNAs/proteins may exhibit oscillatory dy-
namics, as observed experimentally in [14, 19]. In fact, computation of steady
states shows that the models with no link A have no alternative but to exhibit
oscillatory behaviour in the presence of TNF, since no possible steady states
exist (except possible special solutions of the associated differential inclusion).
The oscillatory behaviour (see analysis below) is in very good agreement with
the experimental data reported in [19].
Qualitatively, all model variants respond in a similar fashion to TNF stim-
ulation. As the stimulus duration increases, more cells choose the apoptotic
pathway. Testing the four model variants shows that link A is very strong:
not surprisingly, models with link A favour the apoptotic pathway, with 80%
of cells reaching the apoptotic state, as opposed to around 50% or 40% in
28 M. Chaves et al.
1 1

TNF
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
2.6733 1.1631
IKK

0.5 0.5

0 0
0 5 10 15 20 0 5 10 15 20
1 1
1.898 2.9469
IkBn

0.5 0.5

0 0
0 5 10 15 20 0 5 10 15 20
NFkBn

1 1
2.3488 2.5784
0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.90041 1.8348
IAP

0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.79962 2.5642
C8a

0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
0.4439 0.69736
C3a

0.5 0.5
0 0
0 5 10 15 20 0 5 10 15 20
Time (hours) Time (hours)

Fig. 2. Example of network dynamics with the hybrid model (variant II), corre-
sponding to cell survival (left) or apoptosis (right) solution. Numbers indicate the
degradation rates for these numerical experiments. Solid lines represent normalized
continuous variables (xi ) and dashed lines represent discrete variables (Xi ).

90

80

70
Survival rate (%)

60 III

50 II
40
IV
30

20 I

10
2 4 6 8 10 12 14 16 18 20 22
TNF duration (hours)

Fig. 3. Percentage of surviving cells for the four model variants.

models II and IV, or 30% in the model with only link C (which favours the
anti-apoptotic pathway) (Fig. 3). These values appear to be in agreement with
experimental data: Rehm et al. [22] report that, for 8 hour treatments with
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 29

6 6 6

Average period (hours)


5 I 5 II 5 III
4 survival 4 survival 4
survival
3 3 apoptosis 3

2 2 2 apoptosis

1 apoptosis 1 1

0 0 0
0 10 20 0 10 20 0 10 20
TNF duration (hours) TNF duration (hours) TNF duration (hours)
7 7 7
TPeak i −TPeak i−1 (hours)

6 I 6 II 6 III
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 2 4 6 0 2 4 6 0 2 4 6
Peak i Peak i Peak i

Fig. 4. Top row: Average period of nuclear IκB oscillations for apoptotic or surviving
cells, as a function of TNF stimulus duration. Vertical lines represent standard devi-
ation over the 500 numerical experiments. Bottom row: Relative timing of sucessive
peaks in IκB oscillations, for apoptotic (grey) or surviving (black) cells. The “+” signs
mark the experimental peak timing in [19].

high and low concentrations of TNFα, the percentage of cells undergoing ac-
tivation of effector caspases was, respectively, 86% and 24%. The numerical
experiments with our model capture the response to high (or significant) con-
centrations of TNFα, so variants I (followed by II and IV) are closer to the
real system.
Quantitative analysis of the oscillatory behaviour reveals some interesting
facts (Fig. 4). To characterize the oscillatory dynamics, the following quan-
tities were computed for nuclear IκB: period of oscillations (approximated),
number of peaks, and relative timing between peaks. First, in all cells os-
cillations cease when TNF stimulation ceases, in agreement with observa-
tions. Second, the timing of successive peaks is also in remarkable quantitative
agreement with experimental data [19], see Fig. 4 (bottom row). The first peak
in nuclear IκB concentration was observed about 72 minutes from the start
of TNF stimulation, and the second peak appears about 4 hours later, very
close to the 75 minutes and 4.5 hours reported in [19]. It is striking that the
time span of the first peak is typically longer than that of the following peaks,
and that the time lapse between consecutive peaks decreases (see Figs. 2, 4).
Third, the average period of oscillations is fairly constant, but “depends” on
the apoptosis/survival decision. Statistical analysis of the period of oscillations
30 M. Chaves et al.

(calculated as indicated in Section 3.2) in nuclear IκB indicates that there is a


natural period (for TNF treatment longer than 3 hours) for cells that eventu-
ally survived. This period is about 3.5 ± 1 hours for models I, II, and IV, and
slightly higher at 4 ± 1 hours for model III. In contrast, for cells that chose the
apoptotic pathway, the period of oscillations can be much smaller. For models
with link A, essentially no oscillations are observed in apoptotic cells (Fig. 4,
top, left): this is because cell death is decided very early on, with link A im-
mediately preventing any further NFκB activity. For model II (links C and L
only), oscillations are observed in apoptotic cells with a natural period which
is lower (about 3 ± 1 hours) than that for surviving cells (Fig. 4, top, mid-
dle). Results for model IV (not shown) are quite similar to those of model II.
For model variant III, there is no difference between observed periods (Fig. 4,
top, right). These results provide indications for discriminating between the
four model variants and also suggest that the period of oscillations may play a
role in the survival/apoptosis decision: lower periods/higher frequencies would
lead towards the apoptotic pathway. A similar result has been reported, for
instance, in the p53-Mdm2 system [16], where more peaks (higher frequency)
were detected in response to higher (and more damaging) γ-irradiation doses.
The p53-Mdm2 system also contains a negative feedback loop similar to the
NFκB-IκB loop.
To address the question of irreversibility of the apoptotic decision, we
checked the capacity of the network to recover from overexpression of active
caspase 3. Fixing node C3a at its maximal value for intervals of 10, 30, 60, and
360 minutes (that is setting discrete C3a(t) = 1, for t <10, 30, 60, or 360), we
calculated the percentage of surviving cells. With model I there are no sur-
viving cells after 1 hour of C3a overexpression but, with model II or IV, this
percentage drops very fast from 45% to 30% survival at 1 hour overexpression
and remains at this value for continued C3a overexpression (see Fig. 5). This
suggests that a significant percentage of cells can still invert the apoptotic

60

III
50
Survival rate (%)

40
II
30
IV
20

10
I
0
0 50 100 150 200 250 300 350 400
C3a overexpression interval (mins.)

Fig. 5. Percentage of surviving cells under increasing intervals of C3a overexpression


for the four model variants and TNF treatment for 16 hours.
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 31

decision, while for the largest part (70% of all cells) the apoptotic pathway
is chosen early on, within an hour of TNF stimulation. Not surprisingly, ex-
amination of the relative values of the parameters ai shows that two thirds of
cells that were able to recover from the apoptotic pathway had degradation
rates for C3a higher than those for NFκB or IκB.
Based on our study of regulation of apoptosis and the NFκB pathway, it
seems clear that the links A and L play quite important roles, and at least
one of these should definitely be included for faithful modeling of apoptosis
via TNF receptors. This eliminates model III. Both links contribute to the
same physiological function: downregulation of NFκB transcriptional activity.
However, link A (direct inhibition of NFκB by C3a) achieves this objective
in a much faster way than link L (“indirect” inhibition of NFκB by C3a,
through complex IKK). The essential difference between models I and II is
thus the length of the pathway representing inhibition of NFκB by C3a. The
shorter path (model I, with link A) leads to much higher apoptosis rates than
the longer path (models II or IV, with link L). The shorter path also renders
recovery from the apoptosis pathway practically impossible, with apoptosis
rates higher than 95% after only half an hour with C3a overexpression (Fig. 5).
The longer path allows a higher recovery rate from the apoptotic pathway,
although the probability of apoptosis does not increase above 70%, even after
6 hours of C3a overexpression. Recent experimental evidence [10] points to
the existence of a link L, that is, caspases are responsible for cleavage or
degradation of (parts of) complex IKK. To further discriminate between a
short or long pathway for the influence of caspases on the NFκB pathway, the
results shown in Fig. 4 suggest the following experiment. First, measure the
period of oscillations during TNF stimulation and then monitor cells for some
time after TNF removal. Next, compare the frequency of oscillations in cells
that survive and in cells that eventually go through the apoptotic program.
If the frequency of oscillations is similar for both groups of cells, or slightly
higher in apoptotic cells, then model II (longer pathway) provides a better
description of the system. If oscillations stopped after a short time interval (as
compared to TNF duration) in apoptotic cells, then model I (shorter pathway)
should be chosen.

5 Conclusion
The present study illustrates the usefulness of Boolean and piecewise linear
models in the analysis of large complex networks. The qualitative dynamics
that emerges from the network structure was studied, leading to predictions on
the response to increasing duration of stimulation, response to overexpression
of a given protein, or indication of which links/interactions play crucial roles
in the regulation of apoptosis. Some quantitative aspects were also analyzed,
such as the probabilities of survival or apoptosis and the frequency/period of
oscillations, and were shown to be in remarkable agreement with experimental
32 M. Chaves et al.

data. Many other questions can be examined in this hybrid framework: for
instance, extending the set of parameters (degradation and synthesis rates,
threshold concentrations) and varying the relative strengths of anti- and pro-
apoptotic links will lead to more refined models, capturing a wider range of
kinetic variability. Although writing the logical rules requires some simplifica-
tions of the biological processes, discrete and hybrid models retain the essen-
tial qualitative properties of the network. The effect of the network structure
on the qualitative dynamics of the system can be easily studied, even when
kinetic details are not well known. This class of models can thus be a pow-
erful method to generate predictions and test new hypotheses for complex
biological networks.

Acknowledgments

The authors thank Peter Scheurich and Monica Schliemann for their many
interesting and fruitful discussions.

References
1. R. Albert and H.G. Othmer. The topology of the regulatory interactions predicts
the expression pattern of the drosophila segment polarity genes. J. Theor. Biol.,
223:1–18, 2003.
2. G. Bernot, J.-P. Comet, A. Richard, and J. Guespin. Application of formal meth-
ods to biological regulatory networks: extending Thomas’ asynchronous logical
approach with temporal logic. J. Theor. Biol., 229:339–347, 2004.
3. R. Casey, H. de Jong, and J.L. Gouzé. Piecewise-linear models of genetic regulatory
networks: equilibria and their stability. J. Math. Biol., 52:27–56, 2006.
4. M. Chaves, T. Eissing, and F. Allgöwer. Bistable biological systems: a charac-
terization through local compact input-to-state stability. IEEE Trans. Automat.
Control, 53:87–100, 2008.
5. M. Chaves, E.D. Sontag, and R. Albert. Methods of robustness analysis for boolean
models of gene control networks. IEE Proc. Syst. Biol., 153:154–167, 2006.
6. N.N. Danial and S.J. Korsmeyer. Cell death: critical control points. Cell, 116:
205–216, 2004.
7. H. de Jong, J.L. Gouzé, C. Hernandez, M. Page, T. Sari, and J. Geiselmann.
Qualitative simulation of genetic regulatory networks using piecewise linear mod-
els. Bull. Math. Biol., 66:301–340, 2004.
8. T. Eissing, H. Conzelmann, E.D. Gilles, F. Allgöwer, E. Bullinger, and
P. Scheurich. Bistability analysis of a caspase activation model for receptor-induced
apoptosis. J. Biol. Chem., 279:36892–36897, 2004.
9. A. Fauré, A. Naldi, C. Chaouiya, and D. Thieffry. Dynamical analysis of a
generic boolean model for the control of the mammalian cell cycle. Bioinformatics,
22(14):e124–e131, 2006.
10. C. Frelin, V. Imbert, V. Bottero, N. Gonthier, A.K. Samraj, K. Schulze-Osthoff,
P. Auberger, G. Courtois, and J.F. Peyron. Inhibition of the NF-κB survival path-
way via caspase-dependent cleavage of the IKK complex scaffold protein and NF-
κB essential modulator NEMO. Cell Death Differ., 15:152–160, 2008.
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis 33

11. T. Gedeon. Attractors in continuous-time switching networks. Communications


on Pure and Applied Analysis, 2:187–209, 2003.
12. L. Glass. Classification of biological networks by their qualitative dynamics.
J. Theor. Biol., 54:85–107, 1975.
13. L. Glass and S.A. Kauffman. The logical analysis of continuous, nonlinear bio-
chemical control networks. J. Theor. Biol., 39:103–129, 1973.
14. A. Hoffmann, A. Levchenko, M.L. Scott, and D. Baltimore. The IκB-NFκB sig-
naling module: temporal control and selective gene activation. Science, 298:1241–
1245, 2002.
15. A.E.C. Ihekwaba, D. Broomhead, R. Grimley, N. Benson, and D.B. Kell. Sensitiv-
ity analysis of parameters controlling oscillatory signalling in the NF-κB pathway:
the roles of IKK and IκBα. IEE Syst. Biol., 1:93–103, 2004.
16. G. Lahav, N. Rosenfeld, A. Sigal, N. Geva-Zatorsky, A.J. Levine, M. Elowitz,
and U. Alon. Dynamics of the p53-Mdm2 feedback loop in individual cells. Nat.
Genetics, 36:147–150, 2004.
17. T. Lipniacki, P. Paszek, A.R. Brasier, B. Luxon, and M. Kimmel. Mathematical
model of NFκB regulatory module. J. Theor. Biol., 228:195–215, 2004.
18. E.R. McDonald and W.S. El-Deiry. Suppression of caspase-8 and -10-associated
RING proteins results in sensitization to death ligands and inhibition of tumor
cell growth. Proc. Natl. Acad. Sci. USA, 101:6170–6175, 2004.
19. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney,
B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller, S.W. Edwards,
H.P. McDowell, J.F. Unitt, E. Sullivan, R. Grimley, N. Benson, D. Broomhead,
D.B. Kell, and M.R.H. White. Oscillations in NF-κB signaling control the dynam-
ics of gene expression. Science, 306:704–708, 2004.
20. E.L. O’Dea, D. Barken, R.Q. Peralta, K.T. Tran, S.L. Werner, J.D. Kearns,
A. Levchenko, and A. Hoffmann. A homeostatic model of IκB metabolism to
control constitutive NFκB activity. Mol. Syst. Biol., 3:111, 2007.
21. N.D. Perkins. Integrating cell-signalling pathways with NF-κB and IKK function.
Nat. Rev. Mol. Cell Biol., 8:49–62, 2007.
22. M. Rehm, H. Düßmann, R.U. Jänicke, J.M. Tavaré, D. Kögel, and J.H.M. Prehn.
Single-cell fluorescence resonance energy transfer analysis demonstrates that cas-
pase activation during apoptosis is a rapid process. J. Biol. Chem., 277:24506–
24514, 2002.
23. M. Schliemann. Modelling and experimental validation of TNFα induced pro- and
antiapoptotic signalling. Master’s thesis, University of Stuttgart, Germany, 2006.
24. M. Schliemann, T. Eissing, P. Scheurich, and E. Bullinger. Mathematical modelling
of TNF-α induced apoptotic and anti-apoptotic signalling pathways in mammalian
cells based on dynamic and quantitative experiments. In Proc. 2nd Int. Conf.
Foundations Systems Biology in Engineering (FOSBE), Stuttgart, Germany, pages
213–218, 2007.
25. R. Thomas. Boolean formalization of genetic control circuits. J. Theor. Biol.,
42:563–585, 1973.
26. R. Thomas, D. Thieffry, and M. Kaufman. Dynamical behaviour of biological
regulatory networks - i. biological rule of feedback loops and practical use of the
concept of the loop-characteristic state. Bull. Math. Biol., 57:247–276, 1995.
Network-Based Models in Molecular Biology

Andreas Beyer

Biotechnology Center, Technische Universität Dresden, 01062 Dresden, Germany


andreas.beyer@biotec.tu-dresden.de

1 Introduction
Biological systems are characterized by a large number of diverse interactions.
Interaction maps have been used to abstract those interactions at all biolog-
ical scales ranging from food webs at the ecosystem level down to protein
interaction networks at the molecular scale.
Organisms consist of thousands of cells with hundreds of different types.
Cells in turn contain millions of molecules comprising thousands of different
chemical species. Our genome contains about 23,000 protein coding genes [32],
and the estimated number of chemically different proteins (considering splice
variants and posttranslational modifications) is at least an order of magnitude
larger. It is difficult to estimate the true number of different proteins, because
there are no reliable methods yet for predicting splice variants. For example,
the NCBI database (www.ncbi.nlm.nih.gov) currently lists about 440,000 pro-
tein entries—many of them may however be redundant. In addition, our cells
contain many other molecules with catalytic or regulatory functions, such as
ribosomal RNA, tRNA, and small interfering RNA (siRNA). Further, the cells
contain thousands of different lipid species and other small molecules serving
as structural components of the cell or as substrates for the biochemical re-
actions executed by the metabolic program. Hence, our body is coordinating
the activity and reactions of hundreds of thousands if not millions of different
chemical species [3]. Even a single cell is a prototypic example of a complex
system [27]. Although biological systems follow all basic physical and chem-
ical principles, they cannot be modeled sufficiently using standard methods
from those two disciplines. Typical physical models describe a system as ei-
ther a small number of different entities (e.g. mechanics) or a large number of
very similar or even identical elements (e.g. thermodynamics). Likewise, also
chemical reaction systems can only be appropriately described if the number of
reacting species is small. However, the behavior and fate of organisms cannot

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 3,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
36 A. Beyer

be described appropriately without considering the fact that they consist of


a large number of very different interacting elements.
Networks or interaction graphs are one way to formalize the complex in-
teractions of heterogeneous entities within and between cells. In their most
simple form those interaction graphs only indicate the possibility of an interac-
tion between two genes or proteins (Fig. 1, top). At the other end of the scale
are detailed models based on ordinary or partial differential equations (Fig. 1,
bottom). Those latter models are used for small, well-studied subsystems and
they provide significantly more detailed insight into the dynamics than simple
interaction maps. Many formal methods have evolved that cover the interme-
diate space between these two extremes [2, 11, 23, 30, 38, 49, 66, 72, 76]. The
best models utilize the available data in an optimal way to provide as much
insight into the biological system as possible.
So, what is the distinct advantage of network-based methods in compar-
ison to other approaches? As opposed to many alternative methods (such
as ordinary or partial differential equations), network models establish a de-
scription of the system including all the different entities and their properties,
even if the detailed knowledge about the properties and behavior of individual
molecules is still very sparse. Network models are a popular way to formalize
all the available knowledge about cellular systems in a consistent framework.
For example, protein and gene interaction maps have been created for many
important model species and for humans, covering thousands of genes and tens
of thousands of interactions [45, 59, 60, 67]. Although these interaction maps
are still incomplete and even though they only represent a static picture of pos-
sible interactions, they constitute the most complete picture of a cell that we
have today (i.e. covering the largest number of genes or biomolecules). It is this
ability to integrate and formalize knowledge and data about very diverse en-
tities of the system (each gene is different from all other genes) in a consistent
way that distinguishes this concept from basically any other existing modeling
approach [7]. One should be aware though that “network-based approaches”
actually cover a very broad and heterogeneous group of methods (Fig. 1), rang-
ing from the above-mentioned static interaction graphs to systems of coupled
differential equations (used e.g. for modeling metabolic networks).
During the previous decades network-based modeling approaches have
been used for describing cellular regulation and cellular metabolism. Networks
have helped to structure and formalize existing knowledge, to summarize and
integrate large measurements, and to predict system behavior. Important ap-
plications include a better understanding of diseases with the ultimate goal
of developing new therapies. Even conceptually simple static protein interac-
tion maps have been used to gain significant insights into how certain genes
are associated with specific diseases. The study of Lage and co-workers [42]
presents an excellent example: many genetic loci (i.e. regions on the genome)
are statistically associated with the occurrence of certain diseases. However,
usually this information is insufficient to mechanistically explain the relation-
ship between the genetic locus and the disease. First of all, many genes may be
Network-Based Models in Molecular Biology 37

Static interaction maps:


show potential of interaction

Conditional interactons:
need condition specific protein
abundance or protein activation
data

level of detail
Causal (directed) interactions:
can predict who affects whom,

coverage
need regulatory information

Logical networks:
+ + considers type of effect
(repressive/activating) and potenti-
ally also Boolean rules (such as
-
“need A AND B to activate C“)

Quantitative models:
k1 k2 kinetic rate constants are known or
derived from the data, can predict
dynamics of system response
k3

Fig. 1. A hierarchy of network models. Depending on the available data and


the research question, different network modeling approaches are chosen. The figure
shows model types with increasing levels of detail and increasing data demand (top
– down). Less detailed models tend to cover a larger number of genes/biomolecules.
Note that only the most detailed category of models “quantitative models” allows for
true simulation of network dynamics, e.g. using ordinary differential equations. Causal
networks and logical networks allow at most for the simulation of the sequence of
events or the order in which proteins/genes are activated, but quantifying the speed
of processes is impossible with these simplified approaches.

located in the respective region of the genome and it is mostly unknown which
of the genes is causal for the disease. Second, even if the causal gene is known,
the molecular mechanisms linking the gene to the disease are usually elusive.
Lage et al. addressed these problems by mapping all genes located in disease-
related loci onto a protein-protein interaction network. They hypothesized
that truly causal genes would cluster together in common protein complexes
of the network. Indeed the authors found protein complexes significantly en-
riched with candidate genes. Often, these complexes also had a molecular
relationship with disease phenotypes. Hence, the investigators not only iden-
tified potentially causal genes, but they also identified protein complexes that
could aid in understanding the molecular mechanisms by which mutations
alter disease susceptibility. This example demonstrated that even comparably
simple networks can yield new insight for our understanding of diseases.
38 A. Beyer

2 Molecular Biological Networks


Currently, three types of molecular networks are receiving the most attention
in the scientific literature: metabolic networks, protein interaction networks,
and gene interaction networks. Other popular network types are transcrip-
tional regulatory cascades, which are derived from genome-wide expression
data (Appendix 1). The availability of large datasets for these most popu-
lar networks has enabled extensive theoretical and computational analysis of
the networks. Importantly, this preference does not always reflect biological
significance. As discussed below, posttranscriptional regulatory networks are
probably as important as transcriptional networks. However, methods for mea-
suring the relevant interactions (e.g. protein-RNA binding) on a large scale are
either not yet available or not as established as other methods (Appendix 1).
Metabolic networks. These networks describe systems of biochemical reac-
tions catalyzed by enzymes. Depending on the available data and the research
question, different formalizations can be used [28, 63]: (i) enzymes can be the
nodes (vertices) of the graph and two enzymes are connected if they catalyze
subsequent steps in a reaction chain (i.e. the product of the first enzyme is
a substrate for the second); (ii) metabolites can be nodes in the graph and
metabolites are connected if they participate in the same reaction; (iii) sto-
ichiometry can be considered if known, e.g. in metabolic flux analysis; and
(iv) if kinetic parameters are known one can create systems of differential
equations describing the dynamics of the system. Many other types of model-
ing schemes are used for depicting metabolic networks, including for example
stochastic processes [62].
Protein-protein networks. Likewise, many different types of protein-
protein interaction networks have been used, utilizing the data in different
ways; e.g. static protein interaction maps summarize either known measured
or predicted protein interactions [59, 60, 67]. Here, the edges often have
weights quantifying the probability of a true physical binding between the
two proteins [15]. Other types of protein networks are regulatory networks,
such as kinase-substrate cascades [55], protein complexes (often representing
molecular machines such as the ribosome) [24, 40], or more detailed structural
models of protein interactions [4].
Gene-gene networks. Gene-gene networks are not molecular networks, but
in fact logical networks: they describe functional relationships between genes.
For example, two genes may be linked if their products participate in the same
process or pathway. Such functional networks have been created by integrat-
ing diverse evidence for common functions of genes [45, 59]. For example, the
fact that two genes appear together in many species (common phylogenetic
profile) indicates that the genes participate in a common process. Also, com-
mon expression patterns across a wide range of different conditions suggest
similar functions of two genes. By integrating such evidence quantitatively
using machine learning approaches it has been possible to create relatively
Network-Based Models in Molecular Biology 39

large maps of functionally related genes. Such a “functional network” has for
example been used to better predict substrates of kinases [47]. This study
nicely demonstrated the value of such data integration, since previous meth-
ods relying exclusively on kinase binding motifs suffered from a large number
of false positives.
Geneticists define a genetic interaction based on the phenotypes observed
when the genes are knocked out: if the knock-out of one gene “masks” the
phenotype of the other knock-out they are said to be linked [7, 9]. A proto-
typical example is the synthetic lethal interaction. In this case the knock-out
of any single gene has no or only very little effect on viability, whereas the
double knock-out of both genes creates a lethal phenotype. Such a synthetic
lethal phenotype can be explained by redundant functions of the two genes
e.g. in two independent pathways that can compensate for each other [35].
Hence, the fact that two genes create a synthetic lethal phenotype indicates
that they participate in distinct pathways.
Underlying the functional or genetic relationships are sequences of physical
or biochemical interactions “connecting” the two genes. Thus, genetic interac-
tions provide important functional information that can be used for inferring
molecular pathways [7].
Other network types. Many other types of molecular interactions have
systematically been studied using network approaches. For example, protein-
DNA interactions are important for understanding transcriptional regulation,
and they have been studied on a large scale for almost a decade [58, 77]. On
the other hand, protein-RNA networks are substantially less researched, al-
though the relevance of alternative splicing is immediately apparent and it
is known that translation is also heavily regulated via RNA binding proteins
[5]. Yet another prominent example is the use of logical networks for describ-
ing transcriptional regulatory cascades [19, 57]. These networks are similar
to the above-mentioned transcription factor-DNA networks; however, such
logical networks; may not always explicitly model the molecular mechanism
underlying the regulatory relationship.

3 Identifying Molecular Biological Networks

In recent years several new technologies have been developed for measur-
ing all kinds of physical and genetic interactions on a large scale (Fig. 2,
Appendix 1). For example protein-protein interactions can be measured
with yeast two-hybrid (Y2H) or tandem affinity purification coupled with
mass-spectrometry-based protein identification (TAP-MS). Protein-DNA
interactions can be measured with chromatin immunoprecipitation and
DNA microarrays (Appendix 1) can be used to identify the DNA frag-
ments (ChIP-Chip). Likewise, techniques for the large-scale measurement of
gene-gene interactions have been developed. These are just some examples to
40 A. Beyer

Kinase - substrate interactions:


protein chips
binding motifs + other binding evidence
eQTL + other binding evidence

Protein-protein interactions:
yeast two-hybrid (Y2H)
TAP-MS
known interacting protein domains

Transcription factor - DNA interactions:


TF binding motifs
ChIP-Chip, ChIP-Seq
knock-out + expression change ('buffering')
over expression + expression change

Transcriptional cascades (TF - TF interactions):


all methods of the above section
time course analysis

= protein = undirected interaction

= target gene = directed interaction

Fig. 2. Inferring regulatory networks with high-throughput methods. The


four types of interactions can be inferred using experimental high-throughput meth-
ods, computational methods, or combinations of the two. The methods listed are not
comprehensive. Refer to Appendix 1 for details about the various methods. Advanced
computational methods combine different evidences for each type of interaction. For
example, TF motif information can be used in combination with ChIP-Chip experi-
ments [6]. Likewise, time course data have been combined with TF binding motifs to
infer regulatory cascades [57].

demonstrate the fact that today a wide range of interactions can be measured
practically at a genomic scale. However, all of these methods are subject to
considerable noise, and often results from different techniques only agree to
a small extent [73]. Therefore, numerous bioinformatic approaches are under
development for physical and genetic network quality assessment, integration,
assembly, and annotation.
Although all large-scale studies are subject to noise, the rationale for data
integration is that observations of true interactions will reinforce or comple-
ment one another when combined across different studies and/or experimental
techniques. For example, the independent observation of a protein-protein in-
teraction by both Y2H and TAP-MS methods, or by two independent TAP-MS
studies, renders this interaction more likely to be true [73].
Network-Based Models in Molecular Biology 41

Such evidences for interaction can be further supported by including other


types of data that are not necessarily measurements of direct physical contact.
For example, if the two genes have correlated expression profiles or similar
patterns of occurrence across several conditions, these findings lend further
support to the raw interaction measurement [53, 69]. Beyer et al. [7] review
methods for integrating genetic interactions with physical binding data to
further support the various types of interactions.
Modern methods for interaction data integration use machine learning ap-
proaches or other statistical means for combining heterogeneous types of data
in a consistent manner. Importantly, the methods assign different weights to
different input data, acknowledging that not all types of evidence are equally
predictive. These methods rely heavily on a set of “gold-standard”, or highly
accurate, interactions which are used to evaluate the predictive utility of dif-
ferent types of evidence [50, 73]. The result is a statistical measure quantifying
the likelihood that any given pair of biomolecules interacts (e.g. two proteins
or a protein-DNA pair) [6, 33, 59, 64, 70]. Strictly speaking, these quantitative
confidence scores describe the probability or reproducibility of the interaction,
not the interaction strength. Nonetheless, there is some evidence that stronger
interactions should be more reproducible, leading to higher scores [20].
The scores resulting from such analyses can be used to filter for high-
confidence interactions and thereby remove potentially many false positive
interactions contained in individual high-throughput measurements. Yet, the
reverse problem of false negatives is equally pressing. Because most large-scale
screens are conducted under only one or few conditions, it is not possible to
fully capture the space of all possible interactions. Here again, data integration
can help to mitigate false negatives, as interactions missing from one study
can be detected using high-confidence interactions from another. Note that in-
tegrating more information can simultaneously reduce both the false-negative
and false-positive rates: as the number of ways of detecting an interaction
increases, the higher the chance it is found by several of these methods, and
the lower the chance it is missed altogether or found by only one.
The same notion applies to basically any other type of molecular or genetic
interaction. For instance, another important problem is the identification of
transcription factor (TF) target genes. Various approaches have been used
to infer those interactions (Fig. 1, Appendix 1); however, no single method is
perfect. Combining clues for TF-target interactions from different independent
sources increases confidence and coverage [6]. For example, Lähdesmäki et al.
[44] combined evidence from DNA binding motifs of TFs with other data such
as nucleosome occupancy to infer a transcriptional regulatory network for the
mouse. Likewise, Beyer et al. [6] combined experimental evidence from ChIP-
Chip with TF motifs, phylogenetic information, expression data, and even
physical protein binding data to infer a high-coverage and high-confidence
transcriptional network for yeast.
42 A. Beyer

4 Dynamics of Molecular Biological Networks

Many aspects of biological networks are time dependent or condition specific.


Although all of the above-mentioned networks have been valuable in the past
for better understanding biological processes, most of them ignore some im-
portant features of biological networks. Biological interactions are dynamic
and condition specific. Only a small subset of all interactions (be it physical
protein binding or logical genetic interactions) is constitutively active. Protein
expression and activation depends on the cell type, cell state, environmental
signals, and the history of the cell (previous states). Further, molecular in-
teractions depend on the genomic sequence of the specific individual since
mutations may alter interactions. Hence, the presence of the interactors as
well as their ability to interact is highly variable. Even genetic interactions
can be condition specific: the same double knock-out may be viable under
one condition but lethal under another [9]. The reasons why current mod-
els ignore these aspects are certainly manifold, yet availability of data is one
of the most important aspects. For example, protein-protein interactions are
usually measured for only a single condition, sometimes not even in the orig-
inal organism (e.g. in the case of Y2H, see Appendix 1). It is thus difficult to
consider condition specificity of interactions in mathematical models.
One way to address this problem is to take mRNA expression data into
account, which are now routinely measured genome-wide using DNA microar-
rays (Appendix 1). Using these data it is possible to predict the presence or ab-
sence of proteins under specific conditions. Since a physical interaction is only
possible if both partners are present, this also allows for predicting conditional
interactions [16]. However, this approach still does not consider protein local-
ization and activation. Even if two proteins are co-expressed, they may not be
located to the same subcellular compartment, impeding an in vivo interaction.
The best-studied aspect of molecular network dynamics is transcriptional
adaptation. The broad availability of DNA microarrays allows for measuring
genome-wide transcriptional profiles relatively easily and at low cost. Today it
is a routine technology for measuring mRNA concentrations, and thousands of
studies have been conducted during the last decade measuring mRNA expres-
sion changes in prokaryotes, in many eukaryotic model species, and in human
samples. The two main databases of publicly available microarray data are
ArrayExpress (www.ebi.ac.uk/arrayexpress/) and the Gene Expression Om-
nibus (www.ncbi.nlm.nih.gov/geo/), currently listing 3874 and 8408 experi-
ments, respectively (as of April 2008). mRNA changes have been measured in
response to external stimuli, following differentiation, between different tissue
types, in diseased versus healthy tissue, and for numerous other applications.
The transcript signatures have been used to identify not just individual genes
responding to the given signal, but also entire pathways or subnetworks acti-
vated or deactivated under specific conditions. DNA microarrays are the most
abundant tool in molecular biology for measuring the cellular response in an
unbiased way. These studies are termed “unbiased” or “systematic” because
Network-Based Models in Molecular Biology 43

there is no need for a priori assumptions about which genes/proteins are likely
to respond during the experiment.
Kiesel et al. [36] used microarray time course data to study the tran-
scriptional change during osteoclastogenesis (i.e. during the transition from
precursor cells to mature osteoclasts). The differentiation of precursor cells
into mature osteoclasts involves dramatic changes of the transcriptional pro-
gram, thereby affecting the topology of the interaction network at the protein
level. The authors identified co-expression networks associated with early and
late response to the differentiation stimulus. A co-expression network is a
graph linking two genes if the two genes are similarly expressed either during
the specific experiment or at a range of different conditions. For the Kiesel
study it was necessary to create two distinct networks to fully capture the
complexity of the dynamical changes. One network described the early, and
the second one the late response during differentiation. Accordingly, the two
networks contained different pathways that are known to be associated with
osteoclastogenesis. These findings emphasize the importance of considering
the dynamics of transcriptional changes—often one may lose important de-
tails when looking at only two time points (before and after treatment, before
and after differentiation, etc.).
Whereas analyzing microarray time course data can in itself reveal impor-
tant insights into the dynamics of transcriptional networks, combining those
data with other interaction data is significantly more powerful. Expression
data can be combined with transcription factor binding data, adding the di-
mension of protein-DNA interactions [23]. Thereby it becomes possible to
infer the molecular mechanisms by which transcriptional networks change
their state. For example, Ramsey et al. [57] combined time course expression
data with transcription factor binding data to assess the regulatory program
responding to macrophage activation. Putative regulatory relationships were
identified by employing a novel method for identifying time-lagged correla-
tion between transcription factors and potential target genes. Those interac-
tions were later corroborated by additionally taking the binding affinity of
transcription factors in upstream regulatory regions into account. Subsequent
experiments confirmed that this combined analysis of expression and binding
data significantly improved the quality of the inferred regulatory network. In
a similar approach, Ernst et al. [19] analyzed yeast TF-DNA binding data
in combination with respective expression data under various different stress
conditions. They identified bifurcation points in the time course expression
data indicating regulatory events. Along with the TF binding data they were
able to identify TFs that were likely regulators of those bifurcations, i.e. they
were regulating a specific subset of the genes.
Alternatively, time course data of expression changes can be combined
with physical protein interaction networks in order to identify pathways or
pathway components that are differentially expressed [11, 30]. Ideker et al.
[31] combined physical interaction networks with expression data and devised
a method based on simulated annealing for identifying relevant subnetworks
44 A. Beyer

(modules) of the physical network. In this case, the network is not a co-
expression network, but a network of proteins binding to other proteins or
to DNA. Expression changes are mapped onto the physical network, i.e. they
become the nodes’ attributes. The algorithm’s task is to identify the most
significant subnetwork enriched for differentially expressed genes. The result
will depend on the topology of the physical network and the strength (extent)
of differential regulation of the individual genes. Several variants of this idea
have been published since then [11, 13, 56, 61]. Most important however is the
central idea: combining dynamic expression data with independently derived
interaction networks significantly improves the statistical power of the analysis
and provides much more insight into the underlying molecular mechanisms [7].
It is well established that protein concentrations are not only regulated
at the transcriptional level, but also at the level of translation and protein
turnover [5, 10]. These posttranscriptional processes affect the topology of
interaction networks just as much as transcriptional changes. For example,
Brockmann et al. [10] have shown that proteins responding early in signaling
cascades and transcription factors in particular are subject to “translation
on demand.” The coding mRNA of such proteins is constitutively expressed,
but translation is blocked until the protein is actually needed. This allows
for a much faster response compared to transcriptional regulation. Hence,
the presence/absence of these network components is highly dependent on
posttranscriptional processes, which are often neglected in studies assessing
the dynamics of protein expression. It is a very important finding that the
regulatory network components themselves are often not regulated at the
transcriptional level and are therefore missed by studies only applying DNA
microarrays for measuring transcriptional changes [8, 74].
The main bottleneck for studying posttranscriptional network changes
more systematically are the experimental limitations of protein detection and
quantification. Current state-of-the-art techniques employ mass spectrometry
for identifying, characterizing, and quantifying proteins [1]. The current lim-
its of this technology are high costs, relatively complicated protocols, data
processing, limited number of detectable proteins (in the range of a few hun-
dred), and limited reproducibility [14, 48]. Recently, significant progress has
been made because of much more sensitive instruments, improved protocols,
and better data analysis tools [14, 43, 48]. Hence, this progress suggests that
in the near future posttranscriptional network dynamics can also be studied
at a level of detail and scope comparable to that of mRNA changes [14, 22].

5 Dynamics on Molecular Biological Networks

The previous section focused on the dynamic adaptation of the network topol-
ogy, e.g. the presence or absence of network components. Here, we will ad-
dress state changes of the nodes themselves, i.e. alterations or activities of
biomolecules in response to external or internal stimuli.
Network-Based Models in Molecular Biology 45

Prototypic examples for such networks are signaling networks, in which


proteins transmit information (signals) from one to the other via protein
state changes. Most intra-cellular signals are transmitted through covalent
protein modifications, e.g. via phosphorylation of specific residues. Kinase cas-
cades “send” signals from membrane receptors or from intra-cellular receptors
to proteins and other molecules, which ultimately changes the phenotype of
the cell or the entire organism. State changes of regulatory proteins such as
G-proteins, kinases, transcription factors or histones can be detected with
antibodies specific for the respective protein changes. The alternative method
of measuring those changes via mass spectrometry holds greater promise for
unbiased large-scale studies, because it does not require a specific antibody
for every possible protein modification. So far, those mass spectrometry based
techniques are subject to the same limitations as the protein concentration
measurements discussed above [17]. Fortunately, the same improved technolo-
gies that are developed for protein quantification can also be used to study
dynamic changes of protein modifications and localization [41, 54, 75].
However, existing studies trying to elucidate the dynamics of cell signal-
ing largely had to rely on the tedious measurement of isolated proteins. Only
recently have new methods been developed for systematically identifying tar-
gets of kinases either experimentally [55] or using computational predictions
[47] (see also Appendix 1). Despite being important for determining the topol-
ogy of regulatory networks, those techniques still do not provide data on the
dynamics or kinetics of network changes on a larger scale. Therefore, kinetic
modeling of signaling cascades or protein transport has been restricted to ei-
ther well-studied pathways such as the cell cycle or relatively simple pathways
such as the osmotic shock response of yeast [38, 39, 72].
Due to the lack of sufficient kinetic parameters, researchers have begun
to develop new methods for simulating information processing in biological
signaling networks [25]. Most of these approaches formalize the logical in-
formation processing rather than quantifying the dynamics. For example,
information such as “Gene A is activated if gene B is deactivated” can be
formalized in logical models such as Boolean networks [2, 21, 29]. Various
alternative methods have been used to infer the logical relationships in reg-
ulatory networks, including Petri nets [26, 65], Bayesian networks [51], and
factor graphs [76].
A new method for the inference of regulatory pathways that combines
physical with functional data was recently introduced by Suthram et al. [71].
The authors used expression quantitative trait loci (eQTL) data in combina-
tion with a physical protein network to infer molecular regulatory pathways
in yeast. eQTL are statistical relationships between positions in the genome
(loci) and the expression of a target gene. A strong correlation indicates that
the respective locus contains a regulator of the target gene. Suthram et al.
simulated the flow of information between the locus and the target gene as an
electric current. The physical protein network serves in this case as the wiring
diagram on which the information “flows.” The strength of current on any
46 A. Beyer

edge (interaction) is indicative for the importance of the interaction for the
regulation of the target gene. By applying their method to yeast eQTL data
the authors could infer several known and new regulatory relationships, and
they were able to predict the directionality of information flow for hundreds
of protein-protein interactions.
The method developed by Suthram et al. enables the inference of causal
relationships, but it does not lend any insight into the effect of interactions.
For example, the currents do not predict whether the regulator increases or
represses the activity of the target. Workman et al. [74] went further in this
respect. Using ChIP-Chip data and knock-out expression measurements under
DNA damaging conditions they could infer a causal network for DNA damage
response. They used the method of factor graphs, which is a generalization of
Bayesian networks. Factor graphs are minimal graph models explaining the
observed (expression) data. Importantly, the method predicts whether any
given interaction is activating or repressing. The down side is that this method
requires significantly more comprehensive data than more simple approaches.
These logical network models cannot fully capture the kinetics of signaling.
However, at least some of them predict state changes in response to different
inputs, they provide insights into the sequence of events, and they allow for
analyzing the stability of the regulatory system and for finding “weak spots.”
A weak spot is a gene in the signaling network whose knock-out would max-
imally alter the output. Those genes could be interesting drug targets, e.g.
when looking for new targets in pathogens or when attacking tumor cells.
Also, those weak genes could be causal for diseases, for example if they are
mutated in patients carrying a certain inheritable disease.
Metabolic networks are another important application of dynamic network
modeling. They too are highly dynamic, and fully capturing their kinetics
would allow for developing new drugs and for optimizing yields in biochemi-
cal reactors. However, modelers face similar problems as those in regulatory
networks: although the kinetic properties of enzymes have been measured for
decades, we are still far from completely covering all relevant enzymes in any
multicellular eukaryote [49]. In addition, enzymes may behave completely dif-
ferently in in vitro systems than in in vivo situation, where pH, temperature,
and many other important parameters may differ [68]. Hence, complete dy-
namic modeling using differential equations is possible only for a relatively
small set of well-studied subsystems. Fortunately, methods have been devel-
oped that do not require kinetic constants for the analysis of metabolic net-
works. For instance, Petri nets have also been used for analyzing metabolic
networks [12]. One of the most mature methods is flux balance analysis (FBA)
[34, 52]. FBA simulates a metabolic network assuming steady state (input bal-
ancing output), which greatly simplifies the data requirements. For example,
elementary modes represent a minimal set of reactions necessary to produce
a given product at steady state [66] (Fig. 3). These elementary modes, can
be deduced just from the stoichiometric matrix. Hence, one only has to know
the possible reactions in the system along with their educts and products to
Network-Based Models in Molecular Biology 47

(a) R2 (b) R2

R1 R4 R1 R4
S1 M1 M2 P1 S1 M1 M2 P1
R5 R5
R3 R3
P2 P2

(c) R2 (d) R2

R1 R4 R1
S1 M1 M2 P1 S1 M1 M2 R4 P1
R5 R5
R3 R3
P2 P2

Fig. 3. Elementary flux modes. (a) A simple metabolic network consuming sub-
strate S1 and producing products P1 and P2 via the reactions R1 through R5. (b – d)
Elementary modes (highlighted) are minimum sets of reactions creating the products
P1 (b, c) or P2 (d). Note that removing R1 affects all elementary modes, i.e. synthesis
of all products. Removal of R4 disables synthesis of P1 only. {R1 }, {R4 }, and {R2,
R3 } are minimal cut sets with respect to P1.

predict all possible chemical fluxes that do not lead to the accumulation of
products under the steady state assumption. Depending on the metabolic net-
work there might be many elementary modes leading from certain substrate(s)
to specific product(s). Such a network would be redundant. However, even if
there are many elementary modes, all of them might require one specific en-
zyme, thus this enzyme would be essential for synthesizing the respective
product (e.g. the enzyme catalyzing reaction R1 in Fig. 3). The concept of
elementary modes has been used to make a range of important predictions:
for example Stelling et al. [66] were able to predict lethal genes in Escherichia
coli by searching for enzymes whose knock-out would remove all possible ele-
mentary modes leading to essential products. Klamt [37] extended this idea to
the concept of minimal cut sets: whereas Stelling and co-authors were looking
for single genes whose knock-out would be detrimental to the organisms, the
cut sets define the minimum set of genes required to turn off the synthesis
of a given product (Fig. 3). This analysis could be instrumental for develop-
ing combinatorial antibiotics targeting different enzymes in bacteria such that
their synergistic interaction would be lethal to the pathogens.
In summary, the “reduced modeling approaches” that are currently pop-
ular do not strictly simulate the dynamics on (or of) the networks, but they
simulate dynamic networks in a way that still leads to important conclu-
sions. In most cases it would be impossible to derive those insights without
48 A. Beyer

these computational tools, given the complexity of regulatory or metabolic


networks. Also, these less quantitative approaches have a higher chance of
truly reaching a genome-wide scale and thus actually achieving a system-wide
perspective.
One of the main driving forces for progress in computational methods is
the development of new experimental techniques. New types of data open up
new possibilities for network simulation. For example, relatively cheap deep
sequencing methods will aid the identification of all transcripts (protein cod-
ing and non-coding) in time courses, which will require a new dimension in
transcriptional network modeling. Those future models will be able to incor-
porate the regulatory effect of micro-RNAs and any other type of non-coding
RNA. Likewise, these technologies will generate detailed data on alternative
splicing, since every transcript will be known in its entire sequence. Thus, new
computational methods capturing the regulation of alternative splicing at a
genomic scale will emerge. Another example is the above-mentioned progress
in proteomics. It is hoped that it will lead to the creation of the first compre-
hensive maps of posttranscriptional regulation.

Acknowledgments

I wish to thank Angela Simeone, Jacob Michaelson, and Antigoni Elefsinioti


for critically reading the manuscript. This work has been funded by the Klaus
Tschira Foundation.

Appendix 1: Large-Scale Detection of Interaction


Networks
Microarrays are used to measure the expression of all genes of an organ-
ism in a single experiment. By measuring time course samples or samples
from different tissues, conditions, etc., it is possible to reveal transcrip-
tional changes in response to stimuli or under disease conditions. Algo-
rithms have been devised to infer regulatory dependences between genes
(transcriptional regulatory networks) from those data.
Protein chips are made to measure protein-protein interactions on a large
scale. Here, selected proteins are fixed to a glass surface and interactions
with unknown proteins in a sample can be measured, e.g. via fluorescence.
If the “probe proteins” are antibodies for proteins of interest, the chips can
be used to quantify protein amounts in the sample. Ptacek et al. [55] de-
tected kinase substrates by fixing 4400 proteins onto a protein array. They
incubated arrays with kinases (two arrays per kinase) and subsequently
identified proteins that were phosphorylated.
Network-Based Models in Molecular Biology 49

Kinase binding motifs plus other binding evidence. Prediction of ki-


nase substrates via the protein sequence alone generates many false pos-
itive predictions because short kinase binding motifs are not specific
enough. However, provided a certain putative substrate contains a binding
motif, actual binding can be corroborated if there is additional indepen-
dent evidence that the two proteins bind directly or that they are at least
involved in the same biological process.
eQTL plus other binding evidence. Here again, weak evidence from ex-
pression quantitative trait loci (eQTL) is combined with other indepen-
dent evidence for physical binding of the two proteins.
Yeast two-hybrid. Two potentially interacting proteins are genetically
fused with transcriptional activation domains. If both proteins bind in
the nucleus of the yeast cells, the dimer binds the DNA and activates a
reporter gene (e.g. GFP). Genes from other species (e.g. mouse or human)
have to be transferred into yeast for this method. Interactions between
proteins that cannot interact in the yeast nucleus but would bind in their
native environment cannot be detected with this method.
TAP-MS. “Bait proteins” are purified from a sample using tandem affin-
ity purification (TAP). Other proteins associated with the bait (“prey
proteins”) are identified with subsequent mass spectrometry (MS). Al-
though TAP-MS measures the native in vivo situation, it cannot dis-
tinguish whether binding of a prey to the bait is direct or indirect (i.e.
mediated via another intermediate prey protein). Also, the method cannot
detect transiently binding proteins (unstable binding). Physical interac-
tions measured with Y2H or TAP-MS are influenced by artifacts due to
gene tagging, which can influence the functioning of the protein produced
[18, 46].
Known interacting domains. This computational method searches for
known protein-protein interaction domains in the sequences of candidate
genes. The domains may be taken from crystal structures of interacting
proteins. If the same two domains are found in other proteins with high
sequence similarity, this indicates potential physical interactions. This
method is applicable genome-wide. However, it is limited by the avail-
able crystal structures and it does not take the protein 3D structure into
account.
TF binding motifs are short DNA sequences that are targets of a specific
transcription factor. They can be inferred, e.g. from a set of known binding
regions/promoter regions of known target genes. The presence of a bind-
ing motif in a promoter of a potential target gene is usually not sufficient
for clearly identifying the gene as a target. Therefore, binding motif in-
formation is usually supplemented with additional evidence, e.g. whether
the motif is conserved upstream of orthologous genes in other species or
whether the putative target is co-expressed with another known target
gene of the same transcription factor.
50 A. Beyer

ChIP-Chip. Transcription factors (TF) are cross-linked (“fixed”) with DNA,


and after fractionating the DNA the TF-DNA duplexes are purified via im-
munoprecipitation (i.e. with antibodies). Cross-linking is reversed and the
DNA fragments are identified by hybridizing them to a DNA microarray.
Thereby, it is possible to identify all binding sites of a TF for a given
condition genome-wide in a single experiment. A related method (ChIP-
Seq) replaces the final step of DNA identification by high-throughput
deep sequencing. Both methods only measure binding under the specific
condition, i.e. DNA targets bound under different conditions are missed.
Knock-out and expression change. Here one knocks out a regulator gene
of interest and measures the expression difference between wild-type and
the knock-out. Genes that are differentially expressed are likely to be
targets of the regulator. The method can only detect target genes if the
transcription factor is activated under the conditions tested and it cannot
distinguish direct from indirect targets. Also, the knock-out itself will
trigger a range of indirect responses that are not directly related to the
function of the TF, because the cell tries to compensate for the knock-out.
Over-expression and expression change. This approach is complemen-
tary to the preceding method, in that the transcription factor of interest
is constitutively expressed at high levels. One then compares the expres-
sion of genes under normal and high expression. Genes that change their
expression are likely to be targets. This method suffers from several of the
above-mentioned problems as well. For example, it does not distinguish
direct from indirect targets. However, it does not require that the TF is
normally active under the condition tested.
Time course analysis can be used to infer transcriptional regulatory cas-
cades. The underlying hypothesis is that the activity of a transcription
factor can to some degree be predicted from its mRNA level. One mea-
sures the expression levels of all genes with DNA microarrays for several
time points. Using appropriate statistical methods one can then infer likely
target genes from the fact that they are expressed after a certain TF is up-
regulated (or downregulated, depending on whether the TF is an activator
or repressor).

Appendix 2: Some Important Definitions


Alternative splicing, splice variant. Alternative splicing is a mechanism
used by cells to generate different protein sequences from the same gene.
All genes are transcribed into RNA and usually only a part of the tran-
script is used for synthesizing proteins. Some parts of the transcript (called
introns) are “spliced out” (i.e. removed) before translation. Many genes
splice different parts of their transcript depending on cell type or exter-
nal conditions. This process of conditional splicing is called alternative
splicing and the resulting gene sequences are called splice variants.
Network-Based Models in Molecular Biology 51

Binding motif. A short DNA or RNA sequence that is recognized by a


binding protein. For example transcription factors recognize the specific
site on the DNA to which they should bind based on a specific sequence
of nucleotides.
DNA hybridization. The binding of single stranded DNA to its comple-
ment (according to Watson–Crick base pairing). DNA hybridization is
utilized to specifically bind sample DNA to probe DNA with a known
sequence (e.g. on DNA microarrays).
DNA microarrays are used to measure RNA concentrations as well as to
identify DNA sequences in biological samples, and are also used for SNP
detection, for re-sequencing, and for a range of other applications. Such
arrays consist of glass slides to which short DNA sequences (“probes”) are
fixed. The DNA probes are either synthesized oligonucleotides or amplified
DNA fragments. DNA concentrations in a given sample are measured
by hybridizing the labeled DNA from the sample to the complementary
probes on the array. More DNA hybridizing to a given probe will be
indicated by a stronger signal. Hence, the signal intensity is a measure of
the abundance of the respective DNA sequence in the sample. RNA first
has to be transformed into cDNA using reverse transcriptases.
Genetic fusion. A variant of genetic manipulation; adding genes or gene
fragments to another gene.
Genotype. An individual’s specific genome sequence. Many genes have dif-
ferent variants (alleles). The pattern of alleles that someone inherited is
the individual’s genotype.
Kinase (protein kinase). A signaling protein adding phosphate groups
onto substrate proteins. The substrate is thereby activated (it acquired
a higher energy level), which may for example alter the structure of the
substrate. The substrate itself can also be a kinase, in which case the sub-
strate in turn can activate its substrates. Such chains of kinases are called
kinase cascades.
Macrophage activation. Macrophages are immune cells responsible for
killing pathogens such as bacteria. The immune response of macrophages
is triggered by pathogen-specific molecules such as bacterial lipids. Upon
such signals, macrophages undergo a range of morphological and other
changes to prepare for attacking pathogens and for “warning” the im-
mune system.
Mass spectrometry (MS). Used for identifying chemical molecules and for
measuring their concentrations in a sample. MS separates fragments of
molecules and measures the mass-to-charge ratio in different types of de-
tectors. By computationally assembling the information about individ-
ual fragments it is possible to deduce the nature of the input molecules
in the sample. Small molecules are measured directly without prior
fragmentation.
Osteoclasts are bone cells responsible for the desorption (destruction) of
bone. Their counterparts are osteoblasts, which generate new bone mate-
52 A. Beyer

rial. Bone is constantly degraded and newly formed by these two types of
cells. An excess of osteoclasts leads to osteoporosis (brittle bones).
Phenotype. The expression of a genotype. Individuals may have different
physiological or molecular characteristics based on their genotype. For
example, eye and hair color are phenotypes determined by the respective
gene variants (genotype). A phenotype is generally determined by both
environmental and genetic factors. Biologists often refer to “the phenotype
of a gene” as the physiological change in response to knocking out the
respective gene.
Phylogenetic profile. Describes the occurrence pattern of a gene in differ-
ent species. Two genes occurring in the same species are said to have
similar phylogenetic profiles.
Simulated annealing is an optimization technique for finding global max-
ima (or minima) in complex fitness landscapes with many local optima.
Simulated annealing starts searching for an optimum from some (random)
parameter configuration. After a number of iterations the current param-
eters are randomized to some extent in order to overcome boundaries be-
tween local maxima/minima (“heating” of parameters). This procedure
is repeated until convergence, while reducing the level of parameter ran-
domization each time (“annealing”).
Stoichiometry describes the type and number of molecules consumed and
the type and number of molecules produced by a chemical reaction.
Substrate. A molecule chemically changed/consumed by an (enzymatic) re-
action. For example, proteins that are phosphorylated by kinases are called
substrates of the kinases.
Transcription. The process of copying a gene’s sequence into RNA. Poly-
merases are protein machines “reading” the sequence of a gene and pro-
ducing the complementary RNA.
Transcription factor (TF). A regulatory protein controlling the transcrip-
tion of genes. TFs bind directly or indirectly (bridged via other proteins)
to DNA and change the 3D structure of DNA, attract or block transcrip-
tional machinery at the site, or alter other proteins in the vicinity (e.g.
histones) to manipulate the transcription rate of the target gene.
Translation. The process of synthesizing a protein from the respective mes-
senger RNA (mRNA). Ribosomes are molecular machines (consisting of
RNA and proteins) reading an mRNA sequence and translating it into
the corresponding amino acid sequence.

References
1. Aebersold R, Mann M. (2003) Mass spectrometry-based proteomics. Nature.
422(6928):198–207.
2. Albert R, Othmer HG. (2003) The topology of the regulatory interactions predicts
the expression pattern of the segment polarity genes in Drosophila melanogaster.
J Theor Biol. 223(1 ):1–18.
Network-Based Models in Molecular Biology 53

3. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. (2002) Molecular


biology of the cell. Garland Science, New York.
4. Aloy et al. (2004) Structure-based assembly of protein complexes in yeast. Science.
303(5666):2026–9.
5. Beyer A, Hollunder J, Nasheuer HP, Wilhelm T. (2004) Post-transcriptional ex-
pression regulation in the yeast Saccharomyces cerevisiae on a genomic scale. Mol
Cell Proteomics. 3(11 ):1083–92.
6. Beyer A et al. (2006) Integrated assessment and prediction of transcription factor
binding. PLoS Comput Biol. 2:e70.
7. Beyer A, Bandyopadhyay S, Ideker T. (2007) Integrating physical and genetic
maps: from genomes to interaction networks. Nature Rev Genet. 8:699–710.
8. Birrell GW, Brown JA, Wu HI, Giaever G, Chu AM, Davis RW, Brown JM.
(2002) Transcriptional response of Saccharomyces cerevisiae to DNA-damaging
agents does not identify the genes that protect against these agents. Proc Natl
Acad Sci USA. 99(13 ):8778–83.
9. Boone C, Bussey H, Andrews BJ. (2007) Exploring genetic interactions and net-
works with yeast. Nat Rev Genet. 8(6):437–49.
10. Brockmann R, Beyer A, Heinisch JJ, Wilhelm T. (2007) Posttranscriptional
expression regulation: what determines translation rates? PLoS Comput Biol.
3(3 ):e57.
11. Calvano SE et al. (2005) A network-based analysis of systemic inflammation in
humans. Nature. 437(7061 ):1032–7.
12. Chen M, Hofestaedt R. (2003) Quantitative Petri net model of gene regulated
metabolic networks in the cell. In Silico Biol . 3:347–365.
13. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. (2007) Network-based classification
of breast cancer metastasis. Mol Syst Biol. 3:140.
14. Collins SR et al. (2007) Toward a comprehensive atlas of the physical interactome
of Saccharomyces cerevisiae. Mol Cell Proteomics. 6(3):439–50.
15. Cox J, Mann M. (2007) Is proteomics the new genomics? Cell . 130(3 ):395–8.
16. de Lichtenberg U, Jensen LJ, Brunak S, Bork P. (2005) Dynamic complex forma-
tion during the yeast cell cycle. Science. 307(5710 ):724–7.
17. Domon B, Aebersold R. (2006) Mass spectrometry and protein analysis. Science.
312(5771 ):212–7.
18. Downard KM. (2006) Ions of the interactome: the role of MS in the study
of protein interactions in proteomics and structural biology. Proteomics. 6:
5374–5384.
19. Ernst J, Vainas O, Harbison CT, Simon I, Bar-Joseph Z. (2007) Reconstructing
dynamic regulatory maps. Mol Syst Biol . 3:74.
20. Estojak J, Brent R, Golemis EA. (1995) Correlation of two-hybrid affinity data
with in vitro measurements. Mol Cell Biol. 15:5820–5829.
21. Fauré A, Naldi A, Chaouiya C, Thieffry D. (2006) Dynamical analysis of a
generic Boolean model for the control of the mammalian cell cycle. Bioinformatics.
22(14 ):e124–31.
22. Foss EJ, Radulovic D, Shaffer SA, Ruderfer DM, Bedalov A, Goodlett DR,
Kruglyak L. (2007) Genetic basis of proteome variation in yeast. Nat Genet.
39(11 ):1369–75.
23. Gao F, Foat BC, Bussemaker HJ. (2004) Defining transcriptional networks through
integrative modeling of mRNA expression and transcription factor binding data.
BMC Bioinformatics. 5:31.
54 A. Beyer

24. Gavin et al. (2006) Proteome survey reveals modularity of the yeast cell machinery.
Nature. 440(7084):631–6.
25. Gilbert D, Fuss H, Gu X, Orton R, Robinson S, Vyshemirsky V, Kurth MJ,
Downes CS, Dubitzky W. (2006) Computational methodologies for modelling,
analysis and simulation of signalling networks. Brief Bioinform. 7(4 ):339–53.
26. Goss PJ, Peccoud J. (1998) Quantitative modeling of stochastic systems in molecu-
lar biology by using stochastic Petri nets. Proc Natl Acad Sci USA. 95(12 ):6750–5.
27. Han JD. (2008) Understanding biological functions through molecular networks.
Cell Res. 18(2):224–37.
28. Heinrich R, Schuster S. (1998) The modelling of metabolic systems. Structure,
control and optimality. Biosystems. 47(1–2):61–77.
29. Helikar T, Konvalina J, Heidel J, Rogers JA. (2008) Emergent decision-making in
biological signal transduction networks. Proc Natl Acad Sci USA. 105(6 ):1913–8.
30. Ideker T et al. (2001) Integrated genomic and proteomic analyses of a systemati-
cally perturbed metabolic network. Science. 292:929–934.
31. Ideker T, Ozier O, Schwikowski B, Siegel AF. (2002) Discovering regulatory
and signalling circuits in molecular interaction networks. Bioinformatics. 18
Suppl 1 :S233–40.
32. International Human Genome Sequencing Consortium (2004). Finishing the eu-
chromatic sequence of the human genome. Nature. 431:931−945.
33. Jansen RC. (2003) Studying complex biological systems using multifactorial per-
turbation. Nature Rev Genet. 4:145–151.
34. Joyce AR, Palsson BO. (2008) Predicting gene essentiality using genome-scale in
silico models. Methods Mol Biol. 416:433–57.
35. Kelley R, Ideker T. (2005) Systematic interpretation of genetic interactions using
protein networks. Nature Biotechnol. 23:561–566.
36. Kiesel J, Miller C, Abu-Amer Y, Aurora R. (2007) Systems level analysis of os-
teoclastogenesis reveals intrinsic and extrinsic regulatory interactions. Dev Dyn.
236(8 ):2181–97.
37. Klamt S, Gilles ED. (2004) Minimal cut sets in biochemical reaction networks.
Bioinformatics. 20(2 ):226–34.
38. Klipp E, Nordlander B, Kruger R, Gennemark P, Hohmann S. (2005) Integrative
model of the response of yeast to osmotic shock. Nature Biotechnol. 23:975–982.
39. Klipp E. (2007) Modelling dynamic processes in yeast. Yeast. 24(11 ):943–59.
40. Krogan et al. (2006) Global landscape of protein complexes in the yeast Saccha-
romyces cerevisiae. Nature. 440(7084):637–43.
41. Krüger M, Kratchmarova I, Blagoev B, Tseng YH, Kahn CR, Mann M. (2008)
Dissection of the insulin signaling pathway via quantitative phosphoproteomics.
Proc Natl Acad Sci USA. 105(7 ):2451–6.
42. Lage K et al. (2007) A human phenome-interactome network of protein complexes
implicated in genetic disorders. Nature Biotechnol. 25:309–316.
43. Lange V et al. (2008) Targeted quantitative analysis of Streptococcus pyogenes
virulence factors by multiple reaction monitoring. Mol Cell Proteomics. [Epub
ahead of print]
44. Lähdesmäki H, Rust AG, Shmulevich I. (2008) Probabilistic inference of transcrip-
tion factor binding from multiple data sources. PLoS ONE . 3(3 ):e1820.
45. Lee I, Date SV, Adai AT, Marcotte EM. (2004) A probabilistic functional network
of yeast genes. Science. 306:1555–1558.
46. Legrain P, Wojcik J, Gauthier JM. (2001) Protein–protein interaction maps: a
lead towards cellular functions. Trends Genet. 17:346–352.
Network-Based Models in Molecular Biology 55

47. Linding R et al. (2007) Systematic discovery of in vivo phosphorylation networks.


Cell . 129(7 ):1415–26.
48. Malmström J, Lee H, Aebersold R. (2007) Advances in proteomic workflows for
systems biology. Curr Opin Biotechnol . 18(4 ):378–84.
49. Mo ML, Jamshidi N, Palsson BØ. (2007) A genome-scale, constraint-based ap-
proach to systems biology of human metabolism. Mol Biosyst. 3(9 ):598–603.
50. Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. (2006) Find-
ing function: evaluation methods for functional genomic data. BMC Genomics.
7:187.
51. Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. (2007) A primer on learning
in Bayesian networks for computational biology. PLoS Comput Biol . 3(8 ):e129.
52. Papin JA, Stelling J, Price ND, Klamt S, Schuster S, Palsson BØ. (2004) Compar-
ison of network-based pathway analysis methods. Trends Biotechnol . 22(8 ):400–5.
53. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. (1999) As-
signing protein functions by comparative genome analysis: protein phylogenetic
profiles. Proc Natl Acad Sci USA. 96:4285–4288.
54. Pflieger D, Jünger MA, Müller M, Rinner O, Lee H, Gehrig PM, Gstaiger M,
Aebersold R. (2008) Quantitative proteomic analysis of protein complexes: con-
current identification of interactors and their state of phosphorylation. Mol Cell
Proteomics. 7(2 ):326–46.
55. Ptacek J et al. (2005) Global analysis of protein phosphorylation in yeast. Nature.
438:679–684.
56. Rajagopalan D, Agarwal P. (2005) Inferring pathways from gene lists us-
ing a literature-derived network of biological relationships. Bioinformatics.
21(6 ):788–93.
57. Ramsey SA et al. (2008) Uncovering a macrophage transcriptional program by
integrating evidence from motif scanning and expression dynamics. PLoS Comput
Biol . 4(3 ):e1000021.
58. Ren B et al. (2000) Genome-wide location and function of DNA binding proteins.
Science. 290(5500 ):2306–9.
59. Rhodes DR et al. (2005) Probabilistic model of the human protein–protein inter-
action network. Nature Biotechnol. 23:951–959.
60. Rual JF et al. (2005) Towards a proteome-scale map of the human protein–protein
interaction network. Nature. 437:1173–1178.
61. Samoilov M, Plyasunov S, Arkin AP. (2005) Stochastic amplification and signaling
in enzymatic futile cycles through noise-induced bistability with oscillations. Proc
Natl Acad Sci USA. 102(7):2310–5.
62. Schilling CH, Letscher D, Palsson BØ. (2000) Theory for the systemic definition
of metabolic pathways and their use in interpreting metabolic function from a
pathway-oriented perspective. J Theor Biol. 203(3):229–48.
63. Scott MS, Perkins T, Bunnell S, Pepin F, Thomas DY, Hallett M. (2005) Iden-
tifying regulatory subnetworks for a set of genes. Mol Cell Proteomics. 4(5 ):
683–92.
64. Sprinzak E, Altuvia Y, Margalit H. (2006) Characterization and prediction of
protein–protein interactions within and between complexes. Proc Natl Acad Sci
USA. 103:14718–14723.
65. Steggles LJ, Banks R, Shaw O, Wipat A. (2007) Qualitatively modelling and
analysing genetic regulatory networks: a Petri net approach. Bioinformatics.
23(3 ):336–43.
56 A. Beyer

66. Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED. (2002) Metabolic net-
work structure determines key aspects of functionality and regulation. Nature.
420(6912 ):190–3.
67. Stelzl U et al. (2005) A human protein–protein interaction network: a resource for
annotating the proteome. Cell. 122:957–968.
68. Stryer L. (1995) Biochemistry. Freeman & Co, New York.
69. Stuart JM, Segal E, Koller D, Kim SK. (2003) A gene coexpression network for
global discovery of conserved genetic modules. Science. 302:249–255.
70. Suthram S, Shlomi T, Ruppin E, Sharan R, Ideker T. (2006) A direct comparison
of protein interaction confidence assignment schemes. BMC Bioinformatics. 7:360.
71. Suthram S, Beyer A, Karp RM, Eldar Y, Ideker T. (2008) eQED: an efficient
method for interpreting eQTL associations using protein networks. Molec Syst
Biol. 4:162.
72. Tyson JJ. (1991) Modeling the cell division cycle: cdc2 and cyclin interactions.
Proc Natl Acad Sci USA. 88(16 ):7328–32.
73. von Mering C et al. (2002) Comparative assessment of largescale data sets of
protein–protein interactions. Nature. 417:399–403.
74. Workman CT et al. (2006) A systems approach to mapping DNA damage response
pathways. Science. 312:1054–1059.
75. Yan W, Hwang D, Aebersold R. (2008) Quantitative proteomic analysis to profile
dynamic changes in the spatial distribution of cellular proteins. Methods Mol Biol .
432:389–401.
76. Yeang CH, Mak HC, McCuine S, Workman C, Jaakkola T, Ideker T. (2005) Val-
idation and refinement of gene-regulatory pathways on a network of physical in-
teractions. Genome Biol. 6(7 ):R62.
77. Zhu J, Zhang MQ. (1999) SCPD: a promoter database of the yeast Saccharomyces
cerevisiae. Bioinformatics. 15:607–611.
Ecological Networks: Structure, Interaction
Strength, and Stability

Samit Bhattacharyya and Somdatta Sinha

Mathematical Modelling and Computational Biology Group,


Centre for Cellular and Molecular Biology, CSIR, Hyderabad 500007, India;
samit@ccmb.res.in, sinha@ccmb.res.in

1 Introduction

The fundamental building blocks of any ecosystem, the food webs, which are
assemblages of species through various interconnections, provide a central con-
cept in ecology. The study of a food web allows abstractions of the complexity
and interconnectedness of natural communities that transcend the specific de-
tails of the underlying systems. For example, Fig. 1 shows a typical food web,
where the species are connected through their feeding relationships. The top
predator, Heliaster (starfish) feeds on many gastropods like Hexaplex, Morula,
Cantharus, etc., some of whom predate on each other [52]. Interactions be-
tween species in a food web can be of many types, such as predation, compe-
tition, mutualism, commensalism, and ammensalism (see Section 1.1, Fig. 2).
Mathematical ecologists have used dynamic models to explore how the
size and connectivity of food webs determine the stability and long-term per-
sistence of a community under fluctuations in density [41], invasion of new
species [11], or nonlinear population dynamics [24]. There are two different
approaches for modeling a food web: the static model and dynamic model.
Static models describe the food web by a graph whose vertices are species
and whose links are the interactions/relations between them. These models
are primarily concerned with the robustness of the food web structure against
modifications (i.e., removal and addition) of vertices and links. Based on the
hierarchical position of the species in a food web, there exist two types of
static models: the cascade model and the niche model. The dynamic models,
on the other hand, account for the stability of food webs, and are represented
by coupled ordinary differential equations, where different functional forms
describe the type of interactions between the species. However, neither the
static nor the dynamic models are useful for making long-term predictions
of the changes in structural organization of food webs due to extinction or
invasion of new species in the community. Other models of food webs — the
assembly model and evolutionary model — mainly focus on this aspect. One
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 4,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
58 S. Bhattacharyya and S. Sinha

Fig. 1. Feeding relationship between a predator Heliaster and marine snails in the
northern Gulf of California (adapted from [52]).

basic assumption for construction of these models is that, instead of being


random as in earlier models, the existence and strengths of links between
species here are based on their web history [17](see Section 1.2).

1.1 Some Basic Definitions

A community in ecology comprises all species populations interacting in an


area. An example of a community is a coral reef, where numerous populations
of fishes, crustaceans, and corals coexist and interact.
A food web in an ecosystem is an assemblage of various organisms that are
interconnected with each other through their different life history processes,
such as feeding and shelter.
A trophic level in a food web consists of all the species that prey on the
same species and are also preyed upon by the same species.
Ecological interactions are the relationships between two species in an
ecosystem. Based on either effects, or on mechanisms, these relationships can
be categorized into many different classes of interactions as described below
and shown in Fig. 2. Here, the arrow signifies the flow of resources in the net-
work, and the sign represents the effect of one species on the other. These
interactions vary greatly with respect to their duration and strength. In many
cases, the interactions of two species may have different impacts under dif-
ferent conditions. This is particularly true in, but not limited to, cases where
species have multiple and drastically different life stages.
Predation is a biological interaction in which one species feeds on another.
Most of the interactions in a food web are predatory. Figure 2(i) shows
the network for this interaction, where species 2 preys on species 1. This
interaction enhances the fitness of predators (indicated by “+”), but reduces
the fitness of the prey species (shown by “−”). Example: There is common
Ecological Networks: Structure, Interaction Strength, and Stability 59

Fig. 2. Types of ecological interactions: (i) predation, (ii) competition, (iii) mutualism,
(iv) commensalism, (v) ammensalism.

predation of carnivores on herbivores in a grazing food web. Parasitism is


similar to predation by mechanism, as it enhances the fitness of the parasite,
but impairs the host. Example: The mite Varroa jacobsoni, is a parasite of the
honeybee. Although not severely, it disrupts the honeybee’s colony formation.
Competition between two species occurs when they share a limited resource
and each tends to prevent the other from accessing it. This reduces the fitness
of one or both species, as is shown by “−” in Fig. 2(ii).
In mutualism or symbiosis, two species provide resources or services to
each other. This enhances the fitness of both species (shown by “+” in
Fig. 2(iii)). Example: Pollination is an example of mutualism, which enhances
the fitness of both the plant and the pollinator.
Commensalism is an interaction, where one species receives a benefit from
another species. This enhances the fitness of one species without any effect on
fitness of the other species (shown by “0” in Fig. 2(iv)). In the marine aquatic
ecosystem, the clown fish and sea anemone share such a relationship.
In ammensalism, one species impedes or restricts the success of the other
without being affected positively or negatively by its presence (shown by “−”
and “0” respectively in Fig. 2(v)). Example: The black walnut tree (Juglans
nigra) secretes a chemical (juglone) from its roots that harms or kills some
species of neighboring plants.

1.2 Ecological Network Models

Ecological networks (i.e., food chains and food webs) have been mathemati-
cally described by different types of models, which focus on different aspects
of their structure. We give simple descriptions of a few as follows:
60 S. Bhattacharyya and S. Sinha

Cascade model. These models are built on the hierarchical positions of the
species in the food chain and use the role of top-down forces (predator to
prey) in shaping the ecological communities. In cascade models, high-ranked
species prey on lower-ranked species in the food chain, and the probability of
consumption depends on the number of connections with these other species.
It has been shown that the mean number of species in the cascade model
grows linearly with the number of the species present in the food chain [13].
Niche model. Like the cascade models, the niche models are also structural
in nature. But unlike the cascade model, the rigid hierarchical effect is relaxed
in the niche model by allowing looping and cannibalism. The “niche model”
takes its name from the premise that each trophic species (i.e., a group of
species that share the same predators and prey) belongs to a specific niche
based on what it eats and, in turn, on what eats it. Recent work [74] has
shown that many diverse real food webs can be better described by the niche
model than by the cascade model, in particular with respect to features such
as cycles and species similarities.
Assembly model and evolutionary model. In this class of models, as
the names suggest, the species composition and structure of the food web
can change with time due to the ongoing introduction of new species and
by species extinctions. As a consequence, these models focus mainly on the
features of the food web after a sufficiently long time, when the size of the
food web and its other properties stabilize. The primary concern of these
models relates to the stability of the food web due to the introduction of
new species into the system by immigration or speciation (assembly) or by
altering one or a few individuals of existing species in the system (evolution-
ary), though the time scales for evolutionary models are longer than those
for assembly ones [60]. A few species from a “pool” are added one by one to
the existing network. The new species may add stability to the web, become
extinct immediately after introduction because of its poor adaptation, or may
cause one or more other species to become extinct due to competition for
resources. Usually studies of these models involve the measurement of various
properties of the underlying network and compare these with the data of real
food webs to determine the robustness in a given time period [9].
Keystone species. This concept provides an important representation for
understanding the organizing forces in ecological communities. A keystone
species is one whose presence contributes critically to the diversity of life in
the food web, and whose removal has a strong adverse impact on the com-
munity structure, even though the species may occupy only a small part of
the ecosystem in terms of biomass or productivity. Keystone species play an
important role in conservation biology [53, 55].
Community matrix. In an ecological network, which describes interactions
between multiple species at different trophic levels, the community matrix rep-
resents the per capita direct effect of one species on other species in the com-
munity [35]. In simple words, the community matrix is a spreadsheet, where
the rows and columns are species and other elements of their environment,
Ecological Networks: Structure, Interaction Strength, and Stability 61

and the entry records are calculations for describing the interactions among
them. The matrix can be used to derive the stability and sensitivity to change
in ecological networks. Alternative but related definitions are common in
ecological literature [41, 78].
Interaction strength is a critical descriptor of the magnitude of effect of
one species on the other in an ecological network. There are several interpre-
tations with the common theme of being a measure of the effect of one species
on another or on all others in the network [38, 40]. In this work, we have used
“interaction strength” to represent the per capita direct effects of a species
on another (e.g., the per capita rate of consumption of the prey species by
the predator, “α” in the first equation of Model I in Section 2, which can also
be modulated by prey preference).

The complexity of a large ecological network or a whole food web can be de-
scribed by many indicators: number of trophic levels, number of species and
their connections, density of interactions, etc. Each trophic module embedded
in the entire food web may also have a similar complexity in its structure and
interactions. There are two mutually nonexclusive aspects that underscore a
long-standing problem in food web theory: Do “the details of population dy-
namics in one or a few modules change the structure of the whole system
over time” [64]? The first aspect is structural, and involves the distributive
pattern of trophic structures or motifs which modulates the real food web
system [3, 10, 46]. The other aspect is related to the study of specific trophic
modules in order to infer properties of the entire ecosystem dynamics [4, 23].
The first one essentially refers to the robustness of the system against removal
of nodes in the functional integrity of the ecological network. In particular,
the study involves how removing or replacing native species with exotic in-
vaders can alter the food web structure, which could be measured by the
number of secondary extinctions, and by the breakup of the network into
smaller components. For instance, conceptualizing food webs as energy-flow
networks, Allesina and Bodini [2] have indicated that it is the “dominating
nodes” (removal of which would make a number of species disappear from the
network) that act as an energy bottleneck for resources flowing to the other
members of the food web. However, it has also been shown that removal of
species with low connectivity sometimes may have a large effect on the persis-
tence of a community, which reinforces the notion of keystone species in food
web theory [18, 31].
The second aspect that is particularly important for understanding real
ecological network dynamics is related to the distribution of interaction
strengths, which determines how strongly the links in the trophic cascade
are coupled. For example in a consumer-resource interaction, the strength is
a function of the metabolic rate, ingestion rate, and preference of the con-
sumer species for the resource species [43]. The theoretical finding that inter-
action strength is one of the key properties promoting persistence in nonlinear
models of food webs has attracted considerable attention [19, 43, 63, 67].
It is indeed challenging to ecologists to quantify the strengths of species
62 S. Bhattacharyya and S. Sinha

interactions, identify the patterns that occur between species, and determine
the mechanisms that cause interactions to vary across space and time in natu-
ral ecosystems. These topics are also important for several other reasons. First,
ecosystems, on the whole, provide hospitable conditions for life. Understanding
how the provision of this hospitability is affected by extinctions and alien intro-
ductions is important, because without knowledge of the strengths of species
interactions, our predictions on the consequences of environmental impacts
become indeterminate for any ecosystem with a reasonable degree of com-
plexity [6]. Second, development of a general understanding of how ecological
communities are structured can benefit from an analysis of the general prop-
erties of multispecies population dynamics models [41, 58]. Knowledge of the
pattern of interaction strengths in natural ecosystems can help to guide the
development of appropriate multispecies models.
A number of recent studies consider the influence of interaction strength
on the stability of food webs for real communities [6, 7, 49, 51] and in food
web models with a large number of species [12, 25, 28, 61]. Martinez et al. [39]
studied the effect of the variation of the interaction strength of omnivore links
on the stability of large food webs. General patterns of food web structure also
appear to be an emergent property of dynamical constraints on species inter-
actions [5, 17]. However, in a more realistic study, the food web dynamics has
been considered by introducing time-varying interactions among the adaptive
foragers [20, 21, 30, 56, 71, 72]. Adaptation modifies predator-prey interac-
tion strengths, and thus, acts on the topology of the network by eventually
removing certain links with zero strength. As a consequence, the complexity
of the food web is molded by how adaptive predators design their foraging
strategy. Recently, it has also been shown that adaptation of foraging behav-
ior and stability of food webs can lead to a rise in basal species richness and
link density, which in turn, increases the emergent complexity of the food web
[21, 30].
However, there still remains a significant gap in predictions on food web
stability and the role of distribution of interaction strengths in the Food
web. This key problem in ecology arises due to the basic difference between
empiricists and theoreticians in understanding the concepts of stability, and
interaction strength [6, 32]. Many theoretical investigations have focused on
the definition of stability in model communities [11, 25, 36, 41, 42, 44]. Multi-
ple definitions of stability have been proposed, some of them designed to have
closer ties to empirical data [14, 15, 33, 37]. This is important because most of
the earlier analytical studies evaluate linear stability of a community at equi-
librium in the face of small perturbations, whereas empirical investigations
focus on community changes (with no assumed equilibrium) in response to
comparatively large perturbations, such as species removals, species additions,
and physical disturbances [75]. The measurement of interaction strengths, on
the other hand, actually centers around different concepts in the system, and
the only consistent aspect is the use of the words “interaction strength”. Laska
and Wootton [32] have clearly mentioned the difference in understanding the
Ecological Networks: Structure, Interaction Strength, and Stability 63

interaction strength for the theoretician and empiricist communities. However,


even within theoretical or empirical investigations, there exists a diversity of
indices that measure the link weight, or interaction strength. Nevertheless,
this creates a critical problem in predicting the relative effects of strong or
weak interactions in a community. For example, strong consumption intensity
by a predator, or large energy flow from prey to predator, is not necessarily
a good predictor of large dynamical effects on prey abundance [6, 50, 54, 62],
nor is it necessarily a good predictor of strong interaction coefficients in the
community matrix [16]. Similarly, strong interaction coefficients in the com-
munity matrix, which are defined by small perturbations, may not necessarily
predict strong effects of a large perturbation such as species addition or re-
moval [1, 79]. This gap between experiment and theory in understanding the
concepts has remained a long-standing problem in ecological research.
In spite of all the recent and ongoing research on networks in general, and
food webs in particular, a new synthesis of the relationship between struc-
ture and dynamics of complex networks for ecological systems remains elu-
sive. Ecological networks are unlike other networks (including other biological
networks) [47], as the process of various interactions (predation, competition,
mutualism, etc.) between organisms at different trophic levels ties up the en-
tire organization in a unique way, which is an inherent and distinct feature
of the ecological systems. Here, we propose, with simple examples, to explain
how the dynamic interplay between the ecological network structure and in-
teraction strengths regulates the structure of the network and the dynamics
of the species in it. This may provide some interesting insights about the
importance of interaction strengths in ecological network research.

2 Food Web Structure, Interaction Strength,


and Stability
Food web models are extensions of bioenergetic consumer-resource models,
which by definition focus exclusively on trophic interactions. In a recent study,
it was shown that predation is the most important process determining the
community structure and dynamics [51]. There are a few important related
factors that regulate the strength of this process, such as metabolic efficien-
cies, handling times, foraging strategies, and frequencies of encounters [21].
In the following section, we consider two simple ecological networks of the
prey-predator interaction and discuss how the emergence of new functional
components that alter interaction strengths can regulate the stability in food
web dynamics.

2.1 The Models

Model I. Figure 3(i) shows a prey-predator system where the predator species
is commensal on the prey species. In this simple model, in the absence of
64 S. Bhattacharyya and S. Sinha

Fig. 3. Food web configurations of (i) Model I and (ii) Model II.

the predator (Y ), the prey species (X) follows a density-dependent logistic


growth with r as its intrinsic growth rate, and K as the carrying capacity
of the environment. However, in the presence of the predator, the growth of
the prey is reduced due to predation of Y on X. This interaction follows a
hyperbolic function with γ denoting the half saturation coefficient of predation
and α deciding the strength of interaction, i.e., the per capita consumption
rate (see [43] for an actual measure of the interaction strength). In the absence
of prey, the predator species dies out exponentially at a rate d. On predation,
the rate at which this food adds to the growth of the predator population
is given by the conversion rate β. The rate of change of the prey (dX/dt)
and predator (dY /dt) populations with time are governed by the following
equations:
 
dX X αXY
= rX 1 − −
dt K γ+X
 
dY βαX
= Y −d+ .
dt γ+X

For this study, the parameter values are taken as r = 0.5, K = 5, d = 3,


γ = 0.8, and β = 0.7. The main parameter, α, which indicates the interaction
strength of predation, is varied from 5.5 to 6.5 to describe the change in dy-
namics in this model network.
Model II. Here, we consider the addition of a new structural component in
the food web of Fig. 3(i), which is shown in Fig. 3(ii). It is assumed that the
introduction of a virus species (V ), which can infect the prey species (X) in
Model I, divides the prey population into two compartments: Susceptible (S)
and Infected (I) with S + I = X, where the susceptible class follows the same
growth laws as X in Model I. The flux between susceptible to infective com-
partments is dependent on the strength with which the virus attacks the prey
following a simple law of mass action. However, this compartmentalization
of the prey species also affects the predation strength; although the predator
Ecological Networks: Structure, Interaction Strength, and Stability 65

(Y ) can consume both the susceptible and the infected prey, it may have a
higher preference towards the uninfected prey (S), and a concomitant lower
one towards the infected prey (I). Thus, the earlier strong interaction α (in
Model I) is now modified into two interactions with variable strength: one
strong interaction augmented with one weak interaction (Fig. 3(ii)). The rate
of change of the virus population (dV /dt) depends on the the infected prey
population (I), as every dead and lysed infected prey releases virus into the
environment, initiating a new infection cycle. The temporal evolution of this
ecological network is given as follows:
 
dS S ξαSY
= rS 1 − − λSV −
dt K γ + (S + I)
dI (1 − ξ)αIY
= λSV − − ηI
dt γ + (S + I)
 
dY βα(ξS + (1 − ξ)I)
= Y −d+
dt γ + (S + I)
dV
= −μV + κηI,
dt
where V is the virus density and λ defines the strength of viral infection on the
susceptible class of prey, and represents the “effective per host contact rate
with viruses.” Parameter η denotes the death rate of the infected prey and μ is
the death rate for the virus. κ denotes the “virus replication parameter,” i.e.,
the number of virus productions per infected individual due to lysis. The other
parameter that regulates the interaction strength of predation, indicating the
prey preference of the predator, is ξ (ξ ∈ (0, 1)). The exact choice of these
parameter values is arbitrary, but they are kept within the same range as in
[8, 22].

3 Results
In Model I, the predation strength α is an important determinant of the
dynamics of the prey and predator populations. As seen in the bifurcation
diagrams of prey and predator populations (Fig. 4), both exhibit equilibrium
dynamics for α < 5.6, but the steady state loses its stability and bifurcates
to limit cycle oscillations with increasing amplitude for α > 5.6 (Fig. 4). We
now show the results of the effect of modifying the structure of this simple
two-species network due to, addition of the new link through the virus, which
not only separates the prey species into two compartments, but also modifies
the predation strength (Model II). For simulation of the Model II network,
the new parameter values are chosen as λ = 0.002, η = 0.7, μ = 0.05, and
κ = 13.
The introduction of the new node (V ) and links to the existing module
(Model I) has interesting effects on the population dynamics of the species that
66 S. Bhattacharyya and S. Sinha

Prey
2

0
0.6
Predator

0.4

0.2

0
5.5 5.7 5.9 6.1 6.3 6.5
α

Fig. 4. Bifurcation diagram of the prey and predator in Model I with increasing
predation strength α. At α = 5.6 (approx.), the system undergoes a period-doubling
bifurcation.

a 6 10
1 b 6 101
Susceptible
Susceptible

Infected 100
Infected

4 10
0 4

2 −1
10 2 10−1

0 −2
10 0 10−2
500 500

0.4 0.4
Predator
Predator

100 100
Virus
Virus

50 50
0.2 0.2

10 0 10
0 5 5
5.5 6 6.5 5.5 6 6.5 5.5 6 6.5 5.5 6 6.5
α α α α

Fig. 5. Bifurcation diagram of all four populations–Susceptible prey, infected prey,


predator, and virus–in Model II as function of interaction strength parameter α, for
different prey preference: (A) ξ = 0.5, (B) ξ = 0.99.

depends on the interaction strength. As ξ regulates the predation strength by


changing the prey preference of the predator, we analyzed Model II for two dif-
ferent values of this interaction strength: ξ = 0.99 indicating high preference
for the susceptible prey and very low preference for the infected prey; and,
ξ = 0.5, where the predator has no preference of one over the other. Figure
5 shows the bifurcation diagrams of Model II for the two cases, ξ = 0.5 and
0.99. Figure 5(A) shows that, at ξ = 0.5, there are two important changes
that occur in the same range of predation strength, i.e., 5.5 < α < 6.5.
First, the network reduces to only a “prey (S and I) and virus (V )” system
Ecological Networks: Structure, Interaction Strength, and Stability 67

with the predator population going to zero. This happens because, in the
absence of predation, all of S is available for inducing strong viral infection,
which converts the susceptible prey class to the infected one, and the predator
does not have enough preys to survive through predation. Second, the dynam-
ics of this prey-virus system remains stable with a large virus population and
low prey populations. When the the predator has a strong preference for the S
population, i.e., at ξ = 0.99, this situation continues for low predation strength
(until α = 6.2), and the reduced prey-virus system remains stable (Fig. 5(B)).
However, at higher predation strength (α > 6.2), the predator succeeds in
surviving on predation and reduces the population of I strongly enough to
reduce the production of V , which in turn reduces infection, thereby increas-
ing S, which is then available for predation. This kind of a delayed feedback
on S eventually induces oscillations in all four populations, albeit at higher
α compared to Model I. This interesting phenomenon essentially underscores
the fact that distribution of the type (+ or −) and the strength of interactions
can play a significant role in food web structure and dynamics. It can change
the structure of the network by inducing a species to go extinct, and also
promote stability in an otherwise oscillatory system.

4 Discussion and Conclusion

Community stability in ecology is primarily decided by the topological and


functional architecture of the entire organization. Some studies have indicated
that weak interactions are one of the most dominant threads in weaving natu-
ral communities in tune [50, 77], which is also reasserted by our simple models.
Weak interactions have been proposed as the “glue” that binds large networks
together [43], with ramifications for biodiversity. In particular, this has impor-
tant implications for those species whose low abundance and weak per capita
consumption rates might otherwise be taken as evidence of a negligible role
[42]. Large network simulations have shown that the distribution of interac-
tion strengths is strongly skewed towards weak interactions [29, 61]. Although
the experimental quantification of interaction strength in field studies is dif-
ficult, preliminary contributions on the nature of distributions of interaction
strengths within real food webs are slowly emerging [16, 76]. Similarly, the
importance of weak interactions for dynamic stability and species coexistence
has been suggested from matrix analyses of soil food webs, numerical simula-
tions of small and large webs, and experimental manipulations [6, 32, 48, 59].
Our study, with two very simple yet realistic ecological networks, points to-
wards some intriguing features. One point of interest is that the introduction of
another species in a two-species prey-predator interaction network compart-
mentalizes the single prey species into two subgroups leading to additional
diversity in the network. Such a node can modify the network structure by
pushing the predator species to extinction simply based on the interaction
strength and its preference level. At higher values of both these interaction
68 S. Bhattacharyya and S. Sinha

parameters, the full network structure persists. These features, i.e., the inter-
action strength and network structure, also regulate the population dynamics
of the species. A combination of type and strength of interactions determines
the dynamical stability of the species in the network. One natural extension of
our study would be to introduce yet another class of prey species, Recovered,
which represents the population of individuals that recover from the infection
after a time, and either return to the susceptible class, or may be immune to
further infections. This would, obviously, increase the complexity of the net-
work by adding new nodes and interactions among them. However, this would
contribute towards understanding the concept “diversity leads to stability” on
large-scale food web processes.
Most of the recent research on food web theory in ecology centers around
the local dynamics of a community, but the evolution of food web dynam-
ics across different spatial scales has also received considerable attention
[26, 27, 45, 57, 73]. “Habitat fragmentation and its impact on life” is one of the
most important issues of present research [66]. The destruction of habitat oc-
curs due to a variety of environmental threats, such as habitat removal, invad-
ing alien species, or hunting, each of which may have different effects on food
web structure. Given that they often act concomitantly, these may also inter-
act with each other in unpredictable ways. Introduction of alien species poses
a significant threat to global biodiversity by altering ecosystem processes, such
as nutrient cycling, or disturbance regimes in a community [65], which, in turn,
also affect the strength of the links. If the performance of interacting species
is habitat dependent, then interaction strength may change with scale. Cer-
tain approaches such as hierarchical communities of competitors [69, 70] and
neutral and quasi-neutral communities [68] have been adapted to show that
community organization is relevant in determining the effects of habitat loss
and spatial patterning. Such research on ecological network theory in the fu-
ture would involve rigorous modeling approaches, both analytical and through
simulations, in combination with field and laboratory experimental studies,
to resolve the crucial questions in conservation and restoration ecology.

Acknowledgments
The authors are thankful to the anonymous referees for constructive, criti-
cal comments, and to the Department of Science and Technology, India, for
financial support.

References
1. Abrams, P. et al. The role of indirect effects in food webs. In Food Webs: Inte-
gration of Patterns and Dynamics (eds G.A. Polis & K.O. Winemiller), 371–395,
Chapman & Hall, New York (1996)
2. Allesina, S. and Bodini, A. Who dominates whom in the ecosystem? Energy flow
and bottlenecks and cascading extinctions. J. Theor. Biol., 230, 351–358 (2004)
Ecological Networks: Structure, Interaction Strength, and Stability 69

3. Bascompte, J. and Melian, C. J. Simple trophic modules for complex food webs.
Ecology, 86, 2868–2873 (2005)
4. Bascompte, J. et al. Interaction strength combinations and the overfishing of a
marine food web. Proc. Natl Acad. Sci. USA, 102, 5443–5447 (2005)
5. Bastolla, U., Lassig, M., Manrubia, S. C. and Valleriani, A. Diversity patterns
from ecological models at dynamical equilibrium. J. Theor. Biol., 212, 11-34
(2001)
6. Berlow, E. L. et al. Interaction strengths in food webs: issues and opportunities.
J. Anim. Ecol., 73, 585–598 (2004)
7. Berlow, E. L., Brose U., and Martinez, N. D. The “Goldilocks factor” in food
webs. Proc. Natl. Acad. Sci. USA, 105, 4079–4080 (2008)
8. Bhattacharyya, S. and Bhattacharya, D. K. Pest control through viral diseases:
mathematical modeling and analysis. J. Theor. Biol., 238, 177–197 (2006)
9. Caldarelli, G., Higgs, P. G. and McKane, A. J. Modelling coevolution in multi-
species communities, J. Theor. Biol., 193, 345–358 (1998)
10. Camacho, J. et al. Quantitative analysis of the local structure of food webs.
J. Theor. Biol., 246, 260–268 (2007)
11. Case, T. J. Invasion resistance arises in strongly interacting species-rich model
competition communities. Proc. Natl. Acad. Sci. USA, 87, 9610–9614 (1990)
12. Chen, X. and Cohen, J. E. Global stability, local stability and permanence in
model food webs. J. Theor. Biol., 212, 223–305 (2001)
13. Cohen, J. E., Briand, F. and Newman, C. M. Community food webs. Biomathe-
matics, 20, Springer-Verlag, Berlin (1990)
14. Dambacher, J. M. et al. Relevance of community structure in assessing indeter-
minacy of ecological predictions. Ecology, 83, 1372–1385 (2002)
15. Dambacher, J. M. et al. Qualitative stability and ambiguity in model ecosystems.
Am. Nat., 161, 876–888 (2003)
16. De Ruiter, P., Neutel, A. M. and Moore, J. C. Energetics, patterns of interaction
strengths, and stability in real ecosystems. Science, 269, 1257–1260 (1995)
17. Drossel, B. and McKane, A. J. Modelling food webs. In Handbook of Graphs
and Networks (eds S. Bornholdt & H. G. Schuster), 218–247, Wiley-VCH, Berlin
(2003)
18. Dunne, J. A. et al. Network structure and biodiversity loss in food webs: robust-
ness increases with connectance. Ecol. Lett., 5, 558-567 (2002)
19. Emmerson, M. C. and Raffaelli, D. Predator-prey body size, interaction strength
and the stability of a real food web. J. Anim. Ecol., 73, 399–409 (2004)
20. Garcia-Domingo, J. L. and Saldana, J. Food-web complexity emerging from eco-
logical dynamics on adaptive networks. J. Theor. Biol., 247, 819–826 (2007)
21. Garcia-Domingo, J. L. and Saldana, J. Effects of heterogeneous interaction
strengths on food web complexity. Oikos, 117, 336–343 (2008)
22. Ghosh, S., Bhattacharyya, S. and Bhattacharya, D. K. Role of viral infection in
pest control: a mathematical study. Bull. Math. Biol., 69, 2649–2691 (2007)
23. Gross, T. et al. Long food chains are in general chaotic. Oikos, 109, 135–144 (2005)
24. Hastings, A. and Powell, T. Chaos in a 3-species food-chain. Ecology, 72, 896–903
(1991)
25. Jansen, V. A. A. and Kokkoris, G. D. Complexity and stability revisited, Ecol.
Lett., 6, 498–502 (2003)
26. Keitt, T. H. Network theory: an evolving approach to landscape conservation.
Ecological and Modeling for Resource Managers, Springer Berlin, 125–134, (2003)
70 S. Bhattacharyya and S. Sinha

27. Keitt, T. H. and Economo, E. P. Species diversity in neutral metacommunities:


a network approach. Ecol. Lett., 11(1), 52–62, (2008)
28. Kokkoris, G. D. et al. Variability in interaction strength and implications for
biodiversity. J. Anim. Ecol., 71, 362–371 (2002)
29. Kokkoris, G. D., Jansen, V. A. A., Loreau, M. and Troumbis, A. Y. Variability in
interaction strength and implications for biodiversity. J. Anim. Ecol., 71, 362–371
(2002)
30. Kondoh, M. Does foraging adaptation create the positive complexity-stability
relationship in realistic food-web structure? J. Theor. Biol., 238, 646–651 (2006)
31. Krause, A. E. et al. Compartments revealed in food-web structure. Nature, 426,
282–285 (2003)
32. Laska, M. S. and Wootton, J. T. Theoretical concepts and empirical approaches
for measuring interaction strength. Ecology, 79, 461–476 (1998)
33. Law, R. and Morton, R.D. Permanence and the assembly of ecological commu-
nities. Ecology, 77, 762–775 (1996)
34. Lawton, J. H. Food webs. In Ecological Concepts: the Contribution of Ecology
to an Understanding of the Natural World (ed. J. Cherret), 43-78, Blackwell,
Boston (1990)
35. Levines, R. Evolution in Changing Environments: Some Theoretical Explana-
tions. Princeton University Press, Princeton, NJ, USA (1968)
36. Logofet, D. O. Stronger-than-Lyapunov notions of matrix stability, or how ‘flow-
ers’ help solving problems in mathematical ecology. Linear Algebra and Its Ap-
plications, 398, 75–100 (2005)
37. Loreau, M. et al. A new look at the relationship between diversity and stabil-
ity. In Biodiversity and Ecosystem Functioning: Synthesis and Perspectives (eds
M. Loreau, S. Naeem and P. Inchausti), 79–91, Oxford University Press, Oxford
(2002)
38. MacArthur, R. H. and Levines, R. Strong, or weak interactions? Tansactions of
the Connecticut Academy of Arts and Sciences, 44, 177–188 (1972)
39. Martinez, N. D. et al. Diversity, complexity, and persistence in large model ecosys-
tems. In Ecological Networks, Linking Structure to Dynamics in Food Webs (eds
Pascual, M. and Dunne, J. A.) Santa Fe Inst., Studies in the sciences of complex-
ity. Oxford Univ. Press, 163–185 (2006)
40. May, R. M. Will a large complex system be stable? Nature, 238, 413–414 (1972)
41. May, R. M. Stability and Complexity in Model Ecosystems, Princeton University
Press, Princeton, NJ, USA(1973)
42. McCann, K. S. The diversity–stability debate. Nature, 405, 228–233 (2000)
43. McCann, K. et al. Weak trophic interactions and the balance of nature. Na-
ture, 395, 794–798 (1998)
44. McCann, K. and Hastings, A. Re-evaluating the omnivory–stability relationship
in food-webs. Proc. Roy. Soc. of London, Series B, 264, 1249–1254 (1998)
45. Memmott, J. et al. Biodiversity loss and ecological network structure. In Eco-
logical Networks: Linking Structure to Dynamics in Food Webs (eds. M. Pascual
and J.A. Dunne), Oxford University Press, Oxford (2006)
46. Milo, R. et al. Network motifs: simple building blocks of complex networks. Sci-
ence, 298, 824–827 (2002)
47. Montoya, J. M., Pimm, S. L. and Sole, R. V. Ecological networks and their
fragility. Nature, 442, 259–264 (2006)
48. Montoya, J. M. and Sole, R.V. Topological properties of food webs: from real
data to community assembly models. Oikos, 102, 614–622 (2003)
Ecological Networks: Structure, Interaction Strength, and Stability 71

49. Navarrete, S. A. and Berlow, E. L. Variable interaction strengths stabilize marine


community patterns. Ecol. Lett., 9, 526–536 (2006)
50. Navarrete, S. A. and Castilla, J. C. Experimental determination of predation
intensity in an intertidal predator guild: dominant versus subordinate prey. Oikos,
100, 251-262 (2003)
51. Otto, S. B., Berlow, E. L., Rand, N. E., Smiley, J. and Brose, U. Predator diver-
sity and identity drive interaction strength and trophic cascades in a food web.
Ecology, 89, 134–144 (2008)
52. Paine, R. T. Food web complexity and species diversity. Am. Nat., 100, 65–75
(1966)
53. Paine, R. T. A note on trophic complexity and community stability. Am. Nat.,
103(929), 91–93 (1969)
54. Paine, R. T. Food webs - road maps of interactions or grist for theoretical devel-
opment. Ecology, 69, 1648–1654 (1988)
55. Paine, R. T. A. Conversation on refining the concept of keystone species. Con-
servation Biology, 9(4), 962–964 (1995)
56. Petchey, O. L., Beckerman, A. P, Riede, J. O. and Warren, P. H. Size, foraging,
and food web structure. Proc. Natl. Acad. Sci. USA, 105, 4191–4196 (2008)
57. Peterson, E. E., Theobald, D. M. and Ver Hoef, J. M. Geostatistical modeling
on stream networks: developing valid covariance matrices based on hydrologic
distance and stream flow. Freshwater Biology, 52, 267–279 (2007)
58. Pimm, S. L. The complexity and stability of ecosystems. Nature, 307, 321-326
(1984)
59. Polis, G. A. Stability is woven by complex webs. Nature, 395, 744-745 (1998)
60. Post, W. M. and Pimm, S. L. Community assembly and food web stability, Math.
Biosci., 64, 169–192 (1983)
61. Quince, C. et al. Topological structure and interaction strengths in model food
webs. Ecol. Model., 187, 389–412 (2005)
62. Raffaelli, D. G. Trends in research on shallow water food webs. Journal of Ex-
perimntal Marine Biology and Ecology, 250, 223–232 (2000)
63. Rooney, N. et al. Structural asymmetry and the stability of diverse food webs.
Nature, 442, 265–269 (2006)
64. Sabo, J. L. et al. Population dynamics and food web structure - predicting mea-
surable food web properties with minimal detail and resolution. In Dynamic
Food Webs, Multispecies Assemblages, Ecosystem Development and Environmen-
tal Change (eds. de Ruiter, P. C. et al.) Theor. Ecol. Ser., Academic Press, 437–
452 (2005)
65. Schmitz, D. C. and Simberlo, D. Biological invasions: a growing threat. Issues in
Sci. & Tech. 13, 33–40 (1997)
66. Singh, B. K., Subba Rao, J., Ramaswamy, R. and Sinha, S. The role of hetero-
geneity on the spatiotemporal dynamics of hostparasite metapopulation. Ecol.
Model., 180, 435–443 (2004)
67. Singh, B. K., Chattopadhyay, J. and Sinha, S. The role of virus infection in a
simple phytoplankton zooplankton system. J. Theor. Biol., 231, 153–166 (2004)
68. Sole, R. V., Alonso, D. and McKane, A. self-organized instability in complex
ecosystems. Phil. Trans. Roy. Soc. Lond. Ser., B-Biol. Sci. 357, 667–681 (2002)
69. Stone, L. Biodiversity and habitat destruction - a comparative study of model
forest and coral-reef ecosystems. Proc. Natl. Acad. Sci. USA, 261, 381-388 (1995)
70. Tilman, D. et al. Habitat destruction and the extinction debt. Nature, 371, 65-
66 (1994).
72 S. Bhattacharyya and S. Sinha

71. Uchida, S. and Drossel, B. Relation between complexity and stability in food
webs with adaptive behavior. J. Theor. Biol., 247, 713–722 (2007)
72. Uchida, S., Drossel, B. and Brose, U. The structure of food webs with adaptive
behaviour. Ecol. Model., 206, 263–276 (2007)
73. Urban, D. L., Goslee, S., Pierce K. B. and Lookingbill, T.R. Extending commu-
nity ecology to landscapes. Ecoscience, 9, 200–212 (2002)
74. Williams, R. J. and Martinez, N. D. Simple rules yield complex food webs. Nature,
404, 180–183 (2000)
75. Woodward, G. and Hildrew, A. G. Body-size constraints on niche overlap and
intraguild predation in a complex food web. J. Anim. Ecol., 71, 1063–1074 (2002)
76. Wootton, J. T. Estimates and tests of per-capita interaction strength: diet, abun-
dance, and impact of intertidally-foraging birds. Ecological Monographs, 67, 45–
64 (1997)
77. Wootton, J. T. and Emmerson M. Measurement of interaction strength in nature.
Annu. Rev. Ecol. Evol. Syst., 36, 419–444 (2005)
78. Yodzis, P. The indeterminacy of ecological interactions as perceived through per-
turbation experiments. Ecology, 69, 508–515 (1988)
79. Yodzis, P. and Innes, S. Body-size and consumer-resource dynamics. Am. Nat.,
139, 1151–1175 (1992)
Signaling and Feedback in Biological Networks

Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen

Center for Models of Life, Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen,
Denmark; sandeep@nbi.dk, mhjensen@nbi.dk, sneppen@nbi.dk

1 Introduction
Cellular processes operate on a wide range of time and length scales to produce
complex and intricate dynamics. It is a great challenge to understand both
how these dynamical patterns are produced, as well as why they are produced;
that is, what functional or evolutionary role do they play? This is one of the
most fruitful areas in which to apply the ideas of complex networks. Living
cells have all the prerequisites for a useful representation as networks. First,
cellular systems contain numerous non-identical active components—genes,
proteins, RNA, etc. These are the nodes of the network. Second, there are
many interactions between these components, which form the links between
the nodes. Not every pair of components interacts, so the resulting network
is not fully connected, nor is it a tree or other simple topology. Thus, cellular
networks provide plenty of scope for analysing their structure and graph-
theoretic properties, and numerous studies have taken advantage of this (see
[1] for reviews and [2–9] for some examples).
Network representations of cellular systems can easily be augmented to
address dynamical issues. Each node can be associated with a dynamical vari-
able which could represent, for example, the concentration of that protein or
the level of expression of that gene. Equations or rules governing the tem-
poral dynamics of these variables can then be written, where the network
structure determines which variables interact with each other. This usually
requires encoding more information about the interactions into the network
representation. For instance, apart from knowing that one node links to an-
other, one needs to know the sign and strength of the interaction. However, in
a network picture it is sometimes difficult to encode more detailed molecular
information, such as whether the binding of a protein to DNA is accompanied
by DNA looping, or whether a small molecule that binds to a protein can also
bind equally well when that protein is bound to DNA.

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 5,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
74 S. Krishna, M.H. Jensen, and K. Sneppen

The question then is: What kind of physiologically useful processes can be
illuminated by the kind of information that is easily represented in a network
picture of a cell? One broad class of such processes is signal propagation.
Signals need to be sent in response to environmental conditions in order to
trigger the appropriate functional proteins, and need to be sent between pro-
teins in order to perform necessary computations. For example, the presence
of food metabolites in the surroundings triggers signals to proteins involved in
transport and metabolism of those molecules; or a sudden change in the tem-
perature triggers signals to proteins which buffer the cell against the shock.
Network representations of cellular systems are particularly suited to study
signal propagation because they precisely delineate the paths along which sig-
nals could travel. The next level of complication occurs when a signal loops
back onto itself. Such feedback loops are at the core of every non-trivial com-
putation performed by a cell [10–16]. Feedback loops are necessary for much
non-trivial dynamical behaviour, in particular, oscillations and multistability,
both of which are important for proper cellular function in different organisms.
Our review will therefore introduce biological networks specifically with
the intention of investigating signal propagation and feedback. We will de-
scribe simple measures for examining signal propagation on networks. We
will use the organism-wide cellular network of E. coli to discuss whether the
network structure has any particular properties which would affect the cost
and specificity of signal propagation. The review will then continue by dis-
cussing feedback in sub-networks of mammalian and yeast cells. We will take
one example each from a biological setting where, respectively, negative and
positive feedback in the network structure play a crucial role in the dynamical
behaviour of the system. Finally, we will conclude by looking at combinations
of feedback loops. We show that two entangled feedback loops, which are com-
mon in bacterial cells, have dynamical properties that are quite different from
those of their individual loops.

2 Signaling
An organism-wide protein network of the bacterium E. coli can be extracted
from the database EcoCyc [17] and represented as a directed, bipartite graph
with 2846 protein nodes and 2774 reaction nodes [18]. The reaction nodes
include all kinds of cellular reactions between proteins: transcription reactions,
complex formations, protein modifications and metabolic reactions. Figure 1A
shows the giant weakly connected component of this graph, consisting of 1938
reactions (of which 812 are transcription reactions, squares) and 1897 proteins
(circles). Figure 1A also illustrates that the E. coli graph is composed of a
large number of relatively small strong components (a strong component is
a sub-graph where there is a directed path between every pair of nodes).
Figure 1B compares this with the strong component structure of a randomised
network with exactly the same number of nodes and links, as well as the same
in- and out-degree (number of in- and out-links) of each node. The E. coli
Signaling and Feedback in Biological Networks 75

Fig. 1. E. coli protein reaction network. (A, Left) The graph is the largest weak
component of a bipartite network, consisting of proteins (circles) and reaction nodes
(promoters (squares), complex formations and modifications (black squares)). The two
largest hubs, σ 70 and CRP , and their links, have been removed for ease of visualisation.
(A, bottom left) Illustration of the procedure of making the strong component graph.
(A, Right) The resulting strong component graph of the E. coli network. An arrow
in the strong component graph indicates that there is a path connecting the two
strong components in the original graph; nodes correspond to strong components of
minimum size two. (B) The strong component graph for a randomized version of the
E. coli network. The randomisation preserves the total number of nodes, total number
of links and the number of in- and out-links of each node [18].

protein network is much more modular than the randomized network, an


overall feature of regulation/signaling that was first suggested in [19].
In such a network, what we call “signals” are perturbations in the dynami-
cal variables associated with the nodes. For instance, if they were all proteins,
then a perturbation in the concentration of one protein would alter the con-
centration of all the proteins downstream from the original one. The simplest
aspect of the structure of the network that influences signaling is the number
of nodes that are downstream of any given starting node (note that this is a
quantity that can be sensibly studied only with a directed graph representation
of the network; in any connected undirected graph all nodes are downstream
of each other). The possible signals emanating from the starting node are
76 S. Krishna, M.H. Jensen, and K. Sneppen

Fig. 2. The cumulative distribution of number of downstream targets s for nodes of


the E. coli network (lower curve) and the randomised network (upper curve) [18].

obviously limited to reach only these nodes. The strong component graphs
in Fig. 1 show particularly clearly how the network structure affects signal-
ing possibilities. Within each strong component, every node can, in principle,
send a signal to another node. But between strong components the possibil-
ities are hugely reduced. Thus, the E. coli network structure already seems
to be set up to allow plentiful signaling on short length scales, but to allow
only very specific paths on longer length scales. In the random network, how-
ever, most nodes can send signals to almost the entire network (because most
of the nodes are part of one giant strong component). A percolating struc-
ture like this is not conducive to specific signaling because every node has
almost the entire network downstream of it. Figure 2 bolsters this conclusion,
showing that in the E. coli network proteins have a much smaller number of
downstream targets than in the randomised network.

2.1 Cost of Signaling

Signaling is not just about reaching a downstream target. As a signal prop-


agates, it needs other molecules to help it pass the message across consecu-
tive reactions. Consider, for example, a signal initiated by an increase in the
concentration of a given transcription factor. The promoter it influences may
depend on other transcription factors, for example, in an or-gate construction.
If that is the case, and the other transcription factor is already abundant, the
promoter activity will not be influenced and thus the signal will not be trans-
mitted. More generally, for each additional reactant along a reaction pathway,
signal propagation becomes increasingly coupled to the overall state of the
Signaling and Feedback in Biological Networks 77

Fig. 3. (A) Schematic showing how the “cost” of a signaling path, A → F , is measured.
In this case proteins B and D are necessary, giving a cost C = 2. (B) Cost of a signaling
path as a function of its length for the real (solid) and randomised (dashed) E. coli
networks [18].

molecules in the cell. The more reactions in the path, and the more reactants
in each reaction, the more conditions that must be met for propagation of the
signal.
We quantify this cost C = C(path) for an arbitrary path from a starting
protein to a target protein by simply counting the number of reactants along
the entire path (not counting the protein nodes which are part of the path), as
described schematically in Fig. 3A. If the same reactant is used several times,
it is only counted once. Notice that the propagation of a signal does not
necessarily mean an increased level of the proteins involved. The key point is
that a change in input state should be transmitted to a changed output state
of the end product. Our cost function is a simple measure of the complexity
of handling such a signal and it could, in principle, be calculated between any
pair of proteins where a path exists in the directed network.
Figure 3B shows the average cost of signals propagating from one protein
to another along the shortest path connecting them, as a function of the length
l of that path. Each data point is the average over all pairs which are at the
given distance. Except for paths of length two, the average cost for signals
78 S. Krishna, M.H. Jensen, and K. Sneppen

Fig. 4. The six largest strong components of the E. coli network (A–F), along with
plots of the average cost, C(l) as a function of signaling distance. The grey areas show
the range spanned by C(l) for 100 randomised versions of the subgraphs [18].

is significantly smaller for the real E. coli network than for a randomised
networks (error bars are smaller than the symbol size).
Figure 4 repeats this analysis for each of the six largest strong components
in the network. These strong components capture distinct functional units
associated, respectively, to (A) predominantly fatty acid metabolism, (B) the
transcription network around σ factors, (C) PTS-sugar transport, (D) ABC
transporters, (E) the FeII and FeIII transport system and (F) the chemotaxis
module. Overall, we see that the cost within each module is fairly similar to
the random expectation.

2.2 Conclusions About Signaling

We have shown that the molecular network of E. coli is designed in a way which
facilitates local signaling. On longer distances, signal transmission is a priori
nearly impossible, but we find statistical evidence for signal pathways in terms
of a lower signaling “cost” when we measure this by the number of co-factors
needed to transmit a given signal. The fact that the E. coli network has a
lower than randomly expected cost of signaling for paths longer than two steps
shows that it contains many linear chains which have few incoming branches.
That is, the real network is “stringy,” while the randomised network is more
“bushy,” having relatively many more branched pathways. Topologically, a
low cost is equivalent to less cross talk, which is indeed desirable [3, 19].
This picture of a stringy network of long linear chains applies to the large
scale: the place where the real network optimizes specific signaling is between
Signaling and Feedback in Biological Networks 79

strong component modules, rather than within them. A final intriguing point
is that at small scales, within modules, the network has widely different design
features, as seen from Fig. 4. Some modules (C,F) are dominated by complex
formation reactions, and others (D,E) by linear pathways, while the remaining
(A,B) are densely interconnected.
Obviously, signaling is not only limited by the topology of the network, but
also by the type of chemical reactions that facilitate the signals. For example,
in pure protein-protein interaction networks, Refs. [20, 21] show that proteins
with high concentrations propagate signals to proteins at low concentrations,
but not vice versa. Further, when most of a protein is present in an unbound
form, rather than in a complex with other proteins, it inhibits propagation of
signals through that node of the network. Thus, the overall picture of signaling
in biological networks is that one needs careful engineering of both topology
and protein binding chemistry in order to facilitate signal propagation over
more than one or two reactions.

3 Feedback
Figure 5 shows a number of feedback loops. Each node in each loop receives
signals (perturbations) from the previous node and sends it on to the next
node in the cycle. When the signal travels all the way around the loop, it will

Negative feedback loops


a b c d
Hes1 p53 lactose IkBα

β−galacto LacI IkBα NF−kB


Mdm2 sidase mRNA

Positive feedback loops


e f g lactose
cI cI

lactose LacI
Cro transporter

Fig. 5. Examples of positive and negative feedback loops. An ordinary arrow indi-
cates activation, a barred arrow indicates inhibition. (a)–(d) Negative feedback loops
found involving proteins important for, respectively, development [29], apoptosis [30],
lactose consumption [31, 32] and the immune system [33]. (e)–(g) Positive feedback
loops involving proteins important for, respectively, λ phage lysis-lysogeny decision
and induction [34], and import of extracellular lactose [31, 32].
80 S. Krishna, M.H. Jensen, and K. Sneppen

act to either dampen (negative feedback) or enhance (positive feedback) the


original perturbation. Whether the feedback is positive or negative depends
on how the nodes interact. In Fig. 5 we use an ordinary arrow to indicate that
a node activates the next node, and a barred arrow if a node inhibits the next
node. Then, clearly, all loops with an odd number of repressors are negative
feedback loops, while those with an even number of repressors are positive
feedback loops.
In cellular networks such feedback loops are quite common. Previous stud-
ies which searched for small “motifs” in cellular networks found very few feed-
back loops and an overabundance of feedforward loops [7]. However, these
studies looked only at transcription factor networks. As soon as one includes
metabolism, then it becomes quickly apparent that feedback loops are by far
the most common motif, especially at the interface between the metabolic
and regulatory networks of the cell [22]. This interface is quite extensive, as
evidenced by the fact that around half of all transcription factors in E. coli
have a binding site for small metabolic molecules. The sugar lactose is one
such example, being involved in both a negative (Fig. 5c) and a positive feed-
back loop (Fig. 5g). Figure 5 also shows some other examples of negative and
positive feedback loops without small molecules.
Positive feedback loops are closely related to the existence of multiple sta-
ble states of the system, while negative feedback loops are associated with
oscillations. In fact, for a very general class of systems, it has been shown
that the existence of at least one negative feedback loop is necessary (but
not sufficient) for oscillations, and a similar result holds for positive feedback
and multistability [23–25]. References [26–28] study, both theoretically and
through the construction of synthetic gene circuits, multistability in positive
feedback networks. Ref. [35] further explores the connection between oscilla-
tions and negative feedback, showing how the structure of the underlying loop
can be extracted from oscillating time series.

3.1 Negative Feedback and Oscillations in Mammalian


Immune Response

The simplest negative feedback loop is, of course, a protein which represses
itself (Fig. 5a). There are many examples of such proteins: the main regulator
of the E. coli response to UV damage, LexA, represses its own production
[36]; Hes1, involved in development in mammalian cells, also represses tran-
scription of its own gene [29]. A well-known synthetic negative feedback loop
is the repressilator, which consists of three proteins each repressing each other
[37] (the same structure as Fig. 5c). Here we will concentrate on the negative
feedback loop shown in Fig. 5d containing the transcription factor, NF-κB,
which is one of the central regulators of the immune system in mammalian
cells. The NF-κB family of proteins is one of the most studied, being involved
in a variety of cellular processes including immune response, inflammation
and development. NF-κB can be activated by a number of external stimuli
Signaling and Feedback in Biological Networks 81

including bacteria, viruses and various stresses and proteins. In response to


these signals it controls, directly and indirectly, over 150 genes including many
chemokines, immunoreceptors, stress reponse genes and acute phase inflam-
mation response proteins [33].
Nuclear NF-κB is known to activate production of IκBα, an inhibitor
protein which inhibits nuclear import of NF-κB by sequestering it in the
cytoplasm, thus forming a negative feedback loop. Experimentally, when the
NF-κB system is suitably excited, the concentration of NF-κB in the nucleus
begins to oscillate [10, 38].
How does the negative feedback loop of NF-κB produce oscillations? Phys-
ically, what is required for instability of the fixed point, and hence oscillations,
is a time delay, i.e., a sufficient slowing down of the signal going a round the
loop. (If a perturbation in the concentration of one variable instantaneously
affects the concentration of the next one, and so on, then for a negative feed-
back loop, any perturbation will be immediately cancelled and the steady state
will be stable.) In cellular systems many processes could produce time delays:
(i) a process that takes a finite minimum time, (ii) many intermediate steps,
(iii) a sharp response by some of the variables, (iv) saturated degradation, or
(v) autocatalysis (see Ref. [39] for more details).
In the NF-κB system it is, in fact, saturated degradation of IκB that is be-
hind the oscillations. NF-κB forms a complex with its inhibitor protein IκBα.
This complex has the curious property that the external stimulus (a protein
kinase called IKK) leads to a degradation of IκBα only when it is bound in
the complex, and not when it is unbound. As a result, the degradation rate
of IκBα has an upper limit, i.e., is saturated, due to the limited amount of
NF-κB present and hence the limited amount of complex that can form.
Mathematically, it is possible to describe all the essential features of the
NF-κB system using a very simple model consisting of only three variables
[11], nuclear NF-κB (Nn ), cytoplasmic IκB (I) and IκB mRNA (Im ):

dNn (1 − Nn ) INn
=A −B , (1)
dt +I δ + Nn
dIm
= Nn2 − Im , (2)
dt
dI (1 − Nn )I
= Im − C . (3)
dt +I
The saturated degradation is the second term in the last equation. Other
terms in the equations model processes like nuclear import and export of
NF-κB, production of IκB, etc. (see [11] for more details).
An obvious question is why the cell requires oscillations in NF-κB in re-
sponse to inflammation. This is a subject of much debate currently, and there
is no clear answer. However, our model of NF-κB provides a possible clue:
One property of the oscillations of nuclear NF-κB (in Fig. 6) that stands out
is that they are extremely spiky. The spikiness is extremely robust to changes
82 S. Krishna, M.H. Jensen, and K. Sneppen

Fig. 6. (Left) Oscillations of nuclear NF-κB (Nn ) (black curve) and cytoplasmic IκB
(grey curve) for simulations of the model with A = 0.007, B = 954.5, C = 0.035,
δ = 0.029 and  = 2 × 10−5 (these parameter values are derived from the ones used
in Ref. [10], see [11]). In order to facilitate comparison with the experimental plot
(right, obtained from Ref. [38]), the x-axis has been limited to 600 minutes, but the
oscillations are sustained.

Fig. 7. Sensitivity to IKK. (Left) Spike duration, the fraction of time Nn spends above
its mean value, as a function of IKK concentration. (Right) Spike peak, the maximum
concentration of nuclear NF-κB, as a function of IKK concentration. In both plots,
the black dot shows the IKK value used in Fig. 6, which separates regions of spiky and
soft oscillations [11].

in parameter values. In general, the existence and spikiness of the oscilla-


tions is very robust to changes in most of the parameters of the model [11].
However, the system shows a very sensitive response to change in one param-
eter: the external stimulus, IKK. Figure 7 shows that both the spike height
(or peak level), as well as the spike duration, can change by large amounts
in response to small changes in the IKK level. Notice that this sensitivity
is particularly high in IKK ranges which are near the transition from spiky
to soft oscillations. It can be shown that this sensitivity can be transmitted
to genes that are affected by NF-κB, producing a gene response sensitivity
Signaling and Feedback in Biological Networks 83

that is much larger than that obtained by other typical mechanisms which do
not involve oscillations [40, 41]. Thus, oscillations could be a by-product of
designing the system to have a very high sensitivity to small changes in the
external stimulus.

3.2 Positive Feedback and Bistability in Yeast Epigenetics

Cells carry information handed down from their ancestors and are able to pass
on information to their descendants. In many cases this “memory” is epige-
netic—not stored in the DNA sequence—allowing cells with identical DNA to
maintain distinct properties. Epigenetic cell memory implies alternative states
that are stable over time and are inherited through cell division.
One proposed mechanism for epigenetic cell memory invokes positive feed-
back loops in nucleosome modification [42]. Nucleosomes are protein com-
plexes that package eukaryotic DNA, with a density of about one nucleosome
per 200 base pairs (bp). The core nucleosome is composed of two molecules
each of four core histone proteins. Nucleosomes may carry various chemi-
cal modifications (e.g. acetylation and methylation) at different amino acid
positions on the different histones, conferring a large potential information
capacity on each nucleosome. Specific additions and removals of these nucle-
osome modifications are carried out by classes of enzymes, including histone
acetyltransferases (HATs), histone methylases (HMTs), histone deacetylases
(HDACs) and histone demethylases (HDMs). At least some of these modifi-
cations affect the activity of nearby genes, in part because the modifications
can alter the binding of regulatory proteins to the DNA.
Positive feedbacks are present in this system because nucleosomes that
carry a particular modification may recruit (directly or indirectly) the en-
zymes that catalyse similar modification of neighbouring nucleosomes. Thus,
a cluster of nucleosomes may be able to maintain itself stably in a particular
modification state. These states can be inherited through DNA replication
because nucleosomes on the parental DNA strand are distributed to both
daughter strands [43], and the enzymes recruited by these parental nucle-
osomes may then establish the parental modification pattern on the newly
deposited nucleosomes.
A specific case in which positive feedbacks in nucleosome modification
result in multiple stable states occurs in the mating-type system of the eu-
karyote S. pombe (fission yeast) [44]. A ∼20 kbp region of S. pombe DNA
containing two mating-type cassettes is normally in a stable “silenced” state,
with the mating-type genes not expressed. In certain mutants where part of
the silenced region is modified, the system is bistable, flipping between states
where the ura4 gene is either expressed (active) or not (silenced). Each state
is stable and heritable, with transitions occurring at roughly equal frequencies
of ≈ 5 × 10−4 per cell division [44]. Switching appears to be stochastic and is
determined by factors associated with the region itself. In the silenced state,
but not the active state, the region is dominated by nucleosomes that are
84 S. Krishna, M.H. Jensen, and K. Sneppen

Fig. 8. Illustration of basic ingredients of the model: Each oval represents a nucleosome
that can be methylated (M), unmodified (U) or acetylated (A). Enzymatic transitions
(solid arrows) between the three states are in part random (controlled by a noise level
1 − α), and in part autoregulated by recruitment (dotted lines) of enzymes (open
symbols) by nucleosomes in the M or A state [45].

methylated at a particular site. An HMT that can catalyse this modification


and certain HDAC proteins are known to be important for silencing.
One can construct a simple network model [45] (schematically shown in
Fig. 8) of the nucleosome modification system that exhibits all this behaviour,
based on three simplifying assumptions. (1) There are only three relevant
kinds of nucleosomes: unmodified, methylated and acetylated; methylation
and acetylation are mutually exclusive. (2) The nucleosomes are enzymatically
interconverted as shown in Fig. 8, by HMT, HDAC, HDM and HAT enzyme(s).
(3) The HDAC and HMT enzyme(s) are recruited by methylated nucleosomes;
the HDM and HAT enzymes are recruited by acetylated nucleosomes. This is
what makes the feedback positive.
To model S. pombe we take a system consisting of a fixed number of
N = 60 nucleosomes, arranged on a 1-dimensional (1D) string. The region is
isolated from neighbouring DNA by boundary elements [46], which we assume
to be inert. Each nucleosome may be methylated (M), unmodified (U) or
acetylated (A). At each time step one selects a random nucleosome n1 and
attempts one of two changes:
(a) With probability α one attempts a change associated to enzymatic
activity of an enzyme recruited by another nucleosome in the modeled region.
That is, one selects another random nucleosome n2 and if this is in either an
Signaling and Feedback in Biological Networks 85

M or A state, the nucleosome n1 is changed one step toward this state. For
example, when nucleosome n2 is an M: if n1 is an A, then it is changed to U
and if n1 is a U it is changed to M. If nucleosome n1 and n2 are in the same
state, or if n2 is a U, then no changes are made.
(b) With probability 1 − α one attempts a change of the selected nucleo-
some n1 : A U is changed to an M with probability 13 , or an A with probability
1 1
3 whereas an A or an M is changed to U with probability 3 .
One may view process (a) as occurring due to the action of enzymes re-
cruited by nucleosomes in the region within the isolating boundaries, whereas
(b) reflects extrinsic noise caused by unrecruited enzymes. Thus, a lower α
value indicates a higher noise level.
In Fig. 9 we illustrate the dynamics of the model. One observes a fluc-
tuating number of the three kinds of nucleosomes. In the upper panel α is
small (noise is high) and the system has only one stable state, in which the
nucleosome modifications are distributed randomly along the chain. In the
lower panel, with a higher α, the system exists either in a state dominated
by methylated nucleosomes or a state dominated by acetylated nucleosomes,
with occasional switches between the two states. As α is increased further
(i.e., noise is reduced) the states become more stable, and the switching oc-
curs less often. However, the fact that the epigenetic states in the mutant S.
pombe have a finite stability demonstrates that noise in the form of disordered
methylation-acetylation events plays a crucial role.

Fig. 9. Time development of the standard model [45] for a system consisting of N = 60
nucleosomes with respectively α = 0.40 (upper figure) and α = 0.64 (lower figure). The
light grey curve shows the number of methylated, dark grey the number of acetylated
and black the number of unmodified nucleosomes. Time t is measured in number of
attempted nucleosome updates per nucleosome.
86 S. Krishna, M.H. Jensen, and K. Sneppen

This simplified model of epigenetic inheritance in eukaryotes provides some


unexpected insights. First, it is very important that nucleosomes are modified
by enzymes recruited by non-neighbouring nucleosomes. A “1D” variant of
the model where nucleosomes can recruit enzymes to modify only one of their
neighbours along the string does not produce bistability [45]. The difficulty of
obtaining a clear two-state behavior in 1D arises for reasons similar to those
preventing spontaneous magnetization in the 1D Ising model, or the helix-coil
transition in polymer models [47, 48]. Second, it is also very important that the
transition from, say, an M state to an A state requires two consecutive acety-
lation recruitments by nucleosomes in the A state, and therefore effectively
has a rate ∝ A2 . Bistability is lost in variants where this two-step process
is replaced by a single step [45]. The non-linearity produced by this kind of
“cooperative” two-step modification appears to be essential for bistability.
Most importantly, however, at low α, where the modification-demodification
events are completely random (and hence there is no feedback), there is only one
state where the nucleosome modifications are distributed completely randomly
along the string. Thus, we can conclude that positive feedback is essential for
bistability.

4 Combining Multiple Feedback Loops


In the previous sections we investigated the basic properties of single negative
and positive feedback loops. In cellular networks, however, there are multiple
entangled feedback loops. This can already be seen in Fig. 5, where some of
the proteins are present in more than one example (LacI in Fig. 5c and g; cI in
Fig. 5e and f). In an effort to understand how feedback loops interact and the
range of dynamical behaviour possible, we begin by examining two interacting
feedback loops.
Such two-loop network motifs are seen in a large class of cellular response
systems designed to regulate the flux and concentration of small molecules.
These systems control, via two feedback loops, the transport and metabolism
pathways. Typically, these two loops are connected by a common transcrip-
tional regulator that senses the concentration of the small molecule. For
instance, in the arabinose utilization system in E. coli, when intracellular
arabinose binds to the regulator AraC it alters its binding to DNA such that
RNA polymerase and the protein CRP can bind and initiate expression of
genes that increase import of extracellular arabinose as well as its metabolic
consumption [49]. This is schematically shown in Fig. 10. Here, the transport
is controlled by a positive feedback loop, while the metabolism is a negative
feedback loop. This is, of course, not the only logical combination of feedback
loops possible.
Figure 11 (left column) shows four logically distinct combinations of entan-
gled transport and metabolism feedback loops. In each case, the two feedback
loops are connected by a transcriptional regulator (R) that senses the concen-
tration of a particular small molecule (s). One loop regulates transcription of
Signaling and Feedback in Biological Networks 87

Fig. 10. Schematic illustration of molecular processes in a two-loop motif. This motif
is found in the regulation of uptake and metabolism of, for example, maltose and ara-
binose [50, 49]. σ, s denote, respectively, extracellular and intracellular concentrations
of the small molecule. The molecule binds to the regulator, R, forming the complex
{Rs} which activates production of transport proteins, T , and metabolic enzymes, E.
γ is a parameter controlling the metabolic rate per enzyme [13].

the transport proteins (T ) facilitating the influx of the small molecule, while
the other controls transcription of enzymes (E) responsible for the metabolism
of s. The signs show the logic of each feedback loop: positive (+) or negative
(-). Each motif can then be described by a notation of two signs, e.g. (+ –),
which means that the transport loop is positive and the metabolism loop neg-
ative. Thus, there are four logical structures: the socialist (– –), the consumer
(+ –), the fashion (– +) and the collector (+ +) [13]. Each can, in turn,
be implemented in two distinct but logically equivalent ways, depending on
whether s inhibits or activates R. This we denote using the notation (+ – i)
or (+ – a), where the i (respectively, a) indicates inhibition (activation) of R
by s. Th i- and a-motifs with the same logic behave very similarly, so here we
will concentrate on only the a-motifs.
The socialist motif. We call the (– –) motif the socialist because at low levels
of extracellular s (low σ) it increases transport and reduces the metabolism,
while at high levels of extracellular s, it does the opposite. Thus, the two
negative feedback loops help maintain s robustly within a small concentration
range. Such behaviour would be ideal for a system responsible for maintaining
homeostasis. And indeed, a regulatory system with this logic is found in the
iron homeostasis system in mammals [51]: iron activates the ferric uptake
regulator (Fur), which represses transcription initiation of iron uptake genes,
and enhances production of iron-using proteins. For most organisms iron is
essential for several proteins, but is poisonous at high concentrations. There,
88 S. Krishna, M.H. Jensen, and K. Sneppen

Fig. 11. Behaviour of four entangled feedback loop motifs. Plots show the steady state
values of s (middle column) and influx (σT = γEs + s, right column) as a function
of σ. In all plots, the black curve shows the behaviour for the two-loop motif. The two
other curves show the behaviour when only the transport loop is active (E = 1) and
when only the metabolism loop is active (T = 1) [13].

the (– –) motif maintains the loosely bound iron within a narrow concentration
range, and at the same time allows a high consumption of iron molecules by
certain proteins that bind iron strongly.
The consumer motif. The (+ –) motif we term the consumer, because
any amount of extracellular small molecule results in the increase of both
transport and metabolism. Thus, it is ideal for food molecules. This logic
is in fact typical for sugar transport and metabolism in prokaryotes. The
gal [52] and lac [31, 32] operons in E. coli are the most well studied of such
systems. They both use the sugar molecule to inhibit the transriciption factor
Signaling and Feedback in Biological Networks 89

regulating transport and metabolism, the (+ – i) motif. In contrast, maltose


[50] and arabinose [49] work by activating the regulation of transport and
metabolism, the (+ – a) motif. In natural systems, transport and metabolic
genes can be part of a single operon, as in lac [31], or separate operons, as in gal
[52]. The latter arrangement allows non-coordinated regulation of transport
and metabolism and therefore can be engineered to become bistable. This was
also demonstrated by experiments on modified lactose and arabinose systems
[53, 54], where the accompanying negative feedback loop was eliminated by
inactivating E or using a non-metabolisable analogue of s, in agreement with
our predictions from a similar cutting of the metabolic loop in Fig. 11.
The fashion motif. As the fashion motif (– +) is indeed the opposite of the
consumer motif, both logically and functionally, it is not surprising that we
have not found any simple example of it in the regulation of small molecules in
living cells. However, its behaviour (and the reason we call it the fashion motif)
can be illustrated in terms of a market model for a product which is desirable
in small amounts. In such a scenario, the resource, s, is analogous to a fashion
product, E to the consumers, and T to the producers. R can be considered the
value of the product, measured in terms of how much people desire it. When
there is plenty of the product s in the market, its value R decreases, which
in turn decreases its consumption (a positive metabolism feedback loop) as
well as the desire amongst producers to make more of it (a negative transport
feedback loop), making it a (– +) motif. The non-monotonicity of the flux of
the fashion motif translates in this analogy to a saturation of the market when
a fashion product becomes too abundant: Fashion products are most profitable
when their availability is below a certain threshold. When the fashion motif
is supplemented with a positive feedback of R to itself, the collapse of fashion
goods can occur with a remarkably small change in external supply, which is
reminiscent of fashion “bubbles” in society [55]. Although the fashion motif
does not make much rational sense for small molecule response systems, it
may be seen as a mechanism for coherent behaviour in social organization.
The collector motif. The collector motif (+ +) is the logical opposite of the
(– –) motif. Functionally it allows accumulation of a large amount of s, and is
thus also functionally opposite to the socialist motif. Accumulation could be
important for short periods of time, for instance, when an animal is preparing
for hibernation. However, in such cases the (+ +) motif should eventually be
overridden by another system which starts the consumption of the molecule.
Such double positive feedback loops may be found in transcription regulatory
networks and circuits involved in development and cell differentiation, but we
failed to find any examples of them in small molecule regulation. Turning to
a human analogy, the collector motif can be illustrated by making an analogy
between s and the weight of a person. Then this weight increases with the
intake of food (the analogue of transport), and is consumed by exercise (the
analogue of metabolism). In this analogy R represents the internal “state” of
the person, his or her mindset. An increase in a person’s weight, s, increases,
via this internal state, their likelihood to eat more (positive transport feedback
90 S. Krishna, M.H. Jensen, and K. Sneppen

loop) and also decreases their chance to exercise (positive metabolism feedback
loop), thus forming a collector motif. The bistable behaviour of the collector
motif would then contribute to a broadening of the weight distribution in
human populations [60].

4.1 Two-Loop Motifs are More Than the Sum


of Their Single Loops

Figure 11 also shows the behaviour of individual loops in these motifs, ob-
tained by keeping either E or T fixed, thereby cutting feedback in one of the
loops. The near constant value of s in (– –) comes from the metabolic loop’s
ability to constrain s for low σ, and the transport loop’s ability to constrain s
at high σ. Thus, the functionality of (– –) is dominated by the sub-motif that
best prevents large variation of s and flux. The (+ –) obtains a steady increase
in s and a step-like increase in flux with σ by using the negative metabolic
loop’s ability to “smooth out” the bistability associated to the positive trans-
port loop. The (– +) motif exhibits a remarkable non-monotonic behaviour of
flux, which cannot be obtained from any of the sub-motifs. The (+ +) motif
maximizes bistability, by extending it to the extreme of the two bistable re-
gions of its sub-motifs. Overall, we can conclude that whole two-loop motifs
are more than a simple sum of their parts.

4.2 Going Beyond Two Loops

Our analysis of two entangled feedback loops creates a framework for analysing
small molecule regulatory circuits composed of multiple entangled feedback
loops. For instance, the regulation of iron in E. coli, while being dominated
by interactions that form a socialist motif [56, 57], also contains a positive
feedback on the metabolism side involving usage of iron in FeS clusters [58].
An investigation of this three-loop motif suggests that two metabolism loops,
connected like this in “parallel” (as opposed to the “series” connection between
a transport and metabolism loop), are additive in behaviour [13, 59]. Due to
this additiveness, iron regulation in E. coli is able to minimise variation of
both the concentration of iron (a property of the socialist part) as well as the
flux (a property of the fashion part) [56]. This indicates that an interesting
direction to extend these ideas might be to try to formulate “design principles”
for combinations of parallel and serially connected feedback loops.

5 Concluding Remarks

To extract a useful network representation to describe a particular cellular


system, it is necessary to ascertain the sensible level of coarse-graining for that
system — is it the whole-cell network, individual proteins/genes or something
Signaling and Feedback in Biological Networks 91

in between? There is, of course, no one answer to this question. In the examples
above we have looked at a wide range of scales, from the entire E. coli network,
to three or four component sub-networks, down to nucleosomes on DNA. On
all these scales the dynamical behaviour is, however, constrained first by the
available communication channels, and second by the logical properties of
feedback loops in the network. To summarise, we extract the following main
“lessons” from our case studies:
• The E. coli protein network is highly modular.
• The real E. coli network is more “stringy” than the randomised version,
and this reduces constraints on signal propagation.
• Most feedback loops go through small molecules; there are very few in the
transcription network.
• Biological function is coupled to the logic (positive/negative) of the feed-
back.
• Entangled feedback loops are “more” than a simple sum of their parts.

Acknowledgments
We thank our collaborators, with whom much of the work described here was
done: J. Axelsen, I. Dodd, M. Micheelsen, S. Pigolotti, S. Semsey, G. Thon
and G. Tiana. We acknowledge support from The Danish National Research
Foundation and the Villum Kann Rasmussen Foundation.

References
1. S. Bornholdt and H.G Schuster, eds., Handbook of Graphs and Networks: From
the Genome to the Internet, Wiley-VCH, Weinheim (2002).
2. E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai and A.-L. Barabasi, Science,
297, 1551–1555 (2002).
3. S. Maslov and K. Sneppen, Science, 296, 910–913 (2002).
4. K. Sneppen, A. Trusina and M. Rosvall, Europhys. Lett., 69, 853 (2005).
5. A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen, Phys. Rev. Lett., 92, 178702
(2004).
6. J. B. Axelsen, S. Bernhardsson and K. Sneppen, BMC Systems Biology, 2, 25
(2008).
7. S.S. Shen-Orr, R. Milo, S. Mangan and U. Alon, Nat. Genetics, 31, 64–68 (2002).
8. A. Samal, S. Singh, V. Giri, S. Krishna, N. Raghuram and S. Jain, BMC Bioin-
formatics, 7, 118 (2006).
9. S. Singh, A. Samal, V. Giri, S. Krishna, N. Raghuram and S. Jain, Eur. Phys. J.
B, 57, 75–80 (2007).
10. A. Hoffmann, A. Levchenko, M.L. Scott and D. Baltimore, Science, 298, 1241–
1245 (2002).
11. S. Krishna, M.H. Jensen and K. Sneppen, Proc. Natl. Acad. Sci. USA, 103, 10840–
10845 (2006).
92 S. Krishna, M.H. Jensen, and K. Sneppen

12. E. Aurell, S. Brown, J. Johansen and K. Sneppen, Phys. Rev. E, 65, 51914 (2002).
13. S. Krishna, S. Semsey and K. Sneppen, Proc. Natl. Acad. Sci. USA, 104, 20815–
20819 (2007).
14. K.B. Arnvig, S. Pedersen and K. Sneppen, Phys. Rev. Lett., 84, 3005 (2000).
15. G. Tiana, M.H. Jensen and K. Sneppen, Eur. Phys. J. B 29, 135 (2002).
16. M.H. Jensen, G. Tiana and K. Sneppen, Febs Letters 541, 176 (2003).
17. P.D. Karp et al., Nucl. Acids Res., 35, 7577–7590 (2007).
18. J.B. Axelsen, S. Krishna and K. Sneppen, J. Stat. Mech., P01018 (2008).
19. L.H. Hartwell, J.J. Hopfield, S. Leibler and A.W. Murray, Nature, 402(6761),
C47–52 (1999).
20. S. Maslov, K. Sneppen and I. Ispolatov, New J. Phys., 9, 273 (2007).
21. S. Maslov and I. Ispolatov, Proc. Natl. Acad. Sci. USA, 104, 13655–13660 (2007).
22. S. Krishna, A.M.C. Andersson, S. Semsey and Kim Sneppen, Nucl. Acids Res.,
34, 2455 (2006).
23. R. Thomas, Quantum noise, Springer Series in Synergetics 9, Ed. Gardiner,
Springer, Berlin, pp. 180–193 (1981).
24. E.H. Snoussi, J, Biol. Sys., 6, 3–9 (1998).
25. J.L. Gouzé, J. Biol. Syst., 6, 11–15 (1998).
26. J.E. Ferrell Jr., Curr. Opin. Cell Biol., 14, 140–148 (2002).
27. D. Angeli, J.E. Ferrell and E.D- Sontag, Proc. Natl. Acad. Sci. USA, 101, 1822–
1827 (2004).
28. F.J. Isaacs, J. Hasty, C.R. Cantor and J.J. Collins, Proc. Natl. Acad. Sci. USA,
100, 7714–7719 (2003).
29. H. Hirata, S. Yoshiura, T. Ohtsuka, Y. Bessho, T. Harada, K. Yoshikawa and
R. Kageyama, Science, 298, 840–843 (2002).
30. S.L. Harris and A.J. Levine, Oncogene, 24, 2899–2908 (2005).
31. F. Jacob and J. Monod, J. Mol. Biol., 3, 318–356 (1961).
32. P. Wong, S. Gladney and J.D. Keasling, Biotechnol. Prog., 13, 132–143 (1997).
33. H.L. Pahl, Oncogene, 18, 6853–6866 (1999).
34. M. Ptashne, A Genetic Switch: Phage Lambda Revisited, Cold Spring Harbor
Laboratory Press Cold Spring Harbor(2004).
35. S. Pigolotti, S. Krishna and M.H. Jensen, Proc. Natl. Acad. Sci. USA, 104, 6533–
6537 (2007).
36. M. Schnarr et al., Biochimie, 73, 423–431 (1991).
37. M.B. Elowitz and S. Leibler, Nature, 403, 335–338 (2000).
38. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney,
B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller et al., Science, 306,
704–708 (2004).
39. G. Tiana, S. Krishna, S. Pigolotti, M. H. Jensen and K. Sneppen, Phys. Biol., 4,
R1 (2007).
40. C.Y. Huang and J.E. Ferrel Jr, Proc. Natl. Acad. Sci. USA, 93, 10078–10083
(1996).
41. A. Goldbeter and D.E. Koshland, Proc. Natl. Acad. Sci. USA, 78, 6840–6844
(1981).
42. G. Felsenfeld and M. Groudine, Nature, 421, 448 (2003).
43. A.T. Annunziato, J. Biol. Chem., 280, 12065 (2005).
44. G. Thon and T. Friis, Genetics, 145, 685 (1997).
45. I.B. Dodd, M.A. Micheelsen, K. Sneppen and G. Thon, Cell, 129, 813–822 (2007).
46. G. Thon, P. Bjerling, C.M. Brunner and J. Verhein-Hansen, Genetics, 161, 611
(2002).
Signaling and Feedback in Biological Networks 93

47. B.H. Zimm, Proc. Natl. Acad. Sci. USA, 45, 1601 (1959).
48. H.A. Scherage, Pure and Applied Chemistry, 36 1 (1972).
49. R. Schleif, Trends Genet., 16, 559–565 (2000).
50. E. Richet and O. Raibaud, EMBO J., 8, 981–987 (1989).
51. E. Massé and M. Arguin, Trends Biochem. Sci., 30, 462–468 (2005).
52. M.J. Weickert and S. Adhya, Mol. Microbiol., 10, 245–251 (1993).
53. E.M. Ozbudak, M. Thattai, H.N. Lim, B.I. Shraiman and A. van Oudenaarden,
Nature, 427, 737–740 (2004).
54. W.P. Smits, O.P. Kuipers and J.W. Veening, Nat. Rev. Microbiol., 4, 259–271
(2006).
55. R. Donangelo and K. Sneppen, Physica A, 316, 581–591 (2002).
56. S. Semsey, A.M.C. Andersson, S. Krishna, M.H. Jensen, E. Massé and K. Sneppen,
Nucl. Acids Res., 34, 4960–4967 (2006).
57. N. Mitarai, A.M.C. Andersson, S. Krishna, S. Semsey and K. Sneppen, Phys. Biol.,
4, 164–171 (2007).
58. F.W. Outten, O. Djaman and G. Storz, Mol. Microbiol., 52, 861–872 (2004).
59. M. Werner, S. Semsey, K. Sneppen and S. Krishna, preprint (2008).
60. U.S. EPA Exposure Factors Handbook, 1997, http://www.epa.gov/ncea/efh/
Topographic Spreading Analysis
of an Empirical Sex Workers’ Network

Johannes Bjelland,1 Geoffrey Canright,1 Kenth Engø-Monsen,1


and Valencia P. Remple2
1
Telenor R&I, 1331 Fornebu, Norway
johannes.bjelland@telenor.com, geoffrey.canright@telenor.com,
kenth.engø-monsen@telenor.com
2
BC Centre for Disease Control Epidemiology, University of British Columbia,
Vancouver, BC, Canada; Valencia.Remple@bccdc.ca

1 Introduction

The problem of epidemic spreading over networks has received considerable


attention in recent years, due both to its intrinsic intellectual challenge and to
its practical importance. A good recent summary of such work may be found
in Newman [8], while [9] gives an outstanding example of a non-trivial predic-
tion which is obtained from explicitly modeling the network in the epidemic
spreading. In the language of mathematicians and computer scientists, a net-
work of nodes connected by edges is called a graph. Most work on epidemic
spreading over networks focuses on whole-graph properties, such as the per-
centage of infected nodes at long time. Two of us have, in contrast, focused on
understanding the spread of an infection over time and space (the network)
[1, 3, 2]. This work involves decomposing any given network into subgraphs
called regions [1]. Regions are precisely defined as disjoint subgraphs which
may be viewed as coarse-grained units of infection—in that, once one node in
a region is infected, the progress of the infection over the remainder of the re-
gion is relatively fast and predictable [3]. We note that this approach is based
on the ‘Susceptible-Infected’ (SI) model of infection, in which nodes, once
infected, are never cured. This model is reasonable for some infections, such
as HIV—which is one of the diseases studied here. We also study gonorrhea
and chlamydia, for which a more appropriate model is Susceptible-Infected-
Susceptible (SIS) [7] (since nodes can be cured); we discuss the limitations of
our approach for these cases below.
In this paper we apply the “topographic” regions-analysis approach to an
empirical sex network, built from interviews with female sex workers (FSWs)
in Vancouver, Canada. (See [3] for a detailed discussion of the “topographic”

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 6,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
98 J. Bjelland et al.

approach.) The network consists of the FSWs themselves, plus their sex
partners (paid and unpaid), as well as any partners of these partners which
were known to the FSW. This method, beginning with 49 interviewed FSWs,
gave a highly connected network of 553 nodes [10]. Furthermore, STI (sexu-
ally transmitted infection) status was obtained for many of these nodes. In
particular, two of the nodes were identified as being HIV-positive, while 11
other nodes have either gonorrhea, chlamydia, or both.
From the collected network data we build an adjacency matrix, where
element aij = 1 if i has a link to j, and is zero elsewhere. (In the case of a
weighted graph, element aij equals the strength of the link from node i to
node j.) The principal eigenvector of the adjacency matrix is a measure of a
node’s centrality in the graph and is called the eigenvector centrality or EVC.
The EVC scores for the nodes in the (weighted or unweighted) network give
the starting point for our approach: they are used for assigning the nodes
to regions, and for predicting the spreading of disease within and between
regions.
The aims of this work are several. One goal is to extend our earlier topo-
graphic approach to a graph with weighted links. As we will see, this seem-
ingly small change can have very large effects; but we will also see that the
validity of our approach is confirmed, in spite of these large effects. This is
because the modified approach (presented here for the first time) is consistent:
we use the link weights to modify the graph’s adjacency matrix, and hence
the nodes’ EVC values; and we use them again when we define the regions
via the steepest-ascent graph (SAG).
A second aim of this work is to try to exploit the insights gained from the
topographic analysis, in order to find novel suggestions for preventive actions
to hinder the spread of the disease in question. We find that our progress to-
wards this second goal is considerably more modest than that towards the first
goal. We will show “thought experiments,” based on the empirical graph topol-
ogy and link strengths, for which our analysis is extremely useful. However,
we will not find practical suggestions which are immediately promising for the
given Vancouver FSW graph. There are several reasons for this. First, the HIV
graph is so thoroughly protected by condom use that we find little to add in
terms of ideas for preventive measures. Second, the graphs for gonorrhea and
for chlamydia are so thoroughly well connected, and also so well infected, that
we do not find small topological changes which can make a large difference.
We note that our approach treats the network as static; hence any effects
of network dynamics are not taken into account. We believe, however, that
our qualitative results are fairly robust to the likely dynamics of this network,
since its overall structure is thought to be fairly stable over time. Also, our
analysis (once the network is mapped out—which can be time consuming!) is
not computationally demanding, and so may be performed in essentially zero
time compared to the time scale of epidemic spreading. Hence any suggestions
resulting from the analysis may be implemented in something approaching
real time.
Topographic Spreading Analysis 99

2 Uniform Transmission Model


First we study the FSW graph without taking into account the link weights.
That is, each sexual contact is given strength “1” in the adjacency matrix. This
is logically equivalent to giving each link the same probability of transmission
per unit time.
Our purpose in doing this analysis is to be able to compare with the
analysis done using non-uniform link strengths (transmission probabilities).
As we will see, the differences are large and important.

2.1 Visualization and Bipartiteness

Our topographic analysis includes a novel approach to graph visualization:


we group the nodes into their respective regions, and lay out the whole graph
according to the SAG [4]. We present the basic ideas here, and refer the reader
to earlier papers [1, 3, 4] for details. We view the EVC of a node as a measure
of its “well-connectedness” and hence of its “spreading power.” Then we single
out local maxima of the EVC as being particularly important in spreading;
we call these nodes Centers. Also, since EVC (being recursively defined) is
“smooth,” we can speak of “neighborhoods” in the graph as having a typical
EVC; and we conclude that spreading is fast in neighborhoods of high EVC,
and slow in “lower” neighborhoods. We then define regions of the graph—
one for each Center. Each node finds its region (mountain) by following a
steepest-ascent path until it terminates at a local maximum (mountaintop, or
Center). The set of steepest-ascent paths then forms a directed hierarchical
tree graph (the SAG), which is useful both for visualizing the graph and for
predicting the likely paths of fastest epidemic spreading. In a tree graph, any
two nodes are connected by exactly one path, and there are no cycles (closed
loops of links).
The SAG for the unweighted FSW graph is shown in Fig. 1. We note several
interesting points from this visualization. (i) there are many regions (17). (ii)
All the Centers (most central nodes in each region) are men. (iii) Many regions
are small, i.e., 1–3 nodes, while (iv) the bulk of the nodes (517/553) lie in one
of the three largest regions (red—marked R in the figure, blue (B), dark grey
(G)). (v) Every region is well connected to the largest, red, region. Hence the
red region is expected to play a dominant role in any epidemic spreading. (vi)
One HIV-positive node is in the red region, and the other is (while in its own
region) well connected to the central part of the red region.
Now we comment on these points. We believe that points (i)–(iii) derive
from the fact that the graph is nearly bipartite. A bipartite graph consists of
two sets of nodes, such that all links are made between the two sets, and there
are no links between nodes in the same set. Now we suppose (which is almost
true) that the FSW graph is a strictly bipartite graph composed of M and
F nodes. If we further assume that an M node is a Center (local maximum
of centrality), then all of its neighbors are (a) female, (b) highly central,
100 J. Bjelland et al.

Fig. 1. Regions visualization of the FSW network, with all links set to equal strength.
Only the links in the SAG are shown here for visual clarity. The most central node in
each region is enlarged. The three largest regions, which will be discussed further in
the text, are labeled R (red), G (grey), and B (blue).  - Male,  - Female.

and (c) automatically excluded from being a Center. Thus bipartiteness will
tend to favor one gender over another. By the same token, highly central M
nodes are never neighbors of other M Centers, and so are candidate Centers
themselves. Hence there may be a tendency for more, and smaller, regions.
Points (iv)–(vi) tell us that this network is highly prone to infection: the
many regions are not well isolated from one another, because of their common
connection to the dense, infectious red region. Also, the two start nodes are
in or near the central part of the red region, where spreading is fast.

2.2 Infectious Spreading on the Unweighted Graph

We have simulated spreading on the uniform FSW network, by giving each


link the same probability per unit time for spreading. The value used is thus
arbitrary, as is the unit of time. We typically use a value of a few percent, since
much larger values give a very unsmooth time evolution (equivalent to a poor
time resolution). We report the results here because they are illustrative of
the strengths and weaknesses of our method, for the case of multiple regions.
(For reasons given below, these are the only multi-region simulations that we
can perform with this graph.)
Topographic Spreading Analysis 101

Fig. 2. Same visualization as Fig. 1, except that all links are shown. The arrows mark
the known HIV-positive nodes.

Taking the start (infected at t = 0) nodes as shown in Fig. 2 above, we


find, as expected, that the regions as we define them here are again valid
coarse units of infection. We also find that it is difficult to stop or even retard
the infection, because of the topology of the graph. The upper part of Fig. 3
shows a typical epidemic progression, with the growth in the red, blue, and
grey regions resolved. All three “take off” at about the same time, and the
infection spreads rapidly. Measures to retard spreading in the red region—
without resorting to large topological change—are not found to be effective.
We find however that protecting one node—the Center of the grey region—
drastically weakens the red/grey connection. We see in the bottom part of
Fig. 3 the results when this is done: the red and blue regions take off as
before, but the grey region’s takeoff is greatly retarded. This is an example of
the kind of benefit that we believe can be obtained from our analysis.
We also considered the more promising problem of an infection starting in
the grey region—again motivated by the observed red ⇐⇒ grey bottleneck in
the topology. The top of Fig. 4 shows that takeoff is retarded by a factor of
about 3, compared to the former case (top of Fig. 3). It is retarded even further
(about 7 times as slow) if we in addition protect the grey Center (bottom of
Fig. 4).
102 J. Bjelland et al.

Fig. 3. HIV spreading simulation without (top) and with (bottom) measures to isolate
the grey region from the red region. In each plot, there are four growth curves, showing
the total growth of the infection (‘Sum’), and the growth for the red (R), grey (G),
and blue (B) regions (the largest regions in the network).

Fig. 4. Same simulation as Fig. 3, except that the infection starts from a peripheral
node in the grey region.
Topographic Spreading Analysis 103

3 Links Weighted with Transmission Probabilities


In this section we add an important further element of realism by weighting
the links of our FSW graph with transmission probabilities. We are forced
in many cases to use rather crude approximations. Nevertheless, we feel that
the resulting model is considerably closer to reality than the uniform model.
Also (as we will see) it is strikingly different—in particular, each disease will
have its own graph. That is, while the basic topology is the same as that in
Fig. 2, the set of link weights depends on the disease—because these weights
represent transmission rates (probability/time). In fact, for the HIV case, the
topology itself is changed, since we set some link strengths to exactly zero.
In practice, incorporating the link strengths into the analysis involves (1)
building a weighted adjacency matrix W using the link strengths, (2) finding
the corrected EVC as the dominant eigenvector of this matrix W , and (3)
redefining “steepest ascent” to take account of the varying link strengths.
The first two steps are clear; and we describe step (3) in Section 3.2.
Of course, before doing any of this, we must find the link strengths. We
describe our procedure for doing so in the next section.

3.1 Estimating the Probabilities

For each link we want a single weight (number) which gives the probability
per unit time of transmission from an infected node to an uninfected node.
This probability is based on a number of factors which must be estimated
from limited data. We list these factors schematically as follows:

Transmission probability/unit time =


[(unprotected probability/contact)(non-condom use prevalence)
× (contacts/time)]
+ [(protected probability/contact)(condom use prevalence)
×(contacts/time)]

Now we discuss each factor in turn. For each disease (HIV, gonorrhea or
‘NG’, and chlamydia or ‘CT’) we estimate (unprotected probability/contact)
from Ref. [6]. See Table 1. To correct for condom use, we must know the
frequency of condom use for each link (condom use prevalence). For 256 links
(about 17% of them) we have an estimate for (condom use prevalence) from
survey data [10]. We know very little about the remaining links, except for

Table 1. Transmission probabilities/contact for NG (gonorrhea), CT (chlamydia),


and HIV.

NG CT HIV
Unprotected 0.43 0.10 0.05
Protected 0.16 0.074 0
104 J. Bjelland et al.

whether they are a “client” relationship or a “non-client” relationship. We


explain below how we generate link weights for the links for which we have
no survey data.
Estimates for (contacts/time) were available (again) for those links for
which we obtained survey information; however, here we have yet another
source of uncertainty. That is, each interviewed FSW reported contacts with
“regulars” and also contacts with new or “non-regular” customers. We take
the reported estimates of (contacts/time) for regulars as given. For the non-
regulars, we assume that either (i) they will become regular in the future, or
(ii) they will be replaced by other non-regular customers who play essentially
the same role in the network. In short: we ignore the distinction beween cases
(i) and (ii).
We still need a reasonable estimate of contacts/time for non-regulars. We
proceed as follows: for each FSW, we define T to be the total number of
contacts per unit time (summed over all neighbors). Also we let P be the
percentage of contacts from regulars, and let C be the number of contacts/time
from regulars. Then clearly C = P T ; and since we can estimate both P and
C from the survey data, we get an estimate of T (= C/P ). We then estimate
the total contacts/time N for non-regulars to be N = T − C. Finally, we take,
from the survey data, the expected number of non-regular neighbors (still
for each FSW), and call this number K. We then (finally) get the expected
contacts/time for each non-regular as N/K. Our model is clearly very crude,
treating each non-regular in a very average way; but it enables us to move (as
we will see) well beyond the equal-transmission-probability model, and so, we
believe, much closer to reality.
Now we come to the term due to protected sex. We estimate (protected
probability/contact) by correcting the (unprotected probability/contact)
data, using data for the correction due to condom use from [5]. We note here
that we set (protected probability/contact) for HIV to be exactly zero. Not
surprisingly, this will have dramatic effects on the spreading behavior—as we
will see in Section 3.3.
This completes our prescription for estimating link weights for those links
for which we have survey data. We then used a very simple approach—which
we find appropriate to the high degree of uncertainty in our data—to esti-
mate the remaining link weights (transmission probabilities/time). Our so-
lution here is to first divide all links (surveyed and not surveyed) into two
groups: client and non-client. Then, for each group, we simply reproduced the
distribution over the “surveyed” links so as to also assign transmission prob-
abilities to all of the “non-surveyed” links. Since the survey data is discrete,
the link-weight distribution obtained is never smooth. Hence we reproduced
these discrete distributions by simply repeating (sampling) each value in the
discrete distribution with a probability equal to its frequency in the distribu-
tion. That is: we do not attempt to create distributions for each parameter in
the link-weight estimate; instead we simply copy the discrete link weight val-
ues obtained from the survey data onto the unknown links, with appropriate
probabilities.
Topographic Spreading Analysis 105

3.2 SAG∗

Now we address another complication arising from the use of weighted links:
we must reconsider the definition of the steepest-ascent graph (SAG), which is
used both for assigning region membership and for visualization purposes. Our
point here is simple, namely that the definition of steepest ascent should take
account of the link strength. This rather obvious point has not been addressed
in our earlier use of the SAG [1, 2, 3], because these earlier studies were applied
to unweighted graphs. Hence we offer a brief account here of the modification
used for weighted links.
We recall that region membership is assigned by in essence asking each
node to find the steepest path to the “top”—i.e., to the “nearest” local maxi-
mum of the EVC. The notion of local maximum is independent of link stength.
Suppose, however, that a node N has two local maxima (Centers, C1 and C2 )
as neighbors: which region do we place N in? Since we want steepest-ascent
paths to represent most likely spreading, it seems reasonable that a neighbor
C1 with a very weak link to N should not be assigned the steepest-ascent
path—even if it is somewhat higher (in EVC) than C2 . In other words, if we
retain the notion that steepest ascent gives the right answer, then we clearly
want to define the slope as being

slope = Δy/Δx, (1)

with Δx (‘distance’) decreasing with increasing link strength.


Clearly, Δy is the EVC difference, as in earlier (unweighted) work; hence
we simply need some reasonable definition for the “distance” Δx. We take
here the simple heuristic Δx(i, j) = 1/W (i, j) with W (i, j) the link strength
(tranmission probability) between nodes i and j. Our point here is then that
node N may find that it is not simply in the region of its highest neighbor:
instead, it will be placed in the same region as the neighbor N ∗ with the
highest product Δy/Δx = [EV C(N ∗ ) − EV C(N )][W (N, N ∗ )]. In short, if its
link to the highest neighbor is very weak, then (reasonably) it will be placed
instead in the region of a neighbor with a stronger link. We believe this is
consistent with our aim for defining regions—namely, that a region is a coarse-
grained unit of infection, such that infection within a region is relatively fast
and predictable.
We call the resulting steepest-ascent graph SAG∗ (to distinguish it from the
SAG, which does not take link strengths into account). We will see below that
our spreading simulations can only give a limited test of our SAG∗ definition—
since in one case (HIV) the weighted network breaks down, while in the other
two (NG and CT) we only obtain a single region. Hence—while we retain
a belief that our definition is promising—a thorough test will have to await
application to a weighted graph which (i) has several regions, but yet (ii) is
better connected than our HIV graph of the following section.
106 J. Bjelland et al.

3.3 HIV Graph

The SAG∗ for our weighted HIV graph is shown in Fig. 5. We see immediately
that the contrast with Fig. 1 is enormous.
In particular, the 17 regions of Fig. 1 have multiplied many times. In
addition (which is not so easily seen in the figure) some nodes are completely
disconnected due to the zero-weight links, and hence do not appear in the
figure at all. The apparently isolated nodes in the corner of the figure are
one-node regions; such regions occur typically on the periphery of a graph,
where all EVC values are small.
What is even more striking is that adding all non-zero links to the SAG∗
picture of Fig. 5 makes very little change; that is, there are only six non-zero
links which are not shown in the figure (four connecting the one-node regions
to one other node each, and two other inter-region links). Hence we do not
show the full graph: it is essentially that of Fig. 5. This means in turn that
HIV spreading—while seemingly unstoppable in the picture obtained from
Fig. 2—is in fact not a problem for this FSW network. In particular, the

Fig. 5. Regions analysis for the HIV graph, corrected with the transmission probability
on each link. Note that the graph breaks into very many small regions, due to the
(assumed) zero transmission probability for reliable condom use. The two enlarged
nodes are known to be HIV-infected; the four nodes in the upper left corner are single-
node regions in the weighted graph.
Topographic Spreading Analysis 107

two HIV-positive (male) nodes (marked with large squares in Fig. 5) are each
confined to an effective two-node network, consisting of themselves and their
nonclient partner. Hence our expected picture of condom use for this empirical
network implies that HIV spreading will be limited to the non-client partner
relationships of the two infected nodes, and so has effectively zero probability
of reaching the rest of this dense sexual network.
Because the effective graph is so fragmented, and also because the HIV-
infected nodes are effectively isolated, we have not performed spreading sim-
ulations on the weighted HIV graph. We note that the largest region in Fig. 5
has 24 nodes, with a FSW as the most central node in the region. In fact the
strongly bipartite picture obtained from the unweighted graph (Fig. 1) has
also broken down here: both male and female Centers of the many regions are
found. This is however not so surprising, given the fragmented nature of the
effective graph.

3.4 Gonorrhea

Figure 6 shows the steepest-ascent (SAG∗ ) graph when we use link strengths
approriate to gonorrhea. Since 100% condom use does not give 100% protec-
tion [5], the effective gonorrhea graph has all the same links as were present

Fig. 6. Region (SAG∗ ) visualization for the gonorrhea network NG. The enlarged
nodes are known to be STI-infected.
108 J. Bjelland et al.

in Fig. 2; but they are reweighted. We see that the reweighting has still had a
dramatic effect. In particular, the 17 regions found for the unweighted graph
are now a single region for the weighted graph. Also, the Center of this one
region (and so of the entire graph) is an FSW.
An interesting aspect of the gonorrhea SAG∗ is that one of the few existing
homosexual (FSW ⇐⇒ FSW) links plays a very central role in the graph: the
link between the Center and the head of the large red subregion is homosexual.
This means that the two women involved are highly central in the weighted
graph, and also that the link strength between them (transmission probability
for gonorrhea) is not too small. One might then propose to remove this link—
which (as it is certainly requested and paid for by a male customer) should
be possible. However as we will see below, removal of this link—or any single
link—has little or no beneficial effect. (This conclusion is perhaps intuitively
grasped from the fully linked visualization of Fig. 7 below.)
SAGs of either type are strict hierarchical structures—that is, they are
directed trees, with links pointing strictly towards the root (Center). This
means that, for any given region, one can readily define subregions in terms
of branches of the tree. We have picked out the five largest branches of the
gonorrhea SAG∗ and color coded them. We see that it is visually meaningful
to think in terms of subregions for this region.

Fig. 7. Same layout as in as Fig. 6, but with all non-zero links displayed.
Topographic Spreading Analysis 109

Figure 7 shows the NG-graph again, but with all links displayed. We note
that presently infected nodes are enlarged and marked yellow (lighter grey
in printed version) in Fig. 6 and in Fig. 7. From Fig. 6 we see two infected
nodes lying at the heads of their (large) respective subregions, and hence only
one hop from the Center. Also we see that every major subregion is already
infected. This immediately suggests that preventing the further spreading of
gonorrhea on this graph will be quite difficult.
This pessimistic prognosis is also supported by the visualization of Fig. 7.
Here we see that all the major subregions are well connected to one aother,
with infected nodes lying in the heart of a dense cloud of links. We will
test (and confirm) this pessimistic prediction via stochastic simulations—see
Section 4.

3.5 Chlamydia
In Fig. 8 we show the SAG∗ visualization of the chlamydia graph. Qualitatively
we see much the same picture as for the NG graph: a single region, with an
FSW at the Center of the region. In fact, the homosexual dyad that we found
lying centrally in the NG graph is also central here—with the one difference
that here the two FSWs have exchanged roles (Center and subregion head).
Our SAG∗ visualizations suggest that the CT graph is perhaps even more
well connected than the NG graph—in that there are very few subregions,

Fig. 8. Region (SAG∗ ) visualization for the chlamydia network CT. Enlarged nodes
are known STI-infected nodes.
110 J. Bjelland et al.

and they are very large. And since (again) every major subregion is infected,
we arrive at the same qualitative prognosis for this graph: it will be difficult
to hinder the further spreading of the disease.
We have also plotted the analog of Fig. 7 for chlamydia—that is, the full
graph with all non-zero links. The result is again qualitatively like that of
Fig. 7; hence we do not show it here.

4 Spreading on the Gonorrhea Graph


For reasons already given, we have not run spreading simulations on all
three disease graphs. The HIV graph is so heavily disconnected by the many
condom-use-induced zero links that we see no point in running simulations
on it. Of course, these links, involving as they do real sexual contact, do not
have exactly zero probability for infectious spreading, even with 100% condom
use. Also the reported rates of 100% condom use are most likely overstated
in many cases. Hence it would be of interest to set the strength of these “zero
HIV links” to some small but positive value, and to examine the resulting
graph. We reserve this idea for future work.
The remaining two graphs (NG and CT) are qualitatively very similar.
Hence we have chosen to focus on one of them—the NG (gonorrhea) graph.
We must emphasize immediately however that our simulations, being based on
SI dynamics [8], do not accurately model the long-time dynamics of diseases
such as gonorrhea and chlamydia. A more appropriate model would be the SIS
model [7] in which Infected nodes become again Susceptible after a variable
time period.
We expect the SI model to give qualitatively correct results in the early
stage of any infectious process—when few nodes are infected, and they have
not had time to recover. Beyond this early stage the SI model can only over-
estimate the degree of spreading. Hence we present simulation results in this
section, based on the SI model, with two principal caveats:
• Takeoff of the disease will likely occur later for the more realistic SIS model
than what we show here.
• The long-time infected fraction will not approach 100%, but rather a lower
value.
With these caveats clearly in mind, we present some simulations on the
gonorrhea graph. Our aim is to see what insights we can gain from our SAG∗
picture. We will focus principally on when the infection takes off. Because we
simply compare different scenarios (and their takeoff times) with one another,
we feel that our (comparative) conclusions are not greatly weakened by the
caveats given above.
Our procedure for simulation is the same as before: at each time step, each
link ij has a probability pij = W (i, j) of transmitting the infection if exactly
one of the pair ij is already infected. Our link strength data, when the unit
Topographic Spreading Analysis 111

of time is one day, have values which vary from a few percent down to about
10−4 . With these small values we can increment the simulator with a time
step of one day, and get smooth results.
Our simulations differ from one another in three ways: (i) the choice of
“start” nodes which are infected at t = 0; (ii) the choice of a set of “immune”
nodes which cannot be infected; and (iii) sometimes, the choice of links which
are to be blocked from transmission (removed). Choices (ii) and (iii) allow us
to test various strategies for hindering spreading. In the real world of human
sexual behavior, accomplishing either of these effects may be quite difficult;
but we test them here simply to see what can be achieved.
First we simulate the reference case, in which those nodes which are known
to be infected are the start nodes (see again Figs. 6 and 7), and we immunize
no nodes or links. We find (Fig. 9) that the infection takes off very fast—as
anticipated in Section 3.4. Specifically, we see that the takeoff time is very
short—just a few days. This is consistent with the fact that the infection has
already reached three very central (as defined by EVC) nodes. This latter
fact is consistent with two interpretations: either (i) the infection has recently
come to this dense network, and it is on the verge of taking off, or (ii) the
infection has been present for a long time, and has reached an equilibrium
(and rather low) level.

600

500

400
Infected nodes

300

200

100 As−is
Center
Within 1 hop from center
STI red region + head of region
50 random
0
0 50 100 150 200 250 300 350 400
time

Fig. 9. Spreading simulations for gonorrhea, based on the SI model, and using various
prevention strategies. “As-is” = known infected start nodes and no strategy; the other
scenarios involve immunizing various nodes, as described in the text. The unit of time
is one day.
112 J. Bjelland et al.

We do not have sufficient empirical information to favor one of these in-


terpretations over the other. If the first one is correct, it implies that one can
expect a strong growth of infection rate in a relatively short time. If the second
is correct, then our model is likely inadequate, not only in the SI aspect but
probably in other aspects as well. We remind the reader that our topographic
analysis is most useful in understanding the spreading of new infections over
fairly static networks; hence it may be useful in case (i), but has little to say
about case (ii).
Now, in order to test our ideas further, we assume case (i). Based on our
SAG∗ picture, we formulate various immunization strategies and test them via
simulation. We have tried (a) immunizing the Center node; (b) immunizing
the Center and all nodes within one hop of the Center (subregion heads);
(c) immunizing the two infected nodes in the large red subregion, plus that
subregion’s head node; and (d) immunizing 50 nodes chosen at random.
Results for all of these cases are shown in Fig. 9. A simple conclusion is
starkly obvious: none of these immunization strategies is able to retard the
takeoff. In fact, the only clear difference is the trivial and useless one: that
the long-time infected fraction is reduced by the number of immunized nodes
[for example, by 14 for scenario (b), and by 50 for scenario (d)].
In short: as strongly suggested by Fig. 7, the NG network is sufficiently well
connected, and sufficiently well infected, so that we find no simple strategy
which is at all effective in retarding the takeoff.
In order to investigate a different kind of test of the utility of our method
of analysis, we next “cure” all infected nodes, and explore scenarios in which
we can choose the start nodes freely. Our principal aim is to test the following
hypothesis: that time to takeoff is strongly determined by distance from the
Center of the SAG∗ .
Some simple tests of this hypothesis are shown in Fig. 10. Here we show the
progression of infection for three scenarios: (e) the Center is the only infected
start node; (f) a node roughly halfway between the Center and the periphery
is the start node; and (g) a very peripheral node is the start node.
The results of Fig. 10 strongly support our hypothesis. Takeoff times vary
from a few days to about 50 days to almost 150 days, as we move the start
node outward in the SAG∗ .
We also see, in the bottom half of the figure, that our earlier picture [3, 2] of
the movement of the infection “front” over the topography is confirmed here:
the infection [assuming it doesn’t start at the top as in (e)] moves slowly at
first, until it begins to reach more central nodes, at which point it speeds up,
while moving “uphill” (towards the Center); subsequently it moves “downhill,”
slowing down all the while. While we have seen this dynamic pattern many
times before, this is the first time we have tested it on a graph with weighted
links (and with the EVC appropriately corrected via the weighted adjacency
matrix).
While Fig. 10 offers anecdotal evidence for our hypothesis, we also have
statistical data. We have in fact run one-start-node simulations for each node
on the graph, 10 times for each node, and recorded the average time needed to
Topographic Spreading Analysis 113

600
Infected nodes 500
400
300
Central node
200 Medium Central
Not central
100
0
0 50 100 150 200 250 300 350 400
time
0.2
Mean EVC infected nodes

0.15

0.1

0.05

0
0 50 100 150 200 250 300 350 400
time

Fig. 10. Three spreading simulations, based on three chosen scenarios, each with a
single start node. We see that distance from the Center node (in a metric defined by
the SAG∗ ) correlates strongly with time to takeoff. The lower part of the figure shows
the average EVC of the newly infected nodes.

reach an infection number of 300 nodes (about 60%). To measure “distance”


from the Center, we define the dual notion of “closeness”: a node’s closeness
to the Center is simply the product of the link strengths over the (unique)
path to the Center in SAG∗ . Thus many weak links give low closeness, while
few strong links give high closeness; and both the number of hops and the
link strengths of the hops affect the result.
Figure 11 gives a scatter plot for average infection time vs closeness, for
all nodes in the graph except the Center node. We see a strong decreasing
relationship: closer nodes need less time to infect the graph. Thus we find
from these results further strong support for our hypothesis.

5 Summary and Discussion


In this chapter we have extended the topographic approach to the problem
of epidemic spreading over networks to a problem involving two new features.
First, the network is real: it is an empirical sex network, with some nodes
known to be infected with the STIs HIV, gonorrhea, and chlamydia. Second,
we have data which allow us to assign non-uniform link strengths (trans-
mission probabilities), and we have generalized the topographic approach to
incorporate these link strengths.
114 J. Bjelland et al.

103
time

102

101 −12
10 10−10 10−8 10−6 10−4 10−2 100
closeness

Fig. 11. Time needed for a single start node to infect 300 nodes, as a function of that
start node’s “closeness” to the graph’s Center (averaged over 10 experiments for each
start node). Closeness is measured entirely in terms of the modified steepest-ascent
graph SAG∗ . We see a thorough statistical corroboration of the results of Fig. 10.

To help in illuminating the effects of incorporating link strengths, we first


performed the analysis by ignoring these weights. We visualized the resulting
unweighted FSW network, and simulated the progress of HIV on this network
(using uniform transmission probabilities). We found some interesting effects
from the almost-bipartite nature of the unweighted network. We also found
that the network is very highly connected—with the two HIV-infected nodes
very close to the network’s Center—so that retarding the spread of HIV was
difficult. Nevertheless we were able to show significant benefits to be obtained
from our analysis, for some hypothetical cases involving start nodes placed
elsewhere.
Incorporation of empirically obtained link strengths had large conse-
quences. Each disease yielded a distinct weighted graph, by affecting the trans-
mission probabilities. We found (using our assumption that perfect condom
protection was possible) that the HIV graph broke down into many small
components. While our visualization may still have some value, we saw no
value in running simulations on these small components.
Topographic Spreading Analysis 115

Simulations on the gonorrhea graph gave results much like those on the
unweighted FSW graph: the graph was very well connected, and the already-
infected nodes had rather central positions. The result was that we were unable
to find simple topological fixes, inspired by our analysis, which could sig-
nificantly retard spreading. However, we were able to find strong evidence
confirming the basic applicability of our analysis to spreading. Specifically,
we showed that our own notion of a node’s distance from the Center of the
graph correlated strongly with the time needed for that node to infect the
graph.
We emphasize that this is the first application of the topographic approach
to a weighted graph. Performing this analysis has required generalizing our
earlier definition [3] of steepest ascent. The results we obtain here, based on
this new, generalized definition, are very promising. Hence—even as we fail
to come up with promising, concrete suggestions for hindering the spread
of STIs in the Vancouver sex network—we feel that our results confirm the
applicability of our approach to understanding spreading in the real-world
case of a network with non-uniformly weighted links.
We see a clear need for two obvious extensions of this work. First, it would
be useful to reconnect the HIV graph, by assigning small but non-zero proba-
bilities to the 100%-condom-use links. This would allow for a more meaningful
regions analysis and the accompanying testing by simulations (perhaps over
a long time scale).
Second, our approach is most simply understood and applied for diseases
for which SI spreading is appropriate (such as HIV). The application to gon-
orrhea or chlamydia would be greatly strengthened if one could generalize
the method to the SIS and/or SIR case. This is an interesting challenge for
future work.
The data used arrive from self-reported infection status ([10]). To validate
our model, empirically collected retrospective data on actual prevalence and
incidence of the infections could be obtained. This is also recommended for
future work.
Finally, we remind the reader of the motivation for this work. We believe
that the topographic analysis, based on EVC, is extremely useful for under-
standing epidemic spreading on a coarse scale. The analysis itself is not com-
putationally demanding; hence it can be performed in essentially real time.
Thus, we hope that our approach can be useful for disease prevention, in those
cases for which the network can be mapped in reasonably short time—that
is, short compared to both the time scale for infectious spreading, and the
time scale for significant topology changes. The results presented here do not
offer any immediate solution to the problem of STIs in the Vancouver FSW
network, but they do add further support to our belief that this approach may
be useful for this problem, and for others.
116 J. Bjelland et al.

Acknowledgments

GC and KEM acknowledge partial support from the Future and Emerg-
ing Technologies unit of the European Commission through Project DELIS
(IST-2002-001907). VPR acknowledges the financial and in-kind support,
respectively, of the BC Medical Services Fdn and HIV/STI Prevention and
Control, BC Centre for Disease Control.

References
1. G. Canright and K. Engø-Monsen. Roles in networks. Science of Computer Pro-
gramming, pages 195–214, 2004.
2. G. Canright and K. Engø-Monsen. Epidemic spreading over networks: a view from
neighbourhoods. Telektronikk, 101:65–85, 2005.
3. G. Canright and K. Engø-Monsen. Spreading on networks: a topographic view.
In Proceedings, European Conference on Complex Systems, 2005.
4. G. S. Canright and K. Engø-Monsen. Some relevant aspects of network analysis
and graph theory. In J. Bergstra and M. Burgess, editors, Handbook of Network
and Systems Administration. Elsevier, Amsterdam, 2007.
5. K. Holmes, R. Levine, and M. Weaver. Effectiveness of condoms in preventing
sexually transmitted infections. Bull World Health Organ, 82:454–461, 2004.
6. A. M. Jolly, M. E. Moffatt, M. V. Fast, and R. C. Brunham. Sexually transmitted
disease thresholds in Manitoba, Canada. Ann Epidemiol, 15:781–788, 2005.
7. M. Kretzschmar, Y. T. P. H. van Duynhoven, and A. J. Severijnen. Modeling
prevention strategies for gonorrhea and chlamydia using stochastic network simu-
lations. American Journal of Epidimiology, 144:306–317, 1996.
8. M. Newman. The structure and function of complex networks. SIAM Review,
45:167–256, 2003.
9. R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks.
Phys Rev Lett, 86:3200–3203, 2001.
10. V. P. Remple, D. M. Patrick, C. Johnston, M. W. Tyndall, and A. Jolly. Clients of
indoor commercial sex workers: Heterogeneity in patronage patterns and implica-
tions for HIV and STI propagation through sexual networks. Sexually Transmitted
Diseases, May 2007.
Spectral Characterization of Network
Structures and Dynamics

Anirban Banerjee1 and Jürgen Jost2


1
Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin,
Germany; banerjee@molgen.mpg.de
2
Max Planck Institute for Mathematics in the Sciences, Inselstr.22, 04103 Leipzig,
Germany, and Santa Fe Institute, Santa Fe, NM 87501, USA; jost@mis.mpg.de

1 Introduction

Mathematically, graphs defy a systematic and complete classification, and


empirically, the graphs representing networks come in a bewildering multi-
tude. We have developed some tools [8, 9, 10] that at least allow for a rough
classification of graphs that reflects the difference in the empirical domains
from which network data are produced and that does not depend on sophis-
ticated visualization tools.
As such, a graph is a rather simple formal structure. It consists of nodes
or vertices that are connected by edges or links. These nodes then repre-
sent the elements of a network (and we shall often not distinguish between
the network and its underlying graph), and the edges represent relations
between them. These could be chemical interactions as in intracellular net-
works of genes, proteins, or metabolites, synaptic connections between neu-
rons, physical links in infrastructural networks, links between Internet pages,
co-occurrences between words in sentences or on text pages, email contacts
between people, co-authorships between scientists, and so on. This structure
then can be expected to be somehow adapted to the function of the network,
by evolution, self-organization, or design. In turn, any dynamics supported by
the network will be constrained by this underlying structure.
Our approach is based on associating certain mathematical objects—which
ultimately just yield some numbers—to a graph which reflect its structural
properties and which in particular encode the constraints on the dynamics
that it can support. The mathematical objects will be an operator, the graph
Laplacian (a discrete analogue of the Laplace operator in real analysis), and
its eigenfunctions, and the numbers alluded to will be the eigenvalues of that
operator.

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 7,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
118 A. Banerjee and J. Jost

2 Growing Networks

Empirical networks usually do not spring into existence, but rather grow to
their present or final state from smaller beginnings. Naturally, such a growth
process involves the sequential addition of nodes and links (connections). Usu-
ally, nodes are added at random, but their link formation with other nodes
(already present in the network) is often not entirely random. This link for-
mation will follow some rule that typically is still stochastic but also involves
properties of those nodes that are candidates for receiving a link. When that
rule is such that there is a higher chance of receiving links from those nodes
that already have many connections than from those with fewer connections,
we have some form of preferential attachment. Such a rule is known to lead to
a scale-free degree distribution of the nodes in the network; that is, the number
of nodes in the final network that have k links behaves like some power k −α ,
for some positive exponent α. The first such rule was proposed by Simon [44],
and it directly stipulated that those nodes that have more connections also
have a higher chance of receiving additional ones (“the-rich-get-richer” prin-
ciple). This rule and the effects resulting from it were then systematically
investigated by Barabási–Albert [2, 11], and subsequently, many empirical
networks were found to exhibit such a power-law degree distribution.
It would be, however, premature to draw systematic consequences about
other network properties from such a power-law degree distribution. In fact,
there are many rules for network growth that are plausible in many areas
of application that indirectly lead to such a kind of preferential attachment,
but can lead to networks with properties that are otherwise rather different
from those of the schemes of Simon and Barabási–Albert. For instance, Jost–
Joy [28] investigated the “make-friends-with-the-friends-of-your-friends” rule
where a new node first forms one link with a randomly selected node in the
network and then preferentially makes further links with neighbors of that
node. Since the chance of a node being a neighbor of some randomly chosen
node depends on its degree, these subsequent links then also constitute some
preferential attachment, and the resulting degree distribution will follow a
power law. However, other properties of that network are rather different from
those obtained by the direct preferential attachment scheme. In particular,
because of the preference for local connections, the network diameter will be
typically much larger. Even the opposite scheme, where a node preferentially
forms additional links with nodes from which it has a large distance, does not
lead to a network with a very small diameter. For creating a network with a
small diameter, it is rather more efficient that nodes directly use preferential
attachment, that is, preferentially form links with other nodes that have a
high degree and are therefore well connected in the network. Of course, the
most efficient way to achieve a small diameter in a sparse network is to connect
every node to one single central node.
Another crucial difference between a “make-friends-with-the-friends-of-
your-friends” network and a “the-rich-get-richer” network is that the first
Spectral Characterization of Network Structures and Dynamics 119

eigenvalue of the make friends network will be much smaller, implying for
instance that dynamics on such a network are much more difficult to synchro-
nize, as will be explained below. In fact, spectral properties like the behavior
of the first eigenvalue of scale-free networks were analyzed in [3, 4], and it
was pointed out that the scaling exponent and the first eigenvalue are essen-
tially independent parameters for a network. Of course, when networks are
produced by a certain stochastic scheme or drawn from some probability dis-
tribution on the space of networks, then that scheme or distribution will also
lead to some typical spectral behavior, as systematically investigated in [29].
However, when we only know whether a network is scale free, we should be
careful about inferring other network properties. It might be a wiser strategy
to find out more about the underlying network evolution rule, like the above
“make-friends-with-the-friends-of-your-friends” principle, the Cameo princi-
ple of Blanchard–Krüger [12], or whatever is plausible in the given empirical
domain. One important class of rules for which there is much evidence in
various domains is the one of node duplications. That means that instead of
randomly attaching a new external node, we take some node i already present
in the network and double it in the sense that we create a new node i that
forms links with all or some of the neighbors of i. It may or may not also
form a link with i itself. Again, since the chance of another node j of being a
neighbor of the randomly chosen node i and therefore receiving new connec-
tions from i depends on the degree of j, we do get a preferential attachment
scheme. Again, however, as we shall see below, such a node duplication leads
to some specific spectral properties that are not shared by networks arising
from different schemes.
There also exist other distinctions within the class of scale-free networks.
An important one is whether the nodes of high degree are assortative, i.e., pre-
fer connections with other high degree nodes, or disassortative, i.e., avoid con-
nections with high degree nodes and rather form links with low degree nodes.

3 Graph Operators and their Spectral Properties

We have already seen several important network parameters or properties,


like the diameter, the synchronizability, the degree sequence (counting the
number of nodes of degree k in the network as a function of k ∈ N), and the
assortativity. Of course, there are many others, like the clustering coefficient,
which expresses the relative frequency of triangles, that is, triples of nodes
that are pairwise connected. The clustering coeffient is defined as

3 × number of triangles
C := . (1)
number of connected triples of nodes

The normalization is that C becomes one for a fully connected graph.


120 A. Banerjee and J. Jost

Certain properties characterize specific classes of graphs. Complete graphs


are those where every vertex is connected with all others. Of course, for large
graphs, this is an unrealistic situation, as they are typically sparse, in the
sense that the average vertex has connections to only a small fraction of the
vertices present in the graph. A graph is bipartite when it consists of two
classes inside each of which there are no connections. A graph is bipartite
iff it has no closed paths of odd length. In particular, for a bipartite graph,
the clustering coefficient C vanishes. A complete bipartite graph is one where
each member of one class is connected with all members of the other class.
Trees are special bipartite graphs. They have the minimal number of edges,
N − 1, that is needed to make a graph of N vertices connected.
One may also consider more general structural properties, like cohesion,
or functional aspects, like robustness against the destruction of links or the
elimination of nodes. Clearly, no such list of parameters and properties can
be exhaustive. Also, it may not be easy to understand the relations, if any,
between those parameters and properties. In this situation, we have developed
the spectral approach to the description of networks. As we shall explain,
this means the analysis of the density of eigenvalues of a natural operator
associated to a network, the graph Laplacian. While these eigenvalues do
not always fully determine a graph, they nevertheless capture all important
geometric properties, in a more or less explicit form. Plotting the density of
eigenvalues also yields a representation of a graph that can be readily visually
inspected. (In contrast, explicit presentation of the nodes and links becomes
rather opaque once the graph exceeds some moderate size of, say 1–200 nodes.)
Moreover, can easily manipulated by moving the nodes around in a plane.
We now formally introduce the graph Laplacian and its spectrum. We
represent our network structurally as a graph Γ which we assume to be finite
and connected; let it have N vertices. Vertices i, j ∈ Γ connected by an edge
of Γ are called neighbors, i ∼ j. The number of neighbors of a vertex i ∈ Γ
is called its degree ni . For functions v from the vertices of Γ to R, we define
the normalized Laplacian (henceforth simply called the Laplacian) as
1 
Δv(i) := v(j) − v(i). (2)
ni j,j∼i

This operator is different from the algebraic graph Laplacian Lv(i) := ni v(i)−

j,j∼i v(j); see, e.g., [13, 14, 20, 32, 35]. In particular, the spectrum of Δ is
different from that of L; Δ, however, has the same spectrum as the Laplacian
investigated in [15] (in fact, the two operators are equivalent, differing only by
a multiplier). The normalized Laplacian is the operator underlying random
walks and conservative diffusion processes on graphs. Therefore, it seems to be
the more natural operator from a geometric or physical perspective. However,
the algebraic Laplacian does possess certain nice algebraic properties that
are not shared by the normalized Laplacian, like a trace formula, see [22].
Spectral Characterization of Network Structures and Dynamics 121

Nevertheless, in our empirical studies, we have found that the Laplacian con-
sidered here seems to be a better tool for distinguishing different classes of
graphs by spectral properties.
We now recall some elementary properties, see, e.g., [15, 26]. The Laplacian
is symmetric for the product

(u, v) := ni u(i)v(i) (3)
i∈V

for real-valued functions u, v on the vertices of Γ (and because of this sym-


metry, we need not consider complex-valued functions). The eigenvalues of Δ
therefore are real.
Δ is nonpositive in the sense that (Δu, u) ≤ 0 for all u. With the following
convention, the eigenvalues λ then are nonnegative:

Δu + λu = 0. (4)

A nonzero solution u is called an eigenfunction for the eigenvalue λ. Since Γ


has N vertices, Δ has N eigenvalues, not necessarily distinct, as some of them
might occur with higher multiplicity.
The smallest eigenvalue is λ0 = 0, with a constant eigenfunction. This
eigenvalue is simple because we assume that Γ is connected; in general, the
multiplicity of the eigenvalue 0 equals the number of connected components,
with the corresponding eigenfunctions being ≡ 1 on one and ≡ 0 on all other
components. Returning to our case of a connected graph Γ , then

λk > 0 (5)

for k > 0 where we order the eigenvalues as

λ0 = 0 < λ1 ≤ · · · ≤ λN −1 .
For the largest eigenvalue, we have

λN −1 ≤ 2. (6)

In particular, the spectrum of Δ is always confined to the interval [0, 2], re-
gardless of the size of the graph. This is not true for the algebraic graph
Laplacian L, and this property of Δ allows for an easy comparison of the
spectra of graphs irrespective of their sizes.
We have equality in (6) iff the graph is bipartite. Thus, a single eigenvalue
determines the global property of bipartiteness. More generally, a graph is
bipartite iff whenever λ is an eigenvalue, then so is 2 − λ. Thus, the character-
istic spectral property of a bipartite graph is that its spectrum is symmetric
about 1.
For instance, for a complete graph of N vertices,
N
λ1 = ... = λN −1 = , (7)
N −1
122 A. Banerjee and J. Jost

that is, there is only one nontrivial eigenvalue, NN−1 , occurring with multi-
plicity N − 1. Among all graphs with N vertices, this is the largest possible
value for λ1 and the smallest possible value for λN −1 . Thus, the characteristic
spectral property of complete graphs is that there is this eigenvalue with the
highest possible multiplicity.
Many qualitative properties of graphs can be characterized by inequalities
or other relationships between their eigenvalues. For instance, Monasson [36]
carried out a systematic investigation of the spectrum of a small-world graph
as the superposition of a regular ring and a random graph. Also, [23] develops
a method for (re)constructing a graph from its spectrum. We should point
out, however, that in general it is not possible to uniquely determine a graph
from its spectrum. In fact, there exist isospectral graphs, that is, different
graphs with the same eigenvalues. For instance, all complete bipartite graphs
with the same number N of vertices have the same eigenvalues. Actually, they
possess the eigenvalues 0 and 2 with multiplicity 1 and the eigenvalue 1 with
multiplicity N − 2. Any graph with that spectrum is a complete bipartite
graph, but among bipartite graphs of N vertices, the two classes may have
different sizes N1 , N2 , as long as N1 + N2 = N , of course.
We now rewrite the eigenvalue equation (4) as
1 
u(j) = (1 − λ)u(i) for all i. (8)
ni j∼i

We observe that when the eigenfunction u vanishes at i, then also



u(j) = 0. (9)
j∼i

The converse also holds, except for the case λ = 1 when (9) holds at all points
regardless of whether the eigenfunction vanishes there or not.
We now consider motifs, that is, small subgraphs of Γ of a particular type,
and analyze what happens to the spectrum when performing some natural
operations with motifs. As our motif, we take some graph Λ.
We start with motif joining: Here, the motif Λ is a graph that is inde-
pendent of Γ . Let j0 be a vertex of Λ. We assume that Λ has eigenvalue λ
and an eigenfunction uλ that vanishes at j0 , i.e., uλ (j0 ) = 0. We then form
a graph Γ̄ by identifying the vertex j0 with an arbitrary vertex i of Γ . The
new graph then also possesses the eigenvalue λ, with an eigenfunction that
agrees with uλ on Λ and vanishes at the other vertices, that is, those coming
from Γ . Thus, a motif Λ can be joined to an existing graph with a preserved
eigenvalue and a localized eigenfunction when the joining occurs at one (or
several) vertices where that eigenfunction vanishes.
We next consider motif duplication: Here, the motif Λ is a subgraph of Γ ,
with vertices j1 , . . . , jm . Let the function u on the vertex set of Λ satisfy
1 
u(j) = (1 − λ)u(i) for all i ∈ Λ and some λ, (10)
ni
j∈Λ,j∼i
Spectral Characterization of Network Structures and Dynamics 123

where ni is the degree of the vertex i in Γ . Let Γ̄ be obtained from Γ by


doubling the motif Λ, that is, by adding vertices i1 , . . . , im and their connec-
tions as in Λ and connecting each iα with all i ∈ / Λ that are neighbors of jα .
Then the graph Γ̄ possesses the eigenvalue λ with an eigenfunction uλ that
is nonzero at most of the vertices of Λ and its double; it agrees with u on
Λ, with −u on the double of Λ. Thus, the eigenvalue λ is produced by motif
duplication with symmetric eigenfunction balancing. We point out that for
this effect it is essential that there be no connections between a node jα and
its double iα .
The simplest motif is a single vertex, and the corresponding motif dupli-
cation is the doubling of a single vertex j0 ∈ Γ . According to the general
scheme, we add a new vertex i0 and connect i0 with all neighbors of j0 . This
generates an eigenvalue 1, with an eigenfunction u1 that is nonzero only at
j0 and i0 , with u1 (j0 ) = 1, u1 (i0 ) = −1. In the analysis of empirical networks,
we often find that the spectral plot has a high peak at the eigenvalue 1.
In such a situation, a natural hypothesis is that this network evolved via
a sequence of vertex doublings. In fact, vertex duplication with subsequent
random edge deletion has been proposed in different application fields as a
mechanism for network growth that can reproduce qualitative properties of
empirical networks, e.g., for the Internet [30], for protein-interaction networks
[6, 45, 46, 47], or for citation networks [31], although the precise rules can
differ between those investigations, for instance, whether the duplicated node
and its copy are connected or not.
The next simplest motif consists of two connected vertices. Thus, we con-
sider an edge in Γ connecting two vertices j1 , j2 . Equation (10) then becomes
1 1
u(j2 ) = (1 − λ)u(j1 ), u(j1 ) = (1 − λ)u(j2 ), (11)
nj1 nj2
with the solutions
1
λ± = 1 ± √ . (12)
nj1 nj2
The duplication of an edge thus yields the eigenvalues λ± which are symmetric
about 1. Also, when the degree of j1 or j2 is large, λ± are close to 1.
The next motifs consist of three vertices. When we have a chain of vertices
j1 , j2 , j3 for which j2 is connected to both j1 and j3 , but without a connec-
tion between j1 and j3 (that is, the motif is not a triangle), we obtain the
eigenvalues
1 1 1
λ = 1, 1 ± ( + ). (13)
nj2 nj1 nj3
The other motif with three vertices is a triangle, with vertices j1 , j2 , j3 . In this
case, from (10), we obtain the cubic equation
(1 − λ)3 nj1 nj2 nj3 − (1 − λ)(nj1 + nj2 + nj3 ) − 2 = 0 (14)
for λ.
124 A. Banerjee and J. Jost

4 Functional and Dynamical Aspects Determined


by the First Eigenvalue
In this section, we shall argue that the first nontrivial eigenvalue λ1 plays a
special role for understanding important network properties. λ1 is also called
the spectral gap, because it is equal to the difference λ1 − λ0 as λ0 = 0.
λ1 admits the variational characterization
 2 
j∼i (v(i) − v(j))
λ1 = min{  2
: ni v(i) = 0}. (15)
i ni v(i)
v
i

A function v attaining this minimum then is an eigenfunction for λ1 . Since the


numerator in (15) only takes pairs of neighboring vertices into account, λ1 can
become quite small when the graph consists of two large subgraphs that are
connected by few edges. In (15), we can then achieve a small value by taking
some function that equals a positive constant on one of those subgraphs and
a negative constant on theother hand, where the two constants are adjusted
so that the normalization i ni v(i) = 0 is satisfied. Therefore, it is intuitively
clear that λ1 can be estimated against the Polya–Cheeger constant h(Γ ) of
our graph Γ , which is defined as follows. Letting |E| denote the number of
edges contained in an edge set E, we define
|E0 |
h(Γ ) := inf{   }, (16)
min( i∈V1 ni , i∈V2 ni )

where removing E0 disconnects Γ into the components V1 , V2 . We then have


the estimates (see [15] for proofs)
1
h(Γ )2 ≤ λ1 ≤ 2h(Γ ). (17)
2
Incidentally, this implies the inequality

h(Γ ) ≤ 4 (18)

for any connected graph.


Turning to dynamical aspects, we consider a dynamical system with cou-
pling structure given by Γ . More specifically, we consider the coupled equation
for a function u depending on the nodes i ∈ Γ and evolving in discrete time
n∈N

u(i, n + 1) = f (u(i, n)) + (f (u(j, n)) − f (u(i, n))). (19)
ni j,j∼i

Here, f : [0, 1] → [0, 1] is some function; the functions we have in mind are
those whose iteration generates some chaotic dynamics, like the logistic map

f (x) = 4x(1 − x). (20)


Spectral Characterization of Network Structures and Dynamics 125

What is important about f is its Lyapunov exponent,


N −1
1 
μ0 = lim log |f  (ū(n))|;
N →∞ N
n=0

The Lyapunov exponent μ0 is positive for chaotic dynamics f .


is a coupling parameter, usually in the range 0 ≤ ≤ 1. The specific question
we wish to ask is whether, or better, under what circumstances, the solution
u of (19) synchronizes, that is, asymptotically,

lim (u(i, n) − u(j, n)) = 0 for all nodes i, j. (21)


n→∞

This question can be understood as asking about the stability of a synchro-


nized solution
u(i, n) = ū(n) (22)
that solves
ū(n + 1) = f (ū(n)). (23)
Systematic studies of synchronization are [42, 41]. It was then found in [27, 43]
that a sufficient condition for such stability is

1 − e−μ0 1 + e−μ0
< < . (24)
λ1 λN −1
In practice, the left inequality, the one involving λ1 , is the crucial one here. In
particular, when the eigenvalues satisfy appropriate conditions, we can have
a stable synchronized solution that is chaotic (μ0 > 0).
Note that the first eigenvalue even determines the synchronization of dy-
namics with transmission delays between the nodes, see [5].

5 Spectral Plots and What They May Tell Us


In this final section, we describe how (a smoothed version of) the density plot
for the eigenvalues of the Laplacian of a network yields a good heuristic clus-
tering scheme for networks from different empirical domains. More precisely,
we shall see that the spectral plots of different networks from the same do-
main typically look rather similar to each other, but different from those for
networks from different domains. Also, these spectral plots often suggest suit-
able hypotheses about the dominant evolution mechanisms of the underlying
networks. Let us give some examples that summarize some of the discussion
in the preceding sections.
• A high peak at the eigenvalue 1 may indicate many successive node dupli-
cations. This is readily visible in many of our spectral plots.
• Likewise, as analyzed above, see (12), (13), (14), duplications of small
motives leave characteristic traces in the spectrum.
126 A. Banerjee and J. Jost

0.04 0.03
a
b
0.035
0.025
0.03
0.02
0.025

0.02 0.015

0.015
0.01
0.01
0.005
0.005

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

0.06 0.06
c d
0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

Fig. 1. (a) Protein-protein interaction network of Helicobacter pylori. Network size


= 710. Data collected from http://www.cosinproject.org [Download date: 25 Sept.
2005]. (b) Metabolic network of Helicobacter pylori. Size of the network = 940. Nodes
represent substrates, enzymes, and intermediate complexes. Data used in [24]. Data
source: http://www.nd.edu/∼networks/resources.htm. [Download date: 22 Nov. 2004].
(c) Autonomous Systems (ASS) topology of the Internet. Every vertex represents an
AS, and two vertices are connected if there is at least one physical link between the
two corresponding ASS. AS graph of 1998/04/02. Network size = 3522. Data collected
from http://www.cosinproject.org and data used in [18] [Download date: 23 September
2005]. Main source: BGP routing data collected by University of Oregon Route Views
Project, then processed and made available in various formats at the Global ISP inter-
connectivity by AS number page of NLANR (National Laboratory of Applied Network
Research). (d) Word-adjacency networks of a text in Spanish language. Size of the
network = 11558. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon
[Download date 3rd Feb. 2005]. Data used in [34].
Spectral Characterization of Network Structures and Dynamics 127

0.025 0.02
a b
0.018

0.02 0.016

0.014

0.015 0.012

0.01

0.01 0.008

0.006

0.005 0.004

0.002

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

0.025 0.012
c
d

0.02 0.01

0.008
0.015

0.006

0.01
0.004

0.005
0.002

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

Fig. 2. (a) Foodweb network from “Florida bay in wet season”. Data downloaded
from http://vlado.fmf.uni-lj.si/pub/networks/data (main data resource: Chesapeake
Biological Laboratory. Web link: http://www.cbl.umces.edu/). [Download date 21
Dec. 2006]. Network size 128. (b) Foodweb network from “Ythan estuary”. Data
downloaded from http://www.cosinproject.org. [Download Date 21 Dec. 2006]. Net-
work size 135. (c) The network of hyperlinks between weblogs on US politics,
recorded in 2005 by Adamic and Glance [1]. Network size 1222. Data down-
loaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23
April 2007]. (d) Neuronal connectivity of Caenorhabditis elegans. Network size 297.
Data used in [49, 50]. Data Source: http://cdg.columbia.edu/cdg/datasets [Down-
load date: 18 Dec. 2006]. (e) E-mail interchanges between members of the Uni-
veristy Rovira i Virgili (Tarragona) [21]. Network size 1133. Data downloaded from
http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 21 March, 2007].
128 A. Banerjee and J. Jost

x 10−3
9
e
8

0
0 0.5 1 1.5 2

Fig. 2. (Continued)

• As follows from Section 4, the presence of many small eigenvalues indi-


cates that the graph consists of many components that, while possibly
connected densely inside, are only very loosely connected to each other.
That is, the graph consists of many different “communities.” As indicated,
this has important dynamical implications for the synchronizability of the
graph.
• When the highest eigenvalue equals 2, or, more generally, when the spec-
trum is symmetric about 1, the graph is bipartite; see the discussion after
(6). Thus, an approximate such symmetry, or an eigenvalue very close to 2,
will indicate that the graph is close to being bipartite (we hope to present
more precise estimates elsewhere). Also, a bipartite graph can readily sup-
port period 2 oscillations of coupled dynamics, so again, there are direct
dynamical implications here. Also, when a graph is bipartite, a random
walk on it need not converge to a stationary distribution. More generally,
such convergence properties are related to the small and large (close to 2)
eigenvalues. Thus, these eigenvalues will affect the properties of random
search schemes on the underlying graph.
In the Figs. 1 through 4, we can clearly see that networks from the same
empirical domain yield similar spectral plots. Also, we can distinguish different
classes of spectral plots with specific characteristic features. A more detailed
analysis of those classes can be found in [10].
The investigation of the graph properties that can be detected from spec-
tral plots has just begun, and we expect significant advances in the detailed
understanding of classes of empirical graphs from systematic investigations of
their spectra.
Spectral Characterization of Network Structures and Dynamics 129

0.01 0.025
a b
0.009

0.008 0.02

0.007

0.006 0.015

0.005

0.004 0.01

0.003

0.002 0.005

0.001

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

x 10−3 x 10−3
8 9
c d
8
7
7
6
6
5
5
4
4
3
3
2
2

1 1

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

Fig. 3. (a) Topology of the Western states power grid of the United States [49]. Net-
work size 4941. Data downloaded from http://cdg.columbia.edu/cdg/datasets [Down-
load date: 1 March 2007]. (b) Jazz band network. Nodes represent jazz bands. Two
bands are connected if a same musician played in those two bands. Network size 198.
Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download
date: 17 March 2008]. Data used in [19]. (c) Co-authorships between scientists posting
preprints on the High-Energy Theory E-Print Archive, http://arxiv.org/archive/hepth
between 1 Jan. 1995 and 31st Dec. 1999 [37]. Network size 5835. (d) Co-authorships of
scientists working on network theory and experiment [38]. Network size 379. (c,d)
Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download
date: 23 April 2007].
130 A. Banerjee and J. Jost

x 10−3 x 10−3
6 7
a b
6
5

5
4
4
3
3

2
2

1 1

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2

x 10−3
7
c
6

0
0 0.5 1 1.5 2

Fig. 4. Electronic circuits. (a) With size = 122. (b) With size = 252. (c) With
size = 512. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Down-
load date: 15 March 2005]. Data used in [33].

References
1. L.A. Adamic and N. Glance, The political blogosphere and the 2004 US election:
Divided they blog, in Proceedings of the WWW-2005 Workshop on the Weblogging
Ecosystem (2005)
2. R. Albert, A.-L. Barabási, Statistical mechanics of complex networks, Reviews of
Modern Physics 74, 2002, 47–97
3. F.M. Atay, T. Bıyıkoğlu, J. Jost, Synchronization of networks with prescribed
degree distributions, IEEE Trans. Circuits and Systems I 53(1), 2006, 92–98
4. F.M. Atay, T. Bıyıkoğlu, J. Jost, Network synchronization: Spectral versus statis-
tical properties, Phys. D 224, 2006, 35–41
5. F.M. Atay, J. Jost, A. Wende, Delays, connection topology, and synchronization
of coupled chaotic maps, Phys. Rev. Lett. 92(14), 2004, 144101
Spectral Characterization of Network Structures and Dynamics 131

6. A. Banerjee, J. Jost, Laplacian spectrum and protein-protein interaction networks,


preprint
7. A. Banerjee, J. Jost, On the spectrum of the normalized graph Laplacian, Lin.
Alg. Appl. 428, 2008, 3015–3022
8. A. Banerjee, J. Jost, Graph spectra as a systematic tool in computational biology,
Discr. Appl. Math., to appear
9. A. Banerjee, J. Jost, Spectral plots and the representation and interpretation of
biological data, Theory Biosc. 126, 2007, 15–21
10. A. Banerjee, J. Jost, Spectral plot properties: Towards a qualitative classification
of networks, NHM 3, 2008, 395–411
11. A.-L. Barabási, R.A. Albert, Emergence of scaling in random networks, Science
286, 1999, 509–512
12. P. Blanchard, T. Krüger, The “Cameo” principle and the origin of scale-free graphs
in social networks, J. Stat. Phys. 114, 1399–1416, 2004
13. T. Bıyıkoğlu, J. Leydold, P. Stadler, Laplacian Eigenvectors of Graphs, Springer
Berlin, 2007
14. B. Bolobás, Modern Graph Theory, Springer, Berlin, 1998
15. F. Chung, Spectral Graph Theory, AMS, Providence, RI, 1997
16. F. Chung, L.Y. Lu, Complex Graphs and Networks, AMS, Providence, RI, 2006
17. S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks, Oxford University Press,
Oxford, 2003.
18. M. Faloutsos et al., On power-law relationships of the Internet topology, SIG-
COMM, 1999.
19. P.M. Gleiser, L. Danon, Community structure in Jazz, Advances in Complex Sys-
tems (ACS) 6(4), 2003, 565–573
20. C. Godsil, G. Royle, Algebraic Graph Theory, Springer, Berlin, 2001
21. R. Guimera et al., Self-similar community structure in a network of human inter-
actions, Physical Review E 68, 2003, 065103(R)
22. M. Horton, H. Stark, A. Terras, What are zeta functions of graphs and what are
they good for? In Quantum graphs and their applications, Contemp. Math., Amer.
Math. Soc., Providence, RI, 415, 2006, 173–189
23. M. Ipsen, A.S. Mikhailov, Evolutionary reconstruction of networks, Phys. Rev. E
66(4), 046109, 2002
24. H. Jeong et al., The large-scale organization of metabolic networks, Nature 407,
2000, 651–654
25. J. Jost, Mathematical methods in biology and neurobiology, monograph, to appear
26. J. Jost, in: J.F. Feng, J. Jost, M.P. Qian (eds.), Networks: From Biology to Theory,
35–62, Springer, Berlin, 2007
27. J. Jost, M.P. Joy, Spectral properties and synchronization in coupled map lattices,
Phys. Rev. E 65(1), 2002, 016201
28. J. Jost, M.P. Joy, Evolving networks with distance preferences, Phys. Rev. E 66,
2002, 36126–36132
29. D.H. Kim, A. Motter, Ensemble averageability in network spectra, Phys. Rev.
Lett. 98, 2007, 248701
30. J. Kleinberg et al., The Web as a Graph: Measurements, Models, and Methods,
LNCS 1627, 1999, 1–17
31. P. Krapivsky, S. Redner, Network growth by copying, Phys. Rev. E 71, 2005,
036118
32. R. Merris, Laplacian matrices of graphs – A survey, Lin. Alg. Appl. 198, 1994,
143–176
132 A. Banerjee and J. Jost

33. R Milo et al., Network motifs: Simple building blocks of complex networks, Science
298, 2002, 824–827
34. R. Milo et al., Superfamilies of evolved and designed networks, Science 303, 2004,
1538–1542
35. B. Mohar, Some applications of Laplace eigenvalues of graphs, in: G. Hahn,
G. Sabidussi (eds.), Graph Symmetry: Algebraic Methods and Applications, 227–
277, Springer, Berlin, 1997
36. R. Monasson, Diffusion, localization and dispersion relations on “small-world”
lattices, Europ. Phys. J. B 12, 1999, 555–567
37. M.E.J. Newman, The structure of scientific collaboration networks, Proc. Natl.
Acad. Sci. USA 98, 2001, 404–409
38. M.E.J. Newman, Finding community structure in networks using the eigenvectors
of matrices, Phys. Rev. E 74, 2006, 036104
39. M. Newman, The structure and function of complex networks, SIAM Review 45,
2003, 167–256
40. S. Ohno, Evolution by Gene Duplication, Springer, Berlin, 1970
41. L.M. Pecora, T.L. Carroll, Synchronization in chaotic systems, Phys. Rev. Lett.
64, 1990, 821–824
42. A. Pikovsky, M. Rosenblum, J. Kurths, Synchronization – A Universal Concept
in Nonlinear Science, Cambridge University Press, Cambridge, 2001
43. G. Rangarajan, M.Z. Ding, Stability of synchronized chaos in coupled dynamical
systems, Phys. Lett. A 296, 2002, 204–212
44. H. Simon, On a class of skew distribution functions, Biometrika 42, 1955, 425–440
45. R. Solé et al., A model of large scale proteome evolution, Adv. Compl. Syst. 5,
2002, 43–54
46. A. Vazquez et al., Modelling of protein interaction networks, ComPlexUs 1, 2003,
38–44
47. A. Wagner, How the global structure of protein interaction networks evolves, Proc.
Roy. Soc. B 270, 2003, 457–466
48. A. Wagner, Evolution of gene networks by gene duplications — A mathematical
model and its implications on genome organization, Proc. Nat. Acad. Sciences
USA 91(10), 1994, 4387–4391
49. D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature
393, 1998, 440–442
50. J.G. White et al., The structure of the nervous system of the nematode Caenorhab-
ditis elegans, Phil. Trans. Royal Soc. of London Series B-Bio. Sc. 314, 1986, 1–340
51. P. Zhu, R.C. Wilson, A study of graph spectra for comparing graphs. In Proc. of
British Machine Vision Conf. (MBVC), Sep 2005
52. K.H. Wolfe, D.C. Shields, Molecular evidence for an ancient duplication of the
entire yeast genome, Nature 387(6634), 1997, 708–713
Dynamics of Social Complex Networks: Some
Insights into Recent Research

Sergi Lozano

ETH Zurich, Swiss Federal Institute of Technology, UNO D11, Universitätstr. 41,
8092 Zurich, Switzerland; slozano@ethz.ch

1 Introduction: Social Networks as Complex Networks


Social networks analysis (that is, the study of interactions among social ac-
tors from a structural viewpoint) has a long tradition covering several decades
[1, 2, 3]. This sort of study has usually been performed over small social net-
works, and the limitation of size has conditioned the visibility of complexity
[4, 5]. However, the situation has changed significantly in recent times due
to basically two reasons. First, there is an increasing availability of larger so-
cial datasets (obtained in most cases from information and communication
technologies). Secondly, a large number of physicists and other scholars from
complexity science have started to take active interest in the field. New per-
spectives and tools have been provided by these ‘newcomers’, which in com-
bination with the expertise and knowledge accumulated by ‘classical’ social
network analysts, has formed the basis of a multidisciplinary field suitably
termed the science of networks [6, 7].
This research has led to the formal definition of the complexity exhibited
by social networks against the following simple ‘check list’ [5].
1. The network must consist of a large number of nodes showing substan-
tial heterogeneity. Here we understand heterogeneity to mean diversity of
degree.
2. Its structure has to present an ‘intricate architecture’, that is, a topology
that cannot be expressed in terms of simple patterns (like ‘regular’ or
‘completely random’) but must include several degrees of freedom.
3. This topological complexity is translated into the global system behavior
in the form of ‘emergent phenomena’, i.e. even simple local interaction
rules lead to a performance of the whole system that is richer than the
sum of local effects.
4. This influence of local feedbacks over the macroscopical behavior can be
manifested, in particular, as nonlinearities in the operation of the processes

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 8,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
134 S. Lozano

that shape the network itself (i.e. sudden emergencies of determined struc-
tural features are observed when a certain external parameter exceeds a
certain threshold value).
Regarding the fulfillment of this list of requirements by social networks,
Vega-Redondo refers to the results of previous studies about social structure
to confirm that social networks satisfy the first two. Following the same rea-
soning, we notice that the other two requirements (covering dynamic aspects)
are repeatedly recognized in social phenomena, for instance, collective behav-
ior and social mobilization [8, 9] (third point), or the emergence of hierarchical
social structures from interactions at an individual level [10, 11] (fourth point).
Once confirmed that social networks are indeed complex networks, in this
chapter we will focus on the dynamic aspects of this complexity (the two later
points in the check list above). More concretely, we will overview some of the
recent research that addresses dynamics on and of social networks from the
perspective of complex systems. The rest of the chapter is structured as fol-
lows. The second section is devoted to works dealing, as separate topics, with
the analysis of social phenomena over static social networks and with the time
evolution of the social structure. The third section focuses on the coevolution
of social structure and phenomena, stressing the importance of this interplay
from the complexity viewpoint. Finally, the last section summarizes the whole
chapter and points out some ideas about the future evolution of the field.

2 Approaching the Dynamics on and of Social


Networks Separately

The majority of recent studies on social networks, from a complexity perspec-


tive, treat dynamics on and of social networks as different lines of research.
In the first case, each node (social actor) is considered to be a dynamical
system whose state evolves, in part, as a function of the topological features
of the underlying static social substrate. Taking into account the intricate
patterns (using the same expression as that in the Introduction) character-
izing social networks, this scenario results in the nonlinear global behaviors
already mentioned. In the second case, the whole network is considered to be
a dynamical system with a topological state that evolves according to local
rules. Investigations along this line have discovered that certain social rules at
a local (individual) interaction level can forge some of the referred ‘intricate
structural patterns’. In accordance with this scheme, we will address these
two research lines separately.

2.1 Dynamics on Social Networks

Topology is an important aspect that is always present in social dynamics [12].


Accordingly, social networks analysis has placed great importance on studying
Dynamics of Social Complex Networks 135

the influence of social networks and the individual’s role in the evolution
of different social phenomena. A good example of this can be found in the
research devoted to diffusion of innovations [13, 14].
This perspective has resulted in an in-depth knowledge of the most im-
portant structural characteristics of social networks and their influence on
the behavior of the social actors, as has been recognized by scholars recently
entering the field from complexity science [6, 7] (although some ‘traditional’
social network analysts claim that this effort by the ‘newcomers’ is not quite
appreciable [1]). The incorporation of these ‘newcomers’ has not changed this
orientation, but has reinforced it by contributing new analyses and modeling
methodologies.
The works ensuing from this combination of tools and perspectives have
uncovered very relevant results. Some of them, for example, have related the
emergence and resilience of cooperation in social groups with certain structural
features of its social network, such as the degree heterogeneity [15, 16, 17]
or the community structure [18]. Others have shown that scale-freeness and
the small-world phenomenon can influence the consensus time of opinions
in a population [19, 20] and even force scenarios with coexisting domains
of opposite opinions [21]. In order to further understand the various tools
and perspectives developed for explaining and modeling social networks, it is
useful to resort to the exhaustive recent reviews on game theory [22], opinion
dynamics [12, 23], language dynamics [12] or spreading phenomena [5].
Finally, as a sample of work addressing social dynamics on networks, the
first chapter of Part in this book presents a work centered on the study of
epidemic spreading [24]. In this work, the authors apply a mesoscopic (neither
individual nor global, but intermediate) structural approach to predict and
understand the spreading of an incurable disease (like HIV) over an empiri-
cal static network. First, they study the division into subnetworks or regions
of a real social network of sexual contacts obtained by means of interviews.
They also deduce qualitative predictions about infection spreading from the
observed topological features. Second, they use a computational model to nu-
merically contrast these predictions and design possible protection strategies
suitable in this particular case. This work represents an important contribu-
tion to the literature on diseases spreading, since it highlights the analysis
and visualization possibilities of mesoscopic approximations.

2.2 Dynamics of Social Networks

The second separated approach that we are going to consider in this section
is based on the study of network processes, that is, “series of events that cre-
ate, sustain and dissolve social structures” [25]. Logically, this sort of study
requires the use of time in addition to the structural description. However,
in the past social networks analysis mainly focused on the study of static so-
cial networks and their influence over individual and collective behavior [6].
Borgatti [26] argues that one of the reasons behind such an orientation was the
136 S. Lozano

difficulty to obtain longitudinal empirical data. As has been pointed out in the
Introduction, this scenario has changed lately with the increasing availability
of large social datasets obtained from different information and communi-
cation technologies (email traffic, mobile phone calls, activities within peer-
to-peer systems, social media and social networking websites, etc.). Taking
advantage of this new availability, scholars have developed different method-
ologies to understand the evolution of social networks using these data as
input [25, 27, 28].
The (generally) large size of these datasets has given rise to especially in-
teresting applications from a complex network perspective. On one side, we
find works that try to deduce the basic mechanisms ruling social network pro-
cesses. To do that, the authors analyze the evolution of these social datasets
from a statistical point of view (macroscopic level) [4], focusing on their mod-
ular structure (mesoscopic or intermediate level) [29, 30, 31], or addressing
key individual properties such as centrality (microscopic level) [32].
On the other side, following the example of seminal works by Watts and
Strogatz [33] and Barabási and Albert [34], datasets are also used to validate
simple models based on single mechanisms that forge complex social-like fea-
tures. In these works, empirical data are contrasted against the models’ sim-
ulations in terms of structural parameters at different topological scales. For
example, some of these works present extensions of Barabási’s preferential
attachment models and are focused on the degree distribution [35, 36]. Oth-
ers present variants of the seceder model (where the mechanism conditioning
topological evolution is based on each agent’s efforts to differentiate from the
crowd) [37]. Finally, in Ref. [38] the authors propose a model where each agent
is assigned a set of social values (representing different social attributes), and
ties are established in the function of the social distances among agents (dif-
ferences between their social attributes) and α, a parameter quantifying the
homophily in the system (the individuals’ preference to establish and main-
tain links with other individuals they feel similar to). Interestingly, for different
values of α the resulting social network presents different modular structures,
while preserving general topological features of social networks (such as as-
sortativity or high clustering).

3 Coevolution: Social Networks and Phenomena


The separation into two different lines of research presented in the previous
section has been the common approximation to social complex networks un-
til recently. However, from real life observations we conclude that there is,
normally, a certain interdependency among the evolution of both the social
structure and the behavior of each one of the social actors [39]. Consider a
friendship network as an example. On one side, friendship relationships (net-
work links) are the path used, for instance, to cooperate, inform or imitate
behaviors. Thus, the structure conditions different social processes related to
Dynamics of Social Complex Networks 137

these actions (like cooperation and diffusion of habits, for example). On the
other side, the stronger the friendship relation among two people, the more
probable that they introduce each other to new friends, modifying their mu-
tual ‘friendship local neighborhood’ and, consequently, the whole structure of
the network. In general, networks exhibiting such a feedback loop are called
coevolutionary or adaptive networks [40].
This interdependency has clear implications from a complexity point of
view. If structural patterns of social networks can induce nonlinearity in social
phenomena evolving over them and, likewise, social network processes forge
the emergence of complex structural features, a coevolutive scheme has to lead,
necessarily, to scenarios exhibiting extremely rich behaviors. In their recent
review on adaptive networks, Gross and Blasius [40] suport this assertion by
reporting a list of four ‘hallmarks’ typically presented by adaptive networks
in general (and social networks in particular):
• Self-organization towards a dynamical critical state.
• Emergence of ‘specialized’ roles from an initially homogeneous population.
• Formation of complex global topologies (even from very simple local rules).
• Highly complex macroscopical dynamics due to the interaction of local
states and topological complexity.
In the following, we will review some recent works that have addressed
interesting sociological topics from an adaptive networks’ perspective. We will
also identify some of the preced hallmarks in the referred examples.

3.1 Cooperation in Coevolutive Models

In Ref. [41], Skyrms and Pemantle claim to “(..) create models that are more
true to life (..)” by incorporating coevolution among structure and strategies
in evolutive game theory models. Since then, some authors have proposed
models where players’ strategies depend on the structure but, at the same
time, they can modify the connectivity in their local neighborhood in order to
maximize the payoff of a certain strategy (modifying, as an aggregated effect,
the whole topology at the macroscopic level).
Cooperation among individuals and, more concretely, the evolution of one-
shot versions of the Prisoner’s Dilemma played over adaptive networks, have
been intensively studied. In the results of these works, we can find some of the
four ‘complexity hallmarks’ in coevolving networks listed previously. For ex-
ample, in some cases the authors identify the formation of scale-free topologies
(which present a power-law distribution) [42, 43] and the emergence of differ-
entiated roles and hierarchies [42, 44, 45]. Moreover, regarding system dynam-
ics, Ebel and Bornholdt [46], Eguiluz and co-workers [42] and Zimmermann
and Eguiluz [44] report large avalanches of strategy changes when the system
approaches the final state, identifying a sort of self-organized critical behavior.
As a particular case, [47] analyzes a scenario where topological changes
occur much faster than changes of individuals’ strategies. The authors find
138 S. Lozano

that the evolution of individual strategies in this situation no longer corre-


sponds to the Prisoner’s Dilemma, but to a sort of coordination game, leading
to a situation more favorable to cooperation. This result highlights the effect
of separating the time scales of structural and individual dynamics. Notice
that the scenarios considered in the previous section (with static networks
or nonevolving nodes) can be seen as particular extreme cases of coevolving
networks with completely separated time scales (i.e. one of the two time scales
is so large compared with the other that it is not considered). In accordance
with the importance of the relation between the two different time scales in
coevolutive scenarios, we find (as we will see in the following subsections) sev-
eral works that analyze this influence and that consider the cases with one
aspect (network or individuals) static as bounding cases.

3.2 Communication and Diffusion of Information


in Social Networks

The interplay between communication within a population of socioeconomic


agents and its underlying social structure, is an interesting social topic that
deserves further study [48]. Taking business relationships as an example, an
agent would presumably like to occupy a network position that is as strategic
as possible in terms of information reception and processing (close to the
other agents in terms of average distance or with a high betweenness, for
instance). Moreover, since the socioeconomical environment is usually volatile
(keeps changing), actors need to be continuously looking for better contacts
and ‘fresh opportunities’ [49, 50].
Taking into account such a dynamical scenario, where the “who communi-
cates with whom” and the social structure are strongly entangled, this issue is
especially suitable to be studied from a coevolutive viewpoint. Following this
perspective, we can find recent works focused on an individual’s movements
across the social structure to reach strategic positions while minimizing linking
costs [51, 52], or works targeting key positioned individuals [48]. Other authors
investigate the impact of communication on social structure both quantita-
tively (more or less comunication) and qualitatively (different communication
strategies) [53].
In general, the models employed in these works generate social structures
that present complex patterns like modular structures [48]. Furthermore, some
of these works report interesting behaviors of the modeled system like self-
organization to states close to the transition between fragmented and ordered
states [51], sharp phase transitions and resilience of the structure [49, 50].

3.3 Opinion and Cultural Dynamics

Opinion and cultural dynamics are other important social topics which have
been addressed from a coevolutive viewpoint.
Dynamics of Social Complex Networks 139

Centola and co-workers [54] presented a coevolutionary version of


Axelrod’s model on dissemination of culture [55]. As in the seminal model by
Axelrod, they represent cultural traits and features by numerical values that
are transmitted (copied) among individuals in contact, with the difference
that the topology of interactions among individuals also evolves. More con-
cretely, agents can erase and rewire links to neighbors with whom they have
no common social trait (i.e. the affinity among them is 0). The model presents
a complex relationship between heterogeneity and cultural diversity, in which
a high diversity can reduce cultural group formation while simultaneously
increasing social connectedness.
The coevolutive approach has also been used in several recent works ad-
dressing opinion formation processes. In Refs. [56, 57, 58, 59], authors propose
coevolutive versions of the two-state voter’s model [60] to study consensus in
populations’ opinions. In this kind of model, interactions between agents are
enhanced or penalized (or even broken) according to whether they succeed in
reaching an agreement or not. From a complex network point of view, these
models are used to explore the transition between different states, with a spe-
cial interest in the emergence and duration of metastable states reached before
the consensus.
Another model of opinion formation based on a coevolutive approach that
has received considerable attention is proposed in [61]. This model is espe-
cially interesting regarding time-scale separation. In each time step, a rewiring
(structural change) or an opinion imitation (evolution of local state) occurs
with certain probabilities φ and 1 − φ, respectively. Therefore, by tuning φ
the authors can easily recover one of the two extremal single-evolving cases
(static network or nonevolving nodes) or travel along different intermediate
scenarios. By studying the whole range of possible situations, the authors find
that the model undergoes a continuous phase transition as φ is varied, from a
regime in which opinions are highly diverse to one in which most individuals
hold the same opinion.

3.4 Spreading Phenomena


Last but not least, there is a growing literature on spreading (epidemics,
diseases, infections) phenomena from an adaptive network perspective.
We find an example of this in the series of works proposing and analyzing
an adaptive version of the SIS (Susceptible-Infected-Susceptible) model, where
susceptible individuals try to avoid infection by erasing their links with the
infected population [62, 63, 64]. This sort of work analyzes how different levels
of rewiring modify the dynamics of the adaptive SIS model (note that this
implies, once again, studying the effect of having separated time scales). One
common observation is that high levels of rewiring lead to the self-organization
of the susceptible population into a unique, densely connected cluster. In the
case of eventual infection of an individual in the cluster, this sort of organi-
zation favors a rapid spreading of the disease, which is seen as an avalanche
of state change from a macroscopic viewpoint.
140 S. Lozano

We also mention the work presented in [65], where the authors propose an
innovative coevolutionary model of HIV infection spreading through the use
of dynamic complex networks. On one hand, the state of each individual (her
health situation) is determined by means of a Markov process that takes into
account both topological data (such as the number of infected neighbours)
and information regarding the HIV infections (probability of infection and
progression from HIV to AIDS, for instance). On the other hand, the social
structure of the population is defined at each time step in a function of certain
statistical features and the state of nodes (nodes with AIDS are removed from
the network). The authors find a good correspondence between simulation
results and real demographic historical epidemiological data from the United
States. Moreover, this epidemiological prediction model could be integrated
in related decision support systems (regarding anti-drug policy, for instance).

4 Conclusions

Summarizing, the analysis of social networks’ dynamics has been revealed to


be an outstanding application of the complex network theory, as is demon-
strated by the huge (and increasing) amount of work developed in the field
during recent years. Two factors have contributed definitively to this success:
the availability of large longitudinal social datasets obtained from communi-
cation technologies, and the massive integration of scientists from complexity
science (especially physicists) to social networks analysis.
In this chapter, we have provided a general view of recent research in
this area. Following the evolution of the literature in the field, we have first
referred to works treating dynamics on and of social networks separately,
and later have addressed a more recent approach integrating both sorts of
dynamics in a coevolutive scheme. In both cases, but especially in the last
one, we have also echoed the results reported by authors regarding some of the
points proposed by Vega-Redondo’s check list in the Introduction (emergence
of nontrivial structural patterns, nonlinear macroscopical behaviors induced
by local processes, etc.). When talking about coevolutive models, we have also
stressed the effect of having more or less separated time scales for dynamics
on and of social networks.
Finally, regarding the future evolution of the research on the dynamics of
social complex networks, it is expected to keep growing, as the availability
of datasets is increasing and the field continues to attract scholars. Neverthe-
less, to ensure this growth, issues like the ethical implications of social data
collection and analysis [66, 67], the integration among different disciplines
and perspectives within the aforementioned science of networks should be
seriously addressed.
Dynamics of Social Complex Networks 141

References
1. Freeman, L.C.: The Development of Social Network Analysis: A Study in the
Sociology of Science. Empirical Press, Vancouver (BC Canada) (2004).
2. Scott, J.: Social Network Analysis: A Handbook. SAGE Publications, London
(2000).
3. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications.
Cambridge University Press, New York (1994).
4. Holme, P., Edling, C.R., Liljeros, F.: Structure and time evolution of an Internet
dating community. Social Networks 26, 155–174 (2004).
5. Vega-Redondo, F.: Complex Social Networks. Cambridge University Press, New
York (2007).
6. Watts, D.J.: Six Degrees: The Science of a Connected Age. W. W. Norton &
Company Inc., New York (2003).
7. Barabási, A.-L.: Linked: The New Science of Networks. Perseus Publishing,
Cambridge (USA) (2002).
8. Coleman, J.: Foundations of Social Theory. Harvard University Press, Cambridge,
MA (1990).
9. Gould, R.V.: Collective action and network structure. American Sociological Re-
view 58 (2), 182–196 (1993).
10. Gould, R.V.: The origins of status hierarchies: A formal theory and empirical test.
American Journal of Sociology 107 (5), 114378 (2002).
11. Epstein, J.M.: Generating classes without conquest. In: Generative Social Science:
Studies in Agent-Based Computational Modeling. Princeton University Press,
Princeton, NJ (2007).
12. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics.
Reviews of Modern Physics (Accepted) 348 (2008).
13. Rogers, E.M.: Diffusion of Innovations (5th ed.). Free Press, New York (2003).
14. Valente, T.W.: Models and methods for innovation diffusion. In: Carrington, P.,
Scott, J., Wasserman, S. (ed) Models and Methods in Social Network Analysis.
Cambridge University Press, New York (2005).
15. Abramson, G., Kuperman, M.: Social games in a social network. Phys. Rev. E 63,
030901 (2001).
16. Duran, O., Mulet, R.: Evolutionary prisoners dilemma in random graphs. Physica
D 208 (3–4), 257–265 (2005).
17. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilem-
mas in structured heterogeneous populations. Proc. Natl. Acad. Sci. 103, 3490–
3494 (2006).
18. Lozano, S., Arenas, A., Sanchez, A.: Mesoscopic structure conditions the emer-
gence of cooperation on social networks. PLoS ONE 3(4): e1892 doi: 10.1371/
journal.pone.0001892 (2008).
19. Castellano, C., Loreto, V., Barrat, A., Cecconi, F., Parisi, D.: Comparison of voter
and Glauber ordering dynamics on networks. Phys. Rev. E 71 (6), 066107 (2005).
20. Sood, V., Redner, S.: Voter model on heterogeneous graphs. Phys. Rev. Lett. 94
(17), 178701 (2005).
21. Castellano, C., Vilone, D., Vespignani, A.: Incomplete ordering of the voter model
on small-world networks. Europhys. Lett. 63 (1), 153158 (2003).
22. Szabó, G., Fáth, G.: Evolutionary games on graphs. Phys. Rep. 446 (4–6), 97–216
(2007).
142 S. Lozano

23. Stauffer, D.: Sociophysics Simulations II: Opinion Dynamics. arXiv:physics/


0503115v1 [physics.soc-ph] (2005).
24. Bjelland, J., Canright, G., Engø-Monsen, K., Remple, V.P.: Topographic spreading
analysis of an empirical sex workers network. In: (ed). Springer, Berlin (2008).
25. Doreian, P., Stokman, F.N. (ed): Evolution of Social Networks. Routledge, London
(1997).
26. Borgatti, S.P.: The State of Organizational Social Network Research Today. Dept.
of Organization Studies. Boston College, Boston, MA (2003).
27. Snijders, T.A.B.: Models for longitudinal network data. In: Carrington, P.,
Scott, J., Wasserman, S. (ed) Models and Methods in Social Network. Analysis.
Cambridge University Press, New York (2005).
28. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets
to the Internet and WWW. Oxford University Press, Oxford (2003).
29. Palla, G., Barabási, A-L., Vicsek, T.: Quantifying social group evolution. Nature
446 (5), 664–667 (2007).
30. Eckmann, J.-P., Moses, E., Sergi, D.: Entropy of dialogues creates coherent stru-
tures in e-mail traffic. PNAS 101 (40), 14333–14337 (2004).
31. Onnela, J.-P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., Kertész,
J., Barabási, A.-L.: Structure and tie strengths in mobile communication networks.
PNAS 104 (18), 7332–7336 (2007).
32. Braha, D., Bar-Yam Y.: From centrality to temporary fame: Dynamic centrality
in complex networks. Complexity 12 (2), 59–63 (2006).
33. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
393, 440–442 (1998).
34. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science
286, 509–512 (1999).
35. Jin, E.M., Girvan, M., Newman, M.E.J.: Structure of growing social networks.
Phys. Rev. E 64, 046132 (2001).
36. Roth, C.: Generalized Preferential Attachment: Towards Realistic Social Network
Models. ISWC 4th Intl Semantic Web Conference. (2005).
37. Grönlund, A., Holme, P.: Networking the seceder model: Group formation in social
and economic systems. Phys. Rev. E 70, 036108 (2004).
38. Boguña, M., Pastor-Satorras, R., Dı́az-Guilera A., Arenas A.: Models of social
networks based on social distance attachment. Phys Rev E 70, 056122 (2004).
39. Lazer, D.: The co-evolution of individual and network. J. Math. Sociol. 25, 69108
(2001).
40. Gross T., Blassius, B.: Adaptive coevolutionary networks: A review. J. R. Soc.
Interfac 5 (20), 259–271 (2007).
41. Skyrms, B., Pemantle, R.: A dynamic model of social network formation. Proc.
Nat. Acad. Sci. 97 (16), 9340–9346 (2000).
42. Eguiluz, V.M., Zimmermann, M.G., Cela-Conde, C.J., San Miguel, M.: Coopera-
tion and the emergence of role differentiation in the dynamics of social networks.
AJS 110 (4), 9771008 (2005).
43. Biely, C., Dragosits, K., Thurner, S.: The prisoners dilemma on co-evolving net-
works under perfect rationality. Physica D 228, 4048 (2007).
44. Zimmermann, M.G., Eguı́luz, V.M.: Cooperation, social networks, and the emer-
gence of leadership in a prisoners dilemma with adaptive local interactions. Phys.
Rev. E 72, 056118 (2005).
45. Zimmermann, M.G., Eguı́luz, V.M., San Miguel, M.: Coevolution of dynamical
states and interactions in dynamic networks. Phys. Rev. E 69, 065102(R) (2004).
Dynamics of Social Complex Networks 143

46. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66,
056118 (2002).
47. Pacheco, J.M., Traulsen, A., Nowak, M.A.: Coevolution of strategy and structure
in complex networks with dynamical linking. Phys. Rev. Lett. 97, 258103 (2006).
48. Rosvall, M., Sneppen, K.: Dynamics of opinions and social structures.
arXiv:0708.0368v2 [physics.soc-ph] (2007).
49. Marsili, M., Vega-Redondo, F., Slanina, F.: The rise and fall of a networked society:
A formal model. Proc. Nat. Acad. Sci. 101, 1439–1442 (2004).
50. Ehrhardt, G.C.M.A, Marsili, M., Vega-Redondo, F.: Phenomenological models of
socioeconomic network dynamics. Phys. Rev. E 74, 036106 (2006).
51. Holme, P., Ghoshal, G.: Dynamics of networking agents competing for high cen-
trality and low degree. Phys. Rev. Lett. 96, 098701 (2006).
52. König, M.D, Battiston, S., Napoletano, M., Schweitzer, F.: On algebraic graph
theory and the dynamics of innovation networks. Networks and Heterogeneous
Media 3 (2) 201–220 (2007).
53. Rosvall, M., Sneppen, K.: Modeling self-organization of communication and topol-
ogy in social networks. Phys. Rev. E 74, 016108 (2006).
54. Centola, D., González-Avella, J.C., Eguiı́luz, V.M., San Miguel, M.: Homophily,
cultural drift, and the co-evolution of cultural groups. J. of Conflict Resolution 51
(6), 905–929 (2007).
55. Axelrod, R.: The dissemination of culture: A model with local convergence and
global polarization. The Journal of Conflict Resolution 41 (2), 203–226 (1997).
56. Benczik, I.J., Benczik, S.Z., Schmittmann, B., Zia, V.: Lack of consensus in social
systems. EPL 82, 48006 (2007).
57. Vázquez, F., Eguı́luz, V.M., San Miguel, M.: Generic absorbing transition in co-
evolution dynamics. Phys. Rev. Lett. 100, 108702 (2007).
58. Zanette, D.H., Gil, S.: Opinion spreading and agent segregation on evolving net-
works. Phys. D 224, 156–165 (2006).
59. Gil, S., Zanette, D.H.: Coevolution of agents and networks: Opinion spreading and
community disconnection. Phys. Lett. A 356, 89–95 (2006).
60. Liggett, T.M.: Interacting Particle Systems. Springer, New York (1985).
61. Holme, P., Newman, M.E.J.: Nonequilibrium phase transition in the coevolution
of networks and opinions. Phys. Rev. E 74, 056108 (2006).
62. Gross, T., D’Lima, C.J.D., Blasius, B.: Epidemic dynamics on an adaptive net-
work. Phys. Rev. Lett. 96, 208701 (2006).
63. Gross, T., Kevrekidis, I.G.: Coarse-graining adaptive coevolutionary network dy-
namics via automated moment closure. arXiv:nlin/0702047v1 [nlin.AO] (2007).
64. Zanette, D.: Coevolution of agents and networks in an epidemiological model.
arXiv:0707.1249v2 [physics.soc-ph] (2007).
65. Sloot, P.M.A., Ivanov, S.V., Boukhanovsky, A.V., Vijver, D., Boucher, C.A.:
Stochastic simulation of HIV population dynamics through complex network mod-
eling, Int. J. of Computer Mathematics 85 (8), 1175–1187 (2008).
66. Borgatti, S.P., Molina, J.L.: Toward ethical guidelines for network research in
organizations. Social Networks. 27 (2), 107–117 (2005).
67. Birnbaum, M.H.: Methodological and ethical issues in conducting social psychol-
ogy research via the Internet. In: Sansone, C., Morf, C.C., Panter, A.T. (ed) Hand-
book of Methods in Social Psychology. Sage, Thousand Oaks, CA (2004).
The Structure and Dynamics of Linguistic
Networks

Monojit Choudhury1 and Animesh Mukherjee2


1
Microsoft Research India, Sadashivnagar, Bangalore, India – 560080
monojitc@microsoft.com
2
Department of Computer Science and Engineering,
Indian Institute of Technology, Kharagpur, India – 721302
animeshm@cse.iitkgp.ernet.in

1 Introduction
Human beings as a species are quite unique to this biological world, for they
are the only organisms known to be capable of thinking, communicating and
preserving potentially an infinite number of ideas that form the pillars of
modern civilization. This unique ability is a consequence of the complex and
powerful human languages characterized by their recursive syntax and compo-
sitional semantics [40]. It has been argued that language is a dynamic complex
adaptive system that has evolved through the process of self-organization to
serve the purpose of human communication needs [80]. The complexity of hu-
man languages has always attracted the attention of physicists, who have tried
to explain several linguistic phenomena through models of physical systems
(see e.g., [32, 42]).
Like any physical system, a linguistic system (i.e., a language) can be
viewed from three different perspectives [52]. On one extreme, a language is a
collection of utterances that are produced by the speakers of a linguistic com-
munity during the course of their interactions with other speakers of the same
community. This is analogous to the microscopic view of a thermodynamic
system, where every utterance and its corresponding context contributes to
the identity of the language, i.e., the grammar. On the other extreme, a lan-
guage can be characterized by a set of grammar rules and a vocabulary. This
is analogous to a macroscopic view. Sandwiched between these two extremes,
one can also conceive of a mesoscopic view of language, where linguistic enti-
ties, such as the letters, words or phrases are the basic units and the grammar
is an emergent property of the interactions among them.
Complex networks provide a suitable framework to model and study the
structure and dynamics of linguistic systems from a mesoscopic perspec-
tive. Although multi-agent simulation is the preferred modeling paradigm for

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 9,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
146 M. Choudhury and A. Mukherjee

microscopic studies in linguistics (see e.g., [15, 80]), there have been some
works where networks are also involved. For instance, in [67], the interaction
patterns between the agents are modeled as a social network, and the diffusion
of linguistic innovations (which are key to language change) are studied on
various network topologies. This survey is confined to the works pertaining to
various linguistic networks only at the level of mesoscopy.
There has been a plethora of works on linguistic networks with various
motivations and at various levels of linguistic structure. On the basis of the
primary goal of the research, the work in this area can be broadly classified
into two categories: (1) those which investigate the structural properties of
language from the perspective of language evolution and, thereby, explain
the emergence of certain universal characteristics of languages, and (2) those
which try to exploit the network-based representations to develop certain
useful practical systems such as machine translation, information retrieval
and summarization systems. This article focuses on the former works, but a
brief overview of the latter is also presented in Section 5.
The survey is organized from the perspective of linguistic structure.
Section 2 describes lexical networks, where the nodes are words and edges
represent the lexical relationship between two words such as phonetic and
semantic similarity. In Section 3 we present an overview of various networks
where again the nodes are the words, but unlike the case of lexical networks,
the edges represent their co-occurrences in similar context. These networks
are representations of the interactions among words as governed by the gram-
mar rules of a language. Section 4 describes the phonological networks, where
the nodes are sub-lexical units such as phonemes or syllables. Applications
of linguistic networks in natural language processing (NLP) and information
retrieval (IR) are discussed in Section 5. Section 6 concludes the survey by
enumerating some open problems in the area of linguistic networks.

2 Lexical Networks
The phrase “mental lexicon” (ML) usually refers to the repository of word
forms that is assumed to reside in the human brain. The average size of the
receptive vocabulary for a normal high school student has been found to be
more than 100,000 [63]. Quite surprisingly, speakers are capable of navigating
this huge lexicon in a very efficient way; reaction time to judge whether a word
form is legitimate takes less than 100 milliseconds. Consequently, there can
be two important questions associated with ML: (a) how the words are stored
in the long-term memory, i.e., how ML is organized, and (b) how these words
are retrieved from ML. Note that these questions are highly interrelated—to
predict the organization one can investigate how words are retrieved from ML
and vice versa.
One of the earliest attempts to model the organization of ML was made
in [13]. In this work, the authors propose a hierarchical structure of ML, where
Linguistic Networks 147

Fig. 1. The hierarchical structure of ML.

the concepts are arranged in the form of a tree and the attributes of a partic-
ular concept in this tree can be inherited by all the child concepts. Figure 1
shows a representative example formed from the concepts “animal”, “mam-
mal” and “fish”. While early studies like [13] focused mainly on representation
of the local structure of ML, its global structure remained largely unexplored.
Recently, researchers have also started to investigate the global structure of
ML primarily within the framework of complex systems and, more specifically,
complex networks (see [36, 45, 77, 83, 86] for reference). In all of these studies
ML is modeled as a web of interconnected nodes, where each node corresponds
to a word form and the interconnections may be based on any one (or more)
of the following:
• Phonological similarity (e.g., the words banana, bear and bean may be
connected since they start with the same phoneme),
• Semantic similarity (e.g., the words banana, apple and pear may be con-
nected since all of them are names of fruits),
• Frequency of usage,
• Age at which the word forms are acquired,
• Parts of speech, and
• Orthographic properties.
In the rest of this section we review one representative study each (refer-
ring, wherever applicable, to the other relevant ones) of such complex networks
constructed based on (a) phonological, (b) semantic, and (c) orthographic
similarities of the word forms. Syntactic similarity-based networks will be dis-
cussed in detail in the next section.

2.1 Phonological Similarity-Based Networks

Phonological similarity among the word forms has been extensively studied
in the past to infer the structure of ML and, consequently, the nature of a
linguistic system [4, 35, 71, 81]. This large-scale phonological ML has also
been studied in the framework of complex networks in which the word forms
represent the nodes and two nodes (read words) are connected by an edge
if they differ only by the addition, deletion or substitution of one or more
phonemes [36, 45, 83, 86]. [45] reports one of the most popular studies, where
148 M. Choudhury and A. Mukherjee

the author constructs a phonological neighborhood network (PNN) in order


to unfurl the organizing principles of ML. In PNN there is an edge (u, v)
connecting the nodes u and v iff at least two-thirds of the phonemes that
occur in the word represented by u also occur in the word represented by
v. For instance, if the word is 6 phonemes long, then one can derive all its
neighbors by changing at most two phonemes through insertions, deletions,
and substitutions.
The author uses the Hoosier Mental Lexicon database [68] and builds the
above network from the phonologically transcribed forms of each word present
in the database. More specifically, he constructs a directed network, where a
long word can have a short word as its neighbor without the short word
being the neighbor of the long word. For instance, if the number of segments
in which the two words, say w1 and w2 , differ is less than 1/3 of the length
of w1 , then there will be a directed edge from the node corresponding to w1
to the node corresponding to w2 . The fraction 1/3 is chosen, because it has
been useful in earlier experiments for predicting reaction times and familiarity
ratings (see [53] for reference).
The author shows that PNN is characterized by a very high clustering
coefficient (0.235) but at the same time exhibits a long average path length
(6.06) and diameter (20). This indicates that, like a small-world network,
the lexicon has many densely interconnected neighborhoods. However, unlike
small-world network, links between two nodes from different neighborhoods
are harder to find.
Low mean path lengths are necessary in networks that are to be traversed
quickly; the purpose of traversal being search in most of cases. However, in
the case of ML, the search should not inhibit the neighbors of the stimulus
neighbors that are non-neighbors of the stimulus itself and are, therefore, not
similar to the stimulus. Hence, it can be conjectured that, in order to search in
PNN, traversal of links between distant nodes is usually not required. In con-
trast, the search involves an activation of the structured neighborhoods that
share a single sub-lexical chunk, which could be acoustically related during
word recognition [55].
Further, the author shows that the degree distribution of the nodes in PNN
is exponential rather than scale free. Thus, one can posit that the structure
of ML is not consistent with “growth via preferential attachment”—at least
for the neighborhood density metrics used for this study. The reason is that
the standard preferential attachment model, the emergent degree distribution
of the network is known to be scale free [5]. The cause for the emergence of
the exponential degree distribution for PNN is not yet well understood and is
quite an open area for further research.

2.2 Semantic Similarity-Based Networks

One of the classic examples of semantic similarity-based networks is the


WordNet [20]. In this network, concepts (known as synsets) are the nodes,
Linguistic Networks 149

and semantic relationships between them are represented through the edges.
In [77] the authors analyze the structure of the nouns in the English WordNet
database (version 1.6). The semantic relationships between the nouns can
be primarily of four types: (i) hypernymy/hyponymy (e.g., animal/cat), (ii)
antonymy (e.g., day/night), (iii) meronymy/holonymy (e.g., trunk/tree) and
(iv) polysemy (e.g., the concepts “the main stem of a tree”, “the body exclud-
ing the head and neck and limbs”, “a long flexible snout as of an elephant”
and “luggage consisting of a large strong case used when traveling or for stor-
age” are connected to each other due to the polysemous word “trunk” which
can mean all of these). Some of the important findings of this work are as
follows.
• Semantic relationships are scale invariant.
• The hypernymy tree forms the skeleton of the network.
• Inclusion of polysemy reorganizes the network into a small world.
• The nodes with the most traffic (i.e., nodes with the maximum number
of paths passing through them) correspond to those concepts which are
expressed by the most polysemous words. They are also found to have very
high clustering coefficients.
• In the presence of polysemous edges, the distance between two nodes across
the network is not in correspondence with the depth at which they are
found in the hypernymy tree.
Further references to the studies on such semantic relationship-based networks
can be found in [1, 82]. Although there are several works attempting to analyze
the structure of the semantic network of words, one hardly finds any study
explaining the emergence of these topological properties through models of
network synthesis. It would be very interesting to study the correlates of
semantic acquisition and symbol grounding with the model parameters.

2.3 Orthographic Similarity-Based Networks

Like phonological similarity networks, one can also construct networks based
on orthographic similarity, where the nodes are the words and the edit distance
between two words defines the edge weight between the nodes corresponding
to them. Such networks have been studied in order to investigate the diffi-
culties involved in spelling error detection and correction [11]. In this work
the authors construct such networks (SpellNet) for three different languages
(Bengali, Hindi and English) and analyze them to show the following.
• For a particular language, the probability of real word errors can be
equated to the average weighted degree of SpellNet.
• The difficulty of non-word error correction correlates to the average clus-
tering coefficient for a language.
• The basic topological properties are invariant in nature for all the lan-
guages; for instance, the authors find that the SpellNet for all of the three
150 M. Choudhury and A. Mukherjee

languages is characterized by an exponential degree distribution, high clus-


tering coefficient and positive correlation between the degree and clustering
coefficient of the nodes.

3 Word Co-Occurrence Networks


In this section, we review the work on word co-occurrence networks, where
the nodes are the words and an edge between two words indicates that the
words have co-occurred in the language in certain context(s). Depending on
the definition of the context, various networks can be defined. We describe
in detail two such networks: the collocation network and the syntactic de-
pendency network. As an application, we discuss the work by [79] where the
collocation network has been used for unsupervised induction of the gram-
matical structure of a language.

3.1 Collocation Network


One of the most basic and well-studied co-occurrence network types is that
of word collocation networks, where two words are linked if they are neigh-
bors, that is, if they collocate, in a sentence [24]. In this work, two types of
collocation networks, unrestricted and restricted ones, were constructed for
English from the British National Corpus. In an unrestricted network, all the
collocation edges are preserved, whereas in a restricted one only those edges
are preserved for which the probability of occurrence of the edge is higher
than the case when the two words collocate independently. All these networks
are undirected and unweighted, even though in language the order of words
(“ticket book” is different from “book ticket”) as well as the frequency of the
collocations have obvious significance.
The authors found that both the networks exhibit small-world properties.
The average path length between any two nodes is small (around 2 to 3), and
the clustering coefficients are high (0.69 for the unrestricted and 0.44 for the
restricted networks). However, the most striking observation regarding these
networks is that the degree distributions follow a two-regime power law. The
degree distribution of the 5000 most connected words follows a power law with
an exponent −3.07, which is surprisingly close to that of the Barabási-Albert
growth model [5]. These findings led the authors to argue that the word usage
of the human languages is preferential in nature, where the frequency of a word
defines the comprehensibility and production capability. Thus, the higher the
usage frequency of a word, the higher the probability that the speakers will be
able to produce it easily and the listeners will comprehend it quickly. This is
known as the recency effect in linguistics [3]. The small-world property of the
collocation network, on the other hand, makes it easier to search the mental
lexicon (ML). In essence, the authors conclude that the evolution of language
has resulted in an optimal structure of the word interactions that facilitate
easier and faster production, perception and navigation of the words.
Linguistic Networks 151

It does not follow, however, from the collocation networks that a word
with high degree is indeed a word with high usage frequency (unless the word
co-occurrences are completely independent in nature, which essentially is not
the case). In a separate study, Cancho and Solé [25] have shown that the
rank-degree distribution of the words in a very large corpus also follows a
two-regime power law, supporting their claim regarding the presence of a
core lexicon whose size is about 5000 words. In order to explain the two-
regime power law in word collocation networks, Dorogovtsev and Mendes [18]
proposed a preferential attachment-based growth model. At every time step t,
a new word (i.e., a node) enters the language (i.e., the network) and connects
itself preferentially to one of the pre-existing nodes. Simultaneously, ct (where
c is a positive constant) new edges are grown between pairs of old nodes that
are chosen preferentially. Through mathematical analysis and simulations, the
authors establish that this model gives rise to a two-regime power law with
exponents very close to those observed in [24].
There have been studies on the properties of collocation networks for lan-
guages other than English, including Russian [46] and many others [41]. The
basic topological properties of the networks (e.g., scale-free, small-world, as-
sortative) are similar across languages, which points to the fact that like Zipf’s
law, these characteristics are also linguistic universals and call for a non-trivial
psycholinguistic account of their emergence and existence.

3.2 Syntactic Dependency Network

Although collocation networks are easier to construct, they do not necessarily


capture the syntactic and semantic relationships between the words, because
syntactic and semantic relations often extend beyond the local neighborhood
of a word. Syntactic relations between the words of a language are governed by
the underlying grammar. There are various formalisms, such as phrase struc-
ture grammar, tree-adjoining grammar and dependency grammar, to capture
these relationships. In the dependency grammar formalism, a relationship,
often shown as a directed edge, connects two words—the head and the de-
pendent. The dependent word modifies the head word in a certain way. For
example, the nouns are the heads of the adjectives that modify them. Simi-
larly, the verbs are the heads of their subjects, objects and other arguments.
Thus, in the dependency formalism, every sentence is represented as a directed
acyclic graph or a dependency tree as illustrated in Fig. 2. Usually, the finite
verb is the head of the whole sentence and is not dependent on any other word.
Cancho and his co-authors [21, 26] defined the syntactic dependency net-
work (SDN) where the words are the nodes and there is a directed edge be-
tween two words if in any of the sentences of a given corpus there is a directed
dependency relation between these words. The direction of the dependencies
in their construction is from the dependent word to the head word. In order
to construct the SDN, one needs to know the dependency relations between
the words of a sentence. Fortunately, there are large dependency treebanks for
152 M. Choudhury and A. Mukherjee

Fig. 2. Example of a dependency tree. The arrows are labeled by the type of depen-
dency relation and run from the dependent to the head words.

some languages consisting of human annotated dependency trees for several


thousand sentences. The authors studied the SDN for three languages: Czech,
German and Romanian, and observed strikingly similar characteristics.
All the networks exhibit power-law degree distributions and small-world
structures. Some of the very interesting topological properties observed are
the following.
• Disassortative mixing. This shows that words that are used for linking
other words (such as prepositions) and, therefore, have high degree in the
networks, are not linked themselves.
• Hierarchical organization. This implies that there is a top-down hierarchy
that is the basis of phrase structure formalism.
• Small-world structure. This is necessary for recursion and fast navigation
of the mental lexicon.
It is a well-known fact that syntactic dependency links usually do not
intersect in any of the world’s languages. In [22], the author conjectured that
this phenomenon is an outcome of minimization of the Euclidean distance
between the syntactically related words of a sentence, where the Euclidean
distance between two words is given by the number of words separating them.1
Later on, Cancho et al. [23] showed that spectral clustering of SDN classifies
words belonging to the same syntactic categories in the same cluster. As we
shall see in Section 5, quite similar techniques are being used in the field
of NLP for unsupervised induction of syntactic categories.

3.3 Unsupervised Grammar Induction

One of the fascinating applications of word collocation networks, illustrated


in [79], is related to unsupervised induction of grammar. Explaining the pro-
cess of language acquisition is one of the greatest challenges to modern science.
Children learn languages that they are exposed to quite accurately and effort-
lessly. This is one of the strongest evidences in support of our instinctive

1
While it is true that syntactic dependencies have a tendency to avoid crossing,
there are systematic exceptions to that generalization in languages with relatively free
constituent order. In German, for example, about one-third of all relative clauses are
extraposed, thus creating cross dependencies.
Linguistic Networks 153

capacities towards languages [70], which is dubbed the universal grammar by


Noam Chomsky [10]. In [79], the authors proposed a very simple algorithm
for learning hierarchical structures from the collocation graph of a raw text
corpus. The algorithm, ADIOS, works as follows.
A directed collocation graph is constructed from the corpus, where the
words are the nodes, and an edge is drawn from words w to v if v follows w
in a sentence. In fact, each sentence is represented as a separate path in the
graph. The algorithm then iteratively searches for motifs that are shared by
different sentences. A linguistic motif is defined as a sequence of words, which
tends to occur quite frequently in the language and also serves some special
functions. For example, “that the X is Y” is a very commonly occurring motif
in English, where X and Y can be substituted by a large number of words and
this whole pattern can be embedded in various parts of a sentence.
Solan et al. [79] define the probability of a particular structure being a
motif in terms of network flows. After finding the motifs, the algorithm pro-
ceeds to identify interchangeable motifs and merge them into a single node.
Thus, at every step the network becomes smaller and a hierarchical structure
emerges. This structure can then be presented as a set of phrase structure
grammar rules.
ADIOS has a high precision (≈70%), but low recall (≈40%). Through a
comparative analysis of the induced grammars, the authors were able to con-
struct a dendrogram of 6 languages that have been studied. Quite surprisingly,
the dendrogram reflects the phylogenetic relations between these 6 languages.
There are other graph-based methods for unsupervised induction of syntac-
tic structures, but unlike ADIOS, these algorithms are based on standard
probability theory and Bayesian models.

4 Phonological Networks
In the earlier sections, we have seen how complex networks can be used to
study the different types of interactions (phonological, syntactic and semantic)
between the words of a language. In this section, we shall review some of
the works where the networks are constructed from linguistic units that are
smaller than words, e.g., phonemes and syllables.

4.1 Network of Human Speech Sounds

The most basic units of human languages are the speech sounds. The reper-
toire of sounds that make up the sound inventory of a language are not chosen
arbitrarily, even though the speakers are capable of perceiving and producing
a plethora of them. In contrast, the inventories show exceptionally regular
patterns across the languages of the world, which is arguably an outcome of
the self-organization that goes on in shaping their structure. In fact, numer-
ous computational models have been proposed in the literature in order to
154 M. Choudhury and A. Mukherjee

explain the self-organization of the vowel inventories [15, 47, 51, 76]. A few
attempts have also been made in the area of linguistics to reason the observed
patterns across the consonant inventories. Most of these works confine them-
selves to explaining certain individual principles rather than formulating a
general theory describing the pattern emergence. However, complex networks
have been recently used quite successfully to explain the self-organization of
the consonant inventories. In [65] the authors construct a bipartite network
called PlaNet, or the Phoneme-Language Network, in which one of the par-
titions consists of nodes representing the languages while the other partition
consists of nodes representing the consonants. There is an edge between the
nodes of these two partitions if a particular consonant occurs in a particular
language. The authors further construct PhoNet (Phoneme-Phoneme Net-
work), which is the one-mode projection of PlaNet onto the consonant nodes
i.e., a network of consonants in which the nodes are linked as many times as
they have co-occurred across the language inventories. The data used for con-
structing the above networks is drawn from the UCLA Phonological Segment
Inventory Database (UPSID) [54], which consists of 317 languages and 541
consonants that are found across these languages. Several important observa-
tions are made from the study of PlaNet and PhoNet. The observations are
noted below.
From the study of PlaNet [65]
• The degree distribution of the consonant nodes in PlaNet roughly follows
a power law with an exponential cut-off towards the tail.
• A synthesis model based on preferential attachment (a language node
attaches itself to a consonant node depending on the current degree (k)
of the consonant node) can explain the emergence of the degree distribu-
tion of PlaNet. The results match the empirical data more accurately if
the attachment kernel is super-linear (i.e., the attachment probability is
proportional to k α , where α > 1).
From the study of PhoNet [64, 65]
• The degree distribution of the consonant nodes in PhoNet also roughly
indicate a power-law behavior with exponential cut-offs.
• The clustering coefficient of PhoNet (=0.89) is significantly higher than
that of a random graph with the same number of nodes and edges (=0.08).
• Community structure analysis of PhoNet can capture the strong patterns
of co-occurrence of consonants that are prevalent across the languages of
the world.
• The driving force that leads to the emergence of these communities is
feature economy, which states that languages tend to use a small number
of distinctive features and maximize their combinatorial possibilities to
generate a large number of consonants.
• The emergence of the degree distribution and the clustering coefficient of
PhoNet can be explained through a synthesis model that is based on both
preferential attachment and triad (i.e., fully connected triplet) formation.
While the preferential part of the model reproduces the degree distribution
Linguistic Networks 155

of the network, the triad formation part imposes a large number of triangles
onto the generated network, thereby increasing the clustering coefficient.
• The emergence of feature economy can be explained by having a synthesis
model, which is a linear combination of two different parts, one driven
by the usual degree-dependent preference and the other by a factor that
favors the choice of those consonants that share many features with the
already chosen ones.
The authors postulate that the physical significance of the synthesis models is
grounded in the process of language change. Language change is a collective
phenomenon that functions at the level of a population of speakers [80]. They
also conjecture that it is possible to explain the significance of the models
at the level of an individual, primarily in terms of the process of language
acquisition. Further, they argue that there are two orthogonal preferences:
(a) the occurrence frequency of a consonant, and (b) the feature-dependent
preference (that increases the ease of learning), which are instrumental in
the acquisition of the inventories. The synthesis model is essentially a linear
combination of these two mutually orthogonal factors.

4.2 Network of Syllables


The syllable inventory of each language can also be modeled and analyzed in
the framework of a complex network. Each node in this network is a syllable,
and links are established between two syllables each time they are shared by
a word. In [78] the authors report the study of the network of Portuguese
syllables from two different sources: a Portuguese dictionary (DIC) and the
complete work of a very popular Brazilian writer—Machado de Assis (MA).
The authors show that
• The networks have a low average shortest path (DIC: 2.44, MA: 2.61),
• The networks indicate a high clustering coefficient (DIC: 0.65, MA: 0.50),
• Both the networks show a power-law behavior.
Since in Portuguese the syllables are close to the basic phonetic units, unlike
the case in English, the authors argue that the properties of the English syl-
labic network should be different from that of Portuguese. The authors further
conjecture that since Italian has a strong parallelism between its structure and
syllable hyphenization it is possible that the Italian syllabic network has prop-
erties close to that of the Portuguese network, pointing to certain universal
characteristics of language.

5 Applications in NLP and IR


Graph-based approaches are quite common in the areas of natural language
processing (NLP) and information retrieval (IR). Interestingly, although there
are no obvious technical differences between the scope of graph theory in these
areas and in complex networks, the terminologies used and the objectives are
156 M. Choudhury and A. Mukherjee

often quite different. The works on linguistic networks discussed in the last
three sections were primarily targeted to the statistical physics community,
and the objective was to unfurl the structure of languages and their dynamics.
In this section, we will survey some equally interesting and significant works,
which use the same set of mathematical tools, but the objective is to develop
practical applications concerning languages.

5.1 Induction of Syntactic and Semantic Categories

One of the earliest and recurrent applications of networks in NLP has been
in automatic induction of syntactic and semantic categories based on the
distributional hypothesis [39]. The distributional hypothesis states that words
of similar syntactic (semantic) category are found in similar contexts [39]. To
illustrate this concept, consider two unknown words X and Y that occur in
the following sentences:
(1) The red X is very beautiful.
(2) If you Y then I shall punish you.
Even though we do not know what X and Y are, it is easy to infer that the
former is a noun and the latter is a verb. We can draw such inferences about
the syntactic categories (in this case the parts of speech) of words based on
our knowledge that nouns, but not verbs, can be preceded by articles (the) and
adjectives (red). The concept of distributional hypothesis is equally relevant
for semantic categories. Words belonging to the same domain club together.
Thus, the word student is expected to be in vicinity of the word school, rather
than market.
Measuring to what extent two words appear in similar contexts defines
their similarity [62]. The general methodology [12, 27, 31, 72, 74, 75] for
inducing word class information can be outlined as follows.
1. Define the context of a word as a vector. It could be just the set of words
which occur in the same sentence, or only the immediate neighbors of the
words. For syntactic class induction, usually the word order is preserved
during construction of the vectors and the context vectors are defined only
in terms of the function words (such as is, of, the and a).
2. Collect global context vectors for the words by summing up the local con-
texts.
3. Construct a weighted network, where the nodes are the words and the
weight of the edge between two words is the distance between their context
vectors. There are several ways to define the distance between the vectors.
Some of the common measures are Euclidean distance, cosine similarity
and correlation coefficients.
4. Apply a clustering algorithm on these networks to obtain the word classes.
In the syntactic category induction literature, the 150–250 words with the
highest frequency are considered as function words, and the context vectors
Linguistic Networks 157

are defined based on them. Some authors employ a much larger number of fea-
tures and reduce the dimensions of the resulting matrix using singular value
decomposition [72, 74]. [27] uses the spearman rank correlation coefficient and
a hierarchical clustering, [74, 75] use the cosine between vector angles and
buckshot clustering, [31] uses cosine on mutual information vectors for hierar-
chical agglomerative clustering and [12] applies Kullback–Leibler divergence
in his CDC algorithm.
[28] does not sum up the contexts of each word in a context vector, but uses
the most frequent instances of four-word windows in a co-clustering algorithm
[16]: rows and columns (here words and contexts) are clustered simultaneously.
Two-step clustering is undertaken by [74]: clusters from the first step are
used as features in the second step. More recently, Biemann [6] proposed the
Chinese Whispers algorithm for clustering, which is fast and does not require
any parameters to be specified. [7] reports application of Chinese Whispers for
parts-of-speech (POS) induction in English, Finnish and German, which has
also been applied very recently to Bengali [66]. In this work, the authors also
investigate the topological properties of the word networks so constructed and
report a scale-free degree distribution, high clustering coefficient and power-
law cluster size distribution.
Widdows and Dorow [87] propose an unsupervised incremental clus-
ter building approach for acquisition of semantic classes. There are also
graph-based algorithms to infer semantic classes (sets of synonyms, to be
specific) from the lexicons (see, e.g., [17, 43]).
Identification of syntactic or semantic classes is of great importance to NLP
and IR. For instance, POS tagging is the first step towards parsing. However,
the supervised machine learning techniques for POS tagging demand a large
amount of human annotated data, which is expensive as well as non-existent
for most of the languages. Since automatic induction of POS tags through
graph clustering does not require annotated data, it might turn out to be a
very useful technique in NLP for resource-poor languages. Similarly, semantic
clustering of the words is useful for search and IR.

5.2 Word Sense Disambiguation

Word sense disambiguation (WSD) refers to the task of assigning the appropri-
ate sense or meaning to a word in a given context (i.e., sentence or paragraph)
out of the several possibilities. For example, the English word bank has two
different meanings as a noun: 1) river bank, and 2) a financial institution.
However, as shown in the following sentences, in a given context only one of
the senses is appropriate.
(1) They were walking down the bank enjoying the cool river breeze.
(2) She went to the bank to cash her check.
There are several ways in which graph-based techniques have been ap-
plied for WSD. Examples include lexical chaining [29], semantic relatedness
158 M. Choudhury and A. Mukherjee

Fig. 3. Example of Hyperlex: (a) the network of words for disambiguation of the word
“light”; (b) the minimal spanning tree obtained after introduction of the word “light”.
The hubs are shown in bold font.

measures based on path lengths and random walks on semantic networks [57,
61] and lexicon graphs [50]. Due to the paucity of space, here we discuss
in detail only one of the approaches—HyperLex [85]—that rely on the word
co-occurrence graphs.
Consider the problem of automatically identifying and disambiguating the
various senses of the word light. The HyperLex algorithm works as follows.
A sub-corpus consisting of all the paragraphs featuring at least one occurrence
of the word light is extracted from a raw text corpus. A word co-occurrence
graph is constructed from this sub-corpus, where the nodes are the content
words except for the word light. Two words are connected by an edge if they
co-occur in a paragraph more than a preset number of times. The weight of
an edge decreases as the number of times the words co-occur increases. It
has been found that word co-occurrence graphs built in this manner exhibit
small-world properties.
In this co-occurrence network, nodes with very high degree are identified
as hubs. The word light, for which we want to build the disambiguator, is then
introduced to the network and connected to the hubs. A minimal spanning
tree is constructed from the co-occurrence graph, where light is the root node
and the first level consists of the hubs. Figure 3 illustrates this process. Each
node in the spanning tree can be thought of as a sense. Thus, the hubs denote
the basic senses and, as we move further down the tree, we have more refined
senses of the word. This tree can then be used for disambiguating the sense
of the target word (here light) in a particular context.

5.3 Information Retrieval


The central problem of IR is to rank a given collection of documents with
their similarity to a query. Queries are usually very short and the collection of
Linguistic Networks 159

documents huge. In a typical IR setup, the whole web consisting of billions of


webpages represents this collection of documents to be ranked and the query
is only one or two words long. One of the challenges of IR is to utilize the
network structure of the web to compute the ranks of the documents. The web
can be conceptualized as a directed graph where the nodes are the webpages
and a hyperlink from webpage A to webpage B represents a directed edge
between the nodes corresponding to A and B.
The very popular PageRank [9] is one of the first ranking algorithms that is
allegedly used by Google search engine. The basic idea behind the PageRank
algorithm is that the rank (or popularity) of a node is a function of the rank of
its neighbors. In other words, the page which has a hyperlink from a popular
page is also popular. An alternative view of the PageRank algorithm involves a
random walker (here a random surfer). A random walker starts from a random
node and follows the edges of the graph randomly to reach other nodes. The
PageRank of a page is proportionate to the probability that a random surfer
reaches that page by following random hyperlinks on the web. Yet another way
to define PageRank is that it is the components of the principal eigenvector
of the nodes. Thus, PageRank is also known as eigenvector centrality in the
complex network literature.
PageRank considers only the incoming edges of a node. Kleinberg [48]
proposed another ranking algorithm, called HITS, where every node has two
scores, hub and authority. The authority scores are similar to PageRank,
whereas the hub scores are based on the outgoing links, but computed in
the same way. The final rank of a node is the combination of its hub and
authority scores. Kleinberg and co-authors [33] also demonstrated how eigen-
vectors of the web structure can be used to cluster and disambiguate the pages
corresponding to ambiguous words such as “Jaguar” (referring to an animal
or a football team or the car).
One drawback of both PageRank and HITS is that the algorithms assume
that all the hyperlinks have the same importance. There are various modifi-
cations of these algorithms, which use machine learning techniques to learn
weights of the different types of hyperlinks. Examples include RankNet [73],
TrustRank [37] and NetRank [2]. Link analysis, as this field is popularly called,
is a very active area of research in the IR community. Some of the other emerg-
ing applications of complex networks in IR include mining social networks and
blogs. The blogosphere [49], for example, can be represented as a multi-tier
network, where blogs, bloggers and other webpages (typically news articles)
are the nodes, and there are various types of edges representing the social
network of bloggers, the links between blogs and those between the blogs and
other webpages. Analysis of the Blogosphere network is useful in classification
and personalized suggestion of blogs, opinion and sentiment analysis, as well
as in investigating the dynamics of the world of blogs.
160 M. Choudhury and A. Mukherjee

5.4 Other Applications

Due to space limitations, it is impossible to do justice to the network-based


techniques in NLP and IR. There are a variety of NLP tasks, ranging from
parsing to text summarization, where graph-based methods have been applied.
In the previous three subsections we have discussed three specific problems
to illustrate the various usages of such techniques. Before we wrap up this
section, we list a few more example applications to demonstrate the extent
and potential of graph-based techniques in these areas.
Text summarization is a notably important and challenging application
of NLP, which has been elegantly modeled within the framework of complex
networks. The problem of text summarization involves identification of a small
number of sentences from a set of given documents that best summarize the
content of the documents. In [19] summarization has been reformulated as
the problem of finding out the node centrality in a network whose nodes are
the sentences and whose edges represent the word-level similarity between two
sentences. The most central sentences are those which cover most of the ideas
present in the given documents.
Other application areas include dependency parsing [56], textual entail-
ment [38], sentiment classification [34, 69], keyword extraction [60], novelty
detection [30] and prepositional phrase disambiguation [84]. See [8, 58, 59] for
further references.

6 Conclusion

So far we have seen that there has been a substantial amount of work to under-
stand the structure and dynamics of languages at the mesoscopic level within
the framework of complex networks. A parallel thread of research in the field
of NLP and IR tries to achieve a different goal, but uses very much the same
means. Nevertheless, mesoscopic models of language as well as network-based
approaches to NLP are in a nascent state, especially when compared to similar
lines of research in the fields of biology, economics and other social sciences
(refer to the surveys in this volume). On the other hand, there seems to be a
great potential for application of complex network theory to a variety of open
problems in linguistics and language engineering.
One of the fundamental problems of linguistics is characterization and
explanation of linguistic universals, i.e., properties that are common to all
human languages. Differences among the languages, on the other hand, are
restricted by the typologies and implicational hierarchies [14]. We have seen
that, like Zipf’s law, there are many linguistic universals observable in the lin-
guistic networks. For example, the SDNs as well as word collocation networks
of all languages exhibit scale-free degree distributions and the small-world
property. A systematic investigation of topological universals of linguistic net-
works can substantially improve our understanding of languages. At the same
Linguistic Networks 161

time, there are properties for which the linguistic networks vary across lan-
guages. For example, the average degrees of the SpellNets are very different
for English, when compared to Hindi or Bengali. This difference has been at-
tributed to the different writing systems used by English (which is alphabetic)
and the two Indo-Aryan languages (which is abugida). Typological variations
have also been predicted in the topological properties of syllable networks.
Thus, it would be interesting to have a typological theory of languages based
on the structure of the linguistic networks.
Another question of great importance for any linguistic network is on the
emergence of its structural properties. It is least clear why the word collo-
cation networks should display small-world and scale-free properties. Even
though the Dorogovtsev and Mendes model [18] can explain the emergence
of the two-regime power law observed in the collocation networks, it does
not explain by itself the validity and the physical significance of this model
based on preferential attachment. In other words, the phenomenon of prefer-
ential attachment at the mesoscopic level needs an independent microscopic
explanation in terms of psycholinguistic factors, because words cannot volun-
tarily link to other words. Similar microscopic explanations are required for
the non-trivial topological properties of the other linguistic networks, such as
ML, SDN, PhoNet and SpellNet. This is presumably a hard problem, but any
mesoscopic explanation is incomplete without a corresponding microscopic
model.
In the context of NLP and IR applications, network-based models are
mostly ad hoc and this reduces their credibility and, thereby, the popularity,
as compared to the more principled Bayesian approaches. A network-based
language model can bridge this gap and provide us with a more systematic
way of solving the NLP problems within this framework. Although there have
been some initiatives in this direction [44], this area is largely unexplored
and presents numerous challenging problems. Another relatively unexplored,
but potentially fecund, area of research is processes “on” linguistic networks.
Navigation of the ML can be modeled as guided random walks on the ML
network; similarly, typographical errors can be modeled as walks on SpellNet.
The exact nature of such guided walks is still to be explored and can provide
a strong understanding of underlying cognitive principles.
In the previous sections we have seen several ways to define networks
where the nodes represent words. One can conceive of a universal word net-
work obtained through superimposition of these partial representations of a
linguistic system into a multi-tier network where the nodes are the words and
two nodes can be connected by several labeled edges signifying their pho-
netic, collocational, syntactic, orthographic, semantic and various other kinds
of similarities. Studies on such a network can reveal a holistic picture of the
interaction patterns between the words, thereby providing a unified model of
grammar at different levels of linguistic structure.
162 M. Choudhury and A. Mukherjee

References
1. M. E. Adilson, A. P. S. de Moura, Y. C. Lai, and P. Dasgupta. Topology of the
conceptual network of language. Physical Review E, 65(065102):1–4, 2002.
2. A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entities.
In Proceedings of KDD, 2006.
3. A. Akmajian. Linguistics. An introduction to Language and Communication. MIT
Press, Cambridge, MA, 1995.
4. A. Albright and B. Hayes. Rules vs. analogy in english past tenses: A computa-
tional/experimental study. Cognition, 90:119–161, 2003.
5. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science,
286:509–512, 1999.
6. C. Biemann. Chinese whispers - an efficient graph clustering algorithm and its ap-
plication to natural language processing problems. In Proceedings of TextGraphs:
the Second Workshop on Graph Based Methods for Natural Language Process-
ing, pages 73–80, New York, NY, June 2006. Association for Computational
Linguistics.
7. C. Biemann. Unsupervised part-of-speech tagging employing efficient graph
clustering. In Proceedings of the COLING/ACL 2006 Student Research Work-
shop, pages 7–12, Sydney, Australia, July 2006. Association for Computational
Linguistics.
8. C. Biemann, I. Matveeva, R. Mihalcea, and D. Radev, editors. Proceedings of the
Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language
Processing. Association for Computational Linguistics, Rochester, NY, 2007.
9. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine.
CNIS, 30(1–7):107–117, 1998.
10. N. Chomsky. The Minimalist Program. MIT Press, Cambridge, MA, 1995.
11. M. Choudhury, M. Thomas, A. Mukherjee, A. Basu, and N. Ganguly. How difficult
is it to develop a perfect spell-checker? A cross-linguistic analysis through complex
network approach. In Proceedings of the Second Workshop on TextGraphs: Graph-
Based Algorithms for Natural Language Processing, pages 81–88, Rochester, NY,
2007. Association for Computational Linguistics.
12. A. Clark. Inducing syntactic categories by context distribution clustering. In
C. Cardie, W. Daelemans, C. Nédellec, and E. T. K. Sang, editors, Proceedings of
the Fourth Conference on Computational Natural Language Learning and of the
Second Learning Language in Logic Workshop, Lisbon, 2000, pages 91–94. Asso-
ciation for Computational Linguistics, Somerset, NJ, 2000.
13. A. M. Collins and M. R. Quillian. Retrieval time from semantic memory. Journal
of Verbal Learning and Verbal Memory, 8:240–247, 1969.
14. W. Croft. Typology and Universals. Cambridge University Press, Cambridge, MA,
1990.
15. B. de Boer. Self-organisation in vowel systems. Journal of Phonetics, 28(4):
441–465, 2000.
16. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2003), pages 89–98, 2003.
17. W. B. Dolan, L. Vanderwende, and S. Richardson. Automatically deriving struc-
tured knowledge base from on-line dictionaries. In Proceedings of the Pacific As-
sociation for Computational Linguistics, 1993.
Linguistic Networks 163

18. S. N. Dorogovtsev and J. F. F. Mendes. Language as an evolving word Web.


Proceedings of the Royal Society of London B, 268(1485):2603–2606, December
22, 2001.
19. G. Erkan and D. Radev. LexRank: Graph-based lexical centrality as salience in
text summarization. JAIR, 22:457–479, December 4, 2004.
20. C. Felbaum. WordNet, an Electronic Lexical Database for English. MIT Press,
Cambridge, MA, 1998.
21. R. Ferrer-i-Cancho. The structure of syntactic dependency networks: insights from
recent advances in network theory. In: “The Problems of Quantitative Linguistics”,
G. Altmann, V. Levickij, and V. Perebyinis (eds.). Chernivtsi: Ruta. 60–75, 2005
22. R. Ferrer-i-Cancho. Why do syntactic links not cross? Europhysics Letters,
76:1228–1235, 2006.
23. R. Ferrer-i-Cancho, A. Capocci, and G. Caldarelli. Spectral methods cluster words
of the same class in a syntactic dependency network. International Journal of
Bifurcation and Chaos, 17(7):2453–2463, 2007.
24. R. Ferrer-i-Cancho and R. V. Solé. The small world of human language.
Proceedings of The Royal Society of London. Series B, Biological Sciences,
268(1482):2261–2265, November 2001.
25. R. Ferrer-i-Cancho and R. V. Solé. Two regimes in the frequency of words and the
origin of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistics,
8:165–173, 2001.
26. R. Ferrer-i-Cancho and R. V. Solé. Patterns in syntactic dependency networks.
Physical Review E, 69(051915), 2004.
27. S. Finch and N. Chater. Bootstrapping syntactic categories using statistical meth-
ods. In Background and Experiments in Machine Learning of Natural Language:
Proceedings of the 1st SHOE Workshop, pages 229–235. Katholieke Universiteit,
Brabant, Holland, 1992.
28. D. Freitag. Toward unsupervised whole-corpus tagging. In COLING ’04: Proceed-
ings of the 20th International Conference on Computational Linguistics, page 357,
Morristown, NJ, 2004. Association for Computational Linguistics.
29. M. Galley and K. McKeown. Improving word sense disambiguation in lexical chain-
ing. In Proceedings of IJCAI, 2003.
30. M. Gamon. Graph-based text representation for novelty detection. In Proceedings
of the Workshop on TextGraphs at HLT-NAACL, pages 17–24, 2006.
31. S. Gauch and R. Futrelle. Experiments in Automatic Word Class and Word Sense
Identification for Information Retrieval. In Proceedings of the 3rd Annual Sympo-
sium on Document Analysis and Information Retrieval, pages 425–434, Las Vegas,
NV, April 1994.
32. M. Gell-Mann. Language and complexity. In J. W. Minett and W. S.-Y. Wang,
editors, Language Acquisition, Change and Emergence: Essays in Evolutionary
Linguistics. City University of Hong Kong Press, July 2005.
33. D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from
link topology. In Proceedings of the Ninth ACM Conference on Hypertext and
Hypermedia, pages 225–234, 1998.
34. A. B. Goldberg and J. Zhu. Seeing stars when there aren’t many stars: Graph-
based semi-supervised learning for sentiment categorization. In HLT-NAACL 2006
Workshop on Textgraphs: Graph-based Algorithms for Natural Language Process-
ing, 2006.
35. J. H. Greenberg and J. J. Jenkins. Studies in the psychological correlates of the
sound system of American English. Word, 20:157–177, 1964.
164 M. Choudhury and A. Mukherjee

36. T. M. Gruenenfelder and D. B. Pisoni. Modeling the mental lexicon as a complex


system: Some preliminary results using graph theoretic measures. In Research
on Spoken Language Processing Progress Report No. 27, Bloomington, Indiana
University, 27–47, 2005.
37. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with
TrustRank. In Proceedings of VLDB, pages 576–587, 2004.
38. A. D. Haghighi, A. Y. Ng, and C. D. Manning. Robust textual inference via
graph matching. In HLT ’05: Proceedings of the Conference on Human Language
Technology and Empirical Methods in Natural Language Processing, pages
387–394, Morristown, NJ, 2005. Association for Computational Linguistics.
39. Z. S. Harris. Mathematical Structures of Language. Wiley, New York, 1968.
40. M. D. Hauser, N. Chomsky, and W. T. Fitch. The faculty of language: What is it,
who has it, and how did it evolve? Science, 298:1569–1579, 2002.
41. R. F. i-Cancho, A. Mehler, O. Pustylnikov, and A. Dı́az-Guilera. Correlations in
the organization of large-scale syntactic dependency networks. In TextGraphs-2:
Graph-Based Algorithms for Natural Language Processing, pages 65–72. Associa-
tion for Computational Linguistics, 2007.
42. Y. Itoh and S. Ueda. The Ising model for changes in word ordering rules in natural
languages. Physica D: Nonlinear Phenomena, 198(3-4):333–339, 2004.
43. J. Jannink and G. Wiederhold. Thesaurus entry extraction from an on-line dictio-
nary. In Proceedings of Fusion, 1999.
44. B. Jedynak and D. Karakos. Unigram language models using diffusion smoothing
over graphs. In Proceedings of the Second Workshop on TextGraphs: Graph-Based
Algorithms for Natural Language Processing, pages 33–36, Rochester, NY, 2007.
Association for Computational Linguistics.
45. V. Kapatsinski. Sound similarity relations in the mental lexicon: Modeling the
lexicon as a complex network. Speech Research Lab Progress Report, Indiana
University, Bloomington, IN, 2006.
46. V. Kapustin and A. Jamsen. Vertex degree distribution for the graph of word co-
occurrences in Russian. In Proceedings of the Second Workshop on TextGraphs:
Graph-Based Algorithms for Natural Language Processing, pages 89–92, Rochester,
NY, 2007. Association for Computational Linguistics.
47. J. Ke, M. Ogura, and W. S.-Y. Wang. Optimization models of sound systems using
genetic algorithms. Computational Linguistics, 29(1):1–18, 2003.
48. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of
ACM, 46, 1999.
49. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. Structure and evolution of
blogspace. Communications of the ACM, 47(12):35–39, 2004.
50. M. Lesk. Automatic sense disambiguation using machine readable dictionaries:
How to tell a pine cone from an ice cream cone. In Proceedings of SIGDOC, 1986.
51. J. Liljencrants and B. Lindblom. Numerical simulation of vowel quality systems:
the role of perceptual contrast. Language, 48:839–862, 1972.
52. H. Liljenstrom. Micro Meso Macro: Addressing Complex Systems Couplings. World
Scientific Publishing, Singapore, 2005.
53. P. A. Luce and D. B. Pisoni. Recognizing spoken words: The neighborhood acti-
vation model. Ear and Hearing, 19:1–36, 1998.
54. I. Maddieson. Patterns of Sounds. Cambridge University Press, Cambridge, 1984.
Linguistic Networks 165

55. W. Marslen-Wilson. Activation, competition, and frequency in lexical access. In:


G. T. M. Altmann (ed.), Cognitive Models of Speech Processing: Psycholinguis-
tic and Computational Perspectives, MIT Press, Cambridge, MA, pages 148–173,
1990.
56. R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency
parsing using spanning tree algorithms. In HLT ’05: Proceedings of the confer-
ence on Human Language Technology and Empirical Methods in Natural Language
Processing, pages 523–530, Morristown, NJ, 2005. Association for Computational
Linguistics.
57. R. Mihalcea. Graph-based ranking algorithms for large vocabulary word sense
disambiguation. In Proceedings of HTL-EMNLP, 2005.
58. R. Mihalcea and D. Radev. Graph-based algorithms for information retrieval and
natural language processing. Tutorial at HLT/NAACL 2006, 2006.
59. R. Mihalcea and D. Radev, editors. Proceedings of the Second Workshop on
TextGraphs: Graph-Based Algorithms for Natural Language Processing. Associ-
ation for Computational Linguistics, 2006.
60. R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of
EMNLP, 2004.
61. R. Mihalcea, P. Tarau, and E. Figa. PageRank on semantic networks with appli-
cations to word sense disambiguation. In Proceedings of COLING, 2004.
62. G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity.
Language and Cognitive Processes, 6(1):1–28, 1991.
63. G. A. Miller and P. M. Gildea. How children learn words. Scientific American,
257(3):86–91, 1987.
64. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Modeling the co-
occurrence principles of the consonant inventories: A complex network approach.
International Journal of Modern Physics C, 18(2):281–295, 2007.
65. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Self-organization of
sound inventories: Analysis and synthesis of the occurrence and co-occurrence
networks of consonants. Journal of Quantitative Linguistics, http://arXiv.org/
physics/0610120.
66. J. Nath, M. Choudhury, A. Mukherjee, C. Biemann, and N. Ganguly. Unsupervised
parts-of-speech induction for Bengali. In Proceedings of the Sixth International
Language Resources and Evaluation Conference (LREC), 2008.
67. D. Nettle. Using social impact theory to simulate language change. Lingua, 108:
95–117, 1999.
68. H. G. Nusbaum, D. B. Pisoni, and C. K. Davis. Sizing up the Hoosier mental
lexicon: Measuring the familiarity of 20,000 words, Indiana University. Research
on Speech Perception Progress Report No. 10, pages 357–376, 1984.
69. B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectiv-
ity summarization based on minimum cuts. In Proceedings of the 42nd Meeting
of the Association for Computational Linguistics (ACL’04), Main Volume, pages
271–278, Barcelona, Spain, July 2004.
70. S. Pinker. The Language Instinct: How the Mind Creates Language. HarperCollins,
New York, 1994.
71. S. Pinker and A. Price. On language and connectionism: Analysis of a parallel
distributed processing model of language acquisition. Cognition, 28:195–247, 1988.
72. R. Rapp. A practical solution to the problem of automatic part-of-speech induction
from text. In Conference Companion Volume of the 43rd Annual Meeting of the
Association for Computational Linguistics (ACL-05), Ann Arbor, MI, 2005.
166 M. Choudhury and A. Mukherjee

73. M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: Machine learning for
static ranking. In Proceedings of WWW, pages 707–715, 2006.
74. H. Schütze. Part-of-speech induction from scratch. In Proceedings of the 31st An-
nual Meeting on Association for Computational Linguistics, pages 251–258, Mor-
ristown, NJ, 1993. Association for Computational Linguistics.
75. H. Schütze. Distributional part-of-speech tagging. In Proceedings of the 7th Con-
ference on European Chapter of the Association for Computational Linguistics,
pages 141–148, San Francisco, CA, 1995. Morgan Kaufmann Publishers Inc.
76. J.-L. Schwartz, L.-J. Boë, N. Vallée, and C. Abry. The dispersion-focalization
theory of vowel systems. Journal of Phonetics, 25:255–286, 1997.
77. M. Sigman and G. A. Cecchi. Global organization of the wordnet lexicon. Proceed-
ings of the National Academy of Science, 99(3):1742–1747, 2002.
78. M. M. Soares, G. Corso, and L. S. Lucena. The network of syllables in Portuguese.
Physica A: Statistical Mechanics and its Applications, 355(2-4): 678–684, 2005.
79. Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural
languages. Proceedings of National Academy of Sciences, 102(33):11629–11634,
2005.
80. L. Steels. Language as a complex adaptive system. In Proceedings of PPSN VI,
pages 17–26, 2000.
81. D. Steriade. Knowledge of similarity and narrow lexical override. BLS, 29: 583–598,
2004.
82. M. Steyvers and J. B. Tenenbaum. The large-scale structure of semantic networks:
Statistical analyses and a model of semantic growth. Cognitive Science, 29(1):
41–78, 2005.
83. M. Tamariz. Exploring the Adaptive Structure of the Mental Lexicon. Ph.D. the-
sis, Department of Theoretical and Applied Linguistics, Univerisity of Edinburgh,
Scotland, 2005.
84. K. Toutanova, C. D. Manning, and A. Y. Ng. Learning random walk models for
inducing word dependency distributions. In ICML ’04: Proceedings of the Twenty-
First International Conference on Machine Learning, page 103, New York, NY,
2004.
85. J. Véronis. HyperLex: Lexical cartography for information retrieval. Computer
Speech and Language, 18(3):223–252, 2004.
86. M. S. Vitevitch. Phonological neighbors in a small world (network): What can
graph theory tell us about the mental lexicon? Departmental Colloquy co-sponsored
by the Linguistics and Psychology Departments, Rice University, January 27, 2006.
87. D. Widdows and B. Dorow. A graph model for unsupervised lexical acquisition.
In Proceedings of COLING, 2002.
Networks Generated from Natural
Language Text

Chris Biemann and Uwe Quasthoff

Institute for Computer Science, NLP Department, University of Leipzig,


Johannisgasse 26, 04103 Leipzig, Germany; biem@informatik.uni-lepzig.de,
quasthoff@informatik.uni-leipzig.de

1 Introduction

The study of large-scale characteristics of graphs that arise in natural language


processing is an essential step in finding structural regularities. Structure dis-
covery processes have to be designed with an awareness of these properties.
Examining and contrasting the effects of processes that generate graph struc-
tures similar to those observed in language data sheds light on the structure
of language and its evolution.
In this chapter, we examine power-law distributions and small world
graphs (SWGs) originating from natural language data. There are several
reasons for the special interest in these structures.
1. Power laws appear in many rank-frequency statistics. Furthermore, we can
construct graphs with words as nodes and use various rules to introduce
edges between words. In many cases, this results in SWGs, which again
often have a power-law distribution for their node degrees.
2. SWGs appear in many other real world data, like social networks of many
kinds, in the link structure of the World Wide Web or in traffic networks.
It is interesting to analyze all these networks in more detail to identify
similarities and differences.
3. From an application-driven view, SWGs allow effective clustering strate-
gies in nearly linear time. Because these clusters are often related to the
growth process of the underlying graph, they are often meaningful. In
the case of natural language these clusters usually reflect semantic and/or
syntactic structures.
After discussing several data sources that exhibit power-law distributions
with respect to rank frequency in Section 2, graphs with small world properties
in language data are discussed in Section 3. We shall see that these characteris-
tics are omnipresent in language data, and we should be aware of them when
designing structure discovery processes. For example, the knowledge that a
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 10,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
168 C. Biemann and U. Quasthoff

few hundred words make the bulk of words in a text allows one to use only
these words as contextual features with only a minor loss in text coverage.
Knowing that word co-occurrence networks possess the scale-free small world
property has implications for clustering these networks.
An interesting aspect is whether these characteristics are only inherent to
real natural language data or whether they can be produced with generators
of linear sequences in a much simpler way than our intuition about language
complexity would suggest. In other words, we shall see how distinctive these
characteristics are with respect to tests deciding whether a given sequence is
natural language or not.

2 Power Laws in Rank-Frequency Distribution

G. K. Zipf [31, 32] described the following phenomenon: if all words in a corpus
of natural language are arranged in decreasing order of frequency, then the
relation between a word’s frequency and its rank in the list follows a power
law. Since then, a significant amount of research has been devoted to the
question of how this property emerges and what kinds of processes generate
such Zipfian distributions. Hence, some datasets related to language will be
presented that exhibit a power law on their rank-frequency distribution. For
this discussion, basic units of language will be examined.

2.1 Word Frequency

The relation between the frequency of a word at rank r and its rank is given
by f (r) ∼ r−z , where z is the exponent of the power law that corresponds
to the slope of the curve in a log-log plot. The exponent z was assumed to
be exactly 1 by Zipf. In natural language data, slightly differing exponents
in the range of about 0.7 to 1.2 are also observed [30]. B. Mandelbrot [21]
provided a formula that more closely approximates the frequency distributions
in language data after noticing that Zipf’s law holds only for the medium range
of ranks, whereas the curve is flatter for very frequent words and steeper for
high ranks. Figure 1 displays the word rank-frequency distributions of corpora
of different languages taken from the Leipzig Corpora Collection.1
There exist several exhaustive collections of research capitalising Zipf’s
law and related distributions2 ranging over a wide area of datasets; here, only
findings related to natural language will be reported. A related distribution
is the lexical spectrum [16], which gives the probability of choosing a word
from the vocabulary with a given frequency. For natural language, the lexical
spectrum follows a power law with slope γ = z1 + 1, where z is the exponent

1
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007].
2
e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
Networks Generated from Natural Language Text 169

Zipf's law for various corpora


1e+007
German 1M
English 300K
1e+006 Italian 300K
Finnish 100K
100000 power law gamma=1
power-law gamma=0.8
10000
frequency

1000

100

10

0.1
1 10 100 1000 10000 100000 1e+006
rank

Fig. 1. Zipf’s law for various corpora. The numbers next to the language give the
corpus size in sentences. Enlarging the corpus does not affect the slope of the curve,
but merely moves it upwards in the plot. Most lines are almost parallel to the ideal
power-law curve with z = 1. Finnish exhibits a lower slope of γ ≈ 0.8, akin to higher
morphological productivity.

of the Zipfian rank-frequency distribution. For the relation between lexical


spectrum, Zipf’s law and Pareto’s law, see [1].
But Zipf’s law in its original form is just the tip of the iceberg of power-law
distributions in a quantitative description of language. While a Zipfian distri-
bution for word frequencies can be obtained by a simple model of generating
letter sequences with space characters as word boundaries [21, 22], these mod-
els based on “intermittent silence” can neither reproduce the distributions on
sentence length [26] nor explain the relations of words in sequence. Next, more
power-law distributions in natural language are discussed and exemplified.

2.2 Letter N -Grams

To continue with a counter example, letter frequencies do not obey a power


law in the rank-frequency distribution. This also holds for letter N -grams
(including the space character), yet for higher N , the rank-frequency plots
show a large power-law regime with exponential tails for high ranks. Figure 2
shows the rank-frequency plots for letter N -grams up to N = 6 for the first
10,000 sentences of the British National Corpus (BNC,3 [10]).
Still, letter frequency distributions can be used to show that letters are
not forming letter bigrams from the single letters independently, but there
are restrictions on their combination. While this intuitively seems obvious for

3
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
170 C. Biemann and U. Quasthoff

rank-frequency letter N-gram


1e+006
letter 1gram
letter 2gram
letter 3gram
100000
letter 4gram
letter 5gram
letter 6gram
10000
power-law gamma=0.55
frequency

1000

100

10

1
1 10 100 1000 10000 100000 1e+006
rank

Fig. 2. Rank-frequency distributions for letter N -grams for the first 10,000 sentences
in the BNC. Letter N -gram rank-frequency distributions do not exhibit power laws on
the full scale, but increasing N results in a larger power-law regime for low ranks.

letter combination, the following test is proposed for quantitatively examin-


ing the effects of these restrictions: from letter unigram probabilities, a text is
generated that follows the letter unigram distribution by randomly and inde-
pendently drawing letters according to their distribution and concatenating
them. The letter bigram frequency distribution of this generated text can be
compared to the letter bigram frequency distribution of the real text from
where the unigram distribution was measured. Figure 3 shows the generated
plot and the real rank-frequency plot, again from the small BNC sample.
The two curves clearly differ. The generated bigrams without restrictions
predict a higher number of different bigrams and lower frequencies for bigrams
of high ranks as compared to the real text bigram statistics. This shows that
letter combination restrictions do exist, as not all bigrams predicted by the
generation process were observed, resulting in higher counts for valid bigrams
in the sample.

2.3 Word N -Grams

For word N -grams, the relation between rank and frequency follows a power
law, just as in the case for words (unigrams). Figure 4 (left) shows the rank-
frequency plots up to N = 4, based on the first 1 million sentences of the
BNC. As more different word combinations are possible with increasing N ,
Networks Generated from Natural Language Text 171

letter bigram: generated and real


10000
letter 2-grams generated by letter-1-gram distribution
letter 2-gram real

1000
frequency

100

10

1
1 10 100 1000 10000
rank

Fig. 3. Rank-frequency plots for letter bigrams, for a text generated from letter
unigram probabilities and for the BNC sample.

word N gram rank-frequency word bigram: generated and real


1e+007 1e+006
word 1-gram word 1-gram-generated word 2-grams
word 2-gram word 2-grams
1e+006 word 3-gram 100000
word 4-gram

100000
10000
frequency
frequency

10000
1000
1000

100
100

10
10

1 1
1 10 100 1000 10000 1000001e+0061e+0071e+008 1 10 100 1000 10000 100000 1e+006 1e+007
rank rank

Fig. 4. Left: Rank-frequency distributions for word N -grams for the first one million
sentences in the BNC. Word N -gram rank-frequency distributions exhibit power laws.
Right: Rank-frequency plots for word bigrams, for a text generated from letter unigram
probabilities and for the BNC sample.

the curves become flatter as the same total frequency is shared amongst more
units, as previously observed (e.g. [27, 18]). Testing concatenation restrictions
quantitatively as above for letters, it might at first seem surprising that the
curve for a text generated with word unigram frequencies differs only very
little from the word bigram curve, as Fig. 4 (right) shows. Small differences
are only observable for low ranks: more top-rank generated bigrams reflect
172 C. Biemann and U. Quasthoff

that words are usually not repeated in the text. More low-ranked and less
high-ranked real bigrams indicate that word concatenation takes place not
entirely without restrictions, yet is subject to much more variety than letter
concatenation. This coincides with the intuition that it is, for a given word
pair, almost always possible to form a correct English sentence in which these
words are neighbours. Regarding quantitative (as opposed to syntactic or
semantic) aspects, the frequency distribution of word bigrams can be produced
by a generation process based on word unigram probabilities.

2.4 Sentence Frequency


Larger corpora that are compiled from a variety of sources contain a con-
siderable amount of duplicate sentences. In the full BNC, which serves as
the data basis in this case, 7.3% of the sentences occur two or more times.
The most frequent sentences are “Yeah.”, “Mm.”, “Yes.” and “No.”, which
are mostly found in the section of spoken language. But also longer expres-
sions like “Our next bulletin is at 10.30 p.m.” have a count of over 250. The
sentence frequencies also follow a power law with an exponent close to 1 (see
Fig. 5), indicating that Zipf’s law also holds for sentence frequencies.

2.5 Other Power Laws in Language Data


The preceding results strongly suggest that when counting document frequen-
cies in large collections such as the World Wide Web, another power-law

rank-frequency for sentences in the BNC


100000
sentences
power-law gamma=0.9

10000

1000
frequency

100

10

1
1 10 100 1000 10000 100000 1e+006 1e+007
rank

Fig. 5. Rank-frequency plot for sentence frequencies in the full BNC, following a
power law with γ ≈ 0.9, but with a high fraction of sentences occurring only once.
Networks Generated from Natural Language Text 173

rank-frequency for search queries


100000
search queries
power-law gamma=0.75

10000

1000
frequency

100

10

1
1 10 100 1000 10000 100000 1e+006
rank

Fig. 6. Rank-frequency plot for AltaVista search queries, following a power law with
γ ≈ 0.75.

distribution would be found, but such an analysis has not been carried out
and would require access to the index of a web search engine. Further, there are
more power laws in language-related areas, some are mentioned here briefly
to illustrate their omnipresence.
• Web page requests follow a power law, which was employed for a caching
mechanism in [17].
• Related to this, frequencies of web search queries during a fixed time span
also follow a power law, as exemplified in Fig. 6 for a 7-million queries log
of AltaVista4 as used by Lempel [19].
• The number of authors of Wikipedia5 articles was found to follow a power
law with γ ≈ 2.7 for a large regime in [29]. The same paper further dis-
cusses various other power-law relationships.

3 Scale-Free Small Worlds in Language Data


The previous section discussed the shape of rank-frequency distributions for
natural language units. Now the properties of graphs with units represented
as vertices and relations between them as edges will be the focus of interest.
Internal as well as contextual features can be employed for computing similar-
ities between language units that are represented as (possibly weighted) edges
4
http://www.altavista.com
5
http://www.wikipedia.org
174 C. Biemann and U. Quasthoff

in the graph. Some of the graphs discussed here can be classified as scale-free
SWGs; others have different characteristics and represent other, but related,
graph classes.

3.1 Word Co-Occurrence Graph

The notion of word co-occurrence is used to model dependencies between


words. If two words X and Y occur together in some contextual unit of
information (as neighbours, in a word window of 5, in a clause, in a sen-
tence, in a paragraph), they are said to co-occur. When regarding words as
vertices and edge weights as the number of times two words co-occur, the
word co-occurrence graph of a corpus is given by the entirety of all word co-
occurrences. In the following, two specific types of co-occurrence graphs are
considered: the graph as induced by neighbouring words, henceforth called
the neighbour-based graph, and the graph as induced by sentence-based co-
occurrence, henceforth called the sentence-based graph. The neighbour-based
graph can be undirected or directed with edges going from the left to the right
words as found in the corpus, the sentence-based graph is undirected.
To find out whether the co-occurrence of two specific words A and B is
merely due to chance or exhibits a statistical dependency, measures are used
that compute, to what extent the co-occurrence of A and B is statistically
significant. Many significance measures can be found in the literature; for ex-
tensive overviews consult e.g. [9] or [14]. In general, the measures compare the
probability for A and B to co-occur under the assumption of their statistical
independence with the actual probability of their joint co-occurrence in the
corpus. In this work, the log likelihood ratio [13] is used to sort the wheat
from the chaff. It is given in expanded form in [9]:
⎡ ⎤
n log n − nA log nA − nB log nB + nAB log nAB
⎢ + (n − nA − nB + nAB ) log (n − nA − nB + nAB ) ⎥
−2 log λ = 2 ⎢
⎣ + (nA − nAB ) log (nA − nAB ) + (nB − nAB ) log (nB − nAB ) ⎦ ,

− (n − nA ) log (n − nA ) − (n − nB ) log (n − nB )

where n is the total number of contexts, nA the frequency of A, nB the fre-


quency of B and nAB the number of co-occurrences of A and B. As pointed out
by Moore [23], this formula overestimates the co-occurrence significance for
small nAB . For this reason, often a frequency threshold t on nAB (e.g. a min-
imum of nAB = 2) is applied. Further, a significance threshold s regulates the
density of the graph; for the log likelihood ratio, the significance values corre-
spond to the χ2 tail probabilities [23], which makes it possible to translate the
significance value into an error rate for rejecting the independence assump-
tion.6 The operation of applying a significance test results in pruning edges
6
For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating
that two words do not occur by chance, a significance of 6.63 corresponds to a 1%
error.
Networks Generated from Natural Language Text 175

that exist due to random noise and keeping almost exclusively those edges that
reflect a true association between their endpoints. Graphs that contain all sig-
nificant co-occurrences of a corpus, with edge weights set to the significance
value between their endpoints, are called significant co-occurrence graphs in
the remainder. For convenience, no singletons in the graph are allowed, i.e. if a
vertex is not contained in any edge because none of the co-occurrences for the
corresponding word is significant, then the vertex is excluded from the graph.
As observed previously [15, 24], word co-occurrence graphs exhibit the
scale-free small world property. This is in line with co-occurrence graphs
reflecting human associations [25] and human associations in turn forming
SWGs [28]. The claim is confirmed here on an exemplary basis with the
graph for Leipziy Corpora Collection’s (LCC’s) 1 million sentence corpus for
German. Figure 7 gives the degree distributions and graph characteristics for
various co-occurrence graphs.
The shape of the distribution is dependent on the language, as Fig. 8 shows.
Some languages—here English and Italian—have a hump-shaped distribution
in the log-log plot where the first regime follows a power law with a lower expo-
nent than the second regime, as observed in [15]. For the Finnish and German
corpora examined here, this effect could not be found in the data. This prop-
erty of two power-law regimes in the degree distribution of word co-occurrence
graphs motivated the Dorogovtsev-Mendes (DM)-model, see [12]. There, the

de1M neighbour-based graphs degree distribution de1M sentence-based graphs degree distribution

de1M nb. t=2 indegree de1M sb t=10


100000 de1M nb. t=2 outdegree 100000 power law gamma=2
power law gamma=2 de1M sig. sb t=10 s=10
10000 10000
fraction of vertices per degree

fraction of vertices per degree

de1M sig. nb t=10 s=10 indegree


1000 de1M sig. nb. t=10 s=10 outdegree 1000
100 100
10 10
1 1
0.1 0.1
0.01 0.01
0.001 0.001
0.0001 0.0001
1 10 100 1000 10000 1 10 100 1000 10000
degree interval degree interval

Fig. 7. Graph characteristics for various co-occurrence graphs of LCC’s 1-million


sentence German corpus. Abbreviations: nb = neighbour-based, sb = sentence-based,
sig. = significant, t = co-occurrence frequency threshold, s = co-occurrence signifi-
cance threshold. While the exact shapes of the distributions are language and corpus
dependent, the overall characteristics are valid for all samples of natural language
of sufficient size. The slope of the distribution is invariant to changes of thresholds.
Characteristic path length and a high clustering coefficient at low average degrees are
characteristic for SWGs.
176 C. Biemann and U. Quasthoff

significant sentence-based graphs for various languages


100000 Italian 300K sig. sentence-based graph t=2 s=6.63
English 300K sig. sentence-based graph t=2 s=6.63
fraction of vertices per degree
10000 Finnish 100K sig. sentence-based graph t=2 s=6.63
power law gamma=2.5
1000 power law gamma=1.5
power-law gamma=2.8
100
10
1
0.1
0.01
0.001
0.0001
1 10 100 1000 10000
degree interval

Fig. 8. Degree distribution of significant sentence-based co-occurrence graphs of sim-


ilar thresholds for Italian, English and Finnish.

degree distribution with window size 2 degree distribution with window size 2
1e+006 1e+006
Icelandic window 2 Italian window 2
German window 2 English BNC window 2
power-law gamma=2 10000 power-law gamma=1.6
10000 power-law gamma=2.6
# of vertices for degree

# of vertices for degree

100 100

1 1

0.01 0.01

0.0001 0.0001

1e-006 1e-006
1 10 100 1000 10000 100000 1e+006 1 10 100 1000 10000 1000001e+006
degree degree

Fig. 9. Degree distributions in word co-occurrence graphs for window size 2. Left: The
distribution for German and Icelandic is approximated by a power law with γ = 2.
Right: For English (BNC) and Italian, the distribution is approximated by two power-
law regimes.

crossover point of the two power-law regimes is motivated by a kernel lexicon


of about 5000 words that can be combined with all words of a language.
The original experiments of [15] operated on a word co-occurrence graph
with window size 2: an edge is drawn between words if they appear together at
least once in a distance of one or two words in the corpus. Reproducing their
experiment with the first 70 million words of the BNC and corpora of German,
Icelandic and Italian of similar size reveals that the degree distribution of
the English and the Italian graph is in fact approximated by two power-law
regimes. In contrast to this, German and Icelandic show a single power-law
distribution, just as in the experiments above; see Fig. 9. These results suggest
Networks Generated from Natural Language Text 177
degree distribution with distance 1 degree distribution with distance 2
1e+006 1e+006
Italian distance 1 Italian distance 2
English BNC distance 1 English BNC distance 2
power-law gamma=1.8 power-law gamma=1.6
10000 10000
power-law gamma=2.2 power-law gamma=2.6
# of vertices for degree

# of vertices for degree


100 100

1 1

0.01 0.01

0.0001 0.0001

1e-006 1e-006
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000
degree degree

Fig. 10. Degree distributions in word co-occurrence graphs for distance 1 and dis-
tance 2 for English (BNC) and Italian. The hump-shaped distribution is much more
distinctive for distance 2.

that two power-law regimes in word co-occurrence graphs with window size 2
are not a language universal, but only hold for some languages.
To examine the hump-shaped distributions further, Fig. 10 displays the
degree distribution for the neighbour-based word co-occurrence graphs and
the word co-occurrence graphs for connecting only words that appear in a
distance of 2. As it becomes clear from the plots, the hump-shaped distribution
is mainly caused by words co-occurring in distance 2, whereas the neighbour-
based graph shows only a slight deviation from a single power law. Together
with the observations from sentence-based co-occurrence graphs of different
languages in Figure 8, it becomes clear that a hump-shaped distribution with
two power-law regimes is caused by long-distance relationships between words,
if present at all.

3.1.1 Applications of Word Co-Occurrences

Word co-occurrence statistics are an established standard and have been used
in many language processing systems. The authors have used co-occurrences
in practical applications like bilingual dictionary acquisition [4, 11], semantic
lexicon extension [8] and visualisation of concept trails [7]. The aim of this
chapter is to underpin their applications with a theoretical foundation.

3.2 Co-Occurrence Graphs of Higher Order

The significant word co-occurrence graph of a corpus represents words that


are likely to appear near to each other. When one is interested in words
co-occurring with similar other words, it is possible to transform the above-
defined (first-order) co-occurrence graph into a second-order co-occurrence
graph by drawing an edge between two words A and B if they share a common
178 C. Biemann and U. Quasthoff

band
albumn
saxophonist music concerts
album roll jazz
singer
music Marsalis concert
pop
band trumpeter stars star rock
jazz
musicians
rock pianist
singer strata
blues
Jazz mass
classical
coal
musician burst bursts

Fig. 11. Neighbourhoods of jazz and rock in the significant sentence-based word co-
occurrence graph as displayed on LCC’s English corpus website. Both neighbourhoods
contain album, music, singer and band, which leads to an edge weight of 4 in the
second-order graph.

neighbour in the first-order graph. Whereas the first-order word co-occurrence


graph represents the global context per word, the corresponding second-order
graph contains relations between words which have similar global contexts.
The edge can be weighted according to the number of common neighbours,
e.g. by weight = |neigh(A) ∩ neigh(B)|. Figure 11 shows neighbourhoods
of the significant sentence-based first-order word co-occurrence graph from
LCC’s English web corpus7 for the words jazz and rock. Taking into account
only the data depicted, jazz and rock are connected with an edge of weight 4
in the second-order graph, corresponding to their common neighbours album,
music, singer and band. The fact that they share an edge in the first-order
graph is ignored.
In general, a graph of order N + 1 can be obtained from the graph of order
N , using the same transformation. The higher-order transformation without
thresholding is equivalent to a multiplication of the unweighted adjacency
matrix A with itself, then a zeroing of the main diagonal by subtracting the
degree matrix of A. Since the average path length of scale-free SWGs is short
and local clustering is high, this operation leads to an almost fully connected
graph in the limit, which does not allow one to draw conclusions about the
initial structure. Thus, the graph is pruned in every iteration N in the fol-
lowing way. For each vertex, only the maxN outgoing edges with the highest
weights are taken into account. Notice that this vertex degree threshold maxN
does not limit the maximum degree, as thresholding is asymmetric. This op-
eration is equivalent to only keeping the maxN largest entries per row in
the adjacency matrix A = (aij ), then At = (sign(aij + aji )), resulting in an
undirected graph. To examine quantitative effects of the higher-order trans-
formation, the sentence-based word co-occurrence graph of LCC’s 1-million
German sentence corpus (s = 6.63, t = 2) underwent this operation. Figure 12
depicts the degree distributions for N = 2 and N = 3 for different maxN .
7
http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]
Networks Generated from Natural Language Text 179

German cooc order 2 German cooc order 3


German order 2 full 10000 German order 2 max 3
10000 German order 2 max 10 power-law gamma=1
German order 2 max 3 power-law gamma=4
vertices per degree interval

vertices per degree interval


1000
1000
100
100
10 10

1 1
0.1 0.1
0.01 0.01
0.001 0.001
0.0001 0.0001
1 10 100 1000 10000 100000 1 10 100 1000 10000
degree degree

Fig. 12. Degree distributions of word-co-occurrence graphs of higher order. The first-
order graph is the sentence-based word co-occurrence graph of LCC’s 1-million German
sentence corpus (s = 6.63, t = 2). Left: N = 2 for max2 = 3, max2 = 10 and
max2 = ∞. Right: N = 3 for t2 = 3, t3 = ∞, using the second-order graph with
max2 = 3.

Applying the maxN threshold causes the degree distribution to change,


especially for high degrees. In the third-order graph, two power-law regimes
are observable.
Studying the degree distribution of higher-order word co-occurrence graphs
revealed that the characteristic of being governed by power laws is invariant
to the higher-order transformation, yet the power-law exponent changes. This
indicates that the power-law characteristic is inherent at many levels in natu-
ral language data. To examine what this transformation yields on the graphs
generated by other random graph models, Figure 13 shows the degree distribu-
tion of second-order and third-order graphs as generated by the graph gener-
ation models of [3] (Barabási-Albert (BA)-model), [28] (Steyvers-Tenenbaum
(ST)-model) and [12] (DM-model). The underlying first-order graphs are the
undirected graphs of order 10,000 and size 50,000 (k=10) from these three
models.
While the thorough interpretation of second-order graphs of random
graphs might be subject to further studies, the following should be noted:
the higher-order transformation reduces the power-law exponent of the BA-
model graph from γ = 3 to γ = 2 in the second order and to γ ≈ 0.7 in the
third order. For the ST-model, the degree distribution of the full second-order
graph shows a maximum around 2M , then decays with a power law with ex-
ponent γ ≈ 2.7. In the third-order ST-graph, the maximum moves to around
4M 2 for sufficient max2 . The DM-model second-order graph shows, like the
first-order DM-model graph, two power-law regimes in the full version, and
a power-law with γ ≈ 2 for the pruned versions. The third-order degree dis-
tribution exhibits many more vertices with high degrees than predicted by a
power law.
180 C. Biemann and U. Quasthoff

BA order 2 BA order 3
100000 1000
BA full BA order 2 max 10
BA max 10 BA order 2 max 3
10000 BA max 3 100 power-law gamma=0.7
power-law gamma=2

vertices per interval


1000
vertices per interval

10
100

10 1

1
0.1
0.1
0.01
0.01

0.001 0.001
1 10 100 1000 1 10 100 1000
degree degree

ST order 2 ST order 3

ST full ST order 2 max 10


ST max 10 ST order 2 max 3
1000
vertices per degree interval
1000 ST max 3
vertices per degree interval

power-law gamma=2.5
100 100

10 10

1 1

0.1 0.1

0.01 0.01
1 10 100 1000 1 10 100 1000
degree degree
DM order 2 DM order 3
1000
DM full DM order 2 max 10
DM max 10 DM order 2 max 3
1000 DM max 3
vertices per degree interval

vertices per degree interval

power-law gamma=2 100


power-law gamma=1
100 power-law gamma=4
10
10
1
1
0.1
0.1

0.01 0.01

0.001 0.001
1 10 100 1000 1 10 100 1000
degree degree

Fig. 13. Second- and third-order graph degree distributions for BA-model, ST-model
and DM-model graphs.
Networks Generated from Natural Language Text 181

In summary, all random graph models exhibit clear differences for word
co-occurrence networks with respect to the higher-order transformation. The
ST-model shows maxima depending on the average degree of the first-order
graph. The BA-model’s power law is decreased with higher orders, but is
able to explain a degree distribution with power-law exponent 2. The full
DM model exhibits the same two power-law regimes in the second order as
observed for German sentence-based word co-occurrences in the third order.

3.2.1 Applications of Co-Occurrence Graphs of Higher Orders

In [6] and [20], the utility of word co-occurrence graphs of higher orders are
examined for lexical semantic acquisition. The highest potential for extracting
paradigmatic semantic relations can be attributed to second- and third-order
word co-occurrences. In [9] second-order graphs are evaluated against lexical
semantic resources.

3.3 Sentence Similarity

Using words as internal features, the similarity of two sentences can be mea-
sured by the number of common words they share. Since the few top frequency
words are contained in most sentences as a consequence of Zipf’s law, their
influence should be downweighted or they should be excluded to arrive at
a useful measure for sentence similarity. Here, the sentence similarity graph
of sentences sharing at least two common words is examined, with the max-
imum frequency of these words bounded by 100. This maximum frequency
threshold was arbitrarily chosen and could be replaced by a weighting scheme
that attributes more weight to less frequent words. However, a hard thresh-
old reduces the computational cost significantly. The corpus of examination
is here LCC’s 3-million sentences of German. Figure 14 shows the component
size distribution for this sentence similarity graph, Figure 15 shows the degree
distributions for the entire graph and for its largest component.
The degree distribution of the entire graph follows a power law with γ close
to 1 for low degrees and decays faster for high degrees; the largest component’s
degree distribution plot is flatter for low degrees. This can be attributed to
limited sentence length: as sentences are not arbitrarily long, they cannot
be similar to an arbitrary high number of other sentences with respect to
the measure discussed here, as the number of sentences per feature word is
bounded by the word frequency limit. However, the extremely high values
for transitivity and clustering coefficient and the low γ values for the degree
distribution for low degree vertices and comparably long average shortest path
lengths indicate that the sentence similarity graph belongs to a different graph
class than all other graphs discussed above.
182 C. Biemann and U. Quasthoff

sentence similarity graph component distribution

sentence similarity components


power-law gamma=2.7
10000

1000
# of vertices

100

10

1
1 10 100 1000 10000 100000
component size

Fig. 14. Component size distribution for the sentence similarity graph of LCC’s
3-million sentence German corpus. The component size distribution follows a power
law with γ ≈ 2.7 for small components, the largest component comprises 211,447 out
of 416,922 total vertices. The component size distribution complies with the theoretical
results of [2].

sentence similarity graph component distribution

sentence similarity de3M sentences, >1 common, freq <101


only largest component
100000 power-law gamma=0.6
power-law gamma=1.1

10000
# of vertices

1000

100

10

1
1 10 100 1000
degree

Fig. 15. Degree distribution for the sentence similarity graph of LCC’s 3-million
sentence German corpus and its largest component. An edge between two vertices
representing sentences is drawn if the sentences share at least two words with corpus
frequency <101; singletons are excluded.
Networks Generated from Natural Language Text 183

3.3.1 Applications of the Sentence Similarity Graph

A similar measure is used in [5] for document similarity and obtains well-
correlated results when evaluated against a given document classification.
A precision-recall tradeoff arises when lowering the frequency threshold for
feature words or increasing the minimum number of common feature words
two documents must have in order to be connected in the graph: both improve
precision but result in many singleton vertices, which lowers the total number
of documents that are considered.

3.4 Summary of Scale-Free Small Worlds in Language Data

The preceding examples confirm the claim that graphs built on various aspects
of natural language data often exhibit the scale-free small world property or
similar characteristics. Experiments with generated text corpora suggest that
this is mainly due to the power-law frequency distribution of language units.
The slopes of the power law approximating the degree distributions can often
not be produced using the random graph generation models. Specifically, all
previously discussed generation models fail to explain the properties of word
co-occurrence graphs, where γ ≈ 2 was observed as the power-law exponent
of the degree distribution. Of the generation models inspired by language
data, the ST-model exhibits γ = 3, whereas the universality of the DM-
model to capture word co-occurrence graph characteristics can be doubted
after examining data from different languages.

References
1. Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical
report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.
2. Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive
graphs. In STOC ’00: Proceedings of the Thirty-Second Annual ACM Symposium
on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.
3. Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks.
Science, 286, 509.
4. Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text
and co-occurrence statistics. In Proceedings of NODALIDA’05 , Joensuu, Finland.
5. Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document col-
lections using attributes with low noise. In Proceedings of the Third International
Conference on Web Information Systems and Technologies (WEBIST-07), pages
130–135, Barcelona, Spain.
6. Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of
paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth
International Conference on Language Resources and Evaluation (LREC-04),
Lisbon, Portugal.
184 C. Biemann and U. Quasthoff

7. Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically build-
ing concept structures and displaying concept trails for the use in brainstorming
sessions and content management systems. In Proceedings of Innovative Internet
Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.
8. Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of
corenet using a bootstrapping mechanism on corpus-based co-occurrences. In
Proceedings of the 20th International Conference on Computational Linguistics
(COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.
9. Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisi-
tion. Ph.D. thesis, University of Leipzig.
10. Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford
University Computing Service, Oxford, U.K.
11. Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong’s numbers
in the Bible to test an automatic alignment of parallel texts. Special issue of
Sprachtypologie und Universalienforschung (STUF), pages 66–79.
12. Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word
web. Proceedings of The Royal Society of London. Series B, Biological Sciences,
268(1485), 2603–2606.
13. Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coinci-
dence. Computational Linguistics, 19(1), 61–74.
14. Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collo-
cations. Ph.D. thesis, University of Stuttgart.
15. Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human lan-
guage. Proceedings of The Royal Society of London. Series B, Biological Sciences,
268(1482), 2261–2265.
16. Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf’s law and random texts. Advances
in Complex Systems, 5(1), 1–6.
17. Glassman, S. (1994). A caching relay for the world wide web. Computer Networks
and ISDN Systems, 27(2), 165–173.
18. Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of
Zipf’s law to words and phrases. In Proceedings of the 19th International Con-
ference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ,
USA. Association for Computational Linguistics.
19. Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query
results in search engines. In Proceedings of the 12th International Conference on
World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.
20. Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for
generating ontology extension candidates. In Proceedings of the ICML-05 Work-
shop on Ontology Learning and Extension using Machine Learning Methods, Bonn,
Germany.
21. Mandelbrot, B. B. (1953). An information theory of the statistical structure of
language. In Proceedings of the Symposium on Applications of Communications
Theory. Butterworths.
22. Miller, G. A. (1957). Some effects of intermittent silence. American Journal of
Psychology, 70, 311–313.
23. Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events.
In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain.
Association for Computational Linguistics.
Networks Generated from Natural Language Text 185

24. Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search
in monolingual corpora. In Proceedings of the Fifth International Conference on
Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.
25. Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer
Ansatz . Olms, Hildesheim.
26. Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence
length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.
27. Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf.
Process. Manage., 21(3), 215–224.
28. Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic
networks: Statistical analyses and a model of semantic growth. Cognitive Science,
29(1), 41–78.
29. Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors,
ISSI2005 , volume 1, pages 221–231, Stockholm. International Society for Sciento-
metrics and Informetrics.
30. Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with
realistic Zipf’s distribution. Journal of Quantitative Linguistics, 12(1), 29–40.
31. Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.
32. Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-
Wesley, Cambridge, MA.
Efficiency of Navigation in Indexed Networks

Petter Holme1,2,3
1
Department of Computer Science, University of New Mexico, Albuquerque,
NM 87131, USA
2
School of Computer Science and Communication, Royal Institute of Technology,
10044 Stockholm, Sweden
3
Department of Physics, Umeå University, 90187 Umeå, Sweden;
petter.holme@physics.umu.se

1 Introduction

The interplay between network structure and search dynamics has emerged as
a busy subfield of statistical network studies (see e.g. Refs. [1, 9, 10, 13, 14]).
Consider a simple graph G = (V, E) (where V is a set of n vertices and E is
a set of m edges—unordered pairs of vertices). Assume information packets
travel from a source vertex s to a destination t. We assume the packages are
myopic agents (at a given time step they have access to information about the
vertices in their neighborhood, but not more), and have memory (so they can
e.g. perform a depth-first search) but no previous knowledge of the network.
Let τ (p) be the time for a packet p to travel between its source and destination.
One commonly studied quantity of search efficiency is the expectation value of
τ , τ̄ , for randomly chosen s and t. In this chapter we attempt to find efficient
ways to index V (with numbers from 1 to n) and utilize these indices for
packet navigation. In other words, we try to find ways to compress the global
information about the network into numbers 1, . . . , n so that the information
can be used by packets to find short paths to their destinations.
We propose two schemes of indexing the vertices, and corresponding meth-
ods for packet navigation. These schemes, along with two depth-first search
methods (not using vertex indices for more than remembering the path) are
examined on four network models. We will first present the indexing and
search schemes, then the network models for testing the algorithms and at
last the numerical results.

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 11,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
190 P. Holme

2 Indexing and Search Schemes


Now we turn to the schemes for assigning indices to the vertices and using
them in search processes. Our two main schemes are both inspired by search
trees. Packets first move towards a root vertex r, then towards the destination.
Unless the network really is a tree, this approach cannot be exact—a packet
is not guaranteed to find the shortest way both from s to r and from r to t.
However, as we will see, one can assign indices such that the search either
from s to r or from r to t is certain to be as short as possible. One of our
schemes, ASU (accurate search up), will be such that the shortest upward
search is guaranteed. The other, ASD (accurate search down), will have the
shortest possible r to t search.
On a technical note, V is a set of distinct elements and an indexing scheme
is a bijection φ : V → I where I ∈ [1, n]. In the remainder of the text we will
not explicitly distinguish i ∈ V from φ(i).

2.1 ASD Indexing and Search

The numbers 1, . . . , n can be arranged into a search tree [3] such that the
expected value of τ scales like log n. In Fig. 1(a) we give an example of a search
tree. To go from source s to destination t a packet first moves to the root r
by going to the neighbor with the lowest index value. From the root to the
destination, the package moves to the neighbor with the largest index smaller
than, or equal to, t. Our strategy for the ASD indexing and search scheme is
to construct a spanning tree T (G) for the network, index the tree to make it a
search tree, and use the algorithm above to navigate from s to t. The problem
is, however, that real networks are not trees. Imagine adding edges between
vertices of the same heights and branches to the tree in Fig. 1(a)—the tree
will still be a spanning tree, but the packets may not take the same path from
s to t any more. As we will see, with certain ways of constructing the tree and
indexing the vertices, the search either from s to r or r to t will be optimal.
We construct T (G) in the following way.
1. Let the root r be a vertex of smallest eccentricity (maximal distance to
another vertex).
2. Construct the tree such that the distances to the root are the same in T (G)
and G. In other words, construct it such that all edges in T go between
different neighborhoods Γl (r) = {i ∈ V : d(i, r) = l} and Γl+1 (r) for some
level 0 ≤ l ≤ h, where h is the height of the tree (by the choice of r, h is
also the radius of the graph). Such a tree can be constructed by finding
the set of followed edges in a breadth-first search [3] starting from r.
When it is not clear which vertex, or edge, to choose in the above construc-
tion, we choose one at random from all the possible candidates. When T is
constructed, let the indices be a preordering of the vertices in T (i.e. the order
of first occurrence of the vertex in a depth-first search of the graph) [3].
Efficiency of Navigation in Indexed Networks 191

root, r = 1
a b 6 4

1 3
2 5
10
2 3 6 7 10
7 9
4 5 8 9 8

c t=6 4 d
1 3
2 5
10 even odd
7 s=9
8
e 1 f 6 10

1 4
2 t=8
5
2 4 6 3 5
3 7
10 8 9 7 s=9

Fig. 1. Illustration of the ASD (panels (a)–(c)) and ASU (panels (d)–(f)) indexing
and search schemes. (a) shows a search tree where a local search algorithm can find
the shortest path from one vertex to another fast. (b) shows a network indexed by the
ASD scheme. The tree used in the construction is identical to the one shown in (a).
Panel (c) shows an ASD search from s to t (with τ = 4). On the way from s to r
the packet chooses the neighbor (of the current vertex) with lowest index, which here
gives a longer route than the optimal {(9, 10), (10, 1)}. (d) shows a possible partition
of branches of non-root vertices into classes of as similar size as possible (as done in
the ASU indexing scheme). (e) shows a possible indexing based on the partition in (d).
Panel (f) displays a search from s to t with τ = 6. The shortest path from t to r is
accurately found, but a detour to 6 makes the search from r to t suboptimal.

Now we prove that this indexing and the search algorithm always give the
shortest paths from the root to a vertex t. Let ET be the edges of T and let
Ti be the maximal subtree with i as root. By construction, all vertices in Ti
have indices in [i, i + |Ti |] (where | · | denotes the cardinality of a subgraph).
Let i be the largest index in i’s neighborhood smaller than t. Assume there
is an edge (i, j) ∈ E \ ET that the search will follow, i.e. that i < j < t. This
means that j ∈ Ti . By construction, i is the only vertex in Ti at a distance
d(r, i ) (the distance from the rest of Ti to the root is at least d(r, i ) + 1).
Since d(r, i ) = d(r, i) + 1, we have d(r, j) ≥ d(r, i) + 2, which contradicts the
existence of an edge (i, j) ∈ E. Thus, searches from r to t will always follow
the edges of T , which also means the r–t-searches will be as short as possible.
Searching upwards, from i to r, in a graph indexed as above is harder. We
know that one shortest path goes via a vertex j with smaller index than i, but
there might exist suboptimal paths via indices i in the intervals r < i < j and
j < i < i, and there might also be paths via vertices of index larger than j
192 P. Holme

that are optimal. For example, assume the search tree in Fig. 1(a) comes from
a graph with the additional edges (5, 9), (8, 9) and (9, 10) (see Fig. 1(b)). Then,
the shortest path from 9 to r via a vertex of lower index is {(9, 7), (7, 1)}, but
there is an equally long path via a vertex of larger index, {(9, 10), (10, 1)}, and
longer paths via vertices both smaller and larger than 7 but smaller than 9.
Thus, there is no general way of finding the shortest way from s to r. Instead,
we always choose the vertex with the smallest index in the neighborhood.
By this strategy a packet will come closer to r, in index space, for every step.
Furthermore, in tree-like parts of the graph, the search will follow the shortest
paths. An illustration of the ASD search is shown in Fig. 1(c).

2.2 ASU Indexing and Search

Consider a tree T (G) constructed as in the previous section and an indexing


such that d(i, r) < d(j, r) implies i < j (i.e., all indices of a level further from
the root are larger than in levels closer to r). With such an indexing, since the
neighbor of a vertex with the smallest index necessarily is one step closer to
the root, a packet can always find one shortest way to the root. But once the
package is at the root the indices are not of so much help. The search from
r to t has to be, essentially, a depth-first search. There are, however, a few
tricks to speed up the search. First, there is no need to search deeper than t;
if j > t, then t ∈/ Tj . Second, one can choose the indices i, . . . , i + |Γl (r)| of
one level in the tree in a way to narrow down the search. For example, one
can divide the vertices into ν classes (defined by e.g. the remainder when the
index is divided by ν) and index vertices of connected regions of the graph
with indices of the same class. The search can then be restricted to the same
class as the destination. We will pursue this idea with ν = 2.
To derive the ASU indexing scheme, the first goal is to divide the vertices
into classes of connected subgraphs. Furthermore, we require all classes to
be connected to the root vertex. Another aim is to make the classes of as
similar sizes as possible. Our first step is to make kr (the degree, or number
of neighbors, of r) parallel depth-first searches.1 Second, we group the kr
search trees into ν groups with maximally similar sizes. In our case, we seek a
partition of the search trees into two classes such that the sums of vertices in
the respective classes are as close as possible.2 Then we go through the levels,
starting from the root, and assign numbers such that vertices of one partition
have even indices, while those of the other have odd numbers (this assignment
might not always work). To avoid systematic errors we sample the elements of
1
Every iteration, one step is taken in all branches. The different search branches
mark the visited vertices with their indices. A search proceeds only to vertices not
marked by any search. When there are no unmarked vertices, the search branch is
finished.
2
We do this by randomly exchanging search trees between the two classes and
accept changes that improve the partition. The search is continued until their vertex
sums differ by at most one or until the partition has not improved for 1000 trials.
Efficiency of Navigation in Indexed Networks 193

levels randomly. This construction scheme is illustrated in Figs. 1(d) and (e).
An illustration of the ASU search scheme is shown in Fig. 1(f).

2.3 Degree-Based and Random Search

As a reference, we also run simulations for two depth-first search methods that
do not utilize indices [1]. One of them, Rnd, is a regular depth-first search
where the neighbors are traversed in random order. In the other, Deg, the
neighbors are chosen in order from high to low degree. Just as in the ASU and
ASD methods, a packet is assumed to have knowledge about its neighborhood.
If the destination is in the neighborhood of a vertex, then the search will be
over the next time step.

3 Network Models

The efficiency of our indexing and search schemes is more or less directly
affected by the network structure. To investigate this relationship we test the
search schemes on four different types of network models: modified Erdős–
Rényi (ER) graphs [5], square lattices, From Barabási–Albert (BA) [2] and
Holme–Kim (HK) [8] networks. To facilitate comparison, we use the same
average degree, four (dictated by the square grid), in all networks.

3.1 Modified ER Graphs

The ER model is the simplest model for randomly generating simple graphs
with n vertices and m edges. The edges are added one by one to randomly
chosen vertex pairs (the only restriction being that loops or multiple edges are
not allowed). A problem for our purpose is that ER graphs are not necessarily
connected (something required to measure τ̄ ). To remedy this we propose a
scheme to make networks connected.
1. Detect the connected components.
2. Go through the connected components sequentially. Denote the current
component CI .
a) Pick a component CJ randomly.
b) Pick a random edge (i, j) whose removal would not fragment CJ . If no
such edge exists, go to step 2.
c) Pick a random vertex i of CI .
d) Replace (i, j) by (i , j). If the edge (i , j) would exist already (an unlikely
event), go to step 2a. If there is no vertex i ∈ CI such that (i , j) does
not already exist, then go to 3.
3. If the network is still disconnected, go to step 1.
In practice, even for our largest system sizes, the preceding algorithm con-
verges in a few iterations. The number of edges needed to be added never
194 P. Holme

exceeds a few percent of m, and this addition is made with the greatest pos-
sible randomness; hence, we believe the essential network structure of the ER
model is conserved.

3.2 Square Lattice

We use square lattices with periodic boundary conditions. We have n vertices


spread out regularly on an L × L grid such that the vertex with coordinates
(x, y), 1 ≤ x, y ≤ L, is connected to (x, y + 1), (x + 1, y), (x, y − 1), (x − 1, y)
(if x = 1, we formally let x − 1 = L, if x = L we let x + 1 represent 1; and
correspondingly for y).

3.3 BA Model

The networks with a power-law degree distribution are constructed as follows


(with our parameter settings). Start with one vertex connected to two degree-
one vertices. Iteratively add vertices connected to two other vertices. Let the
probability of connecting the new vertex to a vertex i already present in the
network be proportional to ki (preferential attachment).

3.4 HK Model

The Holme–Kim model is a modification of the BA model to give the network


a higher number of triangles. When edges are added from the new vertex
to already present vertices, the first edge is added to an existing node i by
preferential attachment. The second edge is added to one of i’s neighbors,
forming a triangle.

4 Numerical Results

We study the search schemes on the four different network topologies numer-
ically. We use 100 independent networks and 100 different s–t-pairs for every
network. The network sizes range from n = 16 to n = 16,384.
In Fig. 2 we display the average search times as a function of system size
for our simulations. The most conspicuous feature is that the ASD scheme is
always, by far, the most efficient. While ASU and Deg are close to the least
efficient method (Rnd), ASD is rather close to the theoretical limit (equal to
the average distances τ̄ , the upper border of the shaded areas in Fig. 2). To be
more precise, τ̄ is quite constant, about two times larger than the average dis-
tance. The other search schemes (ASU, Deg and Rnd) follow faster increasing
functional forms. For the square lattice, these three schemes increase, approx-
imately proportional to n (the analytical value for two-dimensional random
Efficiency of Navigation in Indexed Networks 195

modified ER
103

transit time, ¿
100

10

1
square lattice

ASU
103
transit time, ¿

100

10

ASD
1
103 BA

DEG
transit time, ¿

100

10

1 RND
104
HK
103
transit time, ¿

100

10

1
100 10 3 104
network size, N

Fig. 2. The average search time τ̄ as a function of the graph sizes n. In all panels,
we display data for the different indexing and search schemes. The shaded areas are
unreachable (corresponding to τ̄ values smaller than the theoretical minimum, the
¯ The different panels correspond to the modified ER model, square
average distance d).
grid, BA model and HK model networks, respectively. Error bars would have been
smaller than the symbol sizes.

walks) whereas for ASD, τ̄ scales like distances in square grids, n1/2 . One way
of interpreting this result is to say that while ASD manages to find the root
as fast as it finds the destination from the root, ASU fails to find t faster than
a random search. The slow downward performance of ASU is not unexpected.
The r–t-search in ASU only differs from a random depth-first search in that it
196 P. Holme

3
n 5
7
n−2 2
n −1 1 4
n−3 6
9
18 8
16 14
12 10
19 11
17 13
15

Fig. 3. A worst-case scenario for navigating from s to r with the ASD indexing and
search scheme. A packet from n − 2 to 1 will travel along the perimeter to 3 and then
move towards the center.

does not search further than the level of the destination, and that it restricts
the search space to half its original size by dividing the vertices into odd and
even indices. The fast upward search of ASD is more surprising. In Fig. 3
we show a network where ASD performs badly. The average time to search
upwards is (n2 + 20n − 13)/8n → n/8 as n → ∞. The downward search takes
3(n − 1)/2n ∼ 3/2, giving a total expected value of τ̄ ∼ n/8. This can be
compared to the average distance d¯ = 3 − 21/4n + 2/n2 ∼ 3. For this example,
τ̄ and d¯ diverge in a way not seen in the network models. Why is the search so
much faster in the model networks? One point is that the worst-case indexing
seen in Fig. 3 is very unlikely. Since the spokes would be sampled randomly,
the chance that a vertex at the perimeter does not find r in two steps is 1/2,
the probability that it finds r in 3 steps is 1/4, and  so on. Continuing this
calculation, a vertex at the perimeter reaches r in 2 k k2k +2 ∼ 6 time steps,
giving τ̄ ∼ 5—not too far from the observed τ̄ /d¯ ∼ 2. We note, however, that
for the model networks many other factors that are not present in the wheel
graph of Fig. 3 affect τ̄ . For example, the high density of short triangles in the
HK model networks will introduce many edges between vertices of the same
level in T (G), which will affect the search efficiency.
τ̄ is approximately linear for the ASU, Deg and Rnd on all network
models. The slopes of these curves are, however, a little different. First, the
Deg method is more efficient (compared to ASU and Rnd) for BA networks
than for the modified ER model. This observation (also made in Ref. [1])
can be explained by the skewed degree distribution in the BA-network—the
packet reaches high-degree vertices quickly. The packet can see a large part
of the network from these hubs, and is therefore more likely to see t. More
interesting, perhaps, is the observation that ASU is more efficient for the
networks with a higher density of short cycles (the square lattice and HK
models). A rough explanation is that the partition procedure of ASU cuts off
many edges between vertices at the same distance from r. Since there are many
such edges in these network models, the network will effectively be sparser
(without changing G’s diameter), which results in a better performance.
Efficiency of Navigation in Indexed Networks 197

5 Discussion

We have investigated navigation in valued graphs, and more specifically in


indexed graphs—graphs where every vertex is associated with a unique num-
ber in the interval [1, n]. These indices can be assigned to facilitate the packet
navigation. The packets are assumed to have no a priori knowledge about the
network, except the neighborhoods of their current positions, but memory
enough to perform a depth-first search. We find that one of our investigated
methods, ASD, is very efficient for four topologically very different network
models. The searches with the ASD scheme are roughly twice as long as the
shortest paths (scaling in the same way as the average distance).
Navigation on indexed graphs has applications in distributed information
systems. If, specifically, the amount of information that can be stored at the
vertices were limited, search strategies such as ours would be useful. One such
system is the Autonomous System level Internet, where the information stored
at each vertex (with the current protocols) increases at least as fast as the
networks themselves. For most real-world applications (other examples being
ad hoc networks [4] or peer-to-peer networks [6, 7, 12]) there are additional
constraints so that the algorithms of this paper cannot immediately be ap-
plied. Such networks are typically changing over time, so ideally it should be
possible to extend the indexing “on the fly” as vertices and edges are added
and deleted from the network. Apart from this, a future direction for research
on indexed graphs is to improve the performance of the algorithms presented
in this work. There might be a fast search-tree-based algorithm that neither
finds the shortest path to the root, nor finds the shortest way to the destina-
tion. For some network topologies there might be faster algorithms that are
not based on constructing a spanning tree. Consider, for example, modular
networks [11] (i.e. networks with tightly connected subgraphs that are only
sparsely interconnected) in which the search can be divided into two stages—
first find the cluster of the destination, then the destination. These two stages
should be reflected in a fast navigation algorithm.

Acknowledgments

PH acknowledges financial support from the Wenner-Gren Foundations,


The Swedish Foundation for Strategic Research and the National Science
Foundation (grant CCR–0331580).

References
1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Search in
power-law networks. Phys. Rev. E, 64:046135, 2001.
2. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science,
286:509–512, 1999.
198 P. Holme

3. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo-


rithms. 2nd edition. The MIT Press, Cambridge MA, 2001.
4. C. de Morais Cordeiro and D. P. Agrawal. Ad Hoc & Sensor Networks: Theory
and Applications. World Scientific, Hackensack, NJ, 2006.
5. P. Erdős and A. Rényi. On random graphs I. Publ. Math. Debrecen, 6:290–297,
1959.
6. N. Ganguly, L. Brusch, and A. Deutsch. Design and analysis of a bio-inspired
search algorithm for peer to peer networks. In O. Babaoglu, M. Jelasity,
A. Montresor, C. Fetzer, and S. Leonardi, editors, Self-star Properties in Com-
plex Information Systems, pages 358–372, Springer-Verlag, New York, 2007.
7. G. Ghoshal and M. E. J. Newman. Growing distributed networks with arbitrary
degree distributions. European Physical Journal B, 59:75, 2007.
8. P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering.
Phys. Rev. E, 65:026107, 2002.
9. B. J. Kim, C. N. Yoon, S. K. Han, and H. Jeong. Path finding strategies in scale-
free networks. Phys. Rev. E, 65:027103, 2002.
10. J. M. Kleinberg. Navigation in a small world. Nature, 406:845, 2000.
11. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in
networks. Phys. Rev. E, 69:026113, 2004.
12. N. Sarshar, P. O. Boykin, and V. P. Roychowdhury. Percolation search in power
law networks: Making unstructured peer-to-peer networks scalable. In Proceedings
of Fourth International Conference on Peer-to-Peer Computing, pages 2–9. IEEE,
2004.
13. P. Sen. A novel approach for studying realistic navigations on networks. J. Stat.
Mech., page P04007, 2007.
14. H. Zhu and Z.-X. Huang. Navigation in a small world with local information. Phys.
Rev. E, 70:036117, 2004.
Evolution of Apache Open Source Software

Haoran Wen, Raissa M. D’Souza, Zachary M. Saul, and Vladimir Filkov

University of California, Davis CA 95616, USA;


hrwen@ucdavis.edu, rmdsouza@ucdavis.edu, zmsaul@ucdavis.edu,
vfilkov@ucdavis.edu

1 Software: A General Paradigm for Network Systems?


Our modern infrastructure relies increasingly on computation and computers.
Accompanying this is a rise in the prevalence and complexity of computer
programs. Current software systems (composed of an interacting collection
of programs, functions, classes, etc.) implement a tremendous range of func-
tionality, from simple mathematical operations to intricate control systems.
Software systems are inherently extendable and tend to gain new functionality
over time. Modern computers and programming languages are Turing com-
plete and, thus, capable of implementing any computable function no matter
how complex. The interdependencies between the elements of a software sys-
tem form a network, and, therefore, we believe software systems can provide
useful prototypic examples of how to build complex networked systems which
require minimal maintenance, are robust bugs to and yet are readily extend-
able. Thus we ask: What makes for good design in software systems?
We are particularly interested in open source software (OSS)—software
with source code that is freely available for download and modification. A
typical OSS project is a collaborative effort by volunteers, with no central
authority assigning development tasks. Instead individuals, or self-organized
teams of developers, fix bugs and maintain and extend the code. In OSS,
modularity is essential [1, 2], and remarkably, the software resulting from an
OSS process can rival or even surpass the quality of commercial software [3, 4].
Software systems are always evolving, responding to user demands for “bug
fixes” and new features. Invariably, systems grow in size and complexity, even-
tually becoming difficult to parse, maintain and extend further. In response
to this, developers refactor their systems [5], streamlining and restructuring
the entire code base. Thus there are several strong analogies between OSS
systems and biological systems. Both classes of systems are inherently modu-
lar, readily evolvable, must be robust to anomalies and experience periods of
punctuated equilibrium [6]. Yet high-confidence data on the structure of OSS,
unlike data on biological networks, is easily obtained for minimal cost.

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 12,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
200 H. Wen et al.

We analyze a series of 50 monthly snapshots of the function call graph of


the Apache 2.0 HTTP Server (called Apache herein). Apache is the most
popular web server on the Internet, and has been since 1996 [7]. It is a mature,
well-established OSS project managed by a group of volunteers worldwide; to
date, hundreds of users have contributed to the code base. Apache is written
in the C programming language, a procedural language. The basic elements
are functions that explicitly invoke one another through function calls which
express the command flow of the program/system. In object-oriented systems,
in contrast, the software networks are made of edges representing abstract
relationships between objects, such as inherits, invokes, etc.
Motivated by advances in network science, we first analyze a collection of
measures on global properties of the Apache call graphs. Certain measures
behave consistently, and we quantify their baselines. Moreover, we find that
punctuated changes in these global measures can signal the points at which a
more detailed, fine-grained examination of code structure is required. Jumps
in global properties can indicate major refactorings, but can also result from
restructuring just a few functions (and radically reduce interdependencies).
We then turn our focus to a bottom-up approach, studying how observable
attributes of the Apache call graph interact using exponential random graph
models. Ultimately, by coupling top-down and bottom-up approaches, we want
to extract how code is restructured over time to achieve better design. As a ma-
ture project, Apache is more in “maintenance” than in growth mode, and the
details of changes can be subtle. Yet, these changes may be especially impor-
tant given that a major expense associated with software is maintenance [8].
Interest in OSS spans multiple communities, from software engineering, to
network science, to economics and organizational behavior. Raymond’s sem-
inal work [1] is an excellent review of the latter, contrasting the “cathedral”
organization of proprietary software to the open “bazaar” nature of OSS.
Perhaps the first work to consider software systems as complex networks was
that by Valverde, Ferrer Cancho and Solé [9], in which they show that soft-
ware collaboration graphs have “scale-free” properties which may result from
optimal design. Shortly thereafter, Myers conducted a detailed investigation
of software collaboration graphs [10], quantifying many features we discuss
herein. Both [9] and [10] focus primarily on object-oriented software (unlike
Apache, which is procedural software), looking at one time snapshot of the col-
laboration network between classes and objects for several different software
systems. Similar to MacCormack, Rusnak and Baldwin [11], we are interested
in tracking the evolution of a software system, focusing on the function call
graph. In [11], their interest is in understanding the impact of managerial
organization on resulting software structure (primarily the modularity).
This manuscript is organized as follows. In Section 2 our data set is de-
fined. Section 3 presents the top-down approach, studying evolution of global
measures. Section 4 presents the bottom-up approach to understanding the
relative importance of measures of structure via statistical modeling. Section 5
contains the discussion and conclusions.
Evolution of Apache Open Source Software 201

Fig. 1. The “5-core” of Apache on November 2005. Each node is a function, with size
indicating its relative length in lines of code. Each directed edge is a function call.

2 The Apache Call Graphs


We analyze the evolution of Apache for a 50-month period using call graph
snapshots taken at one month intervals from October 2001 to November 2005.
Each monthly call graph was created via a two-step process (see [12] for more
details). First, the source was checked out from the Apache Concurrent Ver-
sions System (CVS) repository (for that month) along with matching versions
of both the compiler (and associated tools) and the libraries used by Apache
(e.g., the Apache Portable Runtime). Then, the call graph was extracted us-
ing CodeSurfer [13], a proprietary source code analysis tool. The resulting call
graphs are directed graphs where the nodes are functions, and each edge rep-
resents an explicit call from its source node to its target node. The CodeSurfer
tools extract all explicit function calls, including those to functions in libraries.
The resulting call graphs are extremely interconnected. In November 2005,
there were 2909 nodes and 8284 edges (average node degree of 5.7). The largest
connected component contains all but 72 nodes, while the second largest com-
ponent has only 12 nodes. Figure 1 is a subgraph showing the k-core [14, 15]
at k = 5 for Apache functions (excluding library calls).

3 Evolution of Apache: Global Measures


3.1 Nodes and Edges
The most basic constituents of the Apache call graph network are the func-
tions (i.e., nodes) and function calls (i.e., edges). We denote the number of
functions and calls at a given time by, respectively, N (t) and E(t). Figure 2
202 H. Wen et al.

3000 8500
N 8500
Number of functions N

2900 E

Number of calls E
8000

Number of calls E
8000
2800

2700 7500
7500
2600
7000
2500
7000 E ~ N1.18

2400 6500
10 20 30 40 50 6500
2400 2600 2800 3000
Month Number of functions N

Fig. 2. (Left) Evolution of the number of functions N (left-hand axis) and the number
of function calls E (right-hand axis) during the 50 month period. (Right) E as a
function of N since the first stable release of Apache 2.0 in May 2002 through Nov
2005 (months 8–50). Dots are individual data points. The line is the best fit, E ∼ N 1.18 .

(left) shows their evolution over the entire 50-month period. Our first evi-
dence for a restructuring of the code is observed during the Fourth and the
fifth months, when there is a dramatic decrease in N , of approximately 250
functions, accompanied by a much smaller decrease in E, of approximately 75
function calls. Thus the average degree (N/E) increases dramatically during
this period. Investigating the Apache release history [16], we find that this pe-
riod (from 2002-1-1 to 2002-2-1) marks the transition from the second to the
third beta release of Apache 2.0. According to the release logs, approximately
130 changes were made to the code, with ten of these changes being the addi-
tion of new features. The bulk of the remaining changes were “bug” fixes along
with a few performance improvements. The functionality of the system was
enhanced during a period where the number of functions decreased. We as-
sume redundancy in functions was eliminated, while “functionality” (perhaps
more closely related to number of edges) was preserved and enhanced.
The first stable (non-beta) releases of Apache 2.0 were issued shortly there-
after, in April and May 2002. From there on, the relationship between E and
N is extremely consistent as shown in Fig. 2 (right). We find that E ∼ N 1.18 .
Remarkably, Valverde and Solé find almost identical scaling, of E ∼ N 1.17 ,
for a collection of 80 object-oriented systems [17], where N is the number of
classes and E is the total number of edges, with each edge representing a
relationship between classes. This suggests some universal trend in software
systems.

3.2 Degree and Degree Distribution

The degree of a function conveys much information, and it is important to


distinguish in-degree (being called) from out-degree (calling another func-
tion). In-degree is a measure of code reuse, and functions with high in-degree
Evolution of Apache Open Source Software 203

100 100
2001−10−1 2001−10−1
2005−11−1 2005−11−1
10−1 10−1
p(k)

p(k)
10−2 10−2

10−3 10−3

100 101 102 103 100 101 102


In−degree, k Out−degree, k

Fig. 3. (Left) In-degree distribution and (right) out-degree distribution for the first
month and final month. The dashed line is the best fit functional form for the final
k−μ)2
month: (left) p(k) = 0.55 · k−1.84 , and (right) p(k) = (2πσ 2 k2 )−1/2 exp[ −(ln2σ 2 ], with
2
μ = 0.75 and σ = 0.93.

are information producers. Nodes of high out-degree are information con-


sumers/brokers, consolidating information from many external sources. In the
Apache call graphs the largest observed in-degree is approximately 200, while
the largest out-degree is approximately 30. Due to these differences, we ex-
amine in- and out-degree independently.
One of the most investigated aspects of “complex networks” is their de-
gree distributions, found to exhibit extreme heterogeneity, with node degrees
spanning decades of range. Here too, we find such broad-scale features.
Figures 3 (left) and (right) show respectively the in- and out-degree for the
first and the last of the 50 months investigated, where p(k) is the fraction of
nodes observed with degree k.
Following [18], we assess the best fit to the data between power-law, log-
normal and stretched-exponential distributions, using a weighted least squares
fit. The weight given to each data point reflects inversely how much uncer-
tainty there is in that point (more uncertainty in the tail where the values
are much smaller). The quality of a fit between the set of data points {hi },
measured value {xi } and a function f is quantified as

k
1
Q= [hi − f (xi )]2
i=1
h i

with a smaller Q being better. We find that, for in-degree, a power law pro-
vides the best fit for each of the 50 months with Q ≈ 0.04. Fitting a log-
normal distribution to in-degree gives Q ≈ 0.08, and stretched-exponential
gives Q ≈ 0.15. For out-degree, log-normal provides the best fit for all 50
months with Q ≈ 0.02. A stretched-exponential distribution gives the next
best fit with Q ≈ 0.06, and a power-law fit is the worst, with Q ≈ 0.16.
There are small, almost indiscernible changes to the distributions over
the 50 months. For in-degree we find the exponent of the best fit power law
204 H. Wen et al.

slowly decreases from γ ≈ 1.9 to γ ≈ 1.84, reflecting that the maximum values
of in-degree slowly increase with time. For out-degree, the mean out-degree of
the best fit log-normal distribution slowly increases from μ ≈ 0.64 to μ ≈ 0.75.
However, the shapes of both the in- and out-degree distributions (power-
law and log-normal, respectively) are global properties which are established
before our data sampling begins and remain invariant throughout.

3.3 Dependencies, Visibility and Propagation Cost

A simple call graph is shown in Fig. 4 (left). The corresponding dependency


(or adjacency) matrix, Fig. 4 (right), captures the complete call graph infor-
mation. Matrix element Mij = 1 if function i calls function j, and is zero
otherwise. As edges are directed, M is not symmetric about the diagonal.
The dependency matrix captures direct dependencies. However, indirect
dependencies are also important. For instance, as shown in Fig. 4 (left), a
change in function C could potentially destroy or change the functionality
implemented by A. Changing function F also has indirect impact on A. Yet,
it is less direct, as the shortest path between A and F is of length 3, whereas
the shortest path between A and C has length 2. We can quantify these
indirect dependencies as a function of path length using the “reachability
matrix” [19] and the related “visibility matrix” [20]. The reachability matrix
at path length d, is denoted R(d). Matrix element R(d)ij = 1 if there is a path
of exactly length d connecting function i to j. Note the convenient relationship,
R(d) = Md , where M is the direct dependency matrix. The visibility matrix
at distance d, denoted V(d), is the binary sum of the reachability matrix,
V(d) = R(1) ∨ R(2) ∨ · · · ∨ R(d) = M1 ∨ M2 ∨ · · · ∨ Md , where the operator
“∨” (logical or) is equivalent to the binary sum. V(4) for our simple call graph
example is shown in Fig. 5. Matrix element V(d)ij = 1 if there is a path of
length less than or equal to d connecting function i to j. Note we assume
V(d)ii = 1, i.e., functions are visible to themselves.

A
Dependency Matrix
A B C D E F
B D
A 0 1 0 1 0 0
B 0 0 1 0 0 0
C 0 0 0 1 0 0
C E D 0 0 0 0 1 0
E 0 0 0 0 0 1
F F 0 0 0 0 0 0

Fig. 4. (Left) A simple call graph. (Right) The equivalent dependency matrix.
Evolution of Apache Open Source Software 205

Visibility Matrix with n=4


A B C D E F
A 1 1 1 1 1 1
B 0 1 1 1 1 1
C 0 0 1 1 1 1
D 0 0 0 1 1 1
E 0 0 0 0 1 1
F 0 0 0 0 0 1

Fig. 5. V(4), the visibility matrix up to path length d = 4 for the simple call graph
in Fig. 4 (Left).

3.3.1 Propagation Cost

The propagation cost (PC) was introduced in [11] as a scalar value to quantify
the extent of indirect dependencies in a network. It is defined as the number
of 1’s in V(4) divided by N 2 (the total number of 1’s possible). In other
words, PC is the number of pairs of functions connected by a path of length
less than or equal to 4, divided by the number of all possible pairs. We find
that changes in PC (a global variable) can be useful indicators of important
small-scale changes in the code base. Note that we also analyze PC for V(5),
but get almost identical results.
Figure 6 shows the evolution of PC, along with that of N , for the 50
months of Apache data. The baseline behavior indicates an inverse relation-
ship (as N increases PC decreases and vice versa). There is only one re-
gion that violates this trend, encompassing months 24 to 33. Removing these
months from consideration, we see an extremely consistent relation between
PC and N , as shown in Fig. 6 (right), that PC ∼ N −0.70 . The first anoma-
lous event which does not conform to this scaling relationship is month 24
(September 2003), when N decreases slightly yet PC jumps disproportion-
ately. The second anomalous event is from months 33 to 34 (June 2004 to July
2004), when PC drops dramatically while N remains essentially constant.
No other global property discussed herein shows marked changes in this
time frame, not even during the second anomaly which is most dramatic. N
and E are both essentially invariant (see Fig. 2). The degree distribution is
invariant, and the average clustering coefficient is invariant.
We attempt to isolate what changes in the details of Apache are responsible
for these two anomalous events. Motivated by findings in [10], which suggest
that functions with simultaneously high in- and high out-degree are particu-
larly problematic, we isolate functions whose in- or out-degree changed dur-
ing the time frame of interest. Functions with simultaneously high in-degree
and out-degree have a tremendous amount of upstream and downstream de-
pendencies. They are simultaneously information consumers and information
206 H. Wen et al.

x 10−3 x 10−3
7.8 3000 7.8
7.6

Number of functions N
2900 7.6
7.4
Propagation Cost

Propagation cost
7.2 2800 7.4
7
2700 7.2 PC ~ N−0.70
6.8
6.6 2600 7
6.4
Prop Cost 2500 6.8
6.2 N
6 2400 6.6
10 20 30 40 50 2400 2600 2800 3000
Month Number of functions N

Fig. 6. (Left) Propagation cost (left-hand axis) and N (right-hand axis) as functions
of time. (Right) PC as a function of N since the first stable release of Apache 2.0, with
anomalous months (23 thru 34) removed. We find that PC ∼ N −0.70 .

x 10−3
7.8
2004−6−1 PC
7.6
2004−7−1 w/o 2
1
Propagation Cost

10 7.4
Out−degree

7.2
7
6.8
6.6
6.4
6.2
100 6
100 101 102 10 20 30 40 50
In−degree Month

Fig. 7. (Left) Scatter plot of in-degree and out-degree, using log-log scale, with only
functions whose degree changed in this time period shown. (Right) Propagation cost
over time. Top line is for the entire system. Bottom line is resulting PC if the the two
functions indicated in (left) are removed, denoted “w/o 2” in the legend.

producers. Figure 7 (left) is a scatterplot of in-degree versus out-degree on


June 2002 (open circles) and July 2002 (filled circles), including only func-
tions with changes in these quantities.
Circled in Fig. 7 (left) are two suspicious functions. They have high in-
degree (of 33 and 34) and reasonably high out-degree (of 5 and 4) in June
2002. They maintain the in-degree but drop, as indicated, to an out-degree
of one in July 2002. We remove these two functions (and their edges) from
the call graph for each of the 50 months and plot the resulting evolution of
PC as shown in Fig. 7 (right). The top line is the same as Fig. 6 (left), PC for
the entire system. The bottom line is the resulting PC with the two functions
removed. We no longer see the anomalous behavior and recover the baseline
behavior PC ∼ N −0.70 shown in Fig. 6 (right).
Evolution of Apache Open Source Software 207

These functions (apr thread mutex lock and apr thread mutex unlock)
are members of the Apache Portable Runtime layer that implements function-
ality related to multithreading. Investigating the detailed commit logs written
by developers [21], we find that on August 7, 2003 (between months 23 and
24) attempted “bug” fixes to these two functions were made, with accompa-
nying comments indicating a history of problems with these two functions. On
June 4, 2004 (between months 33 and 34) these two “racy/broken” functions
were dropped from the code entirely and replaced with lower-level system
library calls.

3.4 Path Lengths, Clustering Coefficient and “Small Worlds”

A simple example call graph is given in Fig. 4 (left). There are directed paths
connecting various functions. For instance, function A is connected to func-
tion F via two paths, one of length 3 and one of length 5, where length is
measured by number of hops in the call graph. The path of length 3 is ob-
viously the shortest path connecting A and F . We consider all such pairs of
functions which are connected by a directed path and calculate the short-
est path between them. The fraction of shortest paths of a specified length
(i.e., the normalized distribution) is shown in Fig. 8 (left), for the first month
(October 2001) and the final month (November 2005) of our study. Similar
distributions result for all 50 months, with the typical shortest path of length
between 4 and 5, and the largest shortest path (i.e., the graph diameter) of
length 14.
We compare this distribution of shortest paths to those resulting from two
different random graph growth processes. First we consider an ensemble of 20
realizations of Erdős–Rényi random graphs [22, 23] with N = 2909 nodes and
E = 4142 undirected edges (equivalent to the N = 2909 nodes and E = 8284

0.25 0.1
2001−10−1
2005−11−1
0.2 0.08
Frequency
Frequency

0.15 0.06

0.1 0.04

0.05 0.02

0 0
0 5 10 15 0 10 20 30 40
Length of shortest path Length of shortest path (skewness=0.98206)

Fig. 8. (Left) Normalized shortest paths in Apache, first month and last month.
(Right) Normalized shortest paths averaged over 20 realizations of random networks
with the exact in- and -out degree distributions of Apache on November 2005. The
vertical axis “frequency” means the fraction of shortest paths having that length.
208 H. Wen et al.

directed edges in the November 2005 Apache call graph). Here we find the
typical shortest path is of length 7 or 8, much larger than for the Apache call
graphs. However, the diameter is comparable, ranging from length 14 to 16.
The degree distributions of the Apache call graphs (see Fig. 3) are much
broader and more heterogeneous than the Poisson distribution which char-
acterizes Erdős–Rényi random graphs [22, 23]. Thus we next compare the
Apache graphs to random graphs constructed to match exactly the Apache
degree distribution by extending the ideas in [24, 25] to directed graphs. We
begin with N = 2909 nodes and map each one to a distinct node in Apache.
We assign to each of these new nodes the in- and out-degree of their cor-
responding Apache node. We do not yet specify the connectivity, only the
final degree. In other words, we assign unconnected half-edges. We next per-
form a random matching and pair up each in-degree half-edge with a different
out-degree half-edge chosen at random. We construct an ensemble of 20 such
random graphs. The resulting normalized shortest path distribution, averaged
over the full ensemble, is shown in Fig. 8 (right). Note that the typical path
length is much larger than for Apache, peaking at length 10, and the max-
imum shortest path is around 30. Matching degree distribution alone is not
enough to reproduce the shortest path lengths observed for Apache.
“Small world” networks are characterized by small diameters and large
clustering. We have established the small diameter above. Throughout the 50-
month period the average clustering coefficient, C, fluctuates in the range
0.09 < C < 0.099. Calculating C over an ensemble of corresponding Erdős–
Rényi random graphs yields C = 0.0018, and for the ensemble of random
graphs with the Apache degree distribution C = 0.023. The Apache call graphs
thus have the “small world” characteristics of short average path length and
relatively large clustering coefficient when compared to a comparable random
graph. Note that to measure C we temporarily assume the edges are undi-
rected. A more thorough treatment is presented in the next section, where
“transitive” triads are distinguished from “cyclic” triads. (Cyclic triads are
rarely seen in software, though transitive ones occur frequently.)

4 Evolution of Apache: Models of Network Structure

We have made a number of empirical observations about the Apache call


graph using complex network measures, effectively obtaining a multifaceted
characterization of the graph. One can ask: How do these, and possibly other,
measures combine to tell the story of the whole Apache call graph? And in
general, to what extent is its structure determined by any given observations?
To answer these questions, here we present the statistical modeling ap-
proach of exponential random graph models (ERGMs), developed in recent
social network theory [26, 27] for understanding the relationships between a
large class of local network observations and the full network structure. This
Evolution of Apache Open Source Software 209

bottom-up approach models the extent to which a set of specific observations


(e.g., counts of transitive triads) explains the global structure of a network
(e.g., the Apache call graph), and, in the process, determines which of the
observations best explain its structure. More specifically, given a set of ob-
servations, or explanatory variables, an ERGM models networks as random
samples from an exponential probabilistic space given by linear combinations
of those explanatory variables. Thus, given a network and fitted ERGM, one
can calculate the probability that the network is determined by those vari-
ables, via direct calculations. In practice, these models are very appealing, as
there exist methods for both model fitting (observation available) and simu-
lations (observations unavailable).
The advantage of the ERGM approach is that it is very general and scal-
able. The architecture of the graph is represented by the chosen set of ex-
planatory variables, which can describe either local or global features of the
network, and the values of the model parameters can be quite instructive, in-
dicating the relative importance of the explanatory variables to the maximum
likelihood probability density function (pdf). In addition, ERGMs have been
well studied, and theoretical results exist which can offer some understanding
of the model’s behavior in practice [27].

4.1 ERGM Theory

Here, we describe formally the ERGM statistical framework for modeling net-
works, in particular as it pertains to modeling software call graphs. Let X
be a random variable representing the adjacency matrix of a software net-
work. The pdf for this random variable, P (X = x), tells us the probability
that an observed graph, x, was drawn from X. Unfortunately, the pdf of
X is unknown and cannot be directly calculated. To estimate this pdf, let
z(x) = (z1 (x), z2 (x), . . . , zr (x)) be a vector of explanatory variables, where
each explanatory variable can be any function of the observed data. We pos-
tulate that there exists θ = (θ1 , θ2 , . . . , θr ) such that

log(P (X = x)) ∝ θ1 z1 (x) + θ2 z2 (x) + · · · + θr zr (x) ∝ θ T z(x). (1)

If we exponentiate both sides and divide by a normalizing constant, κ(θ),


ensuring that the probabilities will sum to one, we get the following model:
T
P (X = x) = eθ z(x)
/κ(θ). (2)

This is the standard log linear probability model that is used in a wide range
of fields from the social sciences to biology [28, 29].
To create an ERGM, a set of explanatory variables (virtually any function
from the observed graph to the real numbers) is chosen by the modeler. The
choice of variables is based on the pertinent features of the graph under study,
or on a set of desired features, if the graphs are being simulated. An example,
210 H. Wen et al.

Table 1. Exponential random graph models are extremely flexible. This table shows
several example explanatory variables, identifying the variables by their names in the
statnet package for R [30].

Variable Description
istar(k) The number of k-tuples of edges that point to the same node in the
network.
ctriad The number of 3-cycles in the network.
ttriad The number of two-edge paths for which there is a one-edge shortcut
in the network.
triangle The sum of ctriad and ttriad for the network.
idegree(k) The number of nodes with exactly k incoming edges in the network.
odegree(k) The number of nodes with exactly k outgoing edges in the network.
gwidegree The sum of the counts of each in-degree, weighted by the geometric
sequence, (1 − e−θk )i where θk is a decay parameter.
edges The number of edges in the graph.

non-exhaustive, set of explanatory variables is given in Table 1, most of which


are important for modeling the Apache call graph. The coefficients, θ, can be
interpreted as a preference of the observed network for a given explanatory
variable, if its coefficient is positive, and a preference against a variable, if it
is negative.
Estimating θ based on an observed network is referred to as fitting the
model, while using a predetermined θ to generate networks is referred to as
simulating with the model. Given a set of explanatory variables, the best fit
to the observed network is given by the parameter vector θ which maximizes
the likelihood that the observation is drawn from the probability distribu-
tion given in Eq. 2. In this case, though, the standard maximum likelihood
method to estimate the parameters is difficult because the function for the
normalizing constant κ(θ) is not known a priori. Instead, one typically uses
Markov chain Monte Carlo maximum likelihood estimation (MCMC MLE),
a family of methods based on the Newton-Raphson MLE algorithm [26]. The
maximum likelihood formula for the pdf obtained via fitting can be used with
Markov chain Monte Carlo (MCMC) sampling methods to simulate networks.
There are a number of software packages available for MCMC MLE fitting.
These include the “statnet” package [31] for R [30] and the stand-alone SIENA
software [32].
In practice, one rarely knows which explanatory variables to choose to
fully describe a network using ERGMs. To compare if a particular set of
explanatory variables models an observed network better than another, one
can use several different approaches. For example, the modeler can use the
fitted model to simulate a suite of networks and check how well the simulated
networks match the observed network on any measure of interest (e.g., the
degree distribution). Along this line, the “statnet” statistical package has a
Evolution of Apache Open Source Software 211

built-in goodness-of-fit function which compares simulated networks to the


observed network on a set of such measures. Another approach for comparing
sets of explanatory variables is to use information-theoretic measures, like
the Akaike information criterion (AIC), to assess how well a model fits the
observed data. In addition to providing information on the goodness of fit, the
AIC (which penalizes more complicated models to protect against overfitting
[33]) can also be used to guide the search through the space of possible models,
helping to identify the best variables to include in the model as follows. If the
modeler suspects that a particular variable might be useful in modeling an
observed network, the AIC can be used to test this hypothesis by toggling
the variable in and out of the model, accepting the hypothesis if significant
improvement in the AIC is observed.

4.2 Modeling Process and Results

As an exploratory first step to our modeling process, we fit models made from


many of the possible combinations of a diverse set of explanatory variables
that we expect to be important in explaining the Apache call graph. We in-
clude the counts of connected triads (ctriad and ttriad, cf. Table 1) in many
of our exploratory models because these small connected graphs (graphlets)
may be important architecturally in many types of larger networks [34, 35].
However, we do not expect the ctriad graphlet to be helpful in modeling
software because it implies indirect recursion, an uncommon and difficult pro-
gramming technique, but we include it in our modeling process as a sanity
check. We also investigate various in- and out-degree counts because these
counts provide a local measure of the network’s topology. Further, the suc-
cess of in-degree count as an explanatory variable leads us to investigate the
related in-star variables. In previous modeling efforts [36], degeneracy in the
fitting algorithms was often observed for models using the variables above.
To circumvent such degeneracies it has become standard ERGM practice to
include the geometrically weighted in-degree distribution and the simple edge
count as variables in every model, and we do so here too.
We toggle the variables described above in and out of several models,
identifying variables that are important in fitting the Apache call graph to
a single representative month (June 2003). The results for several represen-
tative models are given in Table 2, together with the AIC for the model.
As expected, the AIC changes very little when ctriad is added to the basic
edges+gwidegree model, indicating the lack of importance of ctriad, but
the AIC improves significantly when the ttriad variable is added, showing us
that the tendency of Apache programmers to include layer-crossing function
calls is important in determining the global nature of the graph. Given these
results, we further refine our search, looking at many more models that in-
clude the ttriad variable, and we find that the out-degree and the higher
in-degree terms are less important than others we consider.
212 H. Wen et al.

Table 2. The AIC for a sample of fitted models. Note: For space and readability, the
notation we use here to describe the models omits the θi parameter coefficient from
Eq. (1). Each term (seperated by +) is a separate model predictor variable with its
own coefficient.
Model AIC
edges+gwidegree 104090
edges+gwidegree+ctriad 104088
edges+gwidegree+ttriad 101473
edges+gwidegree+ttriad+odegree(2) 100065
edges+gwidegree+ttriad+istar(3) 97723
edges+gwidegree+ttriad+idegree(2) 97589
edges+gwidegree+ttriad+istar(2) 94383
edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2) 91017
edges+gwidegree+ttriad+idegree(2)+idegree(3)+istar(2)+istar(3) 89491

Table 2 allows us to see the variables that are important to the AIC and,
hence, are better at predicting the topology of the Apache call graph. For
example, it is interesting that the out-degree of a function is less important to
the global topology than the in-degree, indicating that the emergent structure
of the call graph is more dependent on how many times each function is called
than on how many dependencies they have, which is in line with the findings
in Section 3.
Next, we perform a longitudinal, 50-month study of the Apache call graph
using a few of the best-fitting models from the one-month study. This exper-
iment lets us see if the relative importance of explanatory variables changed
throughout the Apache development process. The ranking by AIC of the
models we fit remains constant across all 50 months, but the values of the
parameters do not. Figure 9 shows a plot of the coefficient values over time
for ttriad, idegree(2,3) and istar(2,3). These variables were chosen be-
cause they were contained in our best-fitting model (as determined by AIC)
from Table 2, and we chose not to study any variables (such as odegree)
from other, less well-fitting models. Our exploratory procedure eliminated the
other variables that we considered because they did not contribute as large
an improvement to the AIC as the variables from the final model.
All of the variables that we’ve measured relating to in-degree (istar(2,3),
idegree(2,3) and gwidegree) are generally negative in this model. On
the other hand, the transitivity variable ttriad is consistently positive
throughout the development cycle. This indicates that there are functions
in Apache that call their callee’s callees (perhaps due to the standard library
functions being included in the Apache call graph).
Interestingly, over the 50-month period, indegree(2) is almost perfectly
anti-correlated with indegree(3) (as seen in Fig. 9). One explanation is that
these two variables are measuring two aspects of the same phenomenon (how
Evolution of Apache Open Source Software 213

Fig. 9. Plots of several interesting coefficients across all 50 months. Top: ttriad.
Middle: idegree(2,3). Bottom: istar(2,3).

many functions are called approximately twice), and, hence, the importance of
the two variables to the model is correlated. Similarly, edges and gwidegree
(not shown) are strongly anti-correlated, perhaps because they both measure
aspects of network density.

5 Discussion and Conclusions


We study the evolution of the function call graph for the Apache 2.0 HTTP
Server over a 50-month period. Apache is a mature, OSS project, written in a
procedural programming language. We characterize Apache first with several
global measures: 1) nodes and edges, 2) degree distribution, 3) dependency
matrices and propagation cost, 4) path length and clustering. We find that
these measures have certain baseline behaviors and that deviations can indi-
cate important structural changes in the code base. In particular, we find that
propagation cost (introduced in [11]) is a sensitive measure that can signal
when a detailed, fine-grained examination of the code base may be required.
Using ideas proposed in [10] (that functions with simultaneously high in- and
out-degrees are problematic), we are able to isolate that the large changes
observed in propagation cost are attributable to just two individual functions
(out of approximately 2900 total functions). By examining the detailed devel-
opment logs we corroborate that indeed these two functions have repeatedly
troubled developers. The techniques presented herein may be useful in general
for code written in procedural programming languages, as they may allow de-
velopers to identify particular functions which, when restructured, can reduce
overall system dependencies.
Using exponential random graph modeling, we investigate the relationships
between the attributes that we empirically observe, and find that the most
important attribute for predicting the global structure of the Apache call
graph is ttriad, the number of transitive triads in the graph. In future work
we intend to explore how the appearance of unexpected features might help
to identify bugs.
214 H. Wen et al.

Acknowledgments

We are indebted to Christian Bird for supplying the call graph data which
is central to our analysis and to Premkumar Devanbu for many useful dis-
cussions. This work was funded in part by the National Science Foundation
under Grant No. IIS-0613949.

References
1. E. S. Raymond. The Cathedral & the Bazaar. O’Reilly and Associates, Sebastopol,
CA, 1999.
2. T. O’Reilly. Lessons from open source software development. Communications of
the ACM, 42(4), 1999.
3. P. Ball. Openness makes software better sooner. Nature, June 25, 2003.
4. D. Challet and Y. Le Du. Microscopic model of software bug dynamics: Closed
source versus open source. International Journal of Reliability, Quality and Safety
Engineering, 12(6), 2005.
5. M. Fowler. Refactoring: Improving the Design of Existing Programs. Addison-
Wesley, Reading, MA, 1999.
6. A. A. Gorshenev and Yu. M. Pis’mak. Punctuated equilibrium in software evolu-
tion. Phys. Rev. E, 70(6):067103, 2004.
7. http://httpd.apache.org.
8. Software Maintenance Costs and references therein, http://www.cs.jyu.fi/
∼koskinen/smcosts.htm.
9. S. Valverde, R. Ferrer Cancho, and R. V. Solé. Scale-free networks from optimal
design. Europhys. Lett., 60(4):512–517, 2002.
10. C. R. Myers. Software systems as complex networks: Structure, function, and
evolvability of software collaboration graphs. Phys. Rev. E, 68:046116, 2003.
11. A. MacCormack, J. Rusnak, and C. Y. Baldwin. Exploring the structure of com-
plex software designs: An empirical study of open source and proprietary code.
Management Science, 52(7), 2006.
12. Z. M. Saul, V. Filkov, P. T. Devanbu, and C. Bird. Recommending random walks.
In Proceedings ESEC/SIGSOFT FSE, pages 15–24, 2007.
13. http://www.grammatech.com/products/codesurfer/overview.html.
14. S. B. Seidman. Network structure and minimum degree. Social Networks, 5:269–
287, 1983.
15. B. Bollobas. The evolution of sparse graphs. In Graph Theory and Combinatorics,
pages 35–57. Academic Press, New York, 1984.
16. http://www.apacheweek.com/features/ap2#rh.
17. S. Valverde and R. V. Solé. Hierarchical small worlds in software architecture. In
Dynamics of Continuous Discrete and Impulsive Systems: Series B; Applications
and Algorithms, volume 14, pages 1–11, 2007.
18. G. Baxter, M. Frean, J. Noble, M. Rickerby, H. Smith, M. Visser, H. Melton,
and E. Tempero. Understanding the shape of Java software. In OOPSLA ’06:
Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented
Programming Systems, Languages, and Applications, pages 397–412, New York,
NY, USA, 2006. ACM.
Evolution of Apache Open Source Software 215

19. J. N. Warfield. Binary matrices in system modeling. IEEE Transactions on Sys-


tems, Man, and Cybernetics, 3:441–449, 1973.
20. D. Sharman and A. Yassine. Characterizing complex product architectures. Sys-
tems Engineering Journal, 7(1), 2004.
21. http://svn.apache.org/viewvc/.
22. P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae, 6:290–
297, 1959.
23. P. Erdős and A. Rényi. On the evolution of random graphs. Publ. Math. Inst.
Hungar. Acad. Sci., 5(17), 1960.
24. M. Molloy and B. Reed. A critical point for random graphs with a given degree
sequence. Random Struct. Alg., 6:161–179, 1995.
25. M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary
degree distributions and their applications. Phys. Rev. E, 64:026118, 2001.
26. T. A. B. Snijders. Markov chain Monte Carlo estimation of exponential random
graph models. Journal of Social Structure, 3(2), 2002.
27. C. J. Anderson, S. Wasserman, and B. Crouch. A p* primer: Logit models for
social networks. Social Networks, 21:37–66, 1999.
28. D. Kaplan. The Sage Handbook of Quantitative Methodology for the Social Sci-
ences. Sage Publications Inc., London, 2004.
29. C. Infante-Rivard, C. R. Weinberg, and M. Guiguet. Xenobiotic-metabolizing
genes and small-for-gestational-age births: Interaction with maternal smoking.
Epidemiology, 17(1):38–46, 2006.
30. The R Project for Statistical Computing, http://www.r-project.org.
31. M. S. Handcock, D. R. Hunter, C. T. Butts, S. M. Goodreau, and M. Morris.
statnet: An r package for the statistical modeling of social networks, 2003.
http://www.csde.washington.edu/statnet.
32. T. A. B. Snijders, P. E. Pattison, G. L. Robins, and M. S. Handcock. New spec-
ifications for exponential random graph models. Sociological Methodology, 99–153
2006.
33. S. Konishi and G. Kitagawa. Information Criteria and Statistical Modeling. Springer,
New York, 2008.
34. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Net-
work motifs: Simple building blocks of complex networks. Science, 298:824–827,
2002.
35. S. Valverde and R. V. Solé. Network motifs in computational graphs: A case study
in software architecture. Phys. Rev. E, 72:026107, 2005.
36. D. R. Hunter and M. S. Handcock. Inference in curved exponential family mod-
els for networks. Technical report, Penn State Department of Statistics, 2004.
Available from http://www.stat.psu.edu/reports/2004/.
Some New Applications of Network
Growth Models

Gourab Ghoshal

Department of Physics and Michigan Center for Theoretical Physics, University of


Michigan, Ann Arbor, MI, 48109, USA; gghoshal@umich.edu

1 Introduction
The study and analysis of complex networks has in recent times sparked
widespread attention from the scientific community [1, 2, 3]. This interest
has been spurred partly by researchers recognizing networks as useful rep-
resentations of real-world complex systems, and also due to the widespread
availability of computing resources, enabling them to gather and analyze data
on a scale much larger than before. Studies have ranged from large-scale empir-
ical analysis of the World Wide Web, social networks and biological systems,
to the development of theoretical models and tools to explore the various
properties of these systems [4, 5].
A topic that has garnered significant interest is the subject of growing
networks, inspired by real-world examples such as that of the Internet, the
World Wide Web and scientific citation networks [6, 7, 8]. The particular
case of the World Wide Web has led to what is perhaps the best-known body
of work on this topic: the preferential attachment model [9, 10], in which
vertices are added to a network with edges that attach to pre-existing vertices
with probabilities depending on those vertices’ degrees. When the attachment
probability is precisely linear in the degree of the target vertex, the resulting
degree sequence has a power-law tail, in the limit of large network size. The
appearance of the power-law tail is what first led to the popularity of growth
models as a method to describe network evolution, as most real-world net-
works appear to have degree distributions that are approximately power laws.
The preferential attachment model, though a good starting point, is insuf-
ficient for describing networks such as the World Wide Web. One can imagine
a variety of processes taking place in addition to the mere deposition of ver-
tices and edges. In particular, it is a matter of common experience that web
pages are sometimes permanently or temporarily removed from the web along
with their links to other web pages. Consequently, there is plenty of room to
build on these models, which are principally growth based, and add another
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 13,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
218 G. Ghoshal

level of complexity by including processes where vertices and edges are also
removed from the network. It is also possible to extend the model to study
general growth and deletion processes, and not just preferential attachment.
Indeed, in the last couple of years, there has been some activity in this re-
gard [11, 12, 13]. A notable example is the work done in [12], where, among
other things, the authors extended the preferential attachment model to in-
clude the deletion of vertices, potentially at different rates, from the addition
of new vertices. They demonstrated that networks still retain their power-law
tail when the rate of vertex accrual outstrips vertex deletion (with an ex-
ponent dependent on the relative rate). However, when the rates are equal,
the exponent diverges and the degree distribution transitions into a stretched
exponential (Weibull distribution). This could have interesting consequences,
for example, in the future character of the web where web pages would share
a more even load than at present.
Most growth models are designed to solve for the degree distribution of
the network. Although one can design a model which describes the evolution
of a number of other network properties, say the clustering coefficient, those
that deal with degree distributions are the most attractive for chiefly two
reasons. The first is in terms of practicality; the degree distribution is rela-
tively straightforward to deal with mathematically, and thus one can calcu-
late a number of properties exactly. The second reason is that the degrees of
vertices typically have a strong effect on the overall behavior of the network;
therefore, they are a useful guide in determining its characteristics. In fact,
the traditional use of growth models has been mostly in this regard, where
researchers define some evolution process and then solve for the equilibrium
degree distribution of the network.
However, it is certainly possible to think of alternative applications. In-
stead of defining a set of processes under which a network evolves and then
determine its final structure, one can turn the question around and specify a
final structure, and then solve for the rules which give rise to that structure.
To see when and why this is applicable, consider the following. Generally, we
can divide evolving networks into broadly three classes. There are those that
evolve naturally, in the sense that they are driven by dynamical processes
not under our control; representative examples are social, biological and in-
formation networks. This class is suited to the more traditional use of growth
models. At best, we can measure the degree distribution of these networks,
guess a set of rules that govern their evolution and then check our calcula-
tions against the measurements, to see if we made a reasonable choice. There
is a different class, mostly infrastructure related, such as the transportation
and power grids, communication networks such as the telephone and Internet,
that are designed by a centrally controlled authority. Since the rules for the
evolution process defined in growth models are mostly local in nature, we are
hard pressed to find a suitable application for them in this particular class.
Finally, there is a relatively new class of networks which falls in between these
two types, the classic example being peer-to-peer file-sharing networks. These
Some New Applications of Network Growth Models 219

networks grow in a collaborative, distributed fashion so that we have no di-


rect influence over their structure. However, we can manipulate some of the
rules by which they form, giving us a limited but potentially useful influence
over their properties. It is this last class that provides a fertile testbed for
alternative uses for our growth model.
Two applications immediately spring to mind. Consider the case of peer-
to-peer networks. Measurements [14] have shown that the degree distribution
of these networks roughly follow a power law. Based on these findings, recent
papers in the literature [15, 16] have proposed search strategies with costs
(search time, bandwidth, etc.) scaling sublinearly or logarithmically with the
size of the network. While it is certainly a practical and worthwhile approach
to determine the structures of existing networks and then try to find ways
to optimize their properties, it also seems natural to approach the question
from the other direction. Some authors [17] have taken the approach of iden-
tifying desirable properties that a network should possess and then proposing
appropriate designs to generate networks possessing these attributes. Since in
a peer-to-peer network users are continually joining and leaving the network,
it is well suited to be described by network growth models. The idea will be to
specify a suitable structure a priori that optimizes some properties, say effi-
cient information transfer, and solve for a set of local rules which can generate
that same structure.
The other application is in the realm of network resilience. Networks typ-
ically experience a significant amount of node/edge turnover, due to possi-
ble failures of key components and resources, or intentional attacks. These
factors can lead to severe disruption of the network structure and, as a re-
sult, loss of its key properties. Thus, it is worth analyzing the effects of these
failures/attacks and use our limited control to attempt to adaptively restore
the original structure of these networks. Authors who work in this field have
focused mostly on the effects of disruption on static networks, where they
have studied the connectivity structure under the random/targeted removal
of nodes and edges [18, 19, 20]. However, under the aegis of growth models
we can move away from the static regime and instead focus on networks that
evolve in time with sustained node and edge removals. We can allow the net-
work to react to these disruptions by introducing new nodes and edges and
attach them in a manner such that the network is able to retain its original
degree distribution. These kinds of models are conventionally referred to as
reactive network models and have previously been studied in [21, 22].
In this chapter, we will move away from the more traditional uses of net-
work growth models and focus on some new applications. The outline is as
follows. In Section 2 we will first define our model. The model is based on a
rate equation approach that governs the evolution of the degree distribution
of the networks that we study. Based on this model, in Section 3 we will talk
about applying growth models to design networks with a set of properties that
may be desirable for its functioning, and illustrate our ideas with the specific
example of peer-to-peer file-sharing networks. In Section 4 we will talk about
220 G. Ghoshal

applying our growth model to the preservation of network degree distribu-


tions from attacks on, or failure of, its resources. In Section 5 we will state
our conclusions. In all cases we use a combination of analytical calculations
coupled with numerical simulations to come to our results.

2 The Model

In this section we will define our model for growing a network. Our approach
will be based on the attachment kernel introduced in [23] in addition to a
general deletion kernel. We will assume that nodes join/leave the network
at intervals, and on doing so form/lose connections with other pre-existing
nodes in the network. For the networks that we consider, we will make the
assumption that, on the typical time scales over which nodes enter or leave
the network, the size of the network n does not change substantially. We will,
however, not assume that our networks are uncorrelated (in terms of degree
correlations) and will take this into account explicitly in our evolution process.
The reasons for doing so will be clarified in Section 4.
To start off, let us define pk to be the fraction of nodes in the network
that at a given time have degree k. Alternatively, one can think of it as the
probability of a randomly chosen node to have degree k. Then by definition,
it satisfies the normalization condition


pk = 1. (1)
k=0

Let us now define the process by which a newly arriving node chooses to attach
to others extant in the network and how a node is removed from the same.
Let πk be the probability that a given edge from a new node is connected to
a given node of degree k, multiplied by the total number of nodes n. Since
the total number of nodes in the network with degree k is npk , this implies
that πk pk is the probability that an edge from a new node is connected to any
node of degree k. Similarly, let ak be the probability that a given node with
degree k fails or is attacked during one node removal, again multiplied by the
total number of nodes n. Then ak pk is the total probability to remove a node
with degree k during one node removal. Since each newly attached edge goes
to some node with degree k, we have the following normalization conditions:

πk pk = 1, (2)
k
k ak pk = 1. (3)

Finally, let us also allow ourselves to choose the number of edges of the newly
joining nodes. Let mk be the distribution
 from which the edges of these nodes
are drawn, with the constraint k kmk = c, in other words, the average
degree of incoming vertices is c.
Some New Applications of Network Growth Models 221

2.1 Rate Equation

Armed with the given definitions, we are now in a position to write a rate
equation governing the evolution of the degree distribution. For a network of
n nodes at a given unit of time, the total number of nodes with degree k is
npk . After one unit of time we add a node and take away another, so the new
number with degree k is now npk , where pk is the new value of pk . Therefore,
we have
 
npk = npk +cπk−1 pk−1 +cπk pk + ek+1|j jaj pj − ek|j jaj pj − ak pk + mk ,
j j
(4)

where ek|j is the conditional probability of following an edge from a node of


degree j and reaching a node of degree k. Note that e0|j and ej|0 are always
zero, and for an uncorrelated network, ek|j = kpk /k, where k = k kpk is
the average degree for the entire network. The πk terms describe the flow of
nodes with degree k − 1 to k and k to k + 1 as a consequence of the new edges
gained due to the addition of new nodes. The first two terms containing aj
describe the flow of nodes with degree k + 1 to k and k to k − 1 as they lose
edges as a result of losing neighbors. The term −ak pk represents the direct
removal of a node of degree k. Finally mk represents the addition of a node
with degree k. Processes where nodes gain or lose two or more edges vanish
in the limit of large n and are not included in Eq. (4).
The rate equation described above presents a formidable challenge due to
the appearance of ek|j from the terms representing edges lost due to neigh-
bors lost, which makes it hard to find a closed-form solution. Nevertheless,
we can make progress in one of two ways. The first will be described here,
and is applicable to our first application of designing networks. We will, for
the moment, leave the description of the second method for a later section.
Equation (4) has a particularly pleasing form if we limit ourselves to the case
of uniformly random deletion, which amounts to setting ak = 1. Doing so then
leads to the following:

ek|j jpj = kpk , (5)
j

which renders Eq. (4) independent of ek|j and thus independent of any degree
correlations. Random deletion thus closes the rate equation for pk , enabling
us to seek a solution for the degree distribution for a given mk and πk . If we
now assume that pk has an asymptotic form in the limit of large time, which
amounts to setting pk to pk , we get the following equation:

cπk−1 pk−1 − cπk pk + (k + 1)pk+1 − kpk − pk + mk = 0. (6)


222 G. Ghoshal

At this point it is convenient to define the following set of generating functions:


∞
G(z) = pk z k , (7)
k=0


F (z) = πk p k z k , (8)
k=0
∞
M (z) = mk z k . (9)
k=0

If we then multiply Eq. (6) by z k and sum over the index k with the convention
p−1 = 0, we find that the generating functions satisfy the following differential
equation:
dG
(1 − z) − G(z) − c(1 − z)F (z) + M (z) = 0. (10)
dz
Our task will be to solve for a set of rules that generate/preserve the degree
distribution of a network that is specified beforehand. In other words, given
a G(z), our aim is to solve for the attachment kernel F (z). We can rearrange
Eq. (10) to get F (z) in terms of the other two distributions,
 
1 dG M (z) − G(z)
F (z) = + . (11)
c dz 1−z

It is a relatively straightforward exercise, starting from the equation above, to


show that the average degree of the network k = c. In other words, solutions
to Eq. (10) require that the average degree c of vertices added to the network
be equal to the average degree of vertices in the network as a whole. Thus, we
can write Eq. (11) as:
M (z) − G(z)
F (z) = G1 (z) + , (12)
c(1 − z)

where G1 (z) = G (z)/G (1) = k qk z k is the generating function for what is
called the excess degree distribution:
(k + 1)pk+1
qk = . (13)
k
The excess degree refers to the number of edges at the end of a vertex that is
reached by following a randomly chosen edge. The factor of k is present since
we are now effectively sampling vertices in proportion to the number of edges
extant on them. Note that the excess degree of a vertex is one less than the
actual degree.
We are now in a position to derive the desired attachment kernel. Noting
that ∞
1 
= zk , (14)
1−z
k=0
Some New Applications of Network Growth Models 223

we can simply read the coefficient of z k on either side of Eq. (12) to give
1  
πk = (k + 1)pk+1 + Pk+1 − Mk+1 , (15)
cpk
where Pk is the cumulative distribution of the degrees of nodes in the network,
and Mk is the cumulative distribution of the degrees of nodes added,
∞ ∞
Pk = pl , Mk = ml . (16)
l=k l=k

We have a number of options for solving Eq. (15); given (almost) any
choice of the distribution mk of the degrees of added vertices, we can find
the corresponding πk that will give the desired final degree distribution of
the network. A particularly convenient choice would be to make the degree
distribution of the added vertices the same as the desired degree distribution,
so that Mk = Pk . Then,
qk (k + 1)pk+1
πk = = . (17)
pk cpk
In other words, if we have some desired degree distribution pk for our network,
one way to achieve it is to add vertices with exactly that degree distribu-
tion and then arrange the attachment process so that the degree distribution
remains preserved thereafter, even as vertices and edges are added to and
removed from the network. Equation (17) tells us the choice of attachment
kernel that will achieve this.
For example, say we want to generate a Poisson network with degrees
distributed according to
μk
pk = e−μ , (18)
k!
where μ is the average degree of the network. Equation (17) tells us that all
we have to do is to introduce nodes with degrees distributed according to
Eq. (18) and attach them uniformly at random to the pre-existing vertices
in the network. Figure 1 shows the degree distribution of a Poisson network
generated using the method described above.
Having built our mathematical framework, we are now free to move on to
specific applications.

3 Generating Networks with Desired Properties


In this section we will discuss the application of our growth model to gener-
ate networks with desired structural properties. Specifically, we will consider
the case of peer-to-peer file-sharing networks. As motivation for this, consider
the following.
A problem which has gained a lot of attention is that of designing an effi-
cient search strategy to find items or data stored on the vertices of a network.
224 G. Ghoshal

100

10−1

10−2
Probability Pk

10−3

10−4

10−5

10−6
1 10 100
Degree k

Fig. 1. The degree distribution for a network of fixed size n = 50,000 generated using
the growth mechanism described in the text, with c = 10. The points represent the
simulation results and the solid line is the distribution Eq. (18).

Interest in this has been inspired partly by the emergence of networked dis-
tributed databases such as peer-to-peer file-sharing networks. In such net-
works the structure of the network and the distribution of the items stored
on it typically change rapidly and frequently, which means that searches must
be performed in real time. In peer-to-peer networks searches typically con-
sist of queries that are forwarded from one vertex to another until the target
item is found. Real-time searches place heavy demands on computer power
and bandwidth, and there is interest in finding efficient search strategies to
decrease these costs.
As mentioned in the Introduction, direct measurements of real peer-to-peer
networks have shown that typically the degree distribution of these networks
follows a power law, which has led some authors to propose search strategies
that exploit this power-law form to improve efficiency. Here we describe an
alternative approach to the problem: instead of tailoring our algorithm to the
observed network, we instead tailor the structure of the network to optimize
the performance of the search algorithm. We will start by defining our al-
gorithm and then outline the properties of interest. We will then consider a
candidate network with a structure that optimizes those properties. The ideas
of this section have been discussed in detail in [24].

3.1 Definition of the Problem

Consider a distributed database consisting of a set of computers, each of which


holds some data items. Copies of the same item can exist on more than one
Some New Applications of Network Growth Models 225

computer, which would make searching easier, but we will not assume this to
be the case. Computers are connected together in a virtual network, meaning
that each computer is designated as a neighbor of some number of other
computers. These connections between computers are purely notional: every
computer can communicate with every other directly over the Internet or
other physical network. The virtual network is used only to limit the amount
of information that computers have to keep about their peers. Each computer
maintains a directory of the data items held by its network neighbors, but
not by any other computers in the network. Searches for items are performed
by passing a request for a particular item from computer to computer until it
reaches one in whose directory that item appears, meaning that one of that
computer’s neighbors holds the item. The identity of the computer holding the
item is then transmitted back to the origin of the search, and the origin and
target computers communicate directly thereafter to negotiate the transfer
of the item. This basic model is essentially the same as that used by other
authors [15] as well as by many actual peer-to-peer networks in the real world.
Note that it achieves efficiency by the use of relatively large directories at
each node of the network, which inevitably use up memory resources on the
computers. However, with standard hash-coding techniques and for databases
of the typical sizes encountered in practical situations (hundred thousands
or millions of items) the amounts of memory involved are quite modest by
modern standards.
The two metrics of search performance that we consider are search time
and bandwidth, both of which should be low in a good algorithm. We define
the search time to be the number of steps taken by a propagating search
query before the desired target item is found. We define the bandwidth for a
node as the average number of queries that pass through that node per unit
time. Bandwidth is a measure of the actual communications bandwidth that
vertices must expend to keep the network as a whole running smoothly, but it
is also a rough measure of the CPU time they must devote to searches. Since
these are limited resources, it is crucial that we do not allow the bandwidth
to grow too quickly as vertices are added to the network; otherwise, the size
of the network will be constrained.

3.2 Search Strategy and Search Time

In order to treat the search problem quantitatively, we need to define a search


strategy or algorithm. Our candidate will be a very simple one, the random
walk search, which, though certainly not the most efficient strategy possible,
has two significant advantages. First, it is simple enough to allow us to carry
out analytic calculations of its performance. Second, as we will show, even
this basic strategy can be made to work very well. Our results constitute an
existence proof that good performance is achievable; searches are necessarily
possible that are at least as good as those analyzed here.
226 G. Ghoshal

In a random walk search a node i originating a search sends a query for


the item it wishes to find to one of its neighbors j, chosen at random. If that
item exists in the neighbor’s directory, the identity of the computer holding
the item is transmitted to the originating node and the search ends. If not,
then j passes the query to one of its neighbors chosen at random, and so forth.
Let pi be the probability that our random walker is at node i at a particular
time. Then the probability pi of its being at i one step later, assuming the
target item has not been found, is
 Aij
pi = pj , (19)
j
kj

where kj is the degree of node j and Aij is an element of the adjacency matrix,

1 if there is an edge joining vertices i, j,
Aij = (20)
0 otherwise.

After reaching equilibrium, the probability distribution over nodes then


tends to the fixed point of (19), which is at

ki
pi = , (21)
2m
where m is the total number of edges in the network. That is, the random
walk visits nodes with probability proportional to their degrees.
When our random walker arrives at a previously unvisited node of de-
gree ki , it “learns” from that node’s directory about the items held by all of
its immediate neighbors, of which there are ki −1 excluding the one we arrived
from (whose items by definition we already know about). Thus, at every step
the walker gathers more information about the network.The average number
of nodes it learns about upon making a single step is i pi (ki − 1), with pi
given by (21), and hence the total number it learns about after τ steps is
 2 
τ  k 
ki (ki − 1) = τ −1 , (22)
2m i k

where k and k 2  represent the mean and mean-square degrees in the network
respectively and we have made use of 2m = nk.
The time taken for the walker to find the desired item, of course, depends
on how many instances of the target exist in the network. In many cases of
practical interest, copies of items exist on a fixed fraction of the nodes in
the network, which makes for quite an easy search. Here we will consider the
much harder problem in which copies of the target item exist on only a fixed
number of nodes, where that number could potentially be just 1. In this case,
the walker will need to learn about the contents of O(n) nodes in order to
find the target, and hence the average time to find the target is given by
Some New Applications of Network Growth Models 227
n
τ =A , (23)
k 2 /k − 1
for some constant A.

3.3 Bandwidth

Bandwidth is the mean number of queries reaching a given node per unit
time. Equation (21) tells us that the probability that a particular current
query reaches node i at a particular time is ki /2m, and assuming as discussed
above that the number of queries initiated per unit time is proportional to
the total number of vertices, the bandwidth for node i is
ki ki
βi = Bn =B , (24)
2m k
where B is another constant.
This implies that high-degree nodes will be overloaded in comparison with
low-degree ones, which means that networks with power-law or other highly
right-skewed degree distributions may be undesirable, resulting in bottlenecks
around the nodes of highest degree that could in principle harm the perfor-
mance of the entire network. If we wish to distribute load evenly among the
computers in our network, then a network with a tightly peaked degree dis-
tribution is desirable.

3.4 Candidate Network

We wish to choose a structure for our network that gives low search times and
modest bandwidth demands, keeping in mind that the structure we consider
must be realizable in practice. In peer-to-peer networks users continually exit
the network whenever they want. Since we as designers have limited control
over this aspect of the network dynamics, we will assume that nodes are
effectively deleted at random. With this in mind, we are ideally placed to use
our model from Section 2.
A simple and attractive choice for our network is the Poisson distributed
network. For a Poisson degree distribution with mean μ we have k = μ and
k 2  = μ2 + μ. Then, using Eq. (23), the average search time is
n
τ =A . (25)
μ
Now if we allow μ to grow as some power of the size of the entire network,
i.e. μ ∝ nα with 0 ≤ α ≤ 1, then τ ∝ n1−α . For smaller values of α searches
will take longer, but the nodes’ degrees are lower on average, meaning that
each vertex will have to devote less memory resources to maintaining its di-
rectory. Conversely, for larger α, searches will be completed more quickly at
the expense of greater memory usage. In the limiting case α = 1, searches
228 G. Ghoshal

are completed in constant time, independent of the network size, despite the
simple nature of the random walk search algorithm. The price we pay for this
good performance is that the network becomes dense, having a number of
edges scaling as n1+α . However, remember that this is a virtual network, in
which the edges are a purely notional construct whose creation and mainte-
nance carry essentially zero cost. There is a cost associated with the directories
maintained by nodes, which for α = 1 will contain information on the items
held by a fixed fraction of all the nodes in the network. For instance, each
node might be required to maintain a directory of 1% of all items in the
network. Because of the nature of modern computer technology, however, we
do not expect this to create a significant problem. User time (for performing
searches) and CPU time and bandwidth are scarce resources that must be
carefully conserved, but memory space on hard disks is cheap, and the tens or
even hundreds of megabytes needed to maintain a directory is considered in
most cases to be a small investment. By making the choice α = 1 we can trade
cheap memory resources for essentially optimal behavior in terms of search
time, and this is normally a good deal for the user.
As a test of our proposed search scheme, we have performed simulations
of the procedure on Poisson networks generated using the methods described
in Section 2. Figure 2 shows as a function of network size the average time τ
taken by a random walker to find an item placed at a single randomly chosen
node in the network. As we can see, the value of τ does indeed tend to a
constant (about 100 steps in this case) as the network size becomes large.
While we have described here the theoretical ideas to grow a network with
a desired degree distribution, within the constraints outlined above, we have
not provided a realistic way to place edges between nodes with the desired

170

160

150

140
Time τ

130

120

110

100

90
0 5000 10000 15000 20000
Network size n

Fig. 2. The time τ for the random walk search to find an item deposited at a random
vertex, as a function of the number of vertices n.
Some New Applications of Network Growth Models 229

attachment kernel πk . If each node entering the network knew the identities
and degrees of all the others, this would be easy; we would simply select a
degree k at random in proportion to πk pk , and then select a node uniformly
at random with that degree. In the real world, however, and particularly in
peer-to-peer networks, no node knows the identity of all others. Typically,
computers only know the identities (such as IP addresses) of their immediate
network neighbors. There is indeed a way to get around this problem, and
that is by using biased random walks to generate the network. The main
purpose of this paper is more to discuss ideas, rather than implementation;
consequently, we will not describe this here. For a detailed discussion of the
practical implementation, along with other details such as data replication, we
refer the interested reader to [24], and instead move on to our next application.

4 Preserving Network Structure from Disruptions

We now turn our attention to quite a different topic: the field of network
resilience. Quite a lot of work has been done in this regard, though most have
focused on the effects of disruption on static networks. Typically, authors have
studied networks where the nodes and edges are progressively removed in some
fashion, and then measured the effect of these removals against the existence
of a giant component. The giant component constitutes the largest set of
nodes in the network, of size O(n), where n is the size of the network, that
are connected to each other by at least one path. The network is considered
static in that no compensatory measures, such as the (re)-introduction of new
edges or nodes, are permitted.
There is indeed good reason to study the resilience of networks. In the real
world, networks suffer from a variety of disruptions, stemming from failure
of key components, continuous addition/removal of nodes and edges and in-
tentional attacks such as Denial of Service, among other things. Since these
disruptions affect the structure of networks and structure is directly related
to performance, it is important to understand how the networks are affected.
However, we can do better than that. We can try and restore some or all of
the structure of the network by allowing it to react to the disruptions of new
nodes and edges. As evidenced from the previous section, considerable effort
can be expended in tailoring a network to have structures that optimize prop-
erties of interest, and it is a worthy effort to try and maintain that structure
in the face of varied disruptions. Note, that in the context of this paper, when
we talk about the structure of the network, we limit ourselves to the degree
distribution.
For the purposes of our study we assume that the designers of the network
are only aware of the statistical properties of the removed nodes and have
no ability to influence the existing network beyond the introduction of new
nodes along with their corresponding edges. They thus have two processes
under their control to compensate for the attack. The first is the degree of
230 G. Ghoshal

the introduced vertices, and the second is the process by which a newly in-
troduced node chooses to attach to a previously extant one on the network.
Consequently, failure is compensated by adding nodes and edges chosen from
an appropriate degree distribution and attaching them to the network via
specially tailored schemes.
As mentioned before, a variety of models have been proposed to simulate
network evolution and growth where vertices are both added and deleted, but
these have concentrated on the relatively simple case of uniform deletion. We
have already shown in Section 2 that, under uniform failures, the appearance
of degree correlations that typically arise as a result of growth processes can
be neglected. For the case of non-uniform deletion, correlations cannot be
ignored. Here we will proceed by demonstrating how to preserve an initially
uncorrelated network throughout the evolution process, with the introduction
of an additional rate equation for the degree correlations; consequently, our
focus will be on the currently neglected case of non-uniform failures. The
results of this section are based on the work of [26].

4.1 Types of Disruptions

Before we move on to our method for repairing networks, we provide a brief de-
scription of the types of attacks or failures that most networks are subject to.
Random failures are the most generally studied schemes in both static and
evolving networks, because they lend themselves to relatively simple analysis.
These types of failures may be representative, say, of disruption of power lines
or transformers in a power grid owing to extraneous factors such as weather.
However, the functionality of most networks often depends on the performance
of higher-degree nodes; consequently, non-uniform attack schemes focus on
these. For example, in a peer-to-peer network, a high-degree node could be a
central user with large amounts of data. High degree could also be indicative
of the amount of load on a node during its operation, or on the public visibility
of a person in a social network. It is reasonable to assume that a malicious
entity such as a computer virus is more likely to strike these important nodes.
We can simulate these kinds of attacks using preferential failures ak ∝ k, that
sample nodes in proportion to their number of connections, and through an
outright attack on the highest-degree nodes represented by ak ∝ θ(k − kmin ),
where θ(x) is the Heaviside step function.
Our method of compensation will involve control over two processes: the
first where our newly incoming/repaired node chooses a degree for itself drawn
from some distribution mk , and second, the process by which this node decides
to attach to any other in the network, governed by the attachment kernel πk .

4.2 Repair Method

The evolution process, specifically non-uniform removal of nodes, can, and in


many cases will, introduce degree correlations into our networks. In order to
Some New Applications of Network Growth Models 231

confront this issue, we proceed as follows. First we will find choices for mk and
πk that satisfy the solutions to the rate equation for a given pk in a network
that is uncorrelated. We will then demonstrate that a special subset of those
solutions for mk and πk is an uncorrelated fixed point of the rate equation for
the degree correlations. Our goal here is to solve for the attachment kernel
πk , that will preserve the original probability distribution pk , subject to a
deletion kernel ak for some choice of mk .
Before we move on, we need to make a slight modification to Eq. (6). In
the earlier instance, we exploited the simplification that arose from uniform
deletion. We will assume here that the initial network is uncorrelated; however,
 ka be the mean
we will retain the general form of the deletion kernel ak . Let
degree of nodes removed from the network (i.e. ka = k kak pk ), and k
the mean degree of the original degree distribution pk . Then we have

ka k
cπk−1 pk−1 − cπk pk + (k + 1) pk+1 − k a pk − ak pk + mk = 0. (26)
k k

Once again, it can be easily shown from Eq. (26) that the average degree of
nodes removed is ka = c. Introducing the cumulative distribution for the
attacked and newly added nodes, Ak and Mk respectively,

 ∞

Ak = al pl , Mk = ml , (27)
l=k l=k

we sum Eq. (26) from k = k  + 1 to ∞, to get

(k + 1)pk+1 Ak+1 − Mk+1


πk p k = + . (28)
k c

Dividing both sides by pk then gives us an expression for the attachment


kernel,
 
1 (k + 1)pk+1 Ak+1 − Mk+1
πk = + . (29)
pk k c

This equation represents the set of possible solutions for the attachment kernel
that will lead to the desired degree distribution, given that the final network
is uncorrelated. The correct choice of solution from the above set must obey
the consistency condition that, when inserted into the rate equation for the
degree correlations, the correlations vanish. The following ansatz chosen from
the above set is such a choice:

mk = ak pk ,
qk (k + 1)pk+1
πk = = . (30)
pk k pk
232 G. Ghoshal

The reason behind this choice will be made more clear in the next section.
Note the similarity with Eq. (17) which was derived in the context of uniform
deletion. Here we see that it holds true even for non-uniform deletion, albeit
with some caveats that we will see shortly. There are basically two conditions
for the existence of a solution given by this equation; ak pk must be a valid
probability distribution, and k must be finite. These are not very stringent
conditions and are typically satisfied by most degree distributions. In other
words, barring some pathological cases, it is always possible to find a solution
of the above form.
We are now in a position to effect our repair on the network. Given the
original degree distribution pk and the form of the attack ak , Eq. (30) gives
us the precise recipe for recovering the degree distribution. We need to sam-
ple the degrees of the newly introduced nodes in proportion to the product of
the deletion kernel and the degree distribution, and then attach these edges
in proportion to the excess degree distribution of the network. To test our
repair method, we provide two examples for initially uncorrelated networks
with 10,000 nodes generated using the configuration model [25]. In the config-
uration model, only the degrees of vertices are specified. Apart from this sole
constraint the connections between vertices are made at random.
We employ two types of attack kernels, preferential attack represented
by ak ∝ k and a targeted attack only on high-degree nodes represented by
ak ∝ θ(k − kmin ) on our two example networks. Our first network has links
distributed according to a power law with an exponential cutoff,
 −γ −k/κ
Ck e k = 0,
pk = (31)
0 k = 0,
where C is a normalization constant.
Our second choice of network has an exponential degree distribution,
 
pk = 1 − e−λ e−λk . (32)
In Fig. 3 we show the resulting degree distribution for the power-law network
where nodes were attacked preferentially, while Fig. 4 shows the results for the
exponentially distributed network undergoing targeted attack. Both figures
indicate that the initial and final networks are in excellent agreement.

4.3 Neglecting Degree Correlations

To demonstrate the validity of our results, we must prove that our initially
uncorrelated networks remain uncorrelated under our repair scheme. Here we
give a brief sketch of the idea; for full details, see [26].
We start off by defining a rate equation for the correlations. The rate equa-
tion describes the evolution of the expected number of edges in the network
with ends of degree k and l. Let the number of such edges in the network be
mel,k , (33)
Some New Applications of Network Growth Models 233

100

10−1

Probability pk 10−2

10−3

10−4

10−5

10−6

10−7
1 10 100
Degree k

Fig. 3. Log-binned degree distribution of a power-law network (104 nodes) with ex-
ponent γ = 3 and exponential cutoff κ = 100, under preferential attack ak ∝ k using
πk from Eq. (30) after setting mk = ak pk . The data points are averaged over multiple
realizations of the network, each subject to 105 iterations of addition and deletion.
The points along with corresponding error bars represent the final degree distribution,
whereas the solid line represents the initial network.

100

10−1

10−2
Probability pk

10−3

10−4

10−5

10−6
1 10
Degree k

Fig. 4. Degree distribution of an exponential network (104 nodes) with λ = 0.4 under
targeted attack ak ∝ Θ(k − 5) using πk from Eq. (30) after setting mk = ak pk .
234 G. Ghoshal

where m = nk/2, and el,k is the probability that a randomly selected edge
has degree k at one end and degree l in the other. The expected number of
edges after one time step where we add c and take away ka edges is then
[m + c − ka ]el,k = mel,k + Δ, (34)
where Δ represents all other edge addition and removal processes.
We have already established that in the steady state case, ka = c irre-
spective of the degree distribution, so our goal is equivalent to showing that
Δ is equal to zero for an uncorrelated network generated/repaired with our
special choices of πk and mk . As a result ek,l = ek,l , implying that the degree
correlations (if any) remain constant over time.
So according to Eq. (34) there exists a set of solutions such that an initially
uncorrelated network will not develop correlations as a consequence of the
evolution process. The attachment kernel Eq. (30) that was employed in the
network evolution process is a subset of these solutions. This allows the repair
method to be employed by maintaining negligible correlations in the network.
To briefly summarize, we have demonstrated that if a network with a
certain degree structure is subjected to an attack that aims to destabilize
that structure, one can recover the same, by manipulating the rules by which
newly added/removed vertices are (re)-introduced back to the network. The
rules that we employ in our repair method are dependent on the types of
attacks on our networks.

5 Conclusion
In this paper we have discussed some interesting alternative applications of
network growth models. Traditionally these models have been used to deter-
mine the processes via which networks in the real world form. However, the
mathematical framework can be adopted to other uses. Here we have provided
two examples.
In the first example, we have considered the problem of designing networks
by trying to manipulate the rules by which they evolve. For a certain class of
networks, such as peer-to-peer networks, the limited control that this manipu-
lation gives us over network structure may be sufficient to generate significant
improvements in network performance. Using generating function methods,
we have shown that it is possible to create networks with a desired degree
distribution by appropriate choice of the attachment kernel that governs how
newly added vertices connect to the network. We studied in detail one partic-
ularly simple case of a Poisson network that can be realized in straightforward
fashion and allows us to perform decentralized searches in constant time, and
makes only constant bandwidth demands per node, even in the limit where
the database becomes arbitrarily large.
In the second example, we have shown how to preserve a network’s degree
distribution from various forms of attack or failures by allowing it to react to
Some New Applications of Network Growth Models 235

the disruptions via the introduction of new nodes and edges. Recent empirical
studies [27] have suggested that node removal, for example, in the World Wide
Web, is typically non-uniform in nature. Unfortunately as we have seen, non-
uniform removal leads to the creation of degree correlations in the network,
which makes analysis difficult. To deal with the special case of non-uniform
deletion we have introduced a rate equation for the evolution of degree cor-
relations and have used that in combination with the equation for the degree
distribution to work around this problem. The structure of many networks
in the real world is crucially related to their performance, and consequently,
loss of these properties can lead to severe constraints on their performance.
In view of this, it is crucial for researchers to come up with effective solutions
to try and manage these types of disruptions.
The ideas in this paper have been presented chiefly to demonstrate the
use and versatility of network evolution models. There remains much oppor-
tunity for other applications than those discussed here, as well as for ways to
execute them in the real world. We hope that this will stimulate the imagina-
tion of researchers working in the field and look forward to new and exciting
developments.

Acknowledgments
The author thanks Mark Newman and Brian Karrer for illuminating discus-
sions. This work was funded by the James S. McDonnell Foundation.

References
1. R. Albert and A.-L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
2. S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51, 1079 (2002).
3. M. E. J. Newman, SIAM Review 45, 167 (2003).
4. D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
5. R. J. Williams and N. D. Martinez, Nature 404, 180 (2000).
6. R. Albert, H. Jeong and A.-L. Barabási, Nature 401, 130 (1999).
7. D. J. de S. Price, Science 149, 510 (1965).
8. S. Redner, Eur. Phys. J. B 4, 131 (1998).
9. D. J. de S. Price, J. Amer. Soc. Inform. Sci. 27, 292 (1976).
10. A.-L. Barabási and R. Albert, Science 286, 509 (1999).
11. E. Ben-Naim and P. L. Krapivsky, J. Phys. A: Math. Theor. 40, 8607 (2007).
12. C. Moore, G. Ghoshal and M. E. J. Newman, Phys. Rev. E 74, 036121 (2006).
13. J. Saldaña, Phys. Rev. E 75, 027102 (2007).
14. T. Hong, in Peer-to-Peer, Harnessing the Benefits of a Disruptive Technology,
edited by Andy Oram (O’Reilly, Sebastopol, CA, 2001), Chap. 14, pp. 203–241.
15. L. A. Adamic, R. M. Lukose, A. R. Puniyani and B. A. Huberman, Phys. Rev. E
64, 046135 (2001).
16. N. Sarshar, P. O. Boykin and V. P. Roychowdhury, Fourth International Confer-
ence on Peer-to-Peer Computing, pp. 2–9, Washington, D.C. (2004).
236 G. Ghoshal

17. G. Paul, T. Tanizawa, S. Havlin and H. E. Stanley, Eur. Phys. J. B 38, 187
(2004).
18. R. Cohen, K. Erez, D. ben-Avraham and S. Havlin, Phys. Rev. Letts. 85, 4626
(2000).
19. D. S. Callaway, M. E. J. Newman, S. H. Strogatz and D. J. Watts, Phys. Rev.
Letts. 85, 5468 (2000).
20. M. E. J. Newman and G. Ghoshal, Phys. Rev. Letts. 100, 138701 (2008).
21. B. Rezai, N. Sarshar, V. Roychowdhury and P. Oscar Boykin, Physica A 381, 497
(2007).
22. A. E. Motter, Phys. Rev. Letts. 93, 098701 (2004).
23. P. L. Krapivsky and S. Redner, Phys. Rev. E 63, 066123 (2001).
24. G. Ghoshal and M. E. J. Newman, Eur. Phys. J. B, 58, 175 (2007).
25. M. Molloy and B. Reed, Random Struct. Algorithms 6, 161 (1995).
26. B. Karrer and G. Ghoshal, Eur. Phys. J B, 62, 239 (2008).
27. J. S. Kong and V. P. Roychowdhury, e-print arXiv:0711.3263v2.
The Big Friendly Giant: The Giant Component
in Clustered Random Graphs

Yakir Berchenko,1 Yael Artzy-Randrup,2 Mina Teicher,1 and Lewi Stone2


1
Interdisciplinary Brain Research Center, Bar Ilan University, Ramat Gan 52900,
Israel; byakir@gmail.com, teicher@macs.biu.ac.il
2
Biomathematics Unit, Faculty of Life Sciences, Tel Aviv University, Ramat Aviv
69978, Israel; artzyra@post.tau.ac.il, lewi@post.tau.ac.il

1 Introduction

Network theory is a powerful tool for describing and modeling complex sys-
tems having applications in widely differing areas including epidemiology [16],
neuroscience [34], ecology [20] and the Internet [26]. In its beginning, one
often compared an empirically given network, whose nodes are the elements
of the system and whose edges represent their interactions, with an ensem-
ble having the same number of nodes and edges, the most popular example
being the random graphs introduced by Erdos and Renyi [11]. As the field
matured, it became clear that the naive model above needed to be refined,
due to the observation that real-world networks often differ significantly from
the Erdos–Renyi random graphs in having a highly heterogenous non-Poisson
degree distribution [5, 15] and in possessing a high level of clustering [33].
Methods for generating random networks with arbitrary degree distribu-
tions and for calculating their statistical properties are now well understood.
This is usually achieved with the aid of the configuration model [6] and by
employing an analysis of a certain branching process based on generating
functions [24]. However, clustering, the other property that characterizes real-
world networks, remains far less understood. Clustering refers to the relative
number of triangles in a network, and is commonly measured by the coefficient
3×N
introduced in [24] as C = N3  . Here N is the total number of triangles in
the network, while N3 is the number of connected triples of nodes. This defi-
nition has the advantage that C is also the probability that two nodes which
connect to a mutual node are connected themselves, thereby forming a triangle
whereby “a friend of a friend is also a friend.”
The main difficulty when studying clustered networks is that the branching
processes, which are at the heart of the generating function formalism of [24],
no longer seem applicable due to the formation of short loops, namely trian-
gles. The lack of obvious analytical tools [16] and techniques for incorporating
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,
Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 14,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
238 Y. Berchenko et al.

triangles into random graph models with an arbitrary degree distribution [21]
has led researchers to pursue several different avenues. One should mention
several of these attempts:
• Giving up on analytic predictions, and conducting instead descriptive
studies [30], where various clustering indices are defined and measured
for a given real-world network. Resorting to simulations is also quite
common [33].
• Considering special cases, which are amenable for analysis. For exam-
ple, constructing a one-mode projection of a bipartite graph [14, 22], or
the framework of [25, 29], which generates exponential random graphs,
or Markov random graphs [32], which are flexible but more difficult to
analyze.
• There is yet another common but somewhat naive practice: adopting re-
sults and criteria from the unclustered case, and wrongly applying these
criteria for studying clustered graphs. Relevant is the example concern-
ing the emergence of the giant component (GC)—where it was shown [24]
that in the usual, unclustered, case there is a GC if the mean number of
nodes at a distance two (z2 ) is larger than the mean number of nodes at a
distance one (z1 ). This result is often (wrongly) taken as the criterion for
clustered networks [22, 31], thereby initiating the quest to calculate z2 in
the presence of clustering [22, 23, 31].
Here we suggest constructing a branching process that is applicable for net-
works with triangles [7, 28]. This recent approach seems very promising, and
we will pay attention to it, using the formalism of [7], rather than that found
in [28]. The latter relies on the restrictive assumption that any two triangles
in a network will never share an edge. Even in this limited setting, the results
are only applicable for relatively low levels of clustering (C), and the concepts
are difficult to interpret and broaden.
In Section 2 we review the application of generating functions for unclus-
tered (C = 0) random networks [24] (2.1), and describe the novel free-excess
degree formalism for clustered networks [7] (2.2). In Section 3 we discuss crit-
icality in random clustered graphs. Most of this section is devoted to the
emergence of the GC, as indeed is the bulk of the literature, but we will also
discuss briefly the second critical point (3.2), where the graph becomes con-
nected, which has impact on processes such as synchronization in networks
[17]. In Section 4 we show how to estimate the size of the GC as shown in [7];
then we broaden the setting to study the robustness and resilience of the GC,
i.e., bond, site and joint bond+site percolation (4.2). In Section 5 we describe
our simulations and compare the theory with data from real-world networks.
We discuss our findings in Section 6.
The Big Friendly Giant 239

2 Generating Functions
A generating function is a clothesline on which we hang up a sequence of
numbers for display.. [35].
For an excellent introduction to generating functions the reader is referred
to the book generatingfunctionology by Wilf [35]. Here we use the terminology
and notation used by Newman and colleagues [24] as it has been adapted for
network theory.

2.1 Unclustered Random Networks: C = 0


We begin by reviewing the application of generating functions for unclustered
(C = 0) random networks [24]. Define the generating function


G0 (x) = pk xk , (1)
k=0

where pk is the probability that a randomly chosen node on the graph has
degree k. The distribution pk is assumed to be normalized, so that G0 (1) = 1.
The same will be true for all generating functions considered here. Because
the probability distribution is normalized and positive definite, G0 (x) is also
convergent for all |x| ≤ 1, and hence has no singularities in this region. The
function G0 (x), and indeed any probability generating function, has a number
of properties that will prove useful in subsequent developments.
Moments. The average over the probability distribution generated by a gen-
erating function, for instance, the average degree z1 of a node in the case of
G0 (x), is given by 
z1 = k = kpk = G0 (1). (2)
k
Thus, if we know the values of the coefficients of a generating function, we
can calculate the mean of the probability distribution which it generates.
Powers. If the distribution of a property k of an object is generated by a
given generating function, then the distribution of the total of k summed over
m independent realizations of the object is generated by the m-th power of
that generating function. For example, if we choose m vertices at random from
a large graph, then the distribution of the sum of the degrees of those vertices
is generated by [G0 (x)]m .
Another quantity that will be important to us is the distribution of the
degree of the vertices that we arrive at by following a randomly chosen edge.
Such an edge arrives at a node with probability proportional to the degree
of that node, and the node therefore has a probability distribution of degree
proportional to kpk . The correctly normalized distribution is generated by

k kpk x
k
G (x)
 = x 0 . (3)
k kpk G0 (1)
240 Y. Berchenko et al.

Beginning at a randomly chosen node and following one of the edges at


that node, we reach a neighbor v1 . We are interested in the distribution of the
outgoing edges of v1 or its “excess degree” (i.e., the node’s degree minus one,
accounting for the edge we arrived along). Since the probability, qk , to have k
outgoing edges is qk = (k + 1)pk+1 /z1 , the distribution of outgoing edges, or
excess degree distribution [24], is generated by the function
 G0 (x) 1
G1 (x) := qk xk = = G0 (x), (4)
G0 (1) z1
k

and the average excess degree is thus



ze = kqk = G1 (1). (5)
k

When the clustering coefficient, C, is zero,1 the probability that any of


these outgoing edges connects to the original node that we started at, or to
any of its other immediate neighbors, scales as N −1 and hence can be neglected
in the limit of large N .2 Thus, making use of the “powers” property described
above, the generating function for the probability distribution of the number
of second neighbors of the original node can be written as

pk [G1 (x)]k = G0 (G1 (x)). (6)
k

Similarly, the distribution of the third-nearest neighbors is generated by


G0 (G1 (G1 (x))), and so on. The average number z2 of the second neighbors is
 
d
z2 = G0 (G1 (x)) = G0 (1)G1 (1) = G0 (1) = z1 ze , (7)
dx x=1

where we have made use of the fact that G1 (1) = 1.

2.2 The Free-Excess Degree

The preceding calculations can be modified for application to clustered net-


works (C > 0) [7]. Analogous to the excess degree, beginning at a randomly
chosen node v0 and following one of the edges at that node, we reach a neigh-
bor v1 . We are now interested in {ei }∞
i=0 , the distribution of the outgoing edges
of v1 that are not connected to a neighbor of v0 .
Suppose we travel from node v0 along an edge to node v1 having degree
d(v1 ) = i + 1 (i.e., with an excess degree of i). The probability that it will
have k neighbors that are not connected back to v0 (via a triangle) is
More accurately: ∀ε > 0 P r(C > ε) → 0 as N → ∞.
1
2
In Section 2.2, when C > 0, we will need a similar observation; namely, that the
probability to have a cycle of length four, that is not composed of two triangles, scales
as N −1 and hence can also be neglected for large N .
The Big Friendly Giant 241
 
i
(1 − C)k C i−k . (8)
k

This is just the probability that of the i outgoing edges of v1 , i−k are connected
in a triangular formation that includes v0 , while the other k edges do not. Here,
as before, C is just the probability of a triangular formation. When d(v1 ) is
not known, from (8) we obtain

   ∞
    k
i i 1−C
ek := qi (1 − C) C
k i−k
= qi C i
. (9)
k k C
i=0 i=0

The generating function, Gc (x), for the distribution is



 ∞ 
 ∞    k
k i i 1−C
Gc (x) := ek x = qi C xk . (10)
k C
k=0 k=0 i=0

The order of summation may be changed to obtain



 ∞  
 k
i i 1−C
Gc (x) = qi C x . (11)
k C
i=0 k=0

Using the binomial theorem we obtain



  i ∞
1−C
Gc (x) = qi C i
1+ x = qi (C + (1 − C)x)i = G1 (C + (1 − C)x).
i=0
C i=0
(12)
Thus, we arrive at the key relationship

Gc (x) = G1 (C + (1 − C)x). (13)

Let us remark that in deriving (8)–(13), it is possible to use any other clus-
tering index, such as c(k)—the degree-dependent clustering coefficient used
in [28]. However, it might be hard, if not impossible, to obtain a solution with
such a simple closed form.
As an example of how (13) may be useful, it is possible to determine the
mean free-excess degree:
 dGc (x) 
iei = x=1
= (1 − C)G1 (1) = (1 − C)ze . (14)
i
dx

Similarly, it will prove useful to calculate the mean number of edges emanating
outwards from nodes at a distance one to nodes at a distance two, beginning
from some arbitrary source node (note that this is not the mean number of
nodes at a distance two, due to the fact that there is a positive probability
242 Y. Berchenko et al.

that two edges reach the same node at a distance two). Similarly to (6) and
(7), the mean is
dG0 (Gc (x)) 
x=1
= G0 (1)G1 (1) · (1 − C) = (1 − C)z1 ze . (15)
dx
This parameter was also calculated in [23] by a different technique, but as will
be discussed shortly, its importance appears to have been overlooked.

3 The Critical Point


The interest in random graph theory was initiated by, and is in great debt
to, a striking discovery by Erdos and Renyi [11]. They studied the following
simple model of a network, referred to as GN , p, or simply as the ER random
graph: Take some number N of nodes and connect each pair with probability
p,3 thus defining a probability measure over the ensemble of all such graphs.
Erdos and Renyi demonstrated what is considered to be one of the most
important properties of the random graph, namely that it possesses a phase
transition, from a low-p state (p(N ) < (1−)
N ) in which all components are small
(of size o(N )), to a high-p state (p(N ) > (1+)
N ) in which an extensive fraction
of all nodes (i.e., Θ(n)) are joined together in a single GC.
This result has been extended by Molloy and Reed [18, 19] and [1] to graphs
with an arbitrary degree distribution, thus making them more applicable for
analyzing real-world networks. Here we examine the critical point, where a
GC emerges, in the context of clustered networks (Section 3.1).
There is yet another interesting point, though not as studied as the latter,
where the graph becomes connected—there is a path from each of the nodes
to any other node. For the ER graph, GN , p, this occurs when p = ln(N N
)
[8].
In Section 3.2 we shall discuss briefly this issue for clustered networks.

3.1 The Emergence of the GC

In their
 seminal paper, Molloy and Reed [18] introduced the parameter
Q := i ipi (i − 2), which identifies the phase transition in random graphs,
i.e., the point where a GC is born. Their procedure utilizes a method for
constructing a random graph, which may be viewed as “walking through a
graph” (Fig. 1a) and assessing the number of unknown nodes encountered
along the way. Suppose one follows a random edge to a node v having degree k.
How does this change the number of unknown nodes? First of all, by arriv-
ing at v the number of unknown nodes decreases by one. However, because
v itself has degree k, then this leads to an increase of (k − 1) in the number
of unknown nodes. The net effect is that the number of unknown nodes in-
creases by (k − 2). In order to calculate the expected change, the probability
3
p is usually a function of N , p(N ).
The Big Friendly Giant 243

a c1 c2 b c1 c2

b2 b3 b4 b2 b3 b4
b1 b1
a1 b5 a1 b5
a2 a2

V0 V0
a3 a3
a4 a4

Fig. 1. Graphical illustration of the exposure procedure. Choose a node at random, say
V0 , and start diffusing from it and counting the nodes encountered on the way. a) When
C = 0 and the network is tree-like (see footnote 1), after counting the new nodes
(a1 − a4 ) we pick one of them at random, say a1 , and count its new neighboring nodes
(b1 − b3 ), which are distributed according to {qi }∞
i=0 . In the next step, we randomly
choose one of the nodes (a2 − a4 , b1 − b3 ) and continue until the entire component
is exposed. b) When C > 0, two modifications are required to deal with cycles due
to triangles (the dashed edges): we use {ei }∞i=0 and diffuse depthwise. After counting
a1 − a4 , when we count the neighbors of a1 we avoid overcounting a2 because {ei }∞ i=0
governs the distribution of the solid-black edges. In the next step if we go from a1 to
b3 in order to count the neighbors of b3 , again we avoid overcounting a2 (because it is
connected to a1 ). The depthwise exposure, which is a permissible scheme [18], is used
to avoid dependencies.

of arriving at v, which is proportional to the degree k, must also be factored


in. This makes the expected
 increase in the number of unknown neighbors
proportional to Q = i ipi (i − 2). If Q is positive, then with each step of the
walk through the graph the number of unknown nodes, and the size of the
component, grows larger—the hallmark traits of the GC. If Q is negative,
then the number of unknown neighbors reduces to zero; therefore, we are not
walking through a GC. Recalling earlier definitions, the condition Q > 0 may
be stated as
ze > 1. (16)
Since in unclustered (C = 0) networks ze = z2 /z1 , Ref. [24] advocates the
following equivalent criterion.
Criterion A. There is a GC in random networks if z2 > z1 , i.e., the mean num-
ber of second-nearest neighbors is greater than the mean number of neighbors.
This has the intuitive epidemiological interpretation: If the mean num-
ber of infected individuals grows with distance from the source, an epidemic
outbreak will occur.
In [7] we have adapted Molloy and Reed’s procedures in a manner that
makes them applicable for clustered networks. Again, suppose we follow a
random edge that begins from a source node and ends at some node v. Previ-
ously, if v had degree k, the number of “unknown” neighbors would increase
by k − 2. However, with triangles there is a possibility that some of the k − 1
outgoing edges will return to nodes that are already known (via dashed edges
244 Y. Berchenko et al.

in Fig. 1b). It is possible to avoid counting these nodes twice, by counting


them in a manner that considers the free-excess degree distribution ek . Thus,
when a node v of free-excess degree i is encountered, the number of “un-
known” neighbors increases by i − 1, and the expected
 increase in the number
of unknown neighbors is thus proportional to Qc = i ei (i − 1). The criterion
for the GC in a clustered network is just Qc > 0. However, from (14), this
condition becomes
(1 − C)ze > 1, (17)
which differs from (16) by the scale factor (1 − C). Multiplying both sides by
z1 , we obtain (1 − C)z1 ze > z1 . Recalling (15), this may be interpreted as the
following criterion.
Criterion B. There is a GC if the mean number of edges emanating outwards
from nodes at a distance one to nodes at a distance two (beginning from some
arbitrary source node) is larger than the mean degree.
Note that in the epidemiological sense, the emphasis is on the growth in
the number of outward edges or transmission routes from a typical source
node to its neighbors, and then to its neighbors’ neighbors (Fig. 2a).
Although previously criterion A was used for clustered networks without
any proper justification [31, 22], Fig. 3a shows that it provides poor predictions
of the critical mean degree z1∗ as a function of the clustering, C (predictions are
made using estimates of z2 in the presence of clustering as detailed in [31, 23]).
The accuracy of the prediction can be assessed against simulations (Fig. 3).
In contrast, criterion B is a much better predictor as shown in Fig. 2b and
Fig. 3a. The latter plots the analytic result for a Poisson degree distribution
where z1 = ze [24] and z1∗ = (1 − C)−1 (from (17)).

a b
largest component

L2 Simulation
2 y = Const × N2/3
10

L1

V0
102 103
size of network (N)

Fig. 2. The difference between the new criterion B and the conventional criterion A.
a) Consider the following example: a typical node has a neighborhood similar to V0 —3
nodes at a distance one in the first layer, L1 , and 2 nodes at a distance two in the
second layer, L2 , but 4 edges to the second layer (from L1 to L2 ). Criterion B predicts
a GC, while criterion A fails to predict a GC. b) The size of the largest component
plotted vs. N for Poisson networks having mean degree z1 = 1.25 and C = 0.2 (i.e.,
at the critical point according to criterion B). Indeed the size at the critical point
correctly scales as ∼N 2/3 , as is known for the case z1 = 1, C = 0 (see references in [8]).
Note that criterion A would wrongly predict this regime to be below the critical point
(since z2 ≈ 1.19 < z1 ) and would suggest that all components should scale as O(log N ).
The Big Friendly Giant 245

a b 300 C=0

size
C = 0.25

z*1
0 z1 5
c 1.5
1
1
0 0.2 0.4
0 0.2 0.4
C C

Fig. 3. The critical mean degree z1∗ for the formation of a GC, plotted as a function
of C. a) Poisson degree distribution. Predictions of criterion A (grey line; z2 estimated
as in [31]). Predictions of criterion B (black line; z1∗ = (1 − C)−1 (see text)). Empirical
estimates of z1∗ (circles) were obtained through the following procedure in order to
overcome finite size effects: first the value of the size of the largest component was
found for networks with C = 0 at the known threshold z1∗ = 1 (b; dashed line). This
value was used to identify the critical threshold in comparable networks with C > 0.
c) SF degree distribution. Symbols as in a. Black and grey lines, which practically
overlap, are based on expressions for z1 and ze for SF networks [24].

Scale-free (SF) networks, where pk ∼ k −α , are usually characterized by


their exponent α. However, for the purpose of discussing criticality, when
α ≈ 3.45 and the tail of the distribution is not very significant, we can also
characterize them by their mean degree. Taking this approach we see that
as opposed to the Poisson degree distribution, Fig. 3c shows that the criti-
cal mean degree for SF networks is almost constant as a function of C. Its
constancy results from the fact that z1  ze and ze increases to a great ex-
tent with a small increase in z1 [24]. However, criterion A, being based on
the behavior of the second moment of the distribution as well, gives similar
predictions (Fig. 3c) from the same considerations.

3.2 Complete Connectivity

Although the transition to complete connectivity is less well studied, the fol-
lowing example makes clear the need for further work in this area, particularly
for clustered networks.
In a recent series of papers [12, 17], the effect of clustering on a network
of coupled phase oscillators was examined. These authors made the plausible
assumption that by investigating a network with a very high mean degree
their network will be connected. When they [17] found groups of oscillators,
each group oscillating at a different frequency, they named them “dynami-
cal clusters,” in order to distinguish them from the topological clusters (i.e.,
connected components).
246 Y. Berchenko et al.

1
N=100
GC N=200

0.85

0 C 0.6

Fig. 4. Size of the GC vs. C for Poisson network with z1 = 1.5ln(N ).

However, from the previous section we might be tempted to guess whether


the second critical point, where the graph become connected, scales with
(1 − C)−1 . Unfortunately, while simulations do not confirm our guess for a
disintegration at C ∗ = 1 − ln(N )
z1 N , they do clearly demonstrate that by intro-
ducing clustering to the network, it breaks down quite early (Fig. 4).
When conducting studies such as [12, 17] or considering the validity of
their implication, one should especially be careful while checking complete
connectivity by counting the multiplicity of the eigenvalue 0 of the graph
Laplacian (as done in [17]).4 In practical use, often numeric implementation
will result in finding very small, though non-zero, eigenvalues instead of the
correct ones [2].

4 The Size of the GC and Its Robustness


4.1 The Size of the GC

In order to find the size of the GC, Andersson [3] examined the probability of
extinction in a two-phase branching process that mimics the construction of a
random graph (with C = 0). In this branching process the source node has a
number of direct descendants distributed according to {pi }∞
i=0 (the first phase),
while each of its descendants has a number of direct descendants distributed
according to {qi }∞
i=0 (the second phase). First, consider the probability u for
a lineage of a single branch that arrives at some node, v1 , to eventually die
out. This necessitates that all k branches leaving v1 die out, an event that
occurs with probability uk . Since the
 degreek of v1 is unspecified, we obtain
the self-consistency condition u = ∞ k=0 qk u = G1 (u), which can be solved
to find u.
The second step takes into consideration that the branching process begins
from some arbitrary source node. Because all branches originating from the
source must die out in order for the process to become extinct, the probability

The idea is basically as follows: find the eigenvalues of the matrix L = D − A,


4

where A is the graph adjacency matrix and D is a diagonal matrix with the degree
of node j at the Djj -th entry; the multiplicity of the eigenvalue 0 is the number of
connected components.
The Big Friendly Giant 247

of extinction (which is equivalent to belonging to a small component) is equal


to G0 (u), while the probability of persistence (or belonging to a GC) is S =
1 − G0 (u), which is also the size of the GC.
The preceding argument needs to be modified for clustered networks [7].
For the latter, the probability u for the lineage of a single branch to die out
no longer fulfills the condition u = G1 (u), because the progeny in the second
phase are no longer distributed by {qi }∞ i=0 . Instead, we can replace qi with ei
so that the self-consistency condition is, to a close approximation, u = Gc (u).
The error remaining is largely due to higher order correlations between
nodes in the branching process that occur with probability of the order of C 2
(and even smaller when triangles sharing an edge are known to be rare, as is
the focus of Ref. [28]). Indeed C 2  1 in many real-world networks. Thus, we
get the following procedure:
(a) Solve for u such that Gc (u) = u.
(b) Calculate GC size as S = 1 − G0 (u).

4.2 The Robustness and Resilience of the GC

Another related question concerns the size of the GC in the presence of dilu-
tion, i.e., when a fraction r of the nodes or edges (or a combination of nodes
and edges) has been randomly removed.5
This is understood to be related to the robustness and resilience of the net-
works against breakdowns of its units, the classic example being the World
Wide Web. Although the naive identification of functionality with the exis-
tence of the GC is sometimes considered problematic,6 this formalism does
have important applications as in, for example, the study of epidemic out-
breaks [10].
We can take the same approach from the previous section and ask again
the probability u for a lineage of a single branch that arrives at some node, v1 ,
to eventually die out. In the case of node removal, in the branching process,
following an edge we reach a node that is unoccupied (was removed) with
probability rn . Therefore, the lineage will die out with probability rn plus
1−rn times the probability that any of the lineages of the outgoing edges from
v1 will eventually die out (found via the self-consistency condition). Thus, step
(a) becomes: Solve for u such that rn +(1−rn )G1 (u) = u. Similar consideration
of edge removal with probability re , replacing the {qi }∞
i=0 with the free-excess
probabilities {ei }∞
i=0 (or G 1 with G c ) and demanding all branches originating
from the source to die out eventually, we get the size of the GC in clustered
networks after joint edge+node removal:

5
Also known respectively as site, bond and joint site+bond percolation.
6
Durret [10] gives a nice critique on the claim that “the internet is robust.. after
dilution (in a certain parameters regime) we still get a GC.” In the regime referred to,
“if all 6 billion people were initially connected then after the removal only 36 people
can check their email.”
248 Y. Berchenko et al.

(a) Solve for u such that 1 − (1 − rn )(1 − re ) + (1 − rn )(1 − re )Gc (u) = u.


(b) Calculate GC size as S = 1 − rn − (1 − rn )G0 (u).
When C = 0, these equations coincide with those in [9]. Indeed, we feel
that our formalism, in contrast to that of [28], has the advantage of being a
natural generalization of previous theory [9, 24].
This theory for the size of the GC is evaluated against simulation and
real-world data in the next section, showing good agreement.

5 Simulations and Real Data

Clustered networks were generated by three different methods, all giving sim-
ilar results, each having its own advantages in terms of efficiency. In all the
methods, a degree sequence was generated by sampling from a desired distri-
bution. In two of the methods, a network was constructed according to the
generated degree sequence by using a fill algorithm [13]. In one case we then
selectively switched links [4] to reach a desired degree of clustering. In the
second case, we selectively reconnected links to nodes of distance two, which
lead to an increase in the number of triangles. The third method was based
on distributing triangles in an empty network under the restrictions of the
degree sequence, and later filling in additional links using a fill algorithm [13].
In Fig. 5 we plot simulations against theory for the size of the GC for a
variety of parameters. Figure 5a shows the size of the GC vs. the mean degree
for different values of C, rn and re , the fraction of nodes and/or edges removed
respectively. In order to isolate the effect of clustering, we have also plotted
in figure 5b the size of the GC vs. C for a fixed mean degree.
The most revealing plot is that of the case rn = re = 0 (top line in
Fig. 5b), where there is good agreement at the lower values of C (i.e. C < 0.3),

a b
200 0. 8
rn=0 re=0
rn=0 re=0.2
0. 6 rn=0.2 re=0
rn=0.2 re=0.2
GC

0. 4
C=0.1 rn =0 re=0
C=0.1 rn=0.1 re=0
C=0.2 rn=0.2 re=0 0. 2
C=0.2 rn=0.2 re=0.2

0 2 4 6 0 0. 2 0. 4 0. 6 0. 8
z1 C

Fig. 5. The size of the GC after dilution. a) As a function of the mean degree for
networks with Poisson degree distribution. A fraction rn and re of the nodes/edges
were removed randomly, for C = 0.1 and C = 0.2. b) As a function of C for networks
with Poisson degree distribution and z1 = 2. A fraction rn and re of the nodes/edges
were removed randomly. Black lines: our prediction for each case.
The Big Friendly Giant 249

a b c
450
1400 30

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 6. The size of the GC after dilution in real-world networks. Grey: simulations with
bars at a width of one std, black: our predictions, broken line: the naive predictions
which do not consider C (i.e., C = 0). a) Nodes removal for the C. elegans neural
network. N = 453, C = 0.124, z1 = 8.9. b) Edges removal for the yeast protein-protein
interaction network. N = 2112, C = 0.055, z1 = 2.1. c) Joint nodes+edges removal for
the network of Zachary’s Karate club. N = 34, C = 0.255, z1 = 4.4.

as well as for its higher values (at C ≈ 0.5), as opposed to a deviation at


intermediate values. This is explained by the fact that initially the O(C 2 )
error in our approximation is rather small, at intermediate values it can grow,
(but still < C 2 ) and towards the critical point it needs to converge back to
the exact result, producing again a very small deviation.
Notice as well that after dilution the deviations become smaller still
(Fig. 5b). This might be explained by the sensitivity of the higher order cor-
relations, which require many edges, and their fast destruction due to it.
We can also take data from real-world networks and compare their behav-
ior under dilution with the prediction. When doing so, we often find, due to
the skewed degree distribution that characterizes many real-world networks
and their “denseness,” that the network stays almost as one connected unit
for a large range of dilution. It is thus not surprising that allowing for cluster-
ing does not improve the predictions. A distinct example is given in Fig. 6a,
where the size of the GC of the neural network of C. elegans [34] is plotted
vs. rn , the fraction of nodes removed. The size of the GC decreases almost
linearly as rn .
Nevertheless, Fig. 6b, c show two real-world networks, the yeast protein-
protein interaction network and Zachary’s Karate club [36], where considering
the value of C gives an advantage in predicting the size of the GC as a function
of dilution.

6 Discussion
Perhaps the most far-reaching result presented here is our criterion B for the
existence of the GC. This simple and intuitive criterion (Is the mean number
of edges going to the second layer larger than the one going to the first?) is a
natural generalization of the well-established Molloy–Reed condition (Is the
mean number of nodes at the second layer larger than the one at the first?),
250 Y. Berchenko et al.

which is often misused. It might be that the Molloy–Reed condition gained


much of its appeal due to the interpretation which identifies the existence
of a GC with the possibility of a random walker, originating from a source
node, to reach a large distance from the source (see as well the related and
interesting electrostatic approach [27]). Although grossly oversimplified, we
may conjecture that this is true for the general case. Indeed, when inspecting
Fig. 2a, for example, we see that in order to have a positive drift away from the
source we need not have an increasing number of nodes at each layer—rather
an increasing number of edges between layers!
We did not study the topological effects of having z2 > z1 in clustered
networks. We expect still to find interesting behavior at z2 = z1 from quantities
such as the diameter of the network. This is indeed a subject for future work.

Acknowledgments

MT and YB are grateful for the support of the EC (project MATHfSS 15661)
and DIP (project Compositionality F 1.2). LS and YAR are grateful for
the support of the James S. McDonnell Foundation and the Israeli Science
Foundation.

References
1. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs, Proc.
of the 32nd Annu. ACM Symposium on Theory of Computing (2000)
2. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du
Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK
User’s Guide, 3rd edition, SIAM, Philadelphia (1999)
3. Andersson, H.: Limit theorems for a random graph epidemic model, Ann. Appl.
Probab. 8, 1331–1349 (1998)
4. Artzy-Randrup, Y., Stone, L.: Generating uniformly distributed random networks,
Phys. Rev. E. 72 (5): 056708 (2005)
5. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks, Science
286, 509512 (1999)
6. Bender, E. A., Canfield, E. R.: The asymptotic number of labeled graphs with
given degree sequences, J. Combin. Theory A 24, 296307 (1978)
7. Berchenko, Y., Artzy-Randrup, Y., Teicher, M., Stone, L.: The emergence and
the size of the giant component in clustered random graphs with a given degree
distribution, submitted.
8. Bollobas, B.: Random Graphs, 2nd edition, Academic Press, New York (2001)
9. Callaway, D. S., Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Network ro-
bustness and fragility: Percolation on random graphs, Phys. Rev. Lett. 85, 5468
(2000)
10. Durrett, R.: Random Graph Dynamics, Cambridge U. Press, Cambridge, UK
(2006)
The Big Friendly Giant 251

11. Erdos, P., Renyi, A.: On the evolution of random graphs, Publications of the
Mathematical Institute of the Hungarian Academy of Sciences 5, 1761 (1960).
12. Gomez-Gardenes, J., Moreno, Y., Arenas, A.: Paths to synchronization on complex
networks, Phys Rev Lett. 98 (3):034101 17358685 (2007)
13. Gotelli, N. J., Entsminger, G. L.: Swap and fill algorithms in null model analysis:
Rethinking the Knight’s Tour, Oecologia 129, 281–291 (2001)
14. Guillaume, J. L., Latapy, M.: A realistic model for complex networks, (2003) cond-
mat/0307095.
15. Jeong, H., Mason, S., Barabasi, A.-L., Oltvai, Z. N.: Lethality and centrality in
protein networks, Nature 411, 4142 (2001)
16. Keeling, M. J.: The effects of local spatial structure on epidemiological invasion.
Proc. R. Soc. London B 266, 859–867 (1999)
17. McGraw, P. N., Menzinger, M.: Analysis of nonlinear synchronization dynamics
of oscillator networks by Laplacian spectral methods, Phys. Rev. E 75, 027104
(2007)
18. Molloy, M., Reed, B.: A critical point for random graphs with a given degree
sequence, Random Structures and Algorithms 6, 161179 (1995)
19. Molloy, M., Reed, B.: The size of the giant component of a random graph with a
given degree sequence, Combin. Probab. Comput. 7, 295 (1998)
20. Montoya, J. M., Sole, R. V.: Small world patterns in food webs, J. Theor. Bio.,
214, 405–412 (2002)
21. Newman, M. E. J.: The structure and function of complex networks, SIAM Review
45, 167 (2003)
22. Newman, M. E. J.: Properties of highly clustered networks, Phys. Rev. E 68,
026121 (2003)
23. Newman, M. E. J.: Random graphs as models of networks. In: Bornholdt, S.,
Schuster, H. G. (eds.) Handbook of Graphs and Networks, Wiley-VCH, Berlin
(2003)
24. Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Random graphs with arbitrary
degree distributions and their applications, Phys. Rev. E. 64, (2001)
25. Park, J., Newman, M. E. J.: Solution for the properties of a clustered network,
Phys. Rev. E 72, 026136 (2005)
26. Pastor-Satorras, R., Vasquez, A., Vespignnani, A.: Dynamical and correlation
properties of the internet, Phys. Rev. Lett. 87, 258701 (2001)
27. Redner, S.: A Guide to First-Passage Processes, Cambridge University Press,
New York (2001)
28. Serrano, M. A., Boguna, M.: Percolation and epidemic thresholds in clustered
networks, Phys. Rev. Lett. 97, 088701 (2006)
29. Strauss, D.: On a general class of models for interaction, SIAM Review 28, 513–527
(1986)
30. Vazquez, A.: Growing networks with local rules: Preferential attachment, cluster-
ing hierarchy and degree correlations, cond-mat/0211528 (2002)
31. Volz, E.: Networks with tunable degree distribution and clustering, Phys. Rev. E
70, 056115 (2003)
32. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social net-
works: I. An introduction to Markov random graphs and p*, Psychometrika 61,
401426 (1996)
33. Watts, D. J., Strogatz, S. H.: Collective dynamics of small-world networks, Nature
393, 440442 (1998)
252 Y. Berchenko et al.

34. White, J. G., Southgate, E., Thompson, J. N., Brenner, S.: Structure of the nervous
system of the nematode C. elegans, Phil. Trans. R. Soc. London 314, 1340 (1986)
35. Wilf, H. S.: generatingfunctionology, 2nd edition, Academic Press, London (1994)
36. Zachary, W.: An information flow model for conflict and fission in small groups,
Journal of Anthropological Research 33, 452–473 (1977)
Technological Networks

Bivas Mitra

Department of Computer Science and Engineering, Indian Institute of Technology,


Kharagpur, 721302, India; bivasm@cse.iitkgp.ernet.in

1 Introduction
The study of networks in the form of mathematical graph theory is one of
the fundamental pillars of discrete mathematics. However, recent years have
witnessed a substantial new movement in network research. The focus of the
research is shifting away from the analysis of small graphs and the properties
of individual vertices or edges to consideration of statistical properties of large
scale networks. This new approach has been driven largely by the availability
of technological networks like the Internet [12], World Wide Web network [2],
etc. that allow us to gather and analyze data on a scale far larger than pre-
viously possible. At the same time, technological networks have evolved as a
socio-technological system, as the concepts of social systems that are based on
self-organization theory have become unified in technological networks [13].
In today’s society, we have a simple and universal access to great amounts
of information and services. These information services are based upon the
infrastructure of the Internet and the World Wide Web. The Internet is the
system composed of ‘computers’ connected by cables or some other form of
physical connections. Over this physical network, it is possible to exchange
e-mails, transfer files, etc. On the other hand, the World Wide Web (com-
monly shortened to the Web) is a system of interlinked hypertext documents
accessed via the Internet where nodes represent web pages and links represent
hyperlinks between the pages. Peer-to-peer (P2P) networks [26] also have re-
cently become a popular medium through which huge amounts of data can
be shared. P2P file sharing systems, where files are searched and downloaded
among peers without the help of central servers, have emerged as a major
component of Internet traffic. An important advantage in P2P networks is
that all clients provide resources, including bandwidth, storage space, and
computing power. In this chapter, we discuss these technological networks in
detail. The review is organized as follows. Section 2 presents an introduction

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 15,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
254 B. Mitra

to the Internet and different protocols related to it. This section also specifies
the socio-technological properties of the Internet, like scale invariance, the
small-world property, network resilience, etc. Section 3 describes the P2P net-
works, their categorization, and other related issues like search, stability, etc.
Section 4 concludes the chapter.

2 The Internet
The Internet is a global network connecting millions of computers in a de-
centralized form. Each Internet computer, called a host, is independent and
operators can choose any of the commercial Internet service providers (ISPs).
Many computer scientists observe the Internet as a “prime example of a large-
scale, highly engineered, yet highly complex system” (Fig. 1). The Internet is
extremely heterogeneous in nature; for instance, data transfer rates and physi-
cal characteristics of connections vary widely. In addition, the Internet evolves
and emerges based upon its large-scale self-organization property. Technically,
the Internet can be defined as the network of networks working with Transmis-
sion Control Protocol (TCP)/Internet Protocol (IP). This definition visualizes
the Internet as a purely technological system. However, this assumption over-
looks the fact that knowledgeable human activities make the Internet work.
Hence, more accurately, the Internet is a global socio-technological system
that is based on a technological structure and a set of protocols [13]. Some of
the important Internets-based services are e-mail, World Wide Web, remote
access, and Internet telephony.

Fig. 1. Internet as complex network.


Technological Networks 255

2.1 Protocols Used in the Internet

Once we have more than one computer, it is theoretically possible to commu-


nicate, provided that the computers ‘speak’ a common language. The Internet
uses a suite of communication protocols, of which the two most important are
the TCP and the IP [19]. These protocols have the following responsibilities:
First, the protocol defines the basic unit of data transfer, called the ‘data-
gram’, used throughout the Internet. Thus, it specifies the exact format of all
data as it passes across the Internet. Second, the TCP/IP software performs
the routing function, choosing a path over which data will be sent. Third, the
protocol includes a set of rules that embody the idea of reliable packet delivery
over unreliable connections.
In addition, these protocols introduce the IP addressing scheme which is
integral to the process of routing datagrams through the Internet to the par-
ticular destination host. Each host on a TCP/IP network is assigned an unique
32-bit IP address that is divided into two main parts: the network number
and the host number (Fig. 2). The network number identifies a network and
must be assigned by the Internet Network Information Center (InterNIC) if
the network is to be part of the Internet. An ISP can obtain blocks of network
addresses from the InterNIC and can itself assign address space as necessary.
The host number identifies a host on a network and is assigned by the local
network administrator. To make them easier to remember, IP addresses are
normally expressed in decimal format as a ‘dotted decimal number’. The four
numbers in an IP address are called octets, because they each have eight bit
positions when viewed in binary form. Currently three classes of networks (A,
B, C) are commonly used. These classes may be segregated by the number of
octets used to identify the network, and also by the range of numbers used
by the first octet. If the value of the first octet is 127, it represents the local
host, regardless of what network it is really in.

Fig. 2. IP addressing.
256 B. Mitra

2.2 Scale Invariance and Small World Property of the Internet

The topology of the Internet is studied at two different levels. At the router
level, the nodes are the routers, and edges are the physical connections be-
tween them. At the interdomain (or autonomous system) level, each domain,
composed of hundreds of routers and computers, is represented by a single
node, and an edge is drawn between two domains if there is at least one route
that connects them. The topology of large-scale networks like the Internet is
characterized by the degree distribution pk , which is defined as the fraction of
nodes in the network having degree k. In 1999, Faloutsos et al. [12] studied the
Internet at both levels, concluding that in each case the degree distribution
follows a power law (Fig. 3) i.e. pk ∼ k −γ . The interdomain topology of the
Internet, captured at three different dates between 1997 and the end of 2002,
resulted in degree exponents between γ = 2.15 and γ = 2.2. The 1995 sur-
vey of the Internet topology at the router level, containing 3888 nodes, found
γ = 2.48. In 2000, Govindan and Tangmunarunkit [15] mapped the connec-
tivity of nearly 150,000 router interfaces and nearly 200,000 router adjacently,
confirming the power-law scaling with γ = 2.3. It is widely believed that the
scale invariance property of the Internet is related to the self-organization
property of the participating nodes. The preferential attachment tendency of

Fig. 3. The first data file holds link directions corresponding to the traceroute direc-
tions, while the second file is an undirected version of the first file. There are a total
of 192,244 nodes, 636,643 directed links, and 609,066 undirected links. The average
and maximum node degrees (undirected) are 6.34 and 1071 respectively, and the node
degree distribution is plotted.
Technological Networks 257

the nodes to join the network [42] stabilizes the degree distribution as the size
of the Internet becomes very large.
Internet as small world. An accurate characterization of the emergent
topological properties of the Internet and a better understanding of the un-
derlying processes that yield these characteristics are crucial for proper eval-
uation of network protocols and systems. In that vein, recent works [20, 5]
have shown the prevalence of small-world phenomena [24, 44] in the Internet.
Small-world graphs exhibit a high degree of clustering, yet have typically short
path lengths between arbitrary vertices. Yook [47] and Pastor-Satorras [32]
have studied the Internet at the domain/autonomous system level between
1997 and 1999 and found that its clustering coefficient ranges between 0.18
and 0.3, compared to the clustering coefficient 0.001 for random networks of
similar parameters. On the other hand, the average path length of the Internet
ranges between 3.70 and 3.77 and at the router level it is around 9, indicating
its small-world character. Small-world behavior in the Internet maps to two
possible causes: first, the high variability of node degree distributions and,
second, the preference of vertices to have local connections [20]. With the
high variability of the node degree distribution, it is likely that two intercon-
nected vertices, say u and v, will have the same neighbor, say w specifically,
when w is a node with extremely large degree. It means that u, v, and w
form a triangle. Such a pattern contributes directly to the computation of the
clustering coefficients of u, v, and w, (i.e. Cu , Cv , and Cw ) and results in a
larger overall average clustering coefficient C of the network. Thus, C grows
with the variability of vertex degree. Also, notice that with highly variable
vertex degrees, the average distance between two vertices (L) is short. This
happens because the shortest path is usually through those extremely popular
vertices. That is, highly popular vertices serve as good navigators through the
graph. On the other hand, preference for the local connectivity also results
in small-world behavior. The reason behind this is that, with a non-negligible
probability of a local connection, if a node u is connected to v and w, then
it is likely that v and w are also close to each other. As a result, there is a
non-negligible probability that a triangle will form among these vertices, re-
sulting in a higher clustering coefficient. Meanwhile, since there are still many
long-range connections, it is easy to find a short path between two randomly
chosen nodes.
In addition, researchers from Stanford University [37] found that as net-
works grow very large, they become very efficient in the number of steps
a data packet takes to get from one node to another node. The number of
steps grows logarithmically with the size of the network, which means that
for 10,000 nodes we need five steps, but for 100 million the number grows only
to 6.5. They also exhibit a clustering property, i.e. the relationships among
nodes are not randomly distributed, but are grouped. Short path links means
that there are some very short paths sprinkled throughout the network that
258 B. Mitra

may directly link one group to another. This conforms to Watts and Strogatz’s
model [44], where a low dimensional regular lattice is transformed to a small
world network.

2.3 Fault Tolerance of the Internet

The Internet and other communication networks display a high degree of ro-
bustness: while key components regularly malfunction, local failures rarely
lead to the loss of the global information-carrying ability of the network [3].
It has been observed that network topology plays an important role in the
robustness of the Internet. Consider an arbitrary connected graph of N nodes,
and assume that an f fraction of the nodes have been removed. This leads
to important questions, like: What is the probability that the resulting sub-
graph is connected, and how does it depend on the removal probability f ?
For a broad class of graphs there exists a threshold probability fc such that
if f < fc the resulting subgraph is connected, but if f > fc the subgraph
becomes disconnected (Fig. 4). Here fc is termed the percolation threshold.
In the following discussion, we will call a network fault tolerant (or robust) if
it contains a giant component comprising of most of the nodes even after a
fraction of its nodes are removed.

2.3.1 Stability Criteria

The topology of the Internet and the failure probability of nodes can be char-
acterized by probability distributions pk and fk respectively. Here pk signifies
the degree distribution which is the probability that a randomly chosen node
has degree k. Similarly fk is the probability that a vertex of degree k, will
be removed from the network. Nodes leave the Internet due to their faulty
nature [8] or due to the attack mounted on the important nodes [9]. Based
upon these basic parameters, an analytical framework has been derived to

Fig. 4. Illustration of the effects of node removal on an initially connected network.


Technological Networks 259

examine the stability of the Internet (or any kind of networks) where the ver-
tices undergo some dynamics [28]. The analytical framework can be expressed
with the help of the following equation:


kpk (k(1 − fk ) − (1 − fk ) − 1) = 0. (1)
k=0

Equation. (1) states the critical condition for the stability the Internet
(characterized by pk ) undergoing any type of failure and attack (characterized
by fk ).
Stability analysis of networks under different node disturbance
schemes. The existing empirical and theoretical results indicate that complex
networks can be divided into two major classes based on their degree distri-
bution pk . In the first class of networks, pk peaks at an average degree k
and decays exponentially for large k. The most investigated examples of such
exponential networks are the random graph model of Erdos and Renyi [11]
and the small-world model of Watts and Strogatz [44], both leading to a fairly
homogeneous network. In contrast, results on the Internet, World Wide Web,
and other large networks indicate that many systems belong to a class of in-
homogeneous networks, referred to as scale-free networks, for which pk decays
as a power law, i.e. pk ∼ k −γ [8]. While the probability that a node has a very
large number of connections (k  k) is practically prohibited in exponential
networks, highly connected nodes are statistically significant in scale-free net-
works. In this review, we concentrate on the scale-free network, as this kind
of network is widely used to model the Internet.
In this section, we consider two types of node removal schemes. The first
scheme studies the removal of randomly selected nodes. In this case, the prob-
ability of removal of any randomly chosen node having degree k after this kind
of failure is fk = f (independent of k) [8]. In the second technique, most highly
connected nodes are removed at each step. This second scheme emulates an
intentional attack on the network [9]. Formally, fk = 0 when k ≤ kmax and
fk = 1 when k > kmax , i.e. all the nodes in the network having degree more
than kmax are removed.
Next we discuss the stability of scale-free networks in the face of failure
and attack. The stability is measured by the change in the size of the giant
component S and the average path length l after removal of the fraction of
nodes. The maximum reduction in the size of the giant component indicates
the breakdown of the network.
Stability against random failure. We start by investigating the stability
of scale-free network to random removal of nodes, looking at the changes in
the relative size of the giant component S and the average path length l [8]. In
a scale-free network, the size of the giant component S decreases slowly from
S = 1 as the fraction of nodes removed f increases (see Fig. 5). In random
failure, most of the removed nodes in the network have low degree; hence,
they have little impact upon the size of the giant component S. Eventually,
260 B. Mitra

Fig. 5. The size of the giant component S and average path length l of an initially
connected network when a fraction f of the nodes are removed. Scale-free network
generated by the scale-free model with N = 10,000 and k = 4. Squares indicate
random node removal, while circles correspond to preferential removal of the most
connected nodes [3].

S reaches 0 at some higher f , which is denoted as the percolation threshold


fc . The analytical calculations indicate that the percolation threshold fc → 1
as the size of the network increases to infinity. In simple terms, scale-free net-
works display an exceptional robustness against random node failures. On the
other hand, the average path length l increases with the fraction of removed
nodes f , as paths are disrupted in the network, and eventually l peaks at per-
colation threshold fc . In random failure, the average path length l increases
slowly with f ; hence, its peak becomes less prominent. After the network
breaks into isolated components, l decreases as well since in this regime the
size of the largest component gradually decreases.
Stability against intentional attack. In the case of intentional attack, the
nodes with the highest degrees are targeted for removal. Naturally, in this
kind of attack, the network breaks down into components faster than in the
case of random failure. The stability of the scale-free networks mainly depends
upon a few highly connected nodes. Removal of these key nodes during the
intentional attack severely affects the stability of the scale-free networks [9].
This phenomenon also becomes predominant from the behavior of the average
path length l, which increases rapidly and reaches its peak at percolation
threshold fc . After the network breaks into isolated components, l decreases
quickly since in this regime the size of the largest component decreases.

2.4 Spreading of Viruses in Internet

Computer viruses and worms are posing serious challenges to the network
research community. In computer science jargon, ‘virus’ refers to malicious
software that spreads from computer to computer and can halt or hin-
der operations at numerous businesses and other organizations, disrupt
Technological Networks 261

cash-dispensing machines, delay airline flights, and even affect emergency call
centers [41, 23, 4]. The structure of contact networks affects the rate and
extent of spreading of computer viruses, just as it does for human diseases;
understanding this structure is a key element in the control of infection. Thus,
recent works in epidemiological models have emphasized the effects of the
virus spread in scale-free networks, in which the degree distribution follows a
power law [16].
There are various epidemic models available in the literature which can be
used to formalize the spread of viruses in the network [33]. In these models,
the susceptible (S) individuals do not have the disease and are ready to be
attacked with a disease if they come in contact with virus infected (I) individ-
uals. The infected individuals may gain permanent or temporary immunity
after some time period and become recovered (R). The R individuals do not
take part in disease transmission. Various epidemic dynamics like SI, SIS,
SIR, SIRS exist in the literature [35, 36]. In SI dynamics, infected individuals
increase until all the S individuals becomes infected. If the I individuals in SI
dynamics become susceptible again after some time period, the SIS dynam-
ics results [34]. Computer viruses mostly fall into this category; they can be
‘cured’ by antivirus software, but without a permanent virus-checking pro-
gram the computer has no way to fend off subsequent attacks by the same
virus. Let us assume that any susceptible individual has a uniform probabil-
ity β per unit time of being infected from any other infected one, and that
infected individuals recover and become immune at some stochastically con-
stant rate γ. Then s, i, r, the individual fraction of nodes in the states of S,
I, and R respectively, are governed by the following differential equations:
ds di
= −βis, = βis − γi. (2)
dt dt
The classical SIS model can be applied to the networked system where in-
fection probability of the node is not constant but varies between the nodes
of the network depending upon its degree. The quantity βi represents the
average rate at which a susceptible individual becomes infected by its neigh-
bors. If λ is the rate of infection via contact with the single infective node
and θ(λ) is the probability that the neighbor of a k degree susceptible node is
infective, then the average rate of infection of the k degree susceptible node
becomes βi = kλθ(λ). The implicit expression for θ(λ) is obtained in [35] by
the following expression:

λ k 2 pk
= 1, (3)
z 1 + kλθ(λ)
k

where z is the average degree and pk is the degree distribution. For particular
choices of pk , this equation can be solved for θ(λ) either exactly or approxi-
mately. For instance, for a power-law degree distribution, Pastor-Satorras and
Vespignani [34] solve it by making an integral approximation, and hence show
262 B. Mitra

that there is no non-zero epidemic threshold for the SIS model in the power-
law case, i.e. the disease will always persist, regardless of the value of the
infection rate parameter. They have also generalized the solution to a num-
ber of other cases, including other degree distributions, finite-sized networks,
and models that include vaccination of some fraction of individuals [35, 36].
In the latter case, they tackle both random vaccination and vaccination tar-
geted at the vertices with highest degree. The results have shown that the
propagation of the disease turns out to be relatively robust against random
vaccination, at least in networks with right-skewed degree distributions, but
highly susceptible to vaccination of the highest-degree individuals.

3 Peer-to-Peer Networks

In client-server architecture, each computer or process in the network is either


a client or a server. A large number of clients request and receive the service
from the servers, and a fixed set of servers provides the service to those clients.
Peer-to-peer (P2P) networks (shown in Fig. 6) provide a different paradigm
of computer networks, where each workstation has equivalent capabilities and
responsibilities [26, 6]. P2P networks diverge the responsibility between par-
ticipants in a network and cumulate the bandwidths of network participants
rather than using conventional centralized resources. An important advan-
tage in this kind of network is that all clients provide resources, including
bandwidth, storage space, and computing power. Thus, as nodes arrive and
demand on the system increases, the total capacity of the system also increases
simultaneously. This is not true for a traditional client-server architecture, in
which adding more clients could mean slower data transfer for all users. In
addition, popular items (like songs, movies) in the network become replicated
over multiple peers due to repeated exchange of items, which increases the
robustness of the shared items in the face of frequent joining and leaving of
peers (termed as peer churn).

Fig. 6. Client-server model and P2P model.


Technological Networks 263

Overlay networks. Peers in the P2P networks are typically connected via
ad hoc overlay connections. If a participating peer knows the location of an-
other peer in the network, then there is a link from the former node to the
latter in the overlay network. Based on how the nodes in the overlay network
are linked to each other, the current P2P architecture can be classified into
three types [43], centralized, decentralized and structured, and decentralized
but unstructured.
1. Centralized: All object index items are kept in a centralized server in
the form of object key, node address etc. Each arriving node needs to
actively notify this server about its kept object information. Therefore, the
querying node only needs to consult the central server to obtain the peer
address containing its searched object. In order to download the searched
object from the peer, the querying node directly establishes the connection
with that peer and downloads the item. This type of P2P architecture is
very simple and easy to deploy. But it has the problem of a single point
of failure, although we can use several parallel servers. An example of this
network type is Napster [31].
2. Decentralized and structured: A structured P2P network employs a
globally consistent protocol to ensure that any node can efficiently route
a search query to a peer that has the desired file. Most of the struc-
tured P2P networks are based on the distributed hash table (DHT), in
which a variant of consistent hashing is used to assign ownership of each
file to a particular peer [27]. A DHT is a hash table whose table entries
are distributed among different peers located in arbitrary locations. Each
data item is hashed to a unique numeric key. Each node is also hashed
to a unique ID in the same key space. Each node is responsible for a
certain number of keys; that is, the responsible node stores the key and
a pointer to the data item with that key. Keys are mapped to their re-
sponsible nodes. The searching and routing algorithms support two basic
operations: lookup(key) and put(key); lookup(k) is used to find the loca-
tion of the node that is responsible for the key k, and put(k) is used to
store a data item (or a pointer to the data item) with the key k in the
node responsible for k. It appears that searches in structured systems fol-
low the well-defined neighboring links; henceforth, these systems provide
guarantees on finding existing data in bounded overlay hops. However,
the strict network structure imposes high overhead for handling dynam-
icity in P2P networks due to peer churn. Some well-known DHT based
structured P2P networks are Chord, Pastry, Tapestry, CAN, and Tulip.
3. Decentralized and unstructured: An unstructured and decentralized
P2P network is formed when the overlay links are established arbitrarily.
As no special network structure needs to be maintained, unstructured P2P
systems are extremely resilient to peer churn. Searching in unstructured
networks is often based on flooding or its variation because there is no
control over data storage [26]. The main disadvantage with such networks
is that the queries may not always be resolved. Popular content is likely to
264 B. Mitra

be available at several peers, but if a peer is looking for rare data shared
by only a few other peers, then it is highly unlikely that the search will
be successful [10]. Since there is no correlation between a peer and the
content managed by the peer, there is no guarantee that flooding will find
a peer that has the desired data. However, due to the high dynamicity of
peers, robustness is given the topmost priority. Most of the popular P2P
networks such as Gnutella and FastTrack are unstructured in nature [14].

In addition, superpeer topologies have also emerged as the most influ-


encing unstructured networks. Here some peers, called dominating nodes or
superpeers, serve the search request of other regular peers [39, 46]. Most of
the commercial systems like KaZaA, Skype have adopted superpeers in their
design. In these systems, superpeer nodes with higher bandwidth and connec-
tivity connect to each other, forming the upper level in the network hierarchy.
Each superpeer node provides service to a set of regular peers which form the
lower level of the network hierarchy.

3.1 Peer-to-Peer Search Schemes

Searching is one of the most important services and utilities provided by the
P2P networks where users try to locate the desired object in the network.
Existing P2P systems support the simple object lookup by key or identi-
fier. Some existing P2P systems can handle more complex keyword queries,
which find documents containing keywords in queries. Searching techniques
are primarily forwarding based. Starting with the requesting node, a query is
forwarded or routed until the node which has the desired object is reached. To
forward query messages, each node must keep information about some other
nodes called neighbors. The information of these neighbors constitutes the
routing table of a node. The desired features of searching algorithms in P2P
systems include high-quality query results, minimal query packet overhead,
high routing efficiency, load balance, resilience to node failures, and support
of complex queries. The quality of query results is application dependent.
Generally, it is measured by the number of results and relevance. The query
packet overhead signifies the amount of packets generated in the network to
satisfy a specific search query. The routing efficiency is generally measured
by the number of overlay hops per query. Different searching techniques make
different trade-offs between these desired characteristics.
Searching in structured P2P networks follows the well-defined neighboring
links to locate some specific object. This provides guarantees on finding exist-
ing data and bounds data lookup efficiency in terms of the number of overlay
hops. But it shows poor performance in the dynamic condition where peers
join and leave the network quite frequently. Searching in the unstructured
P2P systems is more challenging, as the overlay network does not follow any
structure dependent on the data storage. Searching techniques in unstruc-
tured networks can be classified as either flooding based or random walker
Technological Networks 265

based. Broadly, flooding-based techniques are fastest and most inefficient in


terms of overhead, whereas random-walk-based schemas have low overhead
and minimum speed. Therefore, both techniques lie at the extreme ends of
the efficiency/speed spectrum. The following section describes flooding tech-
niques and their variations and also the random-walk-based techniques.

3.1.1 Flooding-Based Search Techniques

Searching in unstructured P2P networks is often based on flooding or its vari-


ations because there is no control over the location of objects. In these tech-
niques, query packets are propagated to all neighbors within a certain radius
until the desired object is found. However, blind flooding mechanism generates
large numbers of redundant query packets in the network, which misutilizes
the valuable bandwidth and makes the unstructured P2P systems far from
scalable. Some proposed controlled flooding-based schemes such as iterative
deepening/expanding ring, informed search, dynamic query-based flooding,
LightFlood, Hurricane flooding, etc. try to improve bandwidth utilization.
Iterative deepening. Yang and Garcia-Molina [45] borrowed an idea from
artificial intelligence and used it in iterative deepening. Like ordinary flooding,
in this case no node has information about the location of the desired data.
The querying node periodically issues a sequence of breadth-first searches
(BFSs) with increasing depth limits. The query terminates when the query
result is satisfied or when the maximum depth limit has been reached.
LightFlood. The LightFlood technique [17] (also called the expanding ring)
not only retains the merits of pure flooding, but also eliminates most of the
redundant messages caused by pure flooding. Thus, LightFlood greatly en-
hances the scalability of Gnutella-style P2P systems. The design of LightFlood
is motivated by two observations: first, the majority of redundant messages
are generated within high hops; second, the network coverage growth rates in
low hops are much higher than those within high hops. Thus, the LightFlood
scheme is divided into two stages. In the first stage, the messages are allowed
on their low hops to be flooded by pure flooding (by giving a small time to
live (TTL) number). Those peers reached on the last hop of pure flooding
(TTL = 0) become seeds, from which the flooding is initiated for the second
stage. The initial pure flooding ensures that a considerable number of seeds
are dispersed across the overlay with a small number of redundant messages.
The next stage of flooding ensures that most redundant messages caused by
pure flooding within the rest of its hops are eliminated. The integration of
these two stages retains the advantages of pure flooding: low latency, high
coverage, and high reliability.
Hurricane flooding. In Hurricane flooding [21], the source of a search
cautiously but exponentially expands its search horizon in a spiral pattern.
Like the expanding ring algorithm, Hurricane flooding increases the scope of
flooding after each round. The source peer divides its neighbors into several
266 B. Mitra

groups with approximately of same size. The source sends query packets to its
neighbors in the first group, starting the first round of flooding. These neigh-
bors faithfully broadcast the query packets (but not back to the source). The
source also sets a limit on the scope of these broadcasting query packets, e.g.,
by using a TTL value. The first round of flooding may have a very narrow
scope with small TTL. This round of flooding may not return the desired re-
sult. Then the source sends query packets to its neighbors in the second group,
with a larger limit on the scope of the flooding. This process repeats until the
source obtains the desired result. It has been shown that Hurricane flooding
reduces the search cost to arbitrarily close to a lower bound for any search
algorithms and bounds the search latency, which is a logarithmic function of
the location of the target.

3.1.2 Random-Walk-Based Search Techniques

Random walk is a popular alternative to flooding for locating resources in P2P


networks under scarcity of network bandwidth. In the standard random walk
algorithm [25], the querying node forwards the query message to one randomly
selected neighbor with some specific TTL value T . When an intermediate
node receives the random walker, it checks to see if it has the resource. If the
intermediate node does not have the resource, it checks the TTL field, and if
T > 0, it decrements T by 1 and forwards the query to a randomly chosen
neighbor; else if T = 0 the query message is dropped. On the other hand,
if the intermediate node has the resource, the query is not forwarded and a
reply is sent back to the querying node. This random walk technique greatly
reduces the message overhead but causes a longer searching delay.
In the k -walker random walk algorithm [26], k walkers are deployed
by the querying node to search the desired item. That is, the querying node
forwards k copies of the query message to k randomly selected neighbors.
Each query message takes its own random walk and each walker checks
whether it reached the destination or its TTL value reaches zero. In this
way, the k-walker random walk algorithm attempts to reduce the routing
delay by a factor of k. However, the arbitrary increase in the number of
walkers results in a significant increase in the redundant visits in the ini-
tial stage, which increases the message overhead. Actually, the performance
of k-walker random walk largely depends on the choice of k and T T L. Intu-
itively, the average number of nodes required to be probed for discovering a
resource is inversely proportional to the popularity of the resource. Choos-
ing low values of k and T T L for searching for a resource with low popular-
ity would result in a low success rate and high delays; choosing high values
of k and T T L for searching for a resource with high popularity would re-
sult in excessive overhead. Thus, the parameters of random walk must be
chosen according to the popularity of the resource being searched for. The
popularity of a resource may not be known a priori at the querying node. In
addition, the popularity may change due to the arrival/departure of nodes,
Technological Networks 267

replication/deletion/exhaustion of resources, or other random changes in the


network. Thus, the parameters of random walk must be set in an adaptive
manner.
The modified random BFS technique [22] is a modification of the k-walker
random walk scheme to reduce the unnecessary message overhead. Here the
querying node forwards the query to a randomly selected subset of its neigh-
bors. On receiving a query message, each neighbor forwards the query to a
randomly selected subset of its neighbors excluding the source node. This
procedure continues until the query stop condition is satisfied. It is expected
that this approach visits more nodes and has a higher query success rate than
the k-walker random walk.
Some hybrid schemes are also developed [25] based on a compromise
between flooding and random walks. One of the hybrid schemes uses local
flooding, until exactly K (predefined) new outer nodes have been discovered.
Then, each of the K nodes initiates an independent random walk.
Gradient-based search in scale-free networks. Recent measurements of
Gnutella networks [7] and simulated Freenet networks [18] have shown that
their topological structure follows a power-law degree distribution. [1] pro-
posed a message-passing algorithm that can be efficiently used to search in
scale free networks such as Gnutella. It has been observed that random walks
in scale free networks naturally gravitate towards the high degree nodes,
but an even better coverage is achieved by intentionally choosing high de-
gree nodes. In [1], Adamic et al. have shown analytically that if the nodes
with highest degree are visited first and subsequently go down to the de-
gree sequence, the significant portion of the network can be covered very
quickly. In the proposed algorithm, the walker approximately follows the de-
gree sequence across the entire scale-free network with an exponent close to
2 (2.0 < γ < 2.3). At each step, the random walker chooses a node with a
degree higher than the current node, quickly finding the highest degree node.
Once the highest degree node has been visited, it will be avoided, and a node
of approximately second highest degree will be chosen. Effectively, after a
short initial climb, one goes down the degree sequence. This is the most effi-
cient way to do this kind of sequential search, visiting highest degree nodes
in sequence. These algorithms are completely decentralized and exploit the
power-law link distribution in the node degree. The paper demonstrates that
the search algorithms work well on real Gnutella networks, scale sublinearly
with the number of nodes, and may help to reduce the network search traffic
that tends to cripple such networks.

3.2 Topological Dynamics and Stability of Superpeer Networks


From the point of view of topological dynamics, P2P networks exhibit sim-
ilar behavior to that of the Internet. However, the special superpeer topol-
ogy exhibited by many commercial P2P networks makes the outcome of the
dynamics different from that of the Internet (mainly scale free networks).
268 B. Mitra

A superpeer network can be modeled by a bimodal degree distribution, where


a small fraction of nodes are superpeers with high degree and a large fraction
of nodes are low degree peers [28]. Formally, degree distribution pk of the
superpeer networks can be specified as
pk > 0 if k = kl , km ; pk = 0 otherwise,

where kl and km are degrees of peers and superpeers respectively. Moreover,


there are some differences in the dynamics of the P2P networks and the Inter-
net. We explain the different kinds of peer dynamics and then illustrate the
outcomes in each case. Peers in the P2P system join and leave the network
randomly without any central coordination. This is termed as peer churn. In
addition, important peers are targeted for attack [38]. All these peer dynamics
can be modeled by different kinds of node removal schemes in random graph.
1. Random failure: Peer churn can be modeled by random removal of nodes
from the graph. This is the simplest model of churn, and the probability
of removal of a node is independent of its degree.
2. Degree-dependent failure: Peers having higher connectivity are more
stable in the network than peers having lower connectivity because those
loosely connected peers enter and leave the network quite frequently. This
observation leads us to model churn in a more realistic manner, where the
probability of removal of a node is inversely proportional to the degree of
that node.
3. Degree-dependent attack: In case of attack, the nodes having higher
degrees are more likely to be removed from the network.
Let the probability distribution fk model the different node removal
techniques. In the following we consider a unified churn/attack model of
the form fk = C k γ , where γ is a parameter called attack exponent and C
is a constant. The different node removal techniques can be realized from
this unified model just by changing the parameter γ.
1. Random failure: For γ = 0, fk = C, i.e., the probability of removal of a
node is independent of the degree of the node.
2. Degree-dependent failure: For γ < 0, the probability of removal of a
node, having degree k is inversely proportional to the degree of the
node, i.e. fk ∝ 1/k γ .
3. Degree-dependent attack: For γ > 0, the probability of removal of a
node having degree k is directly proportional to the degree of the node,
i.e., fk ∝ k γ .

3.2.1 Outcomes

Next we illustrate the impact of different peer dynamics on the stability


of the superpeer networks. The peer churn has been modeled by random
failure and degree-dependent failure, and the attack has been modeled by
degree-dependent attack.
Technological Networks 269

0.95

fr (Percolation threshold)
0.9

0.85 Theoretical 〈Ksp〉=30


Simulation 〈Ksp〉=30
0.8 Theoretical 〈Ksp〉=50
Simulation 〈Ksp〉=50

0.75

0.7
0.85 0.9 0.95 1
r (Fraction of peers)

Fig. 7. The impact of random failure upon the stability of superpeer networks.

Random failure. The analysis done in [30] shows that the superpeer net-
works are quite robust against churn (Fig. 7). Since churn affects peers and
superpeers depending upon their individual fraction in the network, peers are
affected much more than superpeers. The removal of a significant number of
low degree peers along with a few high degree superpeers has less impact upon
the stability of the networks. Practical experience also ensures that superpeer
networks exhibit high robustness in the face of churn. Another significant ob-
servation is that a lower fraction of superpeers in the network (specifically
when it is below 5%) results in a sharp fall in the percolation threshold; that
is, the vulnerability of the network drastically increases when the fraction of
superpeers is below 5%.
Degree-dependent failure. It can be easily identified from Fig. 8, that with
the increase of superpeer degree km , the value of critical attack exponent γc
that percolates the network decreases. This increases the necessary fraction
of superpeers required to be removed to break down the network. Since the
increase of km increases the fraction of peers r, the removal of most of the
low degree peers along with a fraction of superpeers increases the percolation
threshold fd . It is also interesting to observe that the percolating γc remains
quite low and less than 0.1 for the entire range of km . The reason is that
small values of γc result in the removal of a higher fraction of superpeers
nodes from the network. Since the degree-dependent failure mainly removes
the lower degree nodes, which are not so useful for breaking the network down,
removal of a significant amount of superpeers becomes necessary.
Degree-dependent attack. [29] analyzes the behavior of superpeer networks
against degree-dependent attack, where kl and km are the degree of peers and
superpeers respectively and r is the fraction of peers in the network. In [29],
270 B. Mitra

0.07 1

0.06 <k>=8 0.98


<k>=12
<k>=16
0.05 Line fitting curve
0.96

0.04 0.94

fd
Theoretical 〈k〉=4
γc

0.03 0.92 Simulation 〈k〉=12


Theoretical 〈k〉=4
0.02 0.9 Simulation 〈k〉=12

0.01 0.88

0 0.86
10 15 20 25 30 10 15 20 25 30
Km (Degree of superpeers) Km (Degree of superpeers)

Fig. 8. Change in critical attack exponent γc and percolation threshold fd with respect
of superpeer degree km for superpeer networks undergoing degree-dependent failure.
Here mean degree k varies from 8 to 16. x-axis represents the superpeer degree(km )
and y-axis represents the corresponding γc and fd .

the authors have established the critical condition for the stability of the
network against degree-dependent attack:

rklγ+1 (kl − 1) + (1 − r)km


γ+1
(km − 1)
≥ km (k(km + kl ) − km − 2k).
γ
(4)

The inequality gives the set of solutions for the critical exponent γc and sub-
sequently the normalizing constant C, which determines the fraction of peers
and superpeers to be attacked. The nature of the solution set Sγc of the in-
equality has a profound impact upon the fraction of peers and superpeers
required to be removed and the percolation threshold fc . The breakdown of
the network can be due to one of the following three situations.
Case A. Removal of all the superpeers along with a fraction of peers. Net-
works having bounded solution set Sγc where 0 ≤ γc ≤ γcbd exhibit this kind
of behavior at the maximum value of the solution γc = γcbd .
Case B. Removal of only a fraction of superpeers. Networks having un-
bounded solution set Sγc where 0 ≤ γc ≤ +∞ exhibit this kind of behavior as
γc → ∞.
Case C. Removal of some fraction of both superpeers and peers. Intermediate
critical exponent γc ∈ Sγc signifies the fractional removal of both peers and
superpeers.
Figure 9 shows that solution set Sγc of the networks up to a threshold
superpeer fraction spth (spth = 0.19 and 0.41 for kl = 3 and kl = 4 respec-
tively) remains bounded. Hence, the removal of all the superpeers is necessary
to disintegrate the network along with a fraction of the peers (Fig. 9). It also
represents some instances of case B where only some fraction of superpeers
are needed to be removed.
Technological Networks 271

5 1
Peer degree kl=3 0.9 Percolation threshold (fc) (kI=3)
Peer degree kl=4 Peer fraction removed (fp) (kI=3)
4 0.8

Percolation threshold
Superpeer fraction removed (fsp) (kI=3)
Boundary γc (γcbd)

0.7 Percolation threshold (fc) (kI=4)


Peer fraction removed (fp) (kI=4)
3 0.6 Superpeer fraction removed (fsp) (kI=4)
0.5
2 0.4
0.3
1 0.2
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Superpeer fraction Superpeer fraction

Fig. 9. Impact of degree-dependent attack on superpeer networks. Behavior of γcbd


and percolation threshold due to the change of superpeer fraction is shown.

4 Conclusion
In this chapter, we have presented a comprehensive study of various aspects of
technological networks. We have chosen two different technological networks
under consideration: the Internet and P2P networks. The protocols used in
the Internet have been discussed briefly along with their services. An em-
pirical study of the different topological properties of the Internet like scale
invariance, small world, etc. have been elaborated. The impact of the fault
tolerance of the Internet has been discussed in the light of general stability
analysis. The spread of computer viruses has been modeled by network-aware
epidemic models. We have also shed some light on the recent advancements
and classifications of the P2P networks. As search is one of the most impor-
tant services provided by the P2P systems, different search techniques and
their comparative study have been provided. The stability of P2P networks
in the face of churn and attack has also been discussed as a continuation of
the Internet fault tolerance.
The advancements of the Internet have also posed some serious challenges
in front of the network research community. One of the significant problems
is modeling the widely varying Internet traffic. An appropriate modeling of
the Internet is often useful to measure the efficiency of routing algorithms and
the quality of service (QoS) of different web applications. Maintaining specific
QoS in a faulty environment can be another major research issue. There is
always substantial uncertainty when making network management decisions.
A decision maker is limited not only because it possesses only partial infor-
mation due to decentralized control but is also limited by the impossibility
of predicting the future in terms of traffic demand and/or network topology
status. Hence, managing this large-scale Internet is also a non-trivial issue.
Understanding the assortative or disassortative relation among different par-
ticipating nodes and their impact upon the complex structural properties is
also a major research problem.
272 B. Mitra

Advancements in P2P networks also raise some issues regarding security


and trust. The P2P philosophy is based upon the cooperative nature of the
participating peers. However, it has been found that in Gnutella networks,
as many as 65% of the nodes do not contribute resources, but free-ride on
other peers’ resources. Hence, the problem of selfish peers and free riders are a
serious threat against the performance of any P2P system. Development of low
overhead trust-aware protocols to ensure trust among the peers is necessary
to enhance the utility of P2P networks. Understanding the self-organizing
features, evolution, and scalability of the superpeer networks is also interesting
and necessary.

References
1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, B. A. Huberman, Search in power-
law networks, Physical Review E, 64, 046135, 2001.
2. R. Albert, H. Jeong, A.-L. Barabasi, Diameter of the world wide web, Nature, 401,
130–131, 1999.
3. R. Albert, H. Jhong, A.-L. Barabasi, Error and attack tolerance of complex net-
works, Nature, 406, 2000.
4. N. Berger, C. Borgs, T. Chayes, A. Saberi, On the spread of viruses on the Internet,
Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA),
301–310, 2005.
5. T. Bu, D. Towsley, On distinguishing between Internet power law topology gen-
erators, Proceedings of INFOCOM, New York, NY, USA, 2002.
6. D. Clark, Face-to-face with peer-to-peer networking, IEEE Computer, 34 (1),
pp. 18–21, January 2001.
7. Clip2 Company, Gnutella. http://www.clip2.com/gnutella.html.
8. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet to random
breakdown, Physical Review Letters, 85 (21), 2000.
9. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet under in-
tentional attack, Physical Review Letters, 86 (16), 2001.
10. Q. Deng, H. Lv, Analyzing unstructured peer-to-peer Search Networks with
QIL Proceedings of the IEEE International Conference on Services Computing,
pp. 547–550, Shanghai, China, 2004.
11. P. Erdos, A. Renyi, On Random Graphs I, Publ. Mathematical, Debrecen, 6, 290–
297, 1959.
12. M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the internet
topology, Computer Communications Review, 29, 251262, 1999.
13. C. Fuchs, The Internet as a self-organizing socio-technological system”, Cybernet-
ics and Human Knowing, 12 (31), pp. 37–81, 2005.
14. Gnutella: www.gnutellaforums.com.
15. R. Govindan, H. Tangmunarunkit, Heuristics for internet map discovery, Proceed-
ings of IEEE Infocom, 2000.
16. C. Griffin, R. Brooks, A note on the spread of worms in scale-free networks, IEEE
Transactions on Systems, Man, and Cybernetics, Part B, Feb. 2006.
17. L. Guo, S. Jiang, X. Zhang, H. Wang, LightFlood: Minimizing redundant messages
and maximizing scope of peer-to-peer search, IEEE Transactions on Parallel and
Distributed Systems (TPDS) 19 (5), pp. 601–614, May 2008.
Technological Networks 273

18. T. Hong, in Peer-to-Peer: Harnessing the benefits of a disruptive technology, Andy


Oram (ed), O’Reilly, Sebastopol, CA, Chap. 14, pp. 203–241, 2001.
19. C. Hunt, TCP/IP Network Administration, Second Edition, O’Reilly Networking,
December 1997.
20. S. Jin, A. Bestavros, Small-World Internet topologies possible causes and implica-
tions on scalability of end-system multicast, Boston University, Technical Report
BUCS-TR-2002-004, January 2002.
21. S. Jin, H. Jiang, Novel approaches to efficient flooding search in peer-to-peer net-
works, Computer Networks: The International Journal of Computer and Telecom-
munications Networking, 51(10), pp. 2818–2832, July 2007.
22. V. Kalogeraki, D. Gunopulos, D. Zeinalipour-yazti, A local search mechanism for
peer to peer networks, Proc. of the 11th ACM Conference on Information and
Knowledge Management (ACM CIKM02), 2002.
23. J. O. Kephart, A Biologically inspired immune system for computers, artificial
Life IV: Proceedings of the Fourth International Workshop on the Synthesis and
Simulation of Living Systemsl, Cambridge, MA, July, 1994.
24. J. M. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, The Web
as a graph: Measurements, models and methods, in Proceedings of the Interna-
tional Conference on Combinatorics and Computing, Lecture Notes in Computer
Science, pp. 118, Springer, Berlin, 1999.
25. X. Li, J. Wu, Searching techniques in peer-to-peer networks, Handbook of The-
oretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless and Peer-to-Peer
Networks, CRC Press, Ann Arbur, MI, 2005.
26. Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in unstructured
peer-to-peer networks, ACM International Conference on Supercomputing, New
York, USA, 2002.
27. G. Manku, Routing networks for distributed hash tables, Annual ACM Symposium
on Principles of Distributed Computing archive Proceedings of the twenty-second
annual symposium on Principles of distributed computing, Boston, Massachusetts,
pp. 133–142, 2003.
28. B. Mitra, F. Peruani, S. Ghose, N. Ganguly, Analyzing the vulnerability of super-
peer networks against attack, 14th ACM Conference on Computer and Commu-
nications Security, Alexandria, USA, 29 Oct–2 Nov, 2007.
29. B. Mitra, Md. M. Afaque, S. Ghose, N. Ganguly, Developing analytical frame-
work to measure robustness of peer-to-peer networks, 8th International Confer-
ence on Distributed Computing and Networking - ICDCN 2006 (formerly IWDC),
December 27–30, 2006, IIT Guwahati, India.
30. B. Mitra, S. Ghose, N. Ganguly, Effect of dynamicity on peer to peer networks,
14th International Conference on High Performance Computing, Goa, India, 19–22
December 2007.
31. Napster: http://www.napster.com/.
32. R. Pastor-Satorras, A. Vzquez, A. Vespignani, Dynamical and correlation proper-
ties of the Internet, Phys Rev Lett, 87, 258701, 2001.
33. R. Pastor-Satorras, A. Vespignani, Epidemics and immunization in scale-free net-
works in S. Bornholdt and H. G. Schuster (eds.), Handbook of Graphs and Net-
works, Wiley-VCH, Berlin, 2003.
34. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics in finite size scale-free net-
works, Physical Review E, 65, 035108, 2002.
35. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and epidemic states in
complex networks, Physical Review E, 63, 066117, 2001.
274 B. Mitra

36. R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks,


Physical Review Letters, 86, 32003203, 2001.
37. K. Patch, Internet stays small world, Technology Research News, 2003.
38. B. Pretre, Attacks on peer-to-peer networks, Ph.D. thesis, Swiss Federal Institute
of Technology (ETH) Zurich, 2005.
39. Y. J. Pyun, D. S. Reeves, Constructing a balanced, log(N)-diameter super-peer
topology, Proceedings of the 4th International Conference on Peer-to-Peer Com-
puting, Zurich, Switzerland, August 2004.
40. K. Singh, H. Schulzrinne, peer-to-peer internet telephony Using SIP, Columbia
University Technical Report CUCS-044-04, New York, NY, October, 2004.
41. P. Szor, The art of computer virus research and defense, Symantec Press,
Indianapolis, IN, 2005.
42. A. Vazquez, R. Pastor-Satorras, A. Vespignani, Large-scale topological and dy-
namical properties of the Internet, Physical Rev E, 65, 066130, 2002.
43. C. Wang, B. Li, Peer-to-Peer Overlay Networks: A Survey, Department of Com-
puter Science. The Hong Kong University of Science and Technology, Technical
Report, 2003.
44. D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature,
393, 440–442, 1998.
45. B. Yang, H. Garcia-Molina, Improving search in peer-to-peer networks, Proc. of the
22nd IEEE International Conference on Distributed Computing (IEEE ICDCS02),
2002.
46. B. Yang, H. Garca-Molina, Designing a super-peer networks, Proceedings of the
International Conference on Data Engineering (ICDE), Los Alamitos, CA, March
2003.
47. S. Yook, H. Jeong, Y. Tu, A. L. Barabasi, Weighted evolution networks, Phys.
Rev. Lett., 86, 5835, 2001.
Advances in the Theory of Complex Networks

Fernando Peruani1,2
1
CEA-Service de Physique de l’Etat Condensé, Centre d’Etudes de Saclay,
91191 Gif-sur-Yvette, France
2
Institut des Systémes Complexes de Paris Île-de-France, 57/59, rue Lhomond
F-75005 Paris, France; fernando.peruani@iscpif.fr

1 Introduction
An exhaustive and comprehensive review on the theory of complex networks
would imply nowadays a titanic task, and it would result in a lengthy work
containing plenty of technical details of arguable relevance. Instead, this chap-
ter addresses very briefly the ABC of complex network theory, visiting only
the hallmarks of the theoretical founding, to finally focus on two of the most
interesting and promising current research problems: the study of dynamical
processes on transportation networks and the identification of communities in
complex networks.

2 The ABC of Complex Networks


A network or a graph is a set of interconnected nodes (or vertices). The node
connection is performed through edges. An edge represents a link between two
nodes. Between two vertices there can run more than one edge. Alternatively,
an edge can have a number associated to it denoting its importance or weight.
Edges can be directed or undirected. A directed edge between a node A and
a node B symbolizes that, for example, node A “speaks” to node B, while the
opposite is not possible. On the other hand, undirected edges are completely
symmetric. This review deals exclusively with undirected edges. For a compre-
hensive review on the theory of complex networks we refer the reader to [1, 2].

2.1 Network Characterization

2.1.1 Degree Distribution

A network can be characterized in many ways. For example, we could mea-


sure the mean degree of the network k. Here, k stands for one of the most

N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks,


Modeling and Simulation in Science, Engineering and Technology,
DOI: 10.1007/978-0-8176-4751-3 16,

c Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
276 F. Peruani

Fig. 1. The figure shows two network topologies: (a) network with exponential degree
distribution and (b) network with power-law (scale-free) degree distribution. Figure
taken from Ref. [12].

relevant properties of a node, its degree, which indicates the number of edges
attached to it, and ... denotes the average over all nodes of the system.
Though k is a useful and informative quantity, by itself it cannot charac-
terize the structure of the network, and typically a good characterization also
requires higher order moments, such as k 2 , k 3 , etc. How many moments
do we need to know to unequivocally characterize the network? All the infor-
mation about the moments is contained in the degree probability distribution
of the network pk (see Fig. 1). pk is the probability of picking up a node at
random  and observing that its degree is k. The moments are computed as
k n  = k n pk . If the network is such that the vertices (nodes) are statis-
tically independent, that is, the connections are completely at random, then
the degree probability distribution unequivocally determines the properties
of the network. If this is not the case and there are correlations among nodes,
the characterization of the network will require the use of a degree-degree
probability distribution, or an even higher n-points probability distribution,
etc. Let us assume for the moment that vertices are statistically independent.
There are three types of degree distributions which due to their ubiquity and
simplicity, deserve to be specially mentioned: a) the Poisson distribution, de-
fined as pk = e−k kk /k! and which is the degree distribution of a classical
random graph; b) the exponential distribution, defined as pk ∼ e−k/k (see
Fig. 1(a)); and c) the power-law distribution (see Fig. 1(b)), which is propor-
tional to pk ∼ k −γ , with γ > 0, and has (for infinite networks) all moments
higher than m > γ − 1 diverging (for this reason these distributions are re-
ferred to as scalef ree). For distributions like a) and b), the first moment of
the distribution, i.e., k, unequivocally characterizes the network topology,
but in general higher moments are required to unequivocally determine the
network topology.
Advances in the Theory of Complex Networks 277

2.1.2 Clustering Coefficient

Another important quantity used to characterize the network topology is the


clustering coefficient. The clustering coefficient measures the degree of con-
nectivity in the environment close to a node, i.e., the degree of cliquishness
of the closest environment of a node. In a more colloquial way, it is an an-
swer to the question: Are my friends also friends of each other? If a node has
degree z, i.e., z neighbors, and all these z nodes are connected among them,
there would be z(z − 1)/2 edges linking the nodes. The clustering coefficient
is defined as the ratio between the total number y of edges connecting the z
nearest neighbors, and the total number of all possible edges between the z
nearest neighbors,
C = y/ (z(z − 1)/2) . (1)
Logically, a network is associated with a distribution of clustering coefficients;
however, typically only the average cluster coefficient is reported, which is a
simple estimation for the probability of finding that any couple of neighbors
of a given node are also connected among themselves. A simple approximation
for the average clustering coefficient of Poissonian (or exponential) random
networks is given by
k
Crand = . (2)
N
Another definition of average clustering coefficient extensively used in the
literature is given by
3A
Ctriangle = , (3)
B
where A stands for the number of interconnected triplets of nodes, such that
each node is connected to the other two nodes (i.e., a triangle), and B is
the number of connected triplets, where each node is connected to just one
node or more. The factor 3 accounts for the fact that from each triangle three
simple triplets can be formed.

2.1.3 Network Diameter

The network diameter is defined as the maximal distance between any pair of
nodes. The above definition strictly works for fully connected networks; how-
ever, by redefining the diameter as the maximum distance among all fully con-
nected components (clusters) of the system, the definition is applicable to all
kinds of networks. Assuming that the network has a sort of tree-like structure,
a simple rough estimation can be obtained by equating kd with N as follows:

ln(N )
d∼ . (4)
ln(k)

It has been shown that Eq. (4) predicts the correct scaling of d with N and
k for random networks. Note that when k > ln(N ), a random network
278 F. Peruani

has a high probability of being totally connected [1]. The concept of network
diameter is closely related to another important quantity, the average path
length, which is the average distance between any pair of nodes.

2.1.4 Network Spectrum

The network topology can also be studied through the adjacency matrix A,
which is an N × N symmetric matrix whose elements Ai,j represent the con-
nections among the nodes of the network. If nodes i and j are connected,
then Ai,j = 1, otherwise, Ai,j = 0. The spectrum of the network is the set of
eigenvalues of A, and since A has N eigenvalues, the spectral density takes
the form
1 
N
ρ(λ) = δ(λ − λj ). (5)
N i=1
In the limit of N → ∞, ρ(λ) becomes a continuum function.
Interestingly, the topology of the network is related to the spectral density
through

1  1 
dλ λk ρ(λ) = (λj )k = Ai1 ,i2 Ai2 ,i3 . . . Aik ,i1 . (6)
N j N i ,i ,...,i
1 2 k

Equation (6) represents the number of paths returning to the same node in
the network. One of the most remarkable results connected to this kind of ap-
proach is Wigner’s law, which applies to infinite random networks with a con-
nectivity p ∼ N −ξ . When 0 < ξ < 1, Wigner’s
 law predicts that the spectrum
semicircular distribution ρ(λ) = 4N p(1 − p) − λ2 /(2πN p(1−p))
density is a
for |λ| < 2 N p(1 − p) and is vanishing for larger values of λ, except for the
principal eigenvalue, which is isolated from the bulk and increases with net-
work size. For ξ > 1 the spectral density deviates from Wigner’s law and its
odd moments vanish (i.e., k 2m+1  = 0), indicating that the only path that
comes back to the original node is following all nodes previously visited, i.e.,
there are no closed loops [4–9, 26]

2.2 Building a Network

There are equilibrium and non-equilibrium random networks. These terms are
associated to the way in which the network was grown. In this subsection we
briefly review how a network can be built.

2.2.1 Equilibrium Random Networks

Given a fixed number N of nodes and a fixed number M of edges, the network
is built by taking for each edge a randomly selected couple of nodes and
inserting an edge between them.
Advances in the Theory of Complex Networks 279

2.2.2 Non-Equilibrium Random Networks


In this case, the network is grown by simultaneously adding vertices and edges.
The procedure is as follows. a) A node is added at each time step. b) Simulta-
neously, a pair (or several pairs) of randomly chosen vertices are connected by
an edge. If at some moment the addition of nodes is stopped while the addi-
tion of edges continues, the network will tend to an equilibrium. However, the
network will never approach the equilibrium state given by equilibrium net-
works, since the growing process produces a sort of correlation by which ‘old’
nodes are more connected than ‘young’ nodes. The only way to achieve an
equilibrium network configuration is by also allowing the removal of old edges.

2.2.3 Preferential Attachment


A huge amount of real-world networks are scale free; i.e., they exhibit a power-
law degree distribution. The Barabási–Albert model [33] was the first model
that satisfactorily described a non-equilibrium network whose asymptotic de-
gree distribution is a power law. The growth model is as follows. Starting
with a small number m0 of nodes, at every time step, add a new node with m
(m ≤ m0 ) edges and link the new node to m different nodes already present
in the system according to the following rule: choose each node to which the
new node connects to with probability Π proportional to the node degree ki ,
ki
Π(ki ) =  . (7)
j kj

The attachment rule described by Eq. (7) is referred to as preferential attach-


ment. After t time steps this procedure produces a network with N = t + m0
nodes and mt edges. Asymptotically with t the degree distribution of the
network approaches a power law with exponent γ = 3. This remarkable fact
can be understood according to the following simple continuum theory [37].
Assume that ki is a continuous variable whose rate growth is proportional to
Π(ki ), then ki evolves according to
∂ki ki
= mΠ(ki ) = . (8)
∂t 2t
The solution of this equation, with the initial condition that every node i at
its introduction (at time ti ) has ki (ti ) = m, is ki (t) = m(t/ti )β with β = 1/2.
Thus, the cumulative probability takes the form
m1/β t
p[ki (t) < k] = 1 − . (9)
k 1/β (t + m0 )
Taking the derivative of Eq. (9), we obtain the degree distribution
2m1/β t
pk = . (10)
(m0 + t)k 1/β+1
In the limit of t → ∞, pk ∼ 2m1/β k −γ with γ = β −1 + 1 = 3.
280 F. Peruani

2.3 Network Stability: Breaking Down a Network


A finite network can be formed by many isolated clusters of various sizes, or it
can be fully connected with only one giant component. For infinite networks
this statement has to be rephrased in the following way. An infinite network
can exhibit a giant cluster with an infinite number of nodes contained in it, or
on the contrary, all clusters in the system can be finite. If a network exhibits
a giant cluster, we say that the network is stable and highly connected.
We now review the already classical results on percolation of complex
networks [10–14]. Specifically, we follow the method proposed in [10, 11] and
extended in [15, 16]. The goal is to find the minimum fraction of nodes that
should be removed from a network in order to break down the connectivity of
the network. By definition, a network is no longer connected when the initial
giant component disappears, i.e., when the biggest cluster of connected nodes
in the system is much smaller than the total initial number of nodes.
Let pk be the network degree distribution, i.e., the probability of finding
a randomly chosen vertex with degree k, and let qk be the probability that
a node of degree k survives the failure or attack. Correspondingly, 1 − qk
is the probability that a node of degree k is removed. In consequence, pk qk
represents the fraction of nodes of degree k that are removed after the failure
or attack. The objective is now to characterize the cluster size distribution
of surviving nodes, and determine under which condition cluster sizes can be
infinite. We make use of generating function formalism and define G(x) as the
generating function of the network degree distribution pk :


G(x) = pk xk . (11)
k=0

Recall that the connection between the generating function and the probabil-
ity distribution it generates is given by
1 dk G(x)
pk = lim . (12)
x−→0 k! dxk

We still need to derive the generating function F0 (x) of the probability of


finding a node of degree k that has survived the attack. Since pk qk is the
probability of finding a surviving node of degree k after the disruptive event,
applying the definition of generating function, Eq. (11), we find that F0 (x)
takes the form


F0 (x) = pk qk xk . (13)
k=0
Another important generating function is the one associated with the prob-
ability of finding a randomly chosen edge connected to a node of degree k
(after the attack):
∞
kpk qk k xF0 (x)
A(x) = x = , (14)
z G (1)
k=0
Advances in the Theory of Complex Networks 281
∞
where z = k = k=0 kpk = dG(1)/dx. To obtain an expression for the
cluster size distribution, we need first to find the generating function of the
probability that one of the outgoing edges of the node we arrived at connects to
a surviving node of degree k. This is simply A(x)/x, and the desired generating
function can be expressed as

F1 (x) = F0 (x)/G (1) = F0 (x)/z. (15)

Now we look for the generating function H1 (x) of the distribution of cluster
sizes of surviving nodes that are reached by randomly choosing an edge and
following it to one of its ends. If we choose an edge that leads us to a removed
node, regardless of the degree of the node, we say that the cluster size we find
is zero. The probability of following the randomly chosen edge and finding a
surviving node of degree zero is zero, the probability of finding a surviving
node of degree one is p1 q1 /z, the probability of finding a surviving node of
degree two is 2p2 q2 /z, and so on.So, the probability of finding a surviving
node, regardless of its degree, is ∞ k=0 kpk qk /z = F1 (1). In consequence, the
probability of finding an edge that leads to a removed node is 1 − F1 (1).
Clearly, this is also the probability of following a randomly chosen edge that
leads to a zero size component, and so also the coefficient s0 that accompanies
x0 in H1 (x). To find the full expression of H1 (x), we have still to look for the
probabilities that accompany non-zero size components, i.e., xk with k > 0.
This can be computed from the probability s1 of finding, by following a ran-
domly chosen edge, a component of size 1. This is nothing other than the sum
of the probabilities of following an edge and finding a surviving node of degree
k which has its other k − 1 edges connected to removed nodes:


s1 = kpk qk /z(1 − F1 (1))k−1 = F1 (H1 (0)). (16)
k=1

Similarly for s2 , we can obtain




s2 = (k − 1)kpk qk /z(1 − F1 (1))k−2 s1 (17)
k=2
= F1 (H1 (0))H1 (0),

where (1 − F1 (1))k−2 s1 is the probability of taking randomly k − 1 edges and


finding that k − 2 edges are attached to removed nodes, and one to a size 1
component. The term k−1 indicates that there are k−1 possible configurations
for these edges. We observe that Eq. (17) is the derivative with respect to x of
Eq. (16) evaluated in x = 0. However, from the definition given by Eq. (11),
we know that the term x1 is accompanied by a first derivative, while the
second is associated with a second derivative and a factor 1/2. We solve this
problem by considering that the function we have to derive successive times
is xF1 (H1 (x)). The first derivative of this function evaluated in x = 0 is
282 F. Peruani

F1 (H1 (0)), while the second derivative evaluated in x = 0 is 2F1 (H1 (0))H1 (0).
This suggest a self-consistence equation for H1 (x) of the form

H1 (x) = (1 − F1 (1)) + xF1 (H1 (x)). (18)

It can be easily verified that Eq. (18) leads to the correct expressions of
s0 , s1 , . . . , sn by applying the definition given by Eq. (12). Along similar
lines, we can obtain the generating function H0 (x) of the distribution of the
component size to which a randomly chosen node belongs. The main difference
is that instead of determining the probability of finding a randomly chosen
edge attached to a component size s, we now randomly choose a node and
want to determine the probability of finding this node belonging to a cluster
of size s. For this reason, instead of using P (k) as before, we use pk qk and its
corresponding generating function F1 (x). The expression for H0 (x) takes the
form
H0 (x) = (1 − F0 (1)) + xF0 (H1 (x)). (19)
Finally from Eq. (19), we can obtain the average size of the components:

F0 (1)F1 (1)


H0 (1) = s = F0 (1) + . (20)
1 − F1 (1)

As mentioned above, we are interested in knowing the threshold at which


the average cluster size becomes finite, or inversely, when it becomes infinite.
Clearly, Eq. (20) diverges when 1 − F1 (1), and this critical condition sets the
threshold between finite and infinite cluster sizes. Finally replacing F1 (1) by
its definition, Eq. (15), we obtain a critical condition for qk , which was our
initial goal:
∞
kpk (kqk − qk − 1) = 0. (21)
k=0

Equation (21) defines the critical condition for the stability of an uncorrelated
infinite network under an arbitrary attack. For failure, i.e., when the attack
does not depend on the degree k of the node, qk = q and from Eq. (21) the
classical percolation threshold for failure [13, 10] is retrieved as follows:

k
qc = 1 − . (22)
k 2  − k

Notice that Eq. (22) defines the percolation threshold for infinite networks.
The critical qc strongly depends on system size and thus Eq. (22) fails to de-
scribe the stability of finite networks [17]. Also notice that a basic assumption
behind Eq. (21) is that the original network is uncorrelated. Expressions for
the percolation threshold of finite and/or correlated networks are still missing.
Advances in the Theory of Complex Networks 283

3 Two Current Hot Problems in Complex Networks

In this section we address two current hot problems in complex networks: dy-
namics on transportation networks and community identification in complex
network. Part of the future advances of complex network theory clearly is go-
ing to be along the lines of the problems reviewed in this section. However, we
warn the reader that this selection of problems just gathers a small number
of timely interesting issues on networks which are particularly attractive for
the author. The amount of relevant open problems in the fast-evolving area
of network theory exceeds by far the small selection presented here.

3.1 Dynamics on Transportation Networks

A transportation network typically models the movement of entities across


the nodes of the network (see Fig. 2). A classical example is the airline trans-
portation network where each node denotes a city (i.e., an airport) and edges
indicate direct flights between cities. If we associate to each node i a number
ni (t) denoting the number of individuals at node i at time t, we can model the
dynamical flow of mass (or individuals) across the network. It is not difficult
to imagine a transportation network moving various types (e.g., species) of
individuals or entities. This means that at a given instant of time there will
be various species of individuals coexisting at each node. If in turn there is
a dynamics among the various types of individuals, on top of the transport
dynamics there will be an inter-species dynamics. A chemical reaction where
the chemical species diffuse across the transportation network [18] would be
an example of this type of dynamical process. Another example would be the
spreading of a disease through the airline transportation network [19, 20, 21],
as occurred in 2002 during the outbreak of the severe acute respiratory syn-
drome (SARS) [19]. In this case, susceptible, infected, and recovered individ-
uals are the reacting species.
In this section we briefly review some recent results [18, 22–25] which have
helped to elucidate some key aspects of the metapopulation dynamics which
occurs on transport networks. Let us start by understanding the transport
dynamics.

3.1.1 Transport Dynamics

For the moment we assume that there is only one species diffusing in the sys-
tem. A metapopulation description of the transport process can be obtained
by thinking in terms of the mean occupation number ñk (t) of nodes of degree
k at time t, which by definition reads as
1  (i)
ñk (t) = n (t), (23)
Nk (i)
k =k
284 F. Peruani

where the sum runs over all nodes whose degree is k, Nk refers to the total
number of nodes with degree k, and n(i) (t) denotes the occupation number
(= number of individuals) at node i. It is assumed that there is a diffusion
rate d(k, k  ) that controls the migration of individuals from a subpopulation
with degree k to another of degree k  . In consequence, the probability per
 Lk for an individual at a node of degree k of leaving the node is
unit time
Lk = k kp(k  |k)d(k, k  ), where p(k  |k) is the conditional probability that an
edge departing from a node of degree k points to a node of degree k  . Thus,
the (mean-field) time evolution of ñk (t) can be expressed as

∂t ñk (t) = −Lk ñk (t) + k p(k  |k)d(k  , k)ñk (t). (24)
k

The reasoning behind Eq. (24) is very simple. The first term on the right-
hand side accounts for the number of individuals that initially are in a node
of degree k and then leave it, while the second term considers the increase
of individuals in k-degree nodes due to the migration of individuals from
subpopulations of degree k  to k. For uncorrelated networks, p(k  |k) takes the
form p(k  |k) = k  pk /k and Eq. (24) reduces to
k 
∂t ñk (t) = −Lk ñk (t) + pk d(k  , k)ñk (t). (25)
k 
k

If in addition it is assumed that the probability for an individual to leave


a given population is independent of its degree, then Lk = L for all k, and
d(k, k  ) = L/k. The stationary solution for Nk (t) then reads:
k
Nk (t → ∞) = N. (26)
k
A more realistic transportation process has to consider the migration of in-
dividuals to be proportional to the traffic intensity along the network edges.
This can be obtained by defining a heterogeneous diffusion probability for
any given individual to go from a subpopulation of degree k to another one
of degree k  as d(k, k  ) = Lw0 (kk  )θ /Tk , where Tk provides the correct renor-
malization to ensure that overall outflow is still L, θ is a model parameter that
controls the impact of the network topology, and w0 is simply a constant.

3.1.2 Dynamics Among Different Species

In the following discussion we assume that there are multiple species travel-
ing across the network which interact among themselves. We consider three
interacting species: susceptible, infected, and recovered individuals which fol-
low the classical Susceptible-Infected-Recovered (SIR) dynamics (see Fig. 2).
For a single population (node), an epidemic outbreak can occur depending
on the basic reproductive number R0 , which accounts for the number of sec-
ondary infected cases generated by a primary infected individual. The basic
reproductive number is defined as
Advances in the Theory of Complex Networks 285

Subpopulation i:
i i

Transportation network Agents: susceptible


infected
recovered

Fig. 2. The scheme illustrates a tranportation network. Each node is a container of


agents, i.e., a subpopulation. Agents are transported through the network edges, e.g.,
from node j to i. Inside each node, individual agents interact. The figure depicts a
SIR dynamics in which susceptible agents get the disease from infected agents, which
in turn become, after a characteristic time, recovered.

β
R0 = , (27)
μ

where 1/β is the characteristic time required by a susceptible individual to


acquire the disease from any given neighbor, and 1/μ is the characteristic time
an individual remains infected after getting the disease. If R0 > 1 initially,
the number of infected individuals is larger than the number of recovered
individuals, and the disease spreads. When R0 < 1 the epidemic goes to
extinction. Note that even if R0 > 1, i.e., when the disease at the node level
affects many individuals, the infection does not necessarily spread over the
metapopulation system, which in turn means that a macroscopic fraction of
nodes remains immune to the disease. For this to happen, we still require a
fast enough diffusion of individuals. In the following we review the derivation
of the metapopulation disease invasion predictor R∗ , which determines under
which parameters (including R0 and d(k, k  )) a disease infects a finite fraction
of the network.
Let us start out by estimating the number of new infected individuals
(seeds) that may appear in a connected subpopulation of degree k  during
the duration of an outbreak in a subpopulation of degree k. We denote by
αNk the number of infected individuals during the evolution of the epidemic
in a closed subpopulation (α depends on the specific disease model). If each
infected individual holds the disease for a characteristic time μ−1 during which
it can travel to a neighboring subpopulation k  with a rate d(k, k  ), then the
number of new seeds can be expressed as
286 F. Peruani

d(k, k  )αNk
λk,k = . (28)
μ
Now we can derive a simple approximate evolution equation for the number
of infected subpopulations Dkn of degree k at generation n for a random graph
in which each subpopulation has the same degree k,
 
1 λkk  
D = D (k − 1) 1 − ( )
n n−1
1 − Dn−1 /N . (29)
R0

The reasoning behind Eq. (29) is the following. Each of the Dn−1 infected pop-
ulations at generation n − 1 will seed during the next generation a number of
subpopulations proportional to k − 1 times the probability
 that the neighbor-
ing subpopulations are not infected (i.e., 1 − Dn−1 /N ), times the probability
that the new infected
 individuals
 cause a local outbreak (this probability is
proportional to 1 − R0−λkk since the probability that a single individual will
not transmit the disease is R0−1 [27]). Assuming, as before, that d(k, k  ) = p/k,
then λkk = pN0 α/(μk) (where N0 = Nk ) and in addition R0  1 such that
1 − R0−λkk ∼ λkk (R0 − 1), Eq. (29) reduces to

k−1
Dn = pN0 αμ−1 (R0 − 1)Dn−1 . (30)
k
From Eq. (30) it is easy to observe that a macroscopic outbreak can only
occur if
k−1
R∗ = pN0 αμ−1 (R0 − 1) > 1. (31)
k
Thus, the global invasion threshold is defined by Eq. (31). This implies that
to observe global spread the mobility rate has to be such that
μk
p≥ . (32)
α(k − 1) (R0 − 1)

In a heterogeneous metapopulation network, i.e., when the subpopulation


degree varies across the network, Eq. (29) has to be replaced by
  
Dkn = Dkn−1
 (k  − 1)λk k (R0 − 1) p(k|k  ) 1 − Dkn−1 /Nk , (33)
k

where again it was assumed that R0  1. Since p(k|k  ) is the conditional


probability that an edge attached to a node of degree k  has its other tip
connected to a node of degree k, p(k|k  ) k  is the probability that at least one
edge is connected to a node of degree k. In Eq. (33), p(k|k  ) (k  − 1) refers to
the probability that a recently infected node with degree k  , discounting the
edges from which the nodes got the disease, is linked to a node of degree k.
As said above, when degree correlation can be neglected, p(k|k  ) = k p(k)/k,
and Eq. (33) can be expressed as
Advances in the Theory of Complex Networks 287

kp(k) 
Dkn = (R0 − 1) Dkn−1
 (k  − 1)λk k . (34)
k  k

Similarly, Eq. (28) is reduced to

pαNk
λk,k = . (35)
μk 
Consequently, the evolution equation for Dkn reads:

kp(k)pN0 α  n−1 
Dkn = (R0 − 1) Dk (k − 1). (36)
μk2  k

Multiplying both sides by (k − 1) and taking the sum over k on both sides,
Eq. (36) can be expressed as

k 2  − k pN0 α n−1


Θn = (R0 − 1) Θ , (37)
k2 μ

where Θn is defined as Θn = k Dkn (k  − 1). From Eq. (37) we learn that the
disease spreads only if

k 2  − k pN0 α
R∗ = (R0 − 1) > 1. (38)
k2 μ

Equation (38) defines the global invasion threshold for a heterogeneous net-
work.
Though in recent years we have observed important progress related to the
dynamics on transportation networks, there are still many open questions to
be answered. For example, the degree of a subpopulation has been considered
so far decoupled from the subpopulation size. However, we know that, in
many cases, as in an airline transportation network which connects cities of
different sizes, degree and subpopulation size are strongly correlated. In fact, a
satisfactory network growth model for transportation networks is still lacking.
Regarding the dynamics on the nodes, typically death and birth processes
are ignored, even though small size nodes could experience large fluctuations
which in turn could dramatically affect global flow on the network. Bottleneck
effects due to limitation in the transportation channel, as well as limitation
in node capacity, are important problems that deserve to be investigated.

3.2 Identifying Communities in Complex Networks

If we observe real-world networks, we notice that typically there are small


sets of nodes which are highly connected to each other but with only few links
to the rest of the network (see Fig. 3). These sets of highly connected nodes
are typically referred to as communities or modules. To fully understand the
288 F. Peruani

Fig. 3. The scheme illustrates a network comprising two modules or communities.


Notice the high connectivity exhibited by nodes in each community.

internal topological structure of a network it is crucial to correctly detect the


community structure in it.
A general method for identification of communities in unipartite networks
is the maximization of the modularity function Q introduced by Newman
and Girvan [28]. The function Q evaluates the “goodness” of a partition of
a network into communities. The basic assumption behind the modularity
function is that a community or module of a network should exhibit a number
of internal links greater than the number of links of a subset of a random
network. For a network with N nodes and L links, the modularity function Q
is defined as follows:
  2 
m  m
ls ds
Q= qs = − , (39)
s=1 s=1
L 2L

where the sum runs over the m modules of the network, ls is the number of
links inside module s, and ds is the total degree of the nodes in module s. The
term ls /L denotes the fraction of links connecting pairs of nodes belonging
to module s, while (ds /2L)2 represents the fraction of links that one would
find in the module if links were placed at random in the network, under the
constraint of respecting the degree distribution of the original network. If qs
is such that  2
ls ds
qs = − ≥ 0, (40)
L 2L
the module is well defined, in the sense that the module presents more links
than expected by random chance. The greater qs , the better defined the mod-
ule. The identification method implies the maximization of Q, which in turn
involves sampling over all possible partitions of the network. Unfortunately,
Advances in the Theory of Complex Networks 289

the number of possible subsets grows exponentially with the network size, and
the modularity optimization is an NP-complete problem [29]. Typically, the
ambitious goal of finding the true optimum of the measure is not possible.
However, approximations of the minimum can be obtained by applying op-
timization algorithms such as simulated annealing, extremal optimization, or
spectral division. Other drawbacks of the Newman–Girvan method are that it
cannot scan the network below some scale, leaving small modules undetected,
and that it may be affected by the time evolution of the network, i.e., by
network size.

3.2.1 Identifying Communities in Bipartite Networks


Bipartite networks are a special and important class of networks in which
nodes are divided in two disjoint subsets and edges link nodes of one subset
with nodes in the other. The number of applications of bipartite networks is
really huge; however, one application in social science has become the exam-
ple of prototypical bipartite networks: the movie-actor network [30–34]. This
network is divided in two sets, the set of actors and the set of movies (also
referred to as teams; see Fig. 4). An edge that connects an actor a and a movie
m indicates that a has participated in the movie m. Note that the behavior of
these networks strongly depends on whether both partitions grow with time,
which leads to scale-free degree distributions of actors, or on whether one of
the partitions, e.g., the actor set, is fixed over time while the remaining set
grows unboundedly, which results in a beta-distribution for the degree of actor
nodes [35].

Bipartite network:

D C B A
Teams

1 2 3 4
Actors

Unipartite projection:

1 3

2 4

Fig. 4. The figure shows a scheme of a growing bipartite network. The team node
D represents a new incoming node. The scheme at the bottom indicates the resulting
unipartite projection of actor nodes (see text).
290 F. Peruani

Many relevant properties of bipartite networks become evident in the uni-


partite projection of actor nodes. In this unipartite network, an edge running
from an actor a to an actor a indicates that a and a have co-starred in
the same movie (see Fig. 4). Notice that in consequence the actors attached
to a movie m in the bipartite network are part of a clique in the unipartite
projection. Bipartite networks have intrinsically very strong modularity and
typically exhibit complex structure. Guimerá et al. [36] have recently proposed
a simple and elegant model for bipartite network growth which allows us to
study different levels of modularity in bipartite networks. The model assumes
that each actor and movie has associated a color. The number of colors is a
model parameter that has to be defined in advance. The next step is to assign
to each actor a color. Once all this has been defined, the network is grown
according to the following steps.
a) Create team m.
b) Select the number μm of actors in the team.
c) Select the color cm of the team.
d) For each of the μm actors in m proceed as follows: with probability p,
select the actor from the pool of actors with the team color cm ; otherwise,
select an actor at random with equal probability.
The parameter p is called team homogeneity and quantifies how homoge-
neous a team is. For p = 1 all the actors in the team belong to the same
module and modules are perfectly segregated, whereas for p = 0 the color of
the team is irrelevant and actors are perfectly mixed and the network does
not have a modular structure.
Guimerá et al. in [36] have adapted the modularity criterion of Newman–
Girvan, Eq. (39), to account for modularity in bipartite networks. They con-
sider that the expected number of times a given actor a belongs to a team
composed of μ actors is
ta
pa→m = μ  , (41)
k tk
where ta is the total number of movies in which actor a has participated, i.e.,
the degree of node a. Eq. (41) represents the probability that a given team m
with μ actors is connected to actor a. Thus, the probability that a team m is
connected to a and a is given by
ta ta
pa,a →m = μ(μ − 1)  . (42)
( k tk )2

In consequence, the average number na,a of movies in which a and a have


co-starred (assuming a non-correlated random process) is

μm (μm − 1)
na,a = m ta ta , (43)
( m μm )2
Advances in the Theory of Complex Networks 291
 
where m μm = k tk . From Eq. (43) the bipartite modularity can be
expressed as the cumulative deviation from the random expectation of co-
starring movies (i.e., Eq. (44)):
m    
 a=a ∈s caa a=a ∈s ta ta
QB =  −  , (44)
s m μm (μm − 1) ( m μm )2

where caa is the actual number of movies in which a and a have co-starred.
Notice that the identification of modules through the optimization of QB
leads to the same type of problems present in the Newman–Girvan method:
the method leaves small modules undetected and strongly depends on net-
work size.
The identification of communities in complex networks is extremely im-
portant, since it can reveal functional relationships between nodes. So far the
available methods for modularity identification are purely phenomenological
and they cannot guarantee the correct identification of the community struc-
ture. A theoretical founding for modularity identification is still lacking. Due
to the relevance of the problem, we expect to observe important theoretical
progress in this direction in the near future.

4 Concluding Remarks

The complex network community has been growing for years. Everyday we
see new articles on complex networks, and the evolution of the field seems
limitless. In such a dynamical research field, any prediction about the future
of complex network theory is extremely risky. The two selected hot topics
in this chapter, dynamical processes on transportation networks and iden-
tification of communities in complex networks, are certainly areas that will
experience important progress in the near future. Very important progress is
also expected in many other areas, as for example, in dynamical networks of
moving agents. In the coming years we will witness substantial new progress
in network research.

References
1. R. Albert and A.-L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
2. S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Networks: From Biological Nets
to the Internet and WWW, Oxford University Press, Oxford, UK (2003).
3. F. Chung and L. Lu, Adv. Appl. Math. 26, 257 (2001).
4. E.P. Wigner, Ann. Math. 62, 548 (1955).
5. E.P. Wigner, Ann. Math. 65, 203 (1957).
6. E.P. Wigner, Ann. Math. 67, 325 (1958).
7. M.L. Metha, Random Matrices, 2nd ed., Academic Press, New York (1991).
292 F. Peruani

8. A. Crisanti, G. Paladin, and A. Vulpiani, Products of Random Matrices in Statis-


tical Physics, Springer, Berlin (1993).
9. T. Guhr, A. Mueller-Groeling, and H.A. Weidenmueller, Phys. Rep. 299, 189
(1998).
10. D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. Lett.
85, 5468 (2000).
11. M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. E 64, 026118 (2001).
12. R. Albert, H. Jeong, and A.L. Barabási, Nature (London) 406, 6794 (2000); 406,
378 (2000).
13. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 85, 4626
(2000).
14. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 86, 3682
(2001).
15. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 14th ACM Con-
ference on Computer and Communications Security (Association for Computing
Machinery, Inc. New York, 2007).
16. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 26th Symposium
on Principles of Distributed Computing (Association for Computing Machinery,
Inc. New York, 2007).
17. B. Mitra, N. Ganguly, S. Ghose, and F. Peruani, Phys. Rev. E 78, 026115 (2008).
18. V. Colizza, R. Pastor-Satorras, and A. Vespignani, Nature Physics 3, 276–282
(2007).
19. L. Hufnagel, D. Brockmann, and T. Geisel, Proc. Natl. Acad. Sci. USA 101, 15124
(2004).
20. Z. Wu, L.A. Braunstein, V. Colizza, R. Cohen, S. Havlin, and H.E. Stanley, Phys.
Rev. E 74, 056104 (2006).
21. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Proc. Natl. Acad. Sci.
USA 103, 2015–2020 (2006).
22. V. Colizza and A. Vespignani, J. Theor. Biol. 251, 450–467 (2008).
23. V. Colizza and A. Vespignani, Phys. Rev. Lett. 99, 148701 (2007).
24. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Int. J. Bifurcation and
Chaos 17, 2491–2500 (2007).
25. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, BMC Medicine 5, 34
(2007).
26. I.J. Farkas, I. Derenyi, A.-L. Barabási, and T. Vicsek, Phys. Rev. E 64, 026704
(2001).
27. N.T. Bailey, The Mathematical Theory of Infectious Diseases, 2nd edition, Hodder
Arnold (1975).
28. M.E.J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004).
29. S. Fortunato, e-print arXiv:0705.4445.
30. J.J. Ramasco, S.N. Dorogovtsev, and R. Pastor-Satorras, Phys. Rev. E 70, 036106
(2004).
31. D.J. Watts and S.H. Strogatz, Nature (London) 393, 440 (1998).
32. R. Albert and A.-L. Barabási, Phys. Rev. Lett. 85, 5234 (2000).
33. R. Albert and A.-L. Barabási, Science 286, 509 (1999).
34. L.A.N. Amaral, A. Scala, M. Barthélémy, and H.E. Stanley, Proc. Natl. Acad. Sci.
97, 11149 (2000).
Advances in the Theory of Complex Networks 293

35. F. Peruani, M. Choudhury, A. Mukherjee, and N. Ganguly, Europhys. Lett. 79,


28001 (2007).
36. R. Guimera, M. Sales-Pardo, and L.A. Nunes Amaral, Phys. Rev. E 76, 036102
(2007).
37. A.-L. Barabasi, H. Jeong, and R. Albert, Physica A 272, 173 (1999).
Glossary of Essential Terms

Adjacency Matrix: Let G be a graph with n vertices. The n × n matrix A,


such that aij = 1 if there is an edge between vertices vi and vj and where the
rest of the values are 0, is called the adjacency matrix of graph G.

Assortativity: Assortativity refers to a preference for a network’s nodes


to attach to others that are similar or different in some way.

Assortativity Coefficient: The assortativity coefficient is the Pearson


correlation coefficient r between pairs of nodes. Hence, positive values of r
indicate a correlation between nodes of similar degree, while negative values
indicate relationships between nodes of different degree.

Automorphic Equivalence: Two vertices u and v of a labeled graph


G are automorphically equivalent if all the vertices can be relabeled to form
an isomorphic graph with the labels of u and v interchanged.

Betweenness Centrality: Betweenness centrality of a node v is defined


as the sum of ratios of the number of shortest paths between vertices s and t
(s, t ∈ V ) through v to the total number of shortest paths between s and t.
The betweenness centrality g(v) of v is given by

σst (v)
g(v) = Σs=v=t . (1)
σst

Biological Networks: Biological networks are representations of biological


systems such as metabolic networks, protein interaction networks etc.

Bipartite Graphs: Bipartite graphs are graphs that contain vertices of


two distinct types, with edges running only between unlike types.
296 Glossary of Essential Terms

Centrality: The centrality of a node in a network is a measure of the


structural importance of the node.

Citation Networks: A citation network is a network formed by nodes


of articles, such that there is a directed edge from node i to j if the article i
cites article j.

Clique: Cliques are complete graphs where all nodes are connected to
all other nodes.

Closeness Centrality: The closeness centrality Cc (v) for a vertex v is


the reciprocal of the sum of geodesic distances to all other vertices in the
graph:
1
Cc (v) = . (2)
Σt∈V dG (v, t)

Clustering Coefficient: The clustering coefficient for a vertex v in a net-


work is defined as the ratio between the total number of connections among
the neighbors of v to the total number of possible connections between the
neighbors. For a vertex i, the clustering coefficient is given by

|ejk |
Ci = : vj , vk ∈ Ni , ejk ∈ E. (3)
ki (ki − 1)

Community: A community is a subgraph, where in some reasonable sense


the nodes in the subgraph have more to do with each other than with the
nodes that are outside the subgraph.

Coordination Number: The coordination number of a graph is the average


degree z of the nodes of the network.

Cumulative Advantage: Cumulative advantage means that the more


connected a node is, the more likely it is to receive new links. Nodes with
higher degree have a stronger ability to grab links added to the network. This
concept is more popularly known as “preferential attachment.”

Degree Centrality: Degree centrality is defined as the number of links


incident upon a node.

Degree Distribution: The degree distribution of a network gives the


probability distribution of the degree of a random node in a network.

Diameter: The diameter of a graph is defined as the maximum of all


the shortest distances between any two nodes in the graph.
Glossary of Essential Terms 297

Dual Graphs: A dual graph of a given planar graph G is a graph that


has a vertex for each plane region of G, and an edge for each edge in G
joining two neighboring regions, for a certain embedding of G.

Edge Connectivity: The edge connectivity of G, κ (G), is the minimum


size of a disconnecting set.

Edge Cutset: An edge cutset is a set F , a subset of E(G) such that


G − F has more than one component.

Eigenvector Centrality: Eigenvector centrality is a measure of the im-


portance of a node in a network. It assigns relative scores to all nodes in
the network based on the principle that connections to high-scoring nodes
contribute more to the score of the node in question than equal connections
to low-scoring nodes. Thus, the centrality of a node is proportional to the
centrality of the nodes to which it is connected and this in a recursive fashion.

Erdős-Rényi Graph: In the E-R graph model, each pair of n vertices


is connected by an edge with some probability p. The probability of a vertex
having degree k is given by (z = np)
 
n k z k e−z
pk = p (1 − p)n−k  . (4)
k k!

Euclidean Distance: The Euclidean distance between two nodes A and B


is defined as

ED(A, B) = Σi (Ai − Bi )2 . (5)

Euler’s Formula: If a connected planar graph G has exactly n vertices, e


edges, and f faces, then n − e + f = 2.

Euler Tour: An Euler tour of a connected, directed graph G = (V, E)


is a cycle that traverses each edge of graph G exactly once, although it may
visit a vertex more than once.

Euler Walk: An Euler walk in an undirected graph is a path that uses


each edge exactly once.

Geodesic Path: The geodesic path between two vertices is the shortest
path between them.

Giant Component: The giant component refers to a connected subgraph


that contains a majority of the entire graph’s nodes.
298 Glossary of Essential Terms

Hierarchical Clustering: Hierarchical clustering builds (agglomerative) or


breaks up (divisive) a hierarchy of clusters.

Hyperedges: The edges in the network that join more than two nodes
together.

Hypergraphs: Hypergraphs are graphs that have hyperedges.

Incidence Matrix: The incidence matrix of a graph gives the (0, 1)-matrix
which has a row for each vertex and column for each edge, and (v, e) = 1 iff
edge e is incident on vertex v.

Jaccard Coefficient: The Jaccard coefficient is defined as the size of the


intersection divided by the size of the union of the sample sets:

|A ∩ B|
J(A, B) = . (6)
|A ∪ B|

k-core: A k-core is defined as the maximal subset where each node is con-
nected to at least k members.

k -connected: A connected graph G is k-connected iff every pair of ver-


tices in G is joined by at least k non-intersecting paths and there exists at
least one pair with exactly k non-intersecting paths.

k -plex: In a k-plex, all the nodes have degree at least (n − k). 1-plex
represents a clique.

Lagrange’s Matrix: If di is the degree of node i, then Lagrange’s ma-


trix is defined as follows:

⎨ di if i = j
Lij = −1 if i is connected to j. (7)

0 Otherwise

n-clan: An n-clan is an n-clique S such that the subgraph induced by S has


a diameter (D) less than or equal to n.

n-clique: An n-clique is the maximal subset of the nodes where the dis-
tance between any two nodes u and v is less than or equal to n:

d(u, v) ≤ n, ∀u, v. (8)

Network Motif: Network motifs are patterns that occur in different parts
of a network at frequencies much higher than those found in randomized net-
works.
Glossary of Essential Terms 299

Pearson’s Correlation Coefficient: Pearson’s correlation coefficient be-


tween two nodes x and y can be measured as
ΣxΣy
Σxy −
r= n . (9)
2 2
(Σx) (Σy)
(Σx2 − )(Σy 2 − )
n n
Percolation Theory: Percolation theory is based on adding nodes and con-
nections to an empty graph until a giant component surfaces. A percolation
process is one in which vertices or edges on a graph are randomly designated
as either occupied or unoccupied and one asks about various properties of the
resulting patterns of vertices.

Planar Graphs: A graph is planar if it has a drawing without crossings.

Power Law: A power law is any polynomial relationship that exhibits


the property of scale invariance. The most common power laws relate two
variables and have the form
f (x) = axk + o(xk ). (10)
Preferential Attachment: Preferential attachment means that the more
connected a node is, the more likely it is to receive new links. Nodes with
higher degree have a stronger ability to grab links added to the network.

Random Graphs: A random graph is a graph that is generated by some


random process.

Reciprocity: Reciprocity is the probability that a pair of vertices in a


directed network are connected to each other by directed edges.

Regular Equivalence: Two nodes are said to be regularly equivalent if


they have the same profile of ties with other nodes that are also regularly
equivalent.

Resilience: The property of resilience of networks to the removal of their


vertices.

Scale-Free Network: The defining characteristic of scale-free networks


is that their degree distribution follows the Yule–Simon distribution, a power-
law relationship defined by pk ∼ k −γ .

SIR Epidemic Model: SIR (Susceptible-Infected-Recovered/Removed)


is a model of disease spread where individuals are susceptible to a disease,
potentially contract the disease, and then recover without becoming suscep-
tible any further. This can also include individuals who die of the disease.
300 Glossary of Essential Terms

SIS Epidemic Model: SIS (Susceptible-Infected-Susceptible) is a model


of disease spread where individuals are susceptible to a disease, potentially
contract the disease, and are once again susceptible as soon as they recover.

Small-World Network: A small-world network is a type of mathemati-


cal graph in which most nodes are not neighbors of one another, but most
nodes can be reached from every other node by a small number of hops
or steps. These nodes show a large clustering coefficient value and a small
average shortest path distance.

Social Network: A social network is a social structure made of nodes


that are tied by one or more specific types of interdependency, such as values,
visions, ideas, financial exchange, friends, kinship, dislike, conflict, trade, web
links, sexual relations, disease transmission (epidemiology), or airline routes.

Strongly Connected Components: A strongly connected component


of a directed graph G is a maximal set of vertices C ⊂ V , such that, for every
pair of vertices u and v, there is a directed path from u to v and a directed
path from v to u.

Structural Equivalence: Two nodes are said to be structurally equiva-


lent if they have the same relationships to all other nodes.

Structural Holes: Structural holes are nodes that separate non-redundant


sources of information; that is, they act as a bridge between two networks
that are not directly linked.

Technological Networks: Technological networks are man-made networks


designed typically for distribution of some commodity or resource, such as
electricity or information.

Vertex Connectivity: The connectivity of G, κ(G), is the minimum size of


vertex set S such that G − S is disconnected or has only one vertex.

Vertex Cutset: A vertex cutset of a graph G is a set S, a subset of


V (G) such that G − S has more than one component.

Weakly Connected Components: A weakly connected component is


a maximal subgraph of a directed graph such that, for every pair of vertices
u, v in the subgraph, there is an undirected path from u to v and a directed
path from v to u.

Zipf ’s Law: Zipf’s law states that given some corpus of natural language
utterances, the frequency of any word is inversely proportional to its rank in
the frequency table.
Index

C. elegans, 4 child concepts, 147


E. coli, 74 Chinese Whispers algorithm, 157
chlamydia, 97
ADIOS, 153 clustering coefficient, 119, 207, 218, 257,
adjacency matrix, 98, 226 277
Akaike information criterion, 211 coevolution, 137, 139
antonymy, 149 community, 58
Apache, 200 matrix, 60
apoptosis, 19, 27, 31 structure, 135, 138, 287
assembly model, 60 competition, see ecological interaction
attachment kernel, 220, 222, 229–231, complex adaptive system, 145
234 complexity science, 133, 135
attack, 258 compositional semantics, 145
authority, 159 computer viruses, 260
condition specific, 42
B-cell antigen receptor, 7 configuration model, 232
bandwidth, 219, 224, 225, 227, 234 context vectors, 156
Barabási–Albert model, 194 core lexicon, 151
Bayesian network, 45, 46 cost, 77
Belousov–Zhabotinsky reaction, 11 function, 77
bifurcation point, 43 cross talk, 78
binding motif, 39
biochemical reaction, 38 deep sequencing, 48
biological systems, 35 degradation, 81
bistability, 86 degree, 222, 232
blogosphere, 159 correlations, 220, 230–232, 235
Boolean distribution, 136, 203, 217, 234, 256,
model, 20, 23, 25, 31 275
network, 45 cumulative, 223, 232
rules, 23 excess, 222, 232
exponential, 232
cascade model, 60 Poisson, 223, 227, 228, 234
cells, 35 power-law, 217–219, 224, 232
cellular system, 36, 73, 90 Weibull, 218
302 Index

free-excess, 240 fault tolerance, 258


heterogeneity, 133, 135 feature economy, 154
deletion kernel, 220, 230, 231 feedback loops, 74
dendrogram, 153 female sex workers, 97
diameter, 277 fixed point, 81
differential equation, 36, 46 flooding, 265
directed tree, 108 flux balance analysis, 46
disassortative network, 5 food webs, 9, 58
disease, 36
distinctive features, 154 game theory, 135, 137
distributional hypothesis, 156 gene, 35
DNA, 73 generating function, 222, 234
microarray, 42 genome, 35
dynamic modeling, 46 giant component, 229, 242
dynamical, 5 glial cells, 11
global structure, 147
eccentricity, 190 gonorrhea, 97
ecological interaction, 58 grammar
ammensalism, 59 dependency, 151
commensalism, 59 phrase structure, 151
competition, 59 tree-adjoining, 151
mutualism, 59 graph
parasitism, 59 visualization, 99
predation, 58 bipartite, 99, 120, 121, 128
symbiosis, 59 complete, 120
edge complete, 120
duplication, 123 function call, 200
pruning, 174 neighbour-based, 174
eigenfunction, 121–124 random, 122
eigenvalue, 119–125, 128 regions, 97
eigenvector centrality, 98, 159 second order co-occurrence, 177
electric current, 45 sentence-based, 174
elementary mode, 46 small-world, 167
entangled, 86 steepest-ascent, 98
epidemics, 9, 97, 244, 247 topographic analysis, 99
models, 261 graphlet, 211
epigenetic, 83
eukaryotes, 86 habitat fragmentation, 68
evolutionary model, 60 Heaviside step function, 230
explanatory variables, 209 hierarchical structure, 146
exponential random graph models, 200, higher order transformation, 178
209 histones, 83
expression data, 42 HITS, 159
HIV, 97
factor graph, 45, 46 condom use, 103
failures, 258 Holme–Kim model, 194
false holonymy, 149
negative, 41 homeostasis, 87
positive, 41 homosexual, 108
Index 303

hub, 159 minimal cut set, 47


HyperLex, 158 modeling, 135, 137
hyperlink, 159 modular structure, 136
hypernymy, 149 motif, 80, 122
hyponymy, 149 collector, 89
consumer, 88
IκB, 81 duplication, 122, 123
implicational hierarchies, 160 fashion, 89
indexing, 189–193, 195, 196 joining, 122
inflammation, 80 socialist, 87
information, 189, 197 multistability, 74, 80
inhibitor, 81 mutualism, see ecological interaction
interaction strength, 41, 61
Internet, 197, 253 natural language processing, 167
intra-cellular signaling, 7 negative feedback, 80
network
k -core, 201 adaptive, 137
kernel lexicon, 175 assortative, 5
keystone species, 60 clustered, 237
kinase, 8 co-authorship, 129
cascade, 45 co-expression, 43
substrate cascade, 38 dynamic, 140
kinetic modeling, 45 e-mail interchange, 127
kinetics, 46 electronic circuit, 130
equilibrium, 278
Laplacian, 7 food web, 127
algebraic graph Laplacian, 120 gene-gene, 38
graph Laplacian, 117 Internet, 126
normalized Laplacian, 120 jazz band, 129
Leipzig Corpora Collection, 168 logical, 38
letter frequency distribution, 169 metabolic, 38, 46, 126
lexical spectrum, 168 modular, 9
linguistic neuronal, 9, 127
systems, 145 peer-to-peer, 253
universals, 160 phonological neighborhood, 148
local structure, 147 power-grid, 129
logical model, 45 processes, 135
Lyapunov exponent, 125 protein contact, 6
protein-protein interaction, 38, 126
machine learning, 38, 41 protein-RNA, 39
macroscopic, 145 random, 4
mass spectrometry, 44 randomized, 76
maximum likelihood, 210 reactive, 219
May–Wigner theorem, 12 regular, 4
mental lexicon, 146 regulatory, 38
meronymy, 149 scale-free, 119, 259
mesoscopic, 135, 145 sex, 97
metabolic, 80 signaling, 45
microscopic, 145 small-world, 9
304 Index

social, 134–136, 138, 140 piecewise linear, 25, 31


star, 14 PlaNet
static, 138 Phoneme-Language Network, 154
structured P2P, 263 Polya–Cheeger constant, 124
superpeer, 268 polysemy, 149
syntactic dependency, 151 population dynamics, 13
technological, 253 positive feedback, 74
transcriptional, 43 posttranscriptional process, 44
transportation, 283 power law, 257
unstructured and decentralized P2P, distribution, 167
263 in language-related areas, 173
weblog, 127 two-regime, 151
word co-occurrence, 167 preferential attachment, 118, 119, 217,
word collocation, 150 218, 279
word-adjacency, 126 prey
neurons, 11 prey-predator, 63
NF-κB, 80 prey-preference, 65
niche model, 60 procedural software, 200
node propagation cost, 205
deletion, 219, 229 protein, 35, 73
preferential, 230, 232 activation, 42
random, 219, 221, 227, 230 complex, 37, 38
targeted, 230, 232 localization, 42
duplication, 119, 125 modification, 45
noise, 40 protein-DNA interaction, 39, 43
non-equilibrium network, 279 turnover, 44
non-local diffusion, 10
nucleosome, 83 qualitative
dynamics, 32
open source software, 199 properties, 20
opinion dynamics, 135, 138 quality assessment, 40
orthographic similarity, 149
oscillations, 29, 31, 74, 80 random
oscillatory behaviour, 27 matrix, 12
walk, 225, 266
PageRank, 159 biased, 229
parasitism, see ecological interaction rank-degree distribution, 151
Pareto’s law, 168 rate equation, 219, 221, 230–232, 235
peer churn, 268 recency effect, 150
peer-to-peer, 197, 218, 219, 223, 227, recursive syntax, 145
230, 234 regulatory, 80
percolation theory, 280 rich-get-richer principle, 118
Petri net, 45, 46 robustness, 13
PhoNet
Phoneme-Phoneme Network, 154 saturated degradation, 81
phonological similarity, 147 scale-free, 135, 137
phylogenetic small-world graphs in language, 173
profile, 38 science of networks, 133, 140
relations, 153 search time, 219, 224–227, 234
Index 305

search tree, 191 sub-lexical units, 146


searching techniques, 264 symbiosis, see ecological interaction
self-organization, 145 synchronization, 125
semantic similarity, 147 solution, 125
sentence frequency, 172 syntactic similarity, 147
signal propagation, 74 synthetic lethal, 39
signals, 74
simulated annealing, 43 text summarization, 160
SIS model, 139 time course data, 43
small-world, 135, 207, 257 time dependent, 42
social time scales, 138
dynamics, 134, 135 time-lagged correlation, 43
groups, 135 topology, 79
media, 136 transcription
network analysis, 133, 135, 140 factor, 41, 80
networking, 136 factor binding, 43
networks, 134–136, 138, 140 regulatory cascade, 39
phenomena, 135, 137 translation, 39, 44
structure, 138 transmission probability, 103
socio-technological system, 253, 254 tree, 120
software treebanks, 151
engineering, 200 triangle duplication, 123
systems, 199 trophic level, 58
sound inventory, 153 typologies, 160
spectral
gap, 124 UCLA Phonological Segment Inventory
plot, 123, 125, 128 Database, 154
spectrum, 120–122, 125, 128, 278 unsupervised induction, 150
SpellNet, 149
spiky, 81 Watts–Strogatz model, 6
spiral waves, 11 weak spot, 46
spreading, 135, 139 webpages, 159
square lattices, 194 word N -gram frequency, 170
stability, 5 word co-occurrence, 174
state change, 44, 46 word sense disambiguation, 157
steady state, 46 World Wide Web, 253
stimulus, 148
stoichiometry, 38, 46 Zipf’s law, 168
structure discovery, 167 Zipfian distribution, 168

You might also like