TMP 8 BC6

Inference of Sparse Networks with Unobserved Variables.
Application to Gene Regulatory Networks
Nikolai Slavov
Princeton University, NJ 08544
nslavov@princeton.edu
Abstract tent variables inferred by FA is fraught with problems.

In fact, interpretation is not always expected and in-
tended since FA may not have an underlying gener-
Networks are becoming a unifying frame-
ative model. A prime difficulty with interpretation
work for modeling complex systems and net-
arises from the fact that any rotation of the factors and
work inference problems are frequently en-
their loadings by an orthogonal matrix results in a dif-
countered in many fields. Here, I develop
ferent FA decomposition that explains the variance in
and apply a generative approach to network
the observed variables just as well. Therefore, in the
inference (RCweb) for the case when the net-
absence of additional information on the latent vari-
work is sparse and the latent (not observed)
ables and their loading, FA cannot identify a unique
variables affect the observed ones. From
decomposition, much less generative relationships be-
all possible factor analysis (FA) decompo-
tween latent factors and observed variables.
sitions explaining the variance in the data,
RCweb selects the FA decomposition that A frequent choice for a constraint implemented by
is consistent with a sparse underlying net- principle component analysis (PCA) and resulting in
work. The sparsity constraint is imposed by a a unique solution is that the factors are the singu-
novel method that significantly outperforms lar vectors (and thus orthogonal to each other) of the
(in terms of accuracy, robustness to noise, data matrix ordered in descending order of their cor-
complexity scaling and computational effi- responding singular values. Yet, this choice is often
ciency) Bayesian methods and MLE methods motivated by computational convenience rather than
using `1 norm relaxation such as K–SVD and by knowledge about the system that generated the
`1–based sparse principle component analy- data. Another type of computationally convenient
sis (PCA). Results from simulated models constraint applied to facilitate the interpretation of
demonstrate that RCweb recovers exactly the FA results is sparsity, in the form of sparse Bayesian
model structures for sparsity as low (as non- FA (West 2002, Dueck 2005, Carvalho 2008), sparse
sparse) as 50% and with ratio of unobserved PCA (d’Aspremont 2007, Sigg 2008) and FA for gene
to observed variables as high as 2. RCweb is regulatory networks (Srebro 2001, Pe’er 2002). How-
robust to noise, with gradual decrease in the ever, papers that introduce and use sparse PCA do
parameter ranges as the noise level increases. not consider a generative model but rather use spar-
sity as a convenient tool to produce interpretable fac-
tors that are linear combinations of just a few original
1 INTRODUCTION variables. In sparse PCA, sparsity is a way to balance
interpretability at the cost of slightly lower fraction of
Factor analysis (FA) decompositions are useful for ex- explained variance.
plaining the variance of observed variables in terms The algorithm described in this paper (RCweb) also
of fewer unobserved variables that may capture sys- uses a sparse prior, but RCweb explicitly considers the
tematic effects and allow for low dimensional repre- problem from a generative perspective. RCweb asserts
sentation of the data. Yet, the interpretation of la- that there is indeed a set of hidden variables that con-
nect to and regulate the observed variables via a sparse
Appearing in Proceedings of the 13th International Con-
network. Based on that model, I derive a network
ference on Artificial Intelligence and Statistics (AISTATS)
2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of structure learning approach within explicit theoretical
JMLR: W&CP 9. Copyright 2010 by the authors. framework. This allows to propose an approach for
757
Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks
sparse FA which is conceptually and computationally the structure of G requires additional criteria con-
different from all existing approaches such as K-SVD straining the decomposition. The assumption that G
(Aharon 2005), sparse PCA and other LARS (Bradley is sparse requires that C be sparse as well meaning
2004) based methods (Banerjee 2007, Friedman 2008). that the state of the ith observed vertex xi |i ∈ N is
RCweb is appropriate for analyzing data arising from affected by a strict subset of the unobserved variables
any system in which the state of each observed vari- xψi = {xs |s ∈ ψi ⊂ R}, the ones whose weights in the
able is affected by a strict subset of the unobserved ith column of C are not zeros, Ciψi 6= 0 and Ciψ̄i = 0
variables. To assign the inferred latent variables to where ψ̄i is the complement of ψi . Thus, to recover the
physical factors, RCweb needs either data from per- structure of G, RCweb seeks to find a decomposition
turbation experiments or prior knowledge about the of G in which C is sparse. The sparsity can be intro-
factors. This framework generalizes to non-linear in- duced as a regularization with a Lagrangian multiplier
teractions, which is discussed elsewhere. Furthermore, λ:
I analyze the scaling of the computational complexity
of RCweb with the number of observed and unobserved (Ĉ, R̂) = arg min kG − RCk2F + λkCk0 (2)
R,C
variables, as well as the parameter space where RCweb
can accurately infer network topologies and demon- In the Pequations above and throughout the paper
strate its robustness to noise in the data. kCk2F = i,j Cij2
denotes entry-wise (Frobenius) norm,
and the zero norm of a vector or matrix (kCk0 ) equals
2 DERIVATION the number of non-zero elements in the array.
To infer the network topology RCweb aims to solve
Consider a sparse bipartite graph G = (E, N , R) con- the optimization problem defined by (2). Since (2) is
sisting of two sets of vertices N and R and the asso- a NP-hard combinatorial problem, the solution can be
ciated set of directed edges E connecting R to N ver- simplified significantly by relaxing the `0 norm to `1
tices. Define a graphical model in which each vertex norm (Bradley 2004). Then the approximated prob-
s corresponds to a random variable; N observed ran- lem can be tackled with interior point methods (Baner-
dom variables indexed by N (xN = {xs |s ∈ N }) whose jee 2007). As an alternative approach to `1 approxima-
states are functions of P unobserved variables indexed tion, I propose a novel method based on introducing a
by R, xR = {xs |s ∈ R}. Since the states of xN depend degree of freedom in the singular-value decomposition
on (are regulated by) xR , I will also refer to xR as reg- (SVD) of G by inserting an invertible1 matrix B.
ulators. The functional dependencies are denoted by a
set of directed edges E so that each unobserved variable G = USVT ≡ US(BT )−1 BT VT (3)

xi |i ∈ R affects (and its vertex is thus connected to)

| {z } | {z }
R̂ Ĉ
a subset of observed variables xαi = {xs |s ∈ αi ⊂ N }.
Given a dataset G ∈ RM ×N of M configurations of The prior (constraint) that Ĉ is sparse determines B
the observed variables xN , RCweb aims to infer the that minimizes (2), and thus a unique decomposition.
edges E and the corresponding configurations of the The goal of introducing B is to reduce the combina-
unobserved variables, xR . torial problem to one that can be solved with convex
If the state of each observed variable is a linear super- minimization. When the factors underlying the ob-
position of a subset of unobserved variables, the data served variance are fewer than the observations in G
G can be modeled with a very simple generative model there is no need to take the full SVD; if P factors
(1): The data is a product between R ∈ RM ×P (a are expected, only the first P largest singular vectors
matrix whose columns correspond to the unobserved and values from the SVD of G are taken in that de-
variables and the rows correspond to the M measured composition so that USVT is the matrix with rank P
configurations) and C ∈ RP ×N , the weighted adja- that best approximates G in the sense of minimizing
cency matrix of G. The unexplained variance in the kG − USVT k2F . Since conceivably sparse decompo-
data G is captured by the residual Υ. sitions may use columns outside of the best `2 ap-
proximation, RCweb considers taking the first P ∗ for
G = RC + Υ (1) P ∗ > P singular pairs. Such expanded basis is more
This decomposition of G into a product of two matri- likely to support the optimal sparse solution and es-
ces can be considered to be a type of factor analysis pecially relevant for the case when P is not known.
with R being the factors and C the loadings. Even Such choice can be easily accommodated in light of
when P M the decomposition of G does not have the ability of RCweb to exclude unnecessary explana-
a unique solution since RC ≡ RIC ≡ RQT QC ≡ tory variables, see section 4.3.
R∗ C∗ for any orthonormal matrix Q. Thus the iden- 1
B is always invertible by construction, see section 4.3
tification of a unique decomposition corresponding to and equation (4)
758
Nikolai Slavov
Next RCweb computes B based on the requirement maps a vector from RP to the origin. How to choose
that C is sparse for the case N > P . To compute the indices to be moved? At each step RCweb chooses
B, one may set an optimization problem (4). Once i|max| , the index of the largest element (by absolute
B is inferred, R̂ and Ĉ can be computed easily, R̂ = value) of the smallest axis of the ellipsoid which is the
USB̂−1 and Ĉ = (VB̂)T . left singular vector of V with the smallest singular
value. RCweb moves i|max| from ω0 to ω− effectively
B̂ = arg min kVBk0 , so that det(B) > 1 (4) selecting the dimension whose projection is easiest to
B
eliminate and removing its largest component, which
The constraint on B in (4) is introduced to avoid trivial minimizes as much as possible the projection in that
and degenerate solutions, such as B being rank defi- dimension. RCweb keeps moving indices from ω0 to
cient B. Thus the introduction of B reduces (2) to a ω− using the same procedure until the smallest right
problem (4) that is still combinatorial and might also singular vector of Vω0 converges to bi and the small-
be approximated with a more tractable problem by re- est singular value of Vω0 approaches zero. RCweb is
laxing the `0 norm to `1 norm and applying heuristics guaranteed to stop after at most (N−P+1) steps since
(Cetin 2002, Candès 2007) to enhance the solution. I after removal of (N−P+1) indices from ω0 , Vω0 will be
propose a new approach, RCweb, outlined in the next at most rank P −1. If RCweb finds a sparse solution it
section. will converge in fewer steps.
1. Task:
3 RCweb b̂i = min kVbi k0
bT
i bi ≥1
Assume that a sparse cTi corresponding to an optimal
b̂i (the ith column of B̂) form Vb̂i = cTi is known. 2. Initialization:
Define the set of indices corresponding to non-zero el- • ω− = {∅} and ω0 = {1, 2, . . . , N }
ements in cTi with ω− and the set of indices corre-
• Set K−1ω0 = (Vω0 Vω0 )
T −1
= I ∈ RP ×P
sponding to zero elements in cTi with ω0 . Furthermore,
define the matrix Vω0 to be the matrix containing only • Set J = 1;
P
the rows of V whose indices are in ω0 . If ω0 and thus • i|max| = arg max j |Vij |
Vω0 are known, one can easily compute b̂i as the right ω− = {i|max| }, ω0 = {i|i ∈ ω0 , i 6= i|max| }
singular vector of Vω0 corresponding to the zero sin-
• Update K−1
ω0 = RankU pdate(Kω0 , Vi|max| )
−1
gular value. Since cTi and ω0 are not known, RCweb
approximates b̂i (the smallest2 right singular vector of 3. Cycle: J = J + 1 Repeat until convergence
Vω0 ) with vs , the smallest right singular vector of V.
• Find the eigenvector v for K−1 ω0 with the
This approximation relies on assuming that a low rank
largest eigenvalue λmax
perturbation in a matrix results in a small change in its
smallest singular vectors (Wilkinson 1988, Benaych- max ≈ 0 or v → v
• If λ−1 J J−1
, b̂i ≡ v; STOP
Georges 2009). Thus given that RCweb is looking for • Compute the left singular vector of Vω0
the sparsest solution and the set ω− is small relative u = s−1 Vω0 v
to N , the angle between the singular vectors of Vω0 • i|max| = arg max [(|u1 |, . . . , |ui |, . . . , |uN |)];
and V is small as well. Therefore, vs can serve as • ω− = {ω− , i|max| }
a reasonable first approximation of bi . Then RCweb ω0 = {i|i ∈ ω0 , i 6= i|max| }
systematically and iteratively uses and updates vs by • Update K−1 ω0 = RankU pdate(Kω0 , Vi|max| )
−1
removing rows of V until vs converges to bi or equiv-
alently Vω0 becomes singular for the largest set of ω0
indices. When Vω0 becomes singular, all elements of The above algorithm can compute a single vector, b̂i ,
ci whose indices are in ω0 become zero. which is just one column of B̂. To find the other
columns, RCweb applies the same approach to the
RCweb also has an intuitive geometrical interpretation. modified (inflated) matrix, which for the ith column
Consider the matrix V mapping the unit sphere in RP of B is V(i) = V(i−1) + Vb̂i b̂T
i for i = 2, . . . , P . Thus,
(the sphere with unit radius from RP ) to an ellipsoid after the inference of each column of B RCweb modi-
in RN . The axes of the ellipsoid are the left singular fies V(i−1) to V(i) (V(1) ≡ V) so that the algorithm
vectors of V. In this picture, starting with ω− = {∅} will not replicate its choice of ω0 . Note that in the
and ω0 = {1, . . . , N }, solving (4) requires moving the ith update of V(i−1) the Vb̂i b̂Ti will modify only the
fewest number of indices from ω0 to ω− so that Vω0 ω− rows in V(i−1) since the rows of the Vb̂i b̂T i whose
2
By smallest singular vector I mean the singular vector indices are in ω0 contain only zero elements. Apply-
corresponding to the to the smallest singular value ing RCweb to the inflated matrices avoids inferring
759
multiple times the same b̂i , but a b̂i inferred from ing to RCweb, the optimal sparse adjacency matrix
the inflated matrix is generally going to differ at least (Ĉ) and the hidden variables (R̂) can be inferred by
slightly from the corresponding b̂i that solves (4) for the decomposition, Ĝ = R̂Ĉ so that Ĉ is as sparse as
V. To avoid that, RCweb uses the inflated matrices possible while Ĝ is as close as possible to G.
only for the first few iterations until the largest (by
In addition to RCweb, such decomposition can be com-
absolute magnitude) of the Pearson correlations be-
puted by 3 classes of existing algorithms. For a com-
tween the smallest eigenvector of Vω0 from the current
parison, I use the latest versions for which the authors
(ith ) iteration and the recovered columns of B is less
report best performance: (A) PSMF for Bayesian ma-
than 1 − and monotonically decreasing; is chosen
trix factorization as implemented by the author Mat-
for numerical stability and also to reflect the similar-
Lab function PSMF1 (Dueck 2005); (B) BFRM 2 for
ity between the connectivity of R vertices that can
Bayesian matrix factorization as implemented by the
be expected in the network whose topology is being
author compiled executable (Carvalho 2008); (C) em-
recovered. A simpler alternative that works great in
PCA for maximum likelihood estimate (MLE ) sparse
practice is to use the the inflated matrix for the first
PCA (Sigg 2008); (D) K-SVD (Aharon 2005). All al-
k iterations that are enough to find a new direction
gorithms are implemented using the code published by
for b̂i and then RCweb switches back to V so that
their authors, and with the default values of the pa-
the solution is optimal for V. The switch requires k
rameters when parameters are required. The results
rank update of K−1 ω0 ≡ (Vω0 Vω0 )
T −1
and thus choosing k
are compared for various M, N, P, sparsity, and noise
small saves computations. Choosing k too small, how-
levels.
ever, may not be enough to guarantee that b̂i will not
recapitulate a solution that is already found. Usually
k = 10 works great and can be easily increased if the 4.1 LIMITATIONS
new solution is very close to an old one. Before comparing the results, consider some of the lim-
There are a few notable elements that make RCweb itations common to all algorithms and the appropriate
efficient. First, RCweb does almost all computations metrics for comparing the results. In the absence of
in RP and since P N , P < M , that saves both any other information, the decomposition of G (no
memory and CPU time. Second, each step requires matter how accurate) cannot associate hidden vari-
only a few matrix-vector multiplication for computing ables (corresponding to columns of R̂) to physical fac-
the eigenvectors (since the change from the previous tors. Furthermore, all methods can infer Ĉ and R̂
step is generally very small) and K−1
ω0 is computed by
only up to an arbitrary diagonal scaling or permuta-
a rank–one update which obviates matrix inversion. tion matrix. First consider the scaling illustrated by
the following transformation by a diagonal matrix D,
The approach that RCweb takes in solving (4) does Ĝ = R̂Ĉ = R̂I Ĉ = R̂(DD−1 )Ĉ = (R̂D)(D−1 Ĉ) =
not impose specific restrictions on the distribution of R̂∗ Ĉ∗ . Such transformation is going to rescale R̂
the observed variables (G), the noise in the data (Υ) to R̂∗ and Ĉ to Ĉ∗ , which is just as sparse as Ĉ,
or the latent variables R̂. However, the initial approx- kĈ∗ k0 = kĈk0 . Since both decompositions explain
imation of bi with vs can be poor for data arising from the variance in G equally well RCweb (or any of the
dense networks or special worst–case datasets. As other method) cannot distinguish between them. Thus
demonstrated theoretically (Benaych-Georges 2009) given Ĉ, there is a diagonal matrix D̂ that scales Ĉ to
and tested numerically in the next section, RCweb per- Cgold , the weighted adjacency matrix of G.
forms very well at least in the absence of worst–case
scenario special structures in the data. The second limitation is that in the absence of addi-
tion information, RCweb can determine Ĉ and R up
to a permutation matrix. Consider comparing Ĉ to
4 VALIDATION the adjacency matrix used in the simulations, Cgold .
Since the identity of the inferred hidden variables is
To evaluate the performance of RCweb, I first apply not known the rows of Ĉ do not generally correspond
it to data from simulated random bipartite networks to the rows of Cgold ; Ĉi (the ith row of Ĉ) is most
with two different topology types, (1) Erdös & Rènyi likely to correspond to the Cgold row that is most cor-
and (2) scale-free whose corresponding degree distribu- related to Ĉi and the Pearson correlation between the
tions are (1) Poisson and (2) power–law. The network two rows quantifies the accuracy for the inference of
topology is encoded in a weighted adjacency matrix Ĉi . To implement this idea, all rows of Ĉ and Cgold
Cgold and the values for the unobserved variables are are first normalized to mean zero and unit variance
drawn from a standard uniform distribution. The sim- resulting in Cnor and Cgold
nor . The correlation matrix
ulations result in data matrices G ∈ RM ×N containing then is, Σ = CTnor Cgold
nor and the most likely vertex
M observations of all N unobserved variables. Accord- (index of the unobserved variable) corresponding to
760
Nikolai Slavov
Ĉi is k = arg maxj (|Σi1 |, . . . , |Σij |, . . . , |ΣiP |), where 1.0
k ∈ R. The absolute value is required because the di-

agonal elements of D̂ can be negative. The accuracy 0.8
Inference Accuraccy, ρ̄
is measured by the corresponding Pearson correlation,
emPCA kSVD PSMF BFRM
ρi = |Σik |. An optimal solution of this matching prob- 0.6
RCweb
lem can be found by using belief propagation algorithm

for the simple case of a bipartite graph even the LP 0.4
relaxed version guarantees optimal solution (Sanghavi

2007). The overall accuracy is quantified by the mean 0.2
Pi=p
correlation ρ̄ = (1/p) i=1 ρi where p is the number
of inferred unobserved variables and can equal to P or 0
10 15 20 25 30 35 40 45 50 55 60
not depending on whether the number of unobserved Number of unobserved variables, P
variables is known or not. In computing ρ̄, each row

of Cgold is allowed to correspond only to one row of Ĉ Figure 1: Accuracy of network recovery as a function
and vice versa. of the number of unobserved variables, P . Number
In addition to the two common limitations of permuta- of observed variables N = 500; Number of observa-
tion and scaling, some algorithms (d’Aspremont 2007) tions, M = 280. All networks are with Poisson mean
for sparse PCA require M > N and since this is not out–degree 0.10N = 50, with 10 % noise in the obser-
the case in many real world problems and in some of vations.
the datasets simulated here, those methods are not
tested. Instead I chose emPCA, which does not re-
vertices) are increased, the accuracy of the inference
quire M > N and is the latest MLE algorithm for
decreases. All algorithms perform better on Poisson
sparse PCA that according to its authors is more effi-
networks and the lower level of noise in the data from
cient than previous algorithms (Sigg 2008).
power–law networks was chosen to partially compen-
sate for that. An important caveat when comparing
4.2 ACCURACY AND COMPLEXITY the results for different algorithms is that K-SVD it-
SCALING eratively improves the accuracy of the solution, and
thus the output is dependent on the maximum num-
RCweb has a natural way for identifying the mean de-
ber of iterations allowed (Imax ). For the results here,
gree3 . However, some of the other algorithms require
Imax = 20 and the accuracy of K-SVD may improve
the mean degree for optimal performance. To avoid
with higher number of iterations even though I did
underestimating an algorithm simply because it recov-
not observe significant improvement with Imax = 100.
ers networks that are too sparse or not sparse enough,
Even at 20 iterations K-SVD is significantly slower
I assume that the mean degree is known and it is in-
than RCweb and emPCA, Fig.4. The scaling of the al-
put to all algorithms. First all algorithms are tested
gorithms with respect to a parameter was determined
on a very easy inference problem, Fig.1. Since PSMF
by holding all other parameters constant and regress-
and BFRM have lower accuracy and PSMF is signif-
ing the log of the CPU time against the log of the
icantly slower than the other algorithms, the rest of
variable parameter, Fig.4. The scaling with respect
the results will focus on the MLE algorithms that also
to some parameters is below the theoretical expecta-
have better performance. PSMF gives less accurate
tion since the highest complexity steps may not be
results with power–law networks which can be under-
speed–limiting for the ranges of parameters used in
stood in terms of the uniform prior used by PSMF.
the simulations.
PSMF has the advantage over the MLE algorithms in
inferring a probabilistic network structure rather than
a single estimate. Special advantage of BFRM is the 4.3 INTERPRETABILITY
seamless inclusion of response variables and measured
factors in the inference. For the MLE algorithms, the In analyzing real data the number of unobserved vari-
accuracy of network inference increases with the ra- ables (P ) may not be known. I observe that if a sim-
tio of observed to unobserved variables N/P (Fig.2) ulated network having P regulators is inferred assum-
and with the number of observed configurations M , ing P ∗ > P regulators, the elements of B correspond-
Fig.3. In contrast, as the noise in the data and the ing to the excessive regulators are very close to zero,
mean out–degree (mean number of edges from R to N |Bij | ≤ 10−10 for i > P or j > P . Thus, if the data
3
truly originate from a sparse network RCweb can dis-
When RCweb learns all edges, the smallest singular card unnecessary unobserved variables.
value of Vω0 approaches zero and its smallest singular vec-
tor converges to b̂i . The results of RCweb can be valuable even without
761
A
1 1
B A
1
B
1
Inference Accuraccy, ρ̄ 0.9 RCweb
0.9
0.9 emPCA
kSVD 0.9
Inference Accuraccy, ρ̄
0.8
0.8
0.8
0.7 0.8
0.7
0.7
0.6
0.7
0.6
0.5 0.6
0.5 0.6
0.4 0.5
RCweb RCweb
0.4
emPCA 0.3
emPCA 0.4
0.5
kSVD kSVD
0 20 40 60 80
0.2 0.4 RCweb
0 20 40 60 80 0.3
emPCA
Number of unobserved variables, P kSVD
0.2
50 100 150 200 250 300 0 200 400 600 800
Number of Observed Configurations, M
Figure 3: Accuracy of network recovery as a function

of the number of observed configurations, M . Contin-
uous lines & squares, N = 500; dashed lines & circles,
N = 1000 A) Poisson networks with mean out–degree
0.20N {100 and 200}, with 50 % noise; B) power–law
networks (with mean out–degree 0.40N {200 and 400},
with 10 % noise. In all cases P = 30.
5 5 3
10 10 10
Figure 2: Accuracy of network recovery as a function A B C
of the number of unobserved variables, P . The thicker

CPU Time, seconds
4
10
brighter lines with squares correspond to number of 10
4
observed variables N = 500 and the dashed, thinner 10

2
lines correspond to N = 1000. In all cases the number

3
10
of observations is M = 2000. A) Poisson networks

3
10
with mean out–degree 0.25N {125 and 250}, with 50 % 10

2
noise in the observations and B) power–law networks

10
2
10
(with mean out–degree 0.40N {200 and 400}, with 10 10
1
% noise in the observations. C) The same as (B) except RCweb : 1.78 RCweb : 1.41
emPCA: 1.80
RCweb : -0.01
emPCA: 1.33
emPCA: 1.03
for wider range of the observed variables. 10
0
kSVD: 1.40
10
1
3
kSVD: 0.98
10
0
2
kSVD: 0.73
1
10 10 10
P N M
identifying the physical factors corresponding to xR .
Figure 4: Computational efficiency as a function of
Yet identifying this correspondence, and thus over-
P , M and N . The scaling exponents (slopes in log–log
coming the limitations outlined in section 4.1 can be
space) are reported in the legends. The networks are
very desirable. One practically relevant situation al-
with power–law degree distribution and 60 % sparsity,
lowing in-depth interpretation of the results requires
with mean out–degree 0.40N
measuring (if only in a few configurations) the states
of some of the variables that are generally unobserved
(xR ). This is relevant, for example, to situations in
D̂ss = (R̂Tφk R̂φk )−1 R̂Tφk uk . Similar to section (4.1),
which measuring some variables is much more expen-
weighted matching algorithms (Sanghavi 2007) can be
sive than others (such as protein modifications versus
used for finding the optimal solution if there is data
messenger RNA concentrations) and some xs∈R can be
for multiple xs∈R .
observed only once or a few times. Assume that the
states of the k th physical factor are measured (data in Partial prior knowledge about the structure of G can
vector uk ) in nk number of configurations, whose in- also be used to enhance the interpretability of RCweb
dices are in the set φk . This information can be enough results. Assume, for example, that some of the nodes
for determining the vertex xs∈R corresponding to the (xs∈αk ⊂N ) regulated by the k th physical factor (which
k th factor and the corresponding D̂ss as follows: 1) is a hidden variable in the inference) are known. Then
Compute the Pearson correlations ρ ~ between uk and the matching approach that was just outlined can
the columns of R̂φk . Then, the vertex of the inferred be used with Ĉ rather than R̂. If the weights are
network most likely to correspond to the k th physi- not known all non-zero elements of Ĉαk can be set
cal factor is s = arg maxi (|ρ1 |, . . . , |ρi |, . . . , |ρP |). 2) to one. The significance of the overlap (fraction of
762
Nikolai Slavov
common edges) of the regulator most likely to corre-

spond to the k th physical and its known connectiv-
ity (coming from prior knowledge) can be quantified
by a p-val (the probability of observing such overlap
by chance alone) computed from the hyper–geometric
distribution. This approach is exemplified with gene-
expression data in the next section.
4.4 INFERENCE OF A
TRANSCRIPTIONAL NETWORK
Figure 5: Overlap between the first 7 sets of genes
Simulated models have the advantage of having known (column 1) that RCweb identifies to be regulated by a
topology, and thus providing excellent basis for rig- common regulator and sets of genes whose promoters
orous evaluation. Such rigorous evaluation is not are bound by the same TF (column 2) in ChIP-chip
possible for real biological networks. Yet, existing experiments. The second and third columns list the
partial knowledge of transcriptional networks, can number of overlapping genes between the RCweb set
be combined with the approach outlined in section and the ChIP-chip set and the corresponding probabil-
4.3 for evaluation of inference of biological networks. ities of observing such overlaps by chance alone. The
In particular, consider the transcriptional network of fourth column contains a description of the TFs from
budding yeast, Saccharomyces cerevisiae. It can be the second column and the fifth column lists the most
modeled as a two-layer network with transcription enriched GO terms for the RCweb gene sets.
factors (TFs) and their complexes being regulators
(xR ) whose activities cannot be measured on a high-
“gold standard” is the only partially known topology,
throughput scale because of technology limitations.
comparing the performance of the different algorithms
TFs regulate the expression levels of messenger RNAs
in this setting can be hard to interpret; an algorithm
(mRNAs) whose concentrations can be measured eas-
can be penalized for correctly inferred edges if these
ily on a high-throughput scale at multiple physiolog-
edges are absent from the incomplete ChIP topology.
ical states and represent the observed variables, xN .
An objective comparison requires experimental test-
A partial knowledge about the connectivity in this
ing of predicted new edges and that is the subject of
transcriptional network comes from ChIP-chip exper-
a forthcoming paper.
iments (MacIsaac 2006) which identify sets of genes
regulated by the same transcription factors (TFs). The comparison between ChIP-chip gene sets and
RCweb gene sets allows to identify the likely corre-
I first apply RCweb to a set mRNA configurations fol-
spondences between xs∈R and TFs and to infer the
lowed by the approach of section 4.3 to evaluate the
activities of the TFs which are very hard to measure
results. The data matrix G contains 423 yeast datasets
experimentally but crucially important to understand-
measured on the Affymetrix S98 microarray platform
ing the dynamics and logic of gene regulatory networks
at a variety of physiological conditions. RCweb was
(Slavov 2009).
initialized with P ∗ = 160 unobserved variables (corre-
sponding to TFs) and identified P = 153 co-regulated
sets of genes. The inferred adjacency matrix Ĉ is com- 5 CONCLUSIONS
pared directly to the adjacency matrix identified by
ChIP-chip experiments (MacIsaac 2006). Indeed, sets This paper introduces an approach (RCweb) for in-
of genes inferred by RCweb to be co-regulated over- ferring latent (unobserved) factors explaining the be-
lap substantially with sets of genes found to be reg- havior of observed variables. RCweb aims at inferring
ulated by the same TF in ChIP-chip studies, Fig.5. a sparse bipartite graph in which vertices connect in-
Furthermore, the most enriched gene ontology (GO) ferred latent factors (e.g. regulators of mRNA tran-
terms associated with the RCweb inferred sets of genes scription and degradation) to observed variables (e.g.
correspond to the functions of the respective TFs. This target mRNAs). The salient difference distinguishing
fact provides further support to the accuracy of the RCweb from prior related work is a new approach to
inferred network topology and suggests that many of attaining sparse solution that allows the natural inclu-
the genes inferred to be regulated by the corresponding sion of a generative model, relaxation of assumptions
TFs may be bona fide targets that were not detected in on distributions, and ultimately results in more accu-
the ChIP-chip experiments because of the rather lim- rate and computationally efficient inference compared
ited set of physiological conditions in which the exper- to competing algorithms for sparse data decomposi-
iments are performed (MacIsaac 2006). Since the best tion.
763
Acknowledgements Wilkinson, J. H. (1988) The Algebraic Eigenvalue

Problem, Oxford University Press, USA, ISBN-13:
I thank Dmitry Malioutov and Rodolfo Ros Zer- 978-0198534181
tuche for many insightful discussions and extensive,
constructive feedback, as well as David M. Blei Benaych-Georges F., Rao R. (2009) The eigenvalues
and Kenneth A. Dawson for support and comments. and eigenvectors of finite, low rank perturbations of
This work was supported by Irish Research Coun- large random matrices, arXiv:0910.2120v1
cil for Science, Engineering and Technology (IRC- Sanghavi S., Malioutov D., Willsky A. (2007) Linear
SET) and by Science Foundation Ireland under Grant programming analysis of loopy belief propagation for
No.[SFI/SRC/B1155]. weighted matching, NIPS
MacIsaac K.D., at. al. (2006) An improved map
References of conserved regulatory sites for Saccharomyces
cerevisiae. BMC Bioinformatics, 7:113. doi:
West M. (2002) Bayesian factor regression models in 10.1186/1471-2105-7-113.
the large p, small n paradigm. Bayesian Stat, 7, pp.
723–732 Slavov N., Dawson KA. (2009) Correlation Signature
of the Macroscopic States of the Gene Regulatory Net-
Carvalho C. et al. (2008) High-dimensional sparse work in Cancer, PNAS, 106, 11, pp. 4079-4084
factor modelling: Applications in gene expression ge-
nomics. JASA, 103, 484, pp. 1438–1456
Dueck D., Morris Q.,Frey, B. (2005) Multi-way cluster-
ing of microarray data using probabilistic sparse ma-
trix factorization. Bioinformatics, 21, pp. 144–151.
d’Aspremont A, Bach F., Ghaoui L. (2007) Optimal
Solutions for Sparse Principal Component Analysis.
JMLR 9, pp. 1269–1294
Sigg C., Buhmann J. (2008) Expectation-
Maximization for Sparse and Non-Negative PCA.
ICML, Helsinki, Finland
Srebro N. and Jaakkola T. (2001) Sparse Matrix Fac-
torization for Analyzing Gene Expression Patterns,
NIPS
Pe’er D., Regev A., and Tanay A. (2002) Minreg: In-
ferring an active regulator set. Bioinformatics, 18,
258–267
Bradley E., Trevor H., Iain J., Tibshirani R. (2004)
Least Angle Regression. Annals of Statistics 32, pp.
407–499
Aharon M., Elad M., Bruckstein A. (2005) K-SVD
and its non-negative variant for dictionary design.
Wavelets XI 5914, pp. 591411–591424.
Banerjee, O., Ghaoui, L. E. & d’Aspremont, A. (2007)
Model selection through sparse maximum likelihood
estimation JMLR, 9, pp. 485–516
Cetin M., Malioutov D., Willsky A. (2002) A Varia-
tional Technique for Source Localization based on a
Sparse Signal Reconstruction Perspective. ICASSP,
3, pp. 2965-2968, Orlando, Florida
Candès E., Wakin M. and Boyd S. (2007) Enhancing
sparsity by reweighted `1 minimization. J. Fourier
Anal. Appl., 14, pp. 877–905.
764

TMP 8 BC6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TMP 8 BC6

Uploaded by

Copyright:

Available Formats

Inference of Sparse Networks with Unobserved Variables.

Application to Gene Regulatory Networks

Abstract tent variables inferred by FA is fraught with problems.

xi |i ∈ R affects (and its vertex is thus connected to)

Ĉi is k = arg maxj (|Σi1 |, . . . , |Σij |, . . . , |ΣiP |), where 1.0

k ∈ R. The absolute value is required because the di-

lem can be found by using belief propagation algorithm

relaxed version guarantees optimal solution (Sanghavi

not depending on whether the number of unobserved Number of unobserved variables, P

variables is known or not. In computing ρ̄, each row

Number of Observed Configurations, M

Figure 3: Accuracy of network recovery as a function

Figure 2: Accuracy of network recovery as a function A B C

of the number of unobserved variables, P . The thicker

observed variables N = 500 and the dashed, thinner 10

lines correspond to N = 1000. In all cases the number

of observations is M = 2000. A) Poisson networks

with mean out–degree 0.25N {125 and 250}, with 50 % 10

noise in the observations and B) power–law networks

common edges) of the regulator most likely to corre-

Acknowledgements Wilkinson, J. H. (1988) The Algebraic Eigenvalue

You might also like