Professional Documents
Culture Documents
Nikolai Slavov
Princeton University, NJ 08544
nslavov@princeton.edu
757
Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks
sparse FA which is conceptually and computationally the structure of G requires additional criteria con-
different from all existing approaches such as K-SVD straining the decomposition. The assumption that G
(Aharon 2005), sparse PCA and other LARS (Bradley is sparse requires that C be sparse as well meaning
2004) based methods (Banerjee 2007, Friedman 2008). that the state of the ith observed vertex xi |i ∈ N is
RCweb is appropriate for analyzing data arising from affected by a strict subset of the unobserved variables
any system in which the state of each observed vari- xψi = {xs |s ∈ ψi ⊂ R}, the ones whose weights in the
able is affected by a strict subset of the unobserved ith column of C are not zeros, Ciψi 6= 0 and Ciψ̄i = 0
variables. To assign the inferred latent variables to where ψ̄i is the complement of ψi . Thus, to recover the
physical factors, RCweb needs either data from per- structure of G, RCweb seeks to find a decomposition
turbation experiments or prior knowledge about the of G in which C is sparse. The sparsity can be intro-
factors. This framework generalizes to non-linear in- duced as a regularization with a Lagrangian multiplier
teractions, which is discussed elsewhere. Furthermore, λ:
I analyze the scaling of the computational complexity
of RCweb with the number of observed and unobserved (Ĉ, R̂) = arg min kG − RCk2F + λkCk0 (2)
R,C
variables, as well as the parameter space where RCweb
can accurately infer network topologies and demon- In the Pequations above and throughout the paper
strate its robustness to noise in the data. kCk2F = i,j Cij2
denotes entry-wise (Frobenius) norm,
and the zero norm of a vector or matrix (kCk0 ) equals
2 DERIVATION the number of non-zero elements in the array.
To infer the network topology RCweb aims to solve
Consider a sparse bipartite graph G = (E, N , R) con- the optimization problem defined by (2). Since (2) is
sisting of two sets of vertices N and R and the asso- a NP-hard combinatorial problem, the solution can be
ciated set of directed edges E connecting R to N ver- simplified significantly by relaxing the `0 norm to `1
tices. Define a graphical model in which each vertex norm (Bradley 2004). Then the approximated prob-
s corresponds to a random variable; N observed ran- lem can be tackled with interior point methods (Baner-
dom variables indexed by N (xN = {xs |s ∈ N }) whose jee 2007). As an alternative approach to `1 approxima-
states are functions of P unobserved variables indexed tion, I propose a novel method based on introducing a
by R, xR = {xs |s ∈ R}. Since the states of xN depend degree of freedom in the singular-value decomposition
on (are regulated by) xR , I will also refer to xR as reg- (SVD) of G by inserting an invertible1 matrix B.
ulators. The functional dependencies are denoted by a
set of directed edges E so that each unobserved variable G = USVT ≡ US(BT )−1 BT VT (3)
758
Nikolai Slavov
Next RCweb computes B based on the requirement maps a vector from RP to the origin. How to choose
that C is sparse for the case N > P . To compute the indices to be moved? At each step RCweb chooses
B, one may set an optimization problem (4). Once i|max| , the index of the largest element (by absolute
B is inferred, R̂ and Ĉ can be computed easily, R̂ = value) of the smallest axis of the ellipsoid which is the
USB̂−1 and Ĉ = (VB̂)T . left singular vector of V with the smallest singular
value. RCweb moves i|max| from ω0 to ω− effectively
B̂ = arg min kVBk0 , so that det(B) > 1 (4) selecting the dimension whose projection is easiest to
B
eliminate and removing its largest component, which
The constraint on B in (4) is introduced to avoid trivial minimizes as much as possible the projection in that
and degenerate solutions, such as B being rank defi- dimension. RCweb keeps moving indices from ω0 to
cient B. Thus the introduction of B reduces (2) to a ω− using the same procedure until the smallest right
problem (4) that is still combinatorial and might also singular vector of Vω0 converges to bi and the small-
be approximated with a more tractable problem by re- est singular value of Vω0 approaches zero. RCweb is
laxing the `0 norm to `1 norm and applying heuristics guaranteed to stop after at most (N−P+1) steps since
(Cetin 2002, Candès 2007) to enhance the solution. I after removal of (N−P+1) indices from ω0 , Vω0 will be
propose a new approach, RCweb, outlined in the next at most rank P −1. If RCweb finds a sparse solution it
section. will converge in fewer steps.
1. Task:
3 RCweb b̂i = min kVbi k0
bT
i bi ≥1
Assume that a sparse cTi corresponding to an optimal
b̂i (the ith column of B̂) form Vb̂i = cTi is known. 2. Initialization:
Define the set of indices corresponding to non-zero el- • ω− = {∅} and ω0 = {1, 2, . . . , N }
ements in cTi with ω− and the set of indices corre-
• Set K−1ω0 = (Vω0 Vω0 )
T −1
= I ∈ RP ×P
sponding to zero elements in cTi with ω0 . Furthermore,
define the matrix Vω0 to be the matrix containing only • Set J = 1;
P
the rows of V whose indices are in ω0 . If ω0 and thus • i|max| = arg max j |Vij |
Vω0 are known, one can easily compute b̂i as the right ω− = {i|max| }, ω0 = {i|i ∈ ω0 , i 6= i|max| }
singular vector of Vω0 corresponding to the zero sin-
• Update K−1
ω0 = RankU pdate(Kω0 , Vi|max| )
−1
gular value. Since cTi and ω0 are not known, RCweb
approximates b̂i (the smallest2 right singular vector of 3. Cycle: J = J + 1 Repeat until convergence
Vω0 ) with vs , the smallest right singular vector of V.
• Find the eigenvector v for K−1 ω0 with the
This approximation relies on assuming that a low rank
largest eigenvalue λmax
perturbation in a matrix results in a small change in its
smallest singular vectors (Wilkinson 1988, Benaych- max ≈ 0 or v → v
• If λ−1 J J−1
, b̂i ≡ v; STOP
Georges 2009). Thus given that RCweb is looking for • Compute the left singular vector of Vω0
the sparsest solution and the set ω− is small relative u = s−1 Vω0 v
to N , the angle between the singular vectors of Vω0 • i|max| = arg max [(|u1 |, . . . , |ui |, . . . , |uN |)];
and V is small as well. Therefore, vs can serve as • ω− = {ω− , i|max| }
a reasonable first approximation of bi . Then RCweb ω0 = {i|i ∈ ω0 , i 6= i|max| }
systematically and iteratively uses and updates vs by • Update K−1 ω0 = RankU pdate(Kω0 , Vi|max| )
−1
removing rows of V until vs converges to bi or equiv-
alently Vω0 becomes singular for the largest set of ω0
indices. When Vω0 becomes singular, all elements of The above algorithm can compute a single vector, b̂i ,
ci whose indices are in ω0 become zero. which is just one column of B̂. To find the other
columns, RCweb applies the same approach to the
RCweb also has an intuitive geometrical interpretation. modified (inflated) matrix, which for the ith column
Consider the matrix V mapping the unit sphere in RP of B is V(i) = V(i−1) + Vb̂i b̂T
i for i = 2, . . . , P . Thus,
(the sphere with unit radius from RP ) to an ellipsoid after the inference of each column of B RCweb modi-
in RN . The axes of the ellipsoid are the left singular fies V(i−1) to V(i) (V(1) ≡ V) so that the algorithm
vectors of V. In this picture, starting with ω− = {∅} will not replicate its choice of ω0 . Note that in the
and ω0 = {1, . . . , N }, solving (4) requires moving the ith update of V(i−1) the Vb̂i b̂Ti will modify only the
fewest number of indices from ω0 to ω− so that Vω0 ω− rows in V(i−1) since the rows of the Vb̂i b̂T i whose
2
By smallest singular vector I mean the singular vector indices are in ω0 contain only zero elements. Apply-
corresponding to the to the smallest singular value ing RCweb to the inflated matrices avoids inferring
759
Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks
multiple times the same b̂i , but a b̂i inferred from ing to RCweb, the optimal sparse adjacency matrix
the inflated matrix is generally going to differ at least (Ĉ) and the hidden variables (R̂) can be inferred by
slightly from the corresponding b̂i that solves (4) for the decomposition, Ĝ = R̂Ĉ so that Ĉ is as sparse as
V. To avoid that, RCweb uses the inflated matrices possible while Ĝ is as close as possible to G.
only for the first few iterations until the largest (by
In addition to RCweb, such decomposition can be com-
absolute magnitude) of the Pearson correlations be-
puted by 3 classes of existing algorithms. For a com-
tween the smallest eigenvector of Vω0 from the current
parison, I use the latest versions for which the authors
(ith ) iteration and the recovered columns of B is less
report best performance: (A) PSMF for Bayesian ma-
than 1 − and monotonically decreasing; is chosen
trix factorization as implemented by the author Mat-
for numerical stability and also to reflect the similar-
Lab function PSMF1 (Dueck 2005); (B) BFRM 2 for
ity between the connectivity of R vertices that can
Bayesian matrix factorization as implemented by the
be expected in the network whose topology is being
author compiled executable (Carvalho 2008); (C) em-
recovered. A simpler alternative that works great in
PCA for maximum likelihood estimate (MLE ) sparse
practice is to use the the inflated matrix for the first
PCA (Sigg 2008); (D) K-SVD (Aharon 2005). All al-
k iterations that are enough to find a new direction
gorithms are implemented using the code published by
for b̂i and then RCweb switches back to V so that
their authors, and with the default values of the pa-
the solution is optimal for V. The switch requires k
rameters when parameters are required. The results
rank update of K−1 ω0 ≡ (Vω0 Vω0 )
T −1
and thus choosing k
are compared for various M, N, P, sparsity, and noise
small saves computations. Choosing k too small, how-
levels.
ever, may not be enough to guarantee that b̂i will not
recapitulate a solution that is already found. Usually
k = 10 works great and can be easily increased if the 4.1 LIMITATIONS
new solution is very close to an old one. Before comparing the results, consider some of the lim-
There are a few notable elements that make RCweb itations common to all algorithms and the appropriate
efficient. First, RCweb does almost all computations metrics for comparing the results. In the absence of
in RP and since P N , P < M , that saves both any other information, the decomposition of G (no
memory and CPU time. Second, each step requires matter how accurate) cannot associate hidden vari-
only a few matrix-vector multiplication for computing ables (corresponding to columns of R̂) to physical fac-
the eigenvectors (since the change from the previous tors. Furthermore, all methods can infer Ĉ and R̂
step is generally very small) and K−1
ω0 is computed by
only up to an arbitrary diagonal scaling or permuta-
a rank–one update which obviates matrix inversion. tion matrix. First consider the scaling illustrated by
the following transformation by a diagonal matrix D,
The approach that RCweb takes in solving (4) does Ĝ = R̂Ĉ = R̂I Ĉ = R̂(DD−1 )Ĉ = (R̂D)(D−1 Ĉ) =
not impose specific restrictions on the distribution of R̂∗ Ĉ∗ . Such transformation is going to rescale R̂
the observed variables (G), the noise in the data (Υ) to R̂∗ and Ĉ to Ĉ∗ , which is just as sparse as Ĉ,
or the latent variables R̂. However, the initial approx- kĈ∗ k0 = kĈk0 . Since both decompositions explain
imation of bi with vs can be poor for data arising from the variance in G equally well RCweb (or any of the
dense networks or special worst–case datasets. As other method) cannot distinguish between them. Thus
demonstrated theoretically (Benaych-Georges 2009) given Ĉ, there is a diagonal matrix D̂ that scales Ĉ to
and tested numerically in the next section, RCweb per- Cgold , the weighted adjacency matrix of G.
forms very well at least in the absence of worst–case
scenario special structures in the data. The second limitation is that in the absence of addi-
tion information, RCweb can determine Ĉ and R up
to a permutation matrix. Consider comparing Ĉ to
4 VALIDATION the adjacency matrix used in the simulations, Cgold .
Since the identity of the inferred hidden variables is
To evaluate the performance of RCweb, I first apply not known the rows of Ĉ do not generally correspond
it to data from simulated random bipartite networks to the rows of Cgold ; Ĉi (the ith row of Ĉ) is most
with two different topology types, (1) Erdös & Rènyi likely to correspond to the Cgold row that is most cor-
and (2) scale-free whose corresponding degree distribu- related to Ĉi and the Pearson correlation between the
tions are (1) Poisson and (2) power–law. The network two rows quantifies the accuracy for the inference of
topology is encoded in a weighted adjacency matrix Ĉi . To implement this idea, all rows of Ĉ and Cgold
Cgold and the values for the unobserved variables are are first normalized to mean zero and unit variance
drawn from a standard uniform distribution. The sim- resulting in Cnor and Cgold
nor . The correlation matrix
ulations result in data matrices G ∈ RM ×N containing then is, Σ = CTnor Cgold
nor and the most likely vertex
M observations of all N unobserved variables. Accord- (index of the unobserved variable) corresponding to
760
Nikolai Slavov
Inference Accuraccy, ρ̄
is measured by the corresponding Pearson correlation,
emPCA kSVD PSMF BFRM
ρi = |Σik |. An optimal solution of this matching prob- 0.6
RCweb
761
Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks
A
1 1
B A
1
B
1
Inference Accuraccy, ρ̄ 0.9 RCweb
0.9
0.9 emPCA
kSVD 0.9
Inference Accuraccy, ρ̄
0.8
0.8
0.8
0.7 0.8
0.7
0.7
0.6
0.7
0.6
0.5 0.6
0.5 0.6
0.4 0.5
RCweb RCweb
0.4
emPCA 0.3
emPCA 0.4
0.5
kSVD kSVD
0 20 40 60 80
0.2 0.4 RCweb
0 20 40 60 80 0.3
emPCA
Number of unobserved variables, P kSVD
0.2
50 100 150 200 250 300 0 200 400 600 800
5 5 3
10 10 10
4
10
brighter lines with squares correspond to number of 10
4
% noise in the observations. C) The same as (B) except RCweb : 1.78 RCweb : 1.41
emPCA: 1.80
RCweb : -0.01
emPCA: 1.33
emPCA: 1.03
for wider range of the observed variables. 10
0
kSVD: 1.40
10
1
3
kSVD: 0.98
10
0
2
kSVD: 0.73
1
10 10 10
P N M
identifying the physical factors corresponding to xR .
Figure 4: Computational efficiency as a function of
Yet identifying this correspondence, and thus over-
P , M and N . The scaling exponents (slopes in log–log
coming the limitations outlined in section 4.1 can be
space) are reported in the legends. The networks are
very desirable. One practically relevant situation al-
with power–law degree distribution and 60 % sparsity,
lowing in-depth interpretation of the results requires
with mean out–degree 0.40N
measuring (if only in a few configurations) the states
of some of the variables that are generally unobserved
(xR ). This is relevant, for example, to situations in
D̂ss = (R̂Tφk R̂φk )−1 R̂Tφk uk . Similar to section (4.1),
which measuring some variables is much more expen-
weighted matching algorithms (Sanghavi 2007) can be
sive than others (such as protein modifications versus
used for finding the optimal solution if there is data
messenger RNA concentrations) and some xs∈R can be
for multiple xs∈R .
observed only once or a few times. Assume that the
states of the k th physical factor are measured (data in Partial prior knowledge about the structure of G can
vector uk ) in nk number of configurations, whose in- also be used to enhance the interpretability of RCweb
dices are in the set φk . This information can be enough results. Assume, for example, that some of the nodes
for determining the vertex xs∈R corresponding to the (xs∈αk ⊂N ) regulated by the k th physical factor (which
k th factor and the corresponding D̂ss as follows: 1) is a hidden variable in the inference) are known. Then
Compute the Pearson correlations ρ ~ between uk and the matching approach that was just outlined can
the columns of R̂φk . Then, the vertex of the inferred be used with Ĉ rather than R̂. If the weights are
network most likely to correspond to the k th physi- not known all non-zero elements of Ĉαk can be set
cal factor is s = arg maxi (|ρ1 |, . . . , |ρi |, . . . , |ρP |). 2) to one. The significance of the overlap (fraction of
762
Nikolai Slavov
4.4 INFERENCE OF A
TRANSCRIPTIONAL NETWORK
Figure 5: Overlap between the first 7 sets of genes
Simulated models have the advantage of having known (column 1) that RCweb identifies to be regulated by a
topology, and thus providing excellent basis for rig- common regulator and sets of genes whose promoters
orous evaluation. Such rigorous evaluation is not are bound by the same TF (column 2) in ChIP-chip
possible for real biological networks. Yet, existing experiments. The second and third columns list the
partial knowledge of transcriptional networks, can number of overlapping genes between the RCweb set
be combined with the approach outlined in section and the ChIP-chip set and the corresponding probabil-
4.3 for evaluation of inference of biological networks. ities of observing such overlaps by chance alone. The
In particular, consider the transcriptional network of fourth column contains a description of the TFs from
budding yeast, Saccharomyces cerevisiae. It can be the second column and the fifth column lists the most
modeled as a two-layer network with transcription enriched GO terms for the RCweb gene sets.
factors (TFs) and their complexes being regulators
(xR ) whose activities cannot be measured on a high-
“gold standard” is the only partially known topology,
throughput scale because of technology limitations.
comparing the performance of the different algorithms
TFs regulate the expression levels of messenger RNAs
in this setting can be hard to interpret; an algorithm
(mRNAs) whose concentrations can be measured eas-
can be penalized for correctly inferred edges if these
ily on a high-throughput scale at multiple physiolog-
edges are absent from the incomplete ChIP topology.
ical states and represent the observed variables, xN .
An objective comparison requires experimental test-
A partial knowledge about the connectivity in this
ing of predicted new edges and that is the subject of
transcriptional network comes from ChIP-chip exper-
a forthcoming paper.
iments (MacIsaac 2006) which identify sets of genes
regulated by the same transcription factors (TFs). The comparison between ChIP-chip gene sets and
RCweb gene sets allows to identify the likely corre-
I first apply RCweb to a set mRNA configurations fol-
spondences between xs∈R and TFs and to infer the
lowed by the approach of section 4.3 to evaluate the
activities of the TFs which are very hard to measure
results. The data matrix G contains 423 yeast datasets
experimentally but crucially important to understand-
measured on the Affymetrix S98 microarray platform
ing the dynamics and logic of gene regulatory networks
at a variety of physiological conditions. RCweb was
(Slavov 2009).
initialized with P ∗ = 160 unobserved variables (corre-
sponding to TFs) and identified P = 153 co-regulated
sets of genes. The inferred adjacency matrix Ĉ is com- 5 CONCLUSIONS
pared directly to the adjacency matrix identified by
ChIP-chip experiments (MacIsaac 2006). Indeed, sets This paper introduces an approach (RCweb) for in-
of genes inferred by RCweb to be co-regulated over- ferring latent (unobserved) factors explaining the be-
lap substantially with sets of genes found to be reg- havior of observed variables. RCweb aims at inferring
ulated by the same TF in ChIP-chip studies, Fig.5. a sparse bipartite graph in which vertices connect in-
Furthermore, the most enriched gene ontology (GO) ferred latent factors (e.g. regulators of mRNA tran-
terms associated with the RCweb inferred sets of genes scription and degradation) to observed variables (e.g.
correspond to the functions of the respective TFs. This target mRNAs). The salient difference distinguishing
fact provides further support to the accuracy of the RCweb from prior related work is a new approach to
inferred network topology and suggests that many of attaining sparse solution that allows the natural inclu-
the genes inferred to be regulated by the corresponding sion of a generative model, relaxation of assumptions
TFs may be bona fide targets that were not detected in on distributions, and ultimately results in more accu-
the ChIP-chip experiments because of the rather lim- rate and computationally efficient inference compared
ited set of physiological conditions in which the exper- to competing algorithms for sparse data decomposi-
iments are performed (MacIsaac 2006). Since the best tion.
763
Inference of Sparse Networks with Unobserved Variables. Application to Gene Regulatory Networks
764